SMS-Based Server Downtime
Alert System
using the 46elks API · one npm dependency · zero platforms
── Problem ──
Every production system goes down eventually. The question is whether you find out in thirty seconds or thirty minutes.
I had a small VPS running a few personal services — nothing critical, but things I wanted to stay up. I was already paying for uptime monitoring via a third-party dashboard, but the alert channel was email. Email requires me to have a client open. It requires me to be at a desk. For a 3am incident, it's effectively silent.
PagerDuty and similar platforms solve this, but they introduce a managed layer between my system and my phone. I don't want a platform. I want an SMS that fires the moment a threshold is crossed.
The 46elks SMS API is a single authenticated POST request. The question was: can I build a reliable, low-noise downtime alerting system on top of it in an afternoon, with no external dependencies beyond the API itself?
── Hypothesis ──
A Node.js health check loop with a consecutive-failure threshold, combined with a direct call to the 46elks SMS API, can deliver a downtime alert to my phone within 90–120 seconds of a real outage — with a low enough false-positive rate to be trustworthy.
The threshold matters. A single failed check shouldn't send an alert. Three consecutive failures should.
── Architecture ──
┌───────────────────────────────────────────────────┐
│ Monitoring Process (Node.js, separate VPS) │
│ │
│ setInterval(30s) │
│ │ │
│ ▼ │
│ HTTP GET → target/health │
│ │ │
│ ├── 200 OK ──────────────────► reset state │
│ │ │
│ └── timeout / 5xx / error │
│ │ │
│ ▼ │
│ consecutiveFailures++ │
│ │ │
│ ┌─────────┴─────────┐ │
│ │ failures < 3 │ failures >= 3 │
│ │ do nothing │ AND isDown === false │
│ └───────────────────┘ │ │
│ ▼ │
│ POST api.46elks.com/a1/sms │
│ │ │
│ isDown = true │
│ downSince = Date.now() │
└───────────────────────────────────────────────────┘
│
▼
46elks SMS gateway
│
▼
+46 XXX XXX XXX (my phone)
SMS delivered ~1.5–3s
Recovery path:
Next successful check after isDown === true
│
▼
POST /a1/sms — "[RECOVERED] back online after N min"
isDown = false, consecutiveFailures = 0
| Component | Choice | Reason |
|---|---|---|
| Runtime | Node.js 20 LTS | No dependency overhead, built-in HTTPS |
| HTTP client | Node built-in https |
Zero dependencies — the API doesn't need a library |
| Scheduler | setInterval |
No cron daemon needed, restarts cleanly |
| Config | dotenv |
Only external dependency. No credentials in code. |
| Alert channel | 46elks SMS API | Direct REST, basic auth, no platform |
| Monitor host | Separate VPS | Monitor must survive what it monitors |
| Process manager | pm2 | Auto-restart, log rotation, startup on boot |
── Implementation ──
- Provision the monitor host on separate infrastructure. Running the monitor on the same server as the monitored service is the first mistake. I used a $4/mo VPS from a different provider. If the primary host dies, the monitor survives.
-
Define a
/healthendpoint on the target server. Don't ping the root path. A/healthendpoint returns a deliberate 200 with no side effects. Mine returns{"status":"ok","ts":1736330400}. If the endpoint is slow or broken, that's a real signal. - Write the check loop with hysteresis. The failure threshold absorbs transient blips — a DNS hiccup, a brief network stall. Three consecutive failures over 90 seconds means something is genuinely wrong. Without this, false positives make the alert untrustworthy within 24 hours.
- Implement a cooldown on repeat alerts. If the server stays down, you don't want an SMS every 30 seconds. A 10-minute cooldown sends a "still down" reminder after sustained outages without flooding your phone.
- Always send a recovery alert. The down alert opens a loop. The recovery alert closes it. Without it you're left manually checking whether the server came back, or whether your intervention actually worked.
-
Run with pm2.
pm2 start monitor.js --name monitorthenpm2 save && pm2 startup. Auto-restart on crash, log rotation included, survives server reboots.
── Code ──
// monitor.js — adham46elks.com/experiments/001 require('dotenv').config(); const https = require('https'); const { URL } = require('url'); const CONFIG = { targetUrl: process.env.TARGET_URL, alertPhone: process.env.ALERT_PHONE, elksUser: process.env.ELKS_API_USER, elksPassword: process.env.ELKS_API_PASSWORD, fromName: process.env.ALERT_FROM || 'Monitor', checkIntervalMs: Number(process.env.CHECK_INTERVAL_MS) || 30_000, failureThreshold: Number(process.env.FAILURE_THRESHOLD) || 3, timeoutMs: Number(process.env.TIMEOUT_MS) || 5_000, cooldownMs: Number(process.env.COOLDOWN_MS) || 10 * 60_000, }; let consecutiveFailures = 0; let isDown = false; let downSince = null; let lastAlertAt = 0; // ── HTTP check ────────────────────────────────────────────── function httpCheck(targetUrl, timeoutMs) { return new Promise((resolve, reject) => { const parsed = new URL(targetUrl); const lib = parsed.protocol === 'https:' ? https : require('http'); const req = lib.get(targetUrl, (res) => { resolve(res.statusCode); res.resume(); // drain to free socket }); const timer = setTimeout(() => { req.destroy(new Error('timeout')); }, timeoutMs); req.on('error', (err) => { clearTimeout(timer); reject(err); }); req.on('close', () => clearTimeout(timer)); }); } // ── 46elks SMS ────────────────────────────────────────────── function sendSms(message) { return new Promise((resolve, reject) => { const body = new URLSearchParams({ from: CONFIG.fromName, to: CONFIG.alertPhone, message, }).toString(); const auth = Buffer .from(`${CONFIG.elksUser}:${CONFIG.elksPassword}`) .toString('base64'); const req = https.request({ hostname: 'api.46elks.com', path: '/a1/sms', method: 'POST', headers: { 'Authorization': `Basic ${auth}`, 'Content-Type': 'application/x-www-form-urlencoded', 'Content-Length': Buffer.byteLength(body), }, }, (res) => { let raw = ''; res.on('data', (chunk) => { raw += chunk; }); res.on('end', () => resolve(JSON.parse(raw))); }); req.on('error', reject); req.write(body); req.end(); }); } // ── Main check loop ───────────────────────────────────────── async function check() { const ts = new Date().toISOString(); let statusCode; let ok = false; try { statusCode = await httpCheck(CONFIG.targetUrl, CONFIG.timeoutMs); ok = statusCode >= 200 && statusCode < 400; } catch (err) { statusCode = err.message; } if (!ok) { consecutiveFailures++; console.log(`[${ts}] FAIL — ${statusCode} (${consecutiveFailures}/${CONFIG.failureThreshold})`); const shouldAlert = consecutiveFailures >= CONFIG.failureThreshold; const cooldownExpired = Date.now() - lastAlertAt > CONFIG.cooldownMs; if (shouldAlert && !isDown) { isDown = true; downSince = new Date(); const msg = `[DOWN] ${CONFIG.targetUrl} is unreachable. Detected ${ts}.`; console.log('→ Sending alert:', msg); await sendSms(msg).catch(console.error); lastAlertAt = Date.now(); } else if (isDown && cooldownExpired) { const mins = Math.round((Date.now() - downSince.getTime()) / 60_000); const msg = `[STILL DOWN] ${CONFIG.targetUrl} unreachable for ${mins} min.`; console.log('→ Sending reminder:', msg); await sendSms(msg).catch(console.error); lastAlertAt = Date.now(); } } else { if (isDown) { const mins = Math.round((Date.now() - downSince.getTime()) / 60_000); const msg = `[RECOVERED] ${CONFIG.targetUrl} is back after ${mins} min.`; console.log('→ Sending recovery:', msg); await sendSms(msg).catch(console.error); } consecutiveFailures = 0; isDown = false; downSince = null; console.log(`[${ts}] OK — ${statusCode}`); } } // ── Start ──────────────────────────────────────────────────── console.log(`Monitor starting. Target: ${CONFIG.targetUrl}`); console.log(`Interval: ${CONFIG.checkIntervalMs / 1000}s | Threshold: ${CONFIG.failureThreshold} failures`); check(); setInterval(check, CONFIG.checkIntervalMs);
# Target TARGET_URL=https://yourserver.com/health ALERT_PHONE=+46700000000 # 46elks credentials — dashboard.46elks.com ELKS_API_USER=your_api_user ELKS_API_PASSWORD=your_api_password # Tuning ALERT_FROM=Monitor CHECK_INTERVAL_MS=30000 FAILURE_THRESHOLD=3 TIMEOUT_MS=5000 COOLDOWN_MS=600000
── Results ──
Tested over 14 days continuous operation · 2 real incidents captured
| Measurement | Observed |
|---|---|
| Threshold breach → API call | < 100ms |
| 46elks API response time | 180–320ms |
| SMS delivery to handset (Swedish number) | 1.4–3.2s |
| Total: outage detected → SMS received | ~92–95s worst case |
| False positives over 14 days | Zero |
| Missed real alerts | Zero |
| npm dependencies | 1 (dotenv) |
── INCIDENT LOG ──
Incident 1 — 2026-01-11 03:17 UTC Cause: VPS OOM kill, nginx process died Detected: 03:18:47 (91s after nginx stopped responding) Alert SMS received: 03:18:49 Resolved: 03:24:12 (manual: pm2 restart nginx) Recovery SMS received: 03:24:25 Total downtime: 6m 25s Incident 2 — 2026-01-19 14:02 UTC Cause: Brief network partition on host provider Duration: ~45 seconds Result: Never hit failure threshold — resolved between checks Alert: none sent (correct behaviour)The 46elks API leg is not the bottleneck. The bottleneck is the intentional 90-second detection window — 3 checks at 30-second intervals. That's a design decision, not an API limitation. The API call itself completes in under 500ms.
── Edge Cases ──
-
01
Server flapping (up/down faster than check interval) If the server crashes and recovers within the 30-second window between checks, the monitor never sees it. No alert fires. This is a known trade-off of polling-based monitoring. Reducing the interval helps, but increases load and false-positive sensitivity.
-
02
DNS resolution failure vs server down To the monitor,
ENOTFOUNDand a dead server look identical. If your DNS fails, you get a downtime alert for a server that's actually fine. The alert is still correct from the monitor's perspective — the service is unreachable, for whatever reason. -
03
5xx responses and partial failures A 503 from nginx in front of a dead upstream is caught. A 200 from a health endpoint that's lying — returning OK while the database is down — is not. Your
/healthendpoint needs to actually check its dependencies to be meaningful. -
04
International numbers Tested only against a Swedish number (
+46). 46elks supports international delivery, but latency varies by carrier and country. Allow an extra 2–8 seconds for non-Swedish recipients. -
05
Monitor process crash If pm2 fails to restart the monitor, you go blind silently. There's no self-monitoring here. Addressed in Experiment #002 →
── Honest Limitations ──
-
No acknowledgment mechanism. When the SMS arrives, there's no way to reply and signal "I'm handling this." That requires inbound SMS and webhook handling — a different experiment.
-
Polling is not event-driven. The 90-second worst-case detection window is a direct consequence of polling. A dead server can't announce itself. This is an inherent constraint, not an implementation flaw.
-
SMS delivery is best-effort by protocol. 46elks has strong delivery rates, but SMS is not a guaranteed delivery protocol at the network level. Not suitable as a sole alerting channel for life-critical systems.
-
Single point of failure on the monitor host. If the monitoring VPS goes down, all alerts stop. You need either a second independent monitor or a periodic heartbeat SMS to catch this.
-
No escalation path. If the first SMS goes unread, nothing follows up except the 10-minute cooldown reminder. Escalation to a voice call is the subject of Experiment #002.
── Lessons Learned ──
-
—
The 46elks API is genuinely simple. One authenticated POST with three form fields. No SDK, no wrapper, no client library needed. The entire SMS-sending logic is 25 lines of Node with no npm dependencies beyond dotenv. When an API requires no library, that's a sign the abstraction is at the right level.
-
—
Hysteresis is more important than the alerting mechanism. Without the failure threshold, dozens of false-positive alerts would have arrived in the first 24 hours. The 3-failure threshold was the most important design decision in the whole system — not the API call.
-
—
Cooldown logic is easy to forget and critical to include. During an extended outage, a naive implementation sends an SMS every 30 seconds. One reminder every 10 minutes is enough to stay informed without creating noise.
-
—
Recovery alerts are as valuable as down alerts. Without them, you're left manually checking whether the server came back. Closing the loop automatically frees mental overhead during an incident.
-
—
Run the monitor on different infrastructure. If the monitor lives on the server it monitors, you lose both simultaneously and get no alert at all.