Experiment #001 — SMS Downtime Alert

── Problem ──

Every production system goes down eventually. The question is whether you find out in thirty seconds or thirty minutes.

I had a small VPS running a few personal services — nothing critical, but things I wanted to stay up. I was already paying for uptime monitoring via a third-party dashboard, but the alert channel was email. Email requires me to have a client open. It requires me to be at a desk. For a 3am incident, it's effectively silent.

PagerDuty and similar platforms solve this, but they introduce a managed layer between my system and my phone. I don't want a platform. I want an SMS that fires the moment a threshold is crossed.

The 46elks SMS API is a single authenticated POST request. The question was: can I build a reliable, low-noise downtime alerting system on top of it in an afternoon, with no external dependencies beyond the API itself?

── Hypothesis ──

A Node.js health check loop with a consecutive-failure threshold, combined with a direct call to the 46elks SMS API, can deliver a downtime alert to my phone within 90–120 seconds of a real outage — with a low enough false-positive rate to be trustworthy.

The threshold matters. A single failed check shouldn't send an alert. Three consecutive failures should.

── Architecture ──

  ┌───────────────────────────────────────────────────┐
  │  Monitoring Process (Node.js, separate VPS)       │
  │                                                   │
  │  setInterval(30s)                                 │
  │       │                                           │
  │       ▼                                           │
  │  HTTP GET → target/health                         │
  │       │                                           │
  │       ├── 200 OK ──────────────────► reset state  │
  │       │                                           │
  │       └── timeout / 5xx / error                  │
  │               │                                   │
  │               ▼                                   │
  │       consecutiveFailures++                       │
  │               │                                   │
  │     ┌─────────┴─────────┐                         │
  │     │ failures < 3      │ failures >= 3           │
  │     │ do nothing        │ AND isDown === false    │
  │     └───────────────────┘         │               │
  │                                   ▼               │
  │                    POST api.46elks.com/a1/sms     │
  │                                   │               │
  │                         isDown = true             │
  │                         downSince = Date.now()    │
  └───────────────────────────────────────────────────┘
                                      │
                                      ▼
                             46elks SMS gateway
                                      │
                                      ▼
                          +46 XXX XXX XXX (my phone)
                          SMS delivered ~1.5–3s

  Recovery path:
  Next successful check after isDown === true
       │
       ▼
  POST /a1/sms — "[RECOVERED] back online after N min"
  isDown = false, consecutiveFailures = 0

Component	Choice	Reason
Runtime	Node.js 20 LTS	No dependency overhead, built-in HTTPS
HTTP client	Node built-in `https`	Zero dependencies — the API doesn't need a library
Scheduler	`setInterval`	No cron daemon needed, restarts cleanly
Config	`dotenv`	Only external dependency. No credentials in code.
Alert channel	46elks SMS API	Direct REST, basic auth, no platform
Monitor host	Separate VPS	Monitor must survive what it monitors
Process manager	pm2	Auto-restart, log rotation, startup on boot

── Implementation ──

Provision the monitor host on separate infrastructure. Running the monitor on the same server as the monitored service is the first mistake. I used a $4/mo VPS from a different provider. If the primary host dies, the monitor survives.
Define a /health endpoint on the target server. Don't ping the root path. A /health endpoint returns a deliberate 200 with no side effects. Mine returns {"status":"ok","ts":1736330400}. If the endpoint is slow or broken, that's a real signal.
Write the check loop with hysteresis. The failure threshold absorbs transient blips — a DNS hiccup, a brief network stall. Three consecutive failures over 90 seconds means something is genuinely wrong. Without this, false positives make the alert untrustworthy within 24 hours.
Implement a cooldown on repeat alerts. If the server stays down, you don't want an SMS every 30 seconds. A 10-minute cooldown sends a "still down" reminder after sustained outages without flooding your phone.
Always send a recovery alert. The down alert opens a loop. The recovery alert closes it. Without it you're left manually checking whether the server came back, or whether your intervention actually worked.
Run with pm2. pm2 start monitor.js --name monitor then pm2 save && pm2 startup. Auto-restart on crash, log rotation included, survives server reboots.

── Code ──

monitor.js

// monitor.js — adham46elks.com/experiments/001
require('dotenv').config();
const https = require('https');
const { URL } = require('url');

const CONFIG = {
  targetUrl:        process.env.TARGET_URL,
  alertPhone:       process.env.ALERT_PHONE,
  elksUser:         process.env.ELKS_API_USER,
  elksPassword:     process.env.ELKS_API_PASSWORD,
  fromName:         process.env.ALERT_FROM        || 'Monitor',
  checkIntervalMs:  Number(process.env.CHECK_INTERVAL_MS)  || 30_000,
  failureThreshold: Number(process.env.FAILURE_THRESHOLD)  || 3,
  timeoutMs:        Number(process.env.TIMEOUT_MS)         || 5_000,
  cooldownMs:       Number(process.env.COOLDOWN_MS)        || 10 * 60_000,
};

let consecutiveFailures = 0;
let isDown              = false;
let downSince           = null;
let lastAlertAt         = 0;

// ── HTTP check ──────────────────────────────────────────────

function httpCheck(targetUrl, timeoutMs) {
  return new Promise((resolve, reject) => {
    const parsed = new URL(targetUrl);
    const lib    = parsed.protocol === 'https:' ? https : require('http');
    const req    = lib.get(targetUrl, (res) => {
      resolve(res.statusCode);
      res.resume(); // drain to free socket
    });
    const timer = setTimeout(() => {
      req.destroy(new Error('timeout'));
    }, timeoutMs);
    req.on('error', (err) => { clearTimeout(timer); reject(err); });
    req.on('close', ()    => clearTimeout(timer));
  });
}

// ── 46elks SMS ──────────────────────────────────────────────

function sendSms(message) {
  return new Promise((resolve, reject) => {
    const body = new URLSearchParams({
      from:    CONFIG.fromName,
      to:      CONFIG.alertPhone,
      message,
    }).toString();

    const auth = Buffer
      .from(`${CONFIG.elksUser}:${CONFIG.elksPassword}`)
      .toString('base64');

    const req = https.request({
      hostname: 'api.46elks.com',
      path:     '/a1/sms',
      method:   'POST',
      headers: {
        'Authorization':  `Basic ${auth}`,
        'Content-Type':   'application/x-www-form-urlencoded',
        'Content-Length': Buffer.byteLength(body),
      },
    }, (res) => {
      let raw = '';
      res.on('data', (chunk) => { raw += chunk; });
      res.on('end',  ()      => resolve(JSON.parse(raw)));
    });

    req.on('error', reject);
    req.write(body);
    req.end();
  });
}

// ── Main check loop ─────────────────────────────────────────

async function check() {
  const ts = new Date().toISOString();
  let statusCode;
  let ok = false;

  try {
    statusCode = await httpCheck(CONFIG.targetUrl, CONFIG.timeoutMs);
    ok = statusCode >= 200 && statusCode < 400;
  } catch (err) {
    statusCode = err.message;
  }

  if (!ok) {
    consecutiveFailures++;
    console.log(`[${ts}] FAIL — ${statusCode} (${consecutiveFailures}/${CONFIG.failureThreshold})`);

    const shouldAlert     = consecutiveFailures >= CONFIG.failureThreshold;
    const cooldownExpired = Date.now() - lastAlertAt > CONFIG.cooldownMs;

    if (shouldAlert && !isDown) {
      isDown    = true;
      downSince = new Date();
      const msg = `[DOWN] ${CONFIG.targetUrl} is unreachable. Detected ${ts}.`;
      console.log('→ Sending alert:', msg);
      await sendSms(msg).catch(console.error);
      lastAlertAt = Date.now();

    } else if (isDown && cooldownExpired) {
      const mins = Math.round((Date.now() - downSince.getTime()) / 60_000);
      const msg  = `[STILL DOWN] ${CONFIG.targetUrl} unreachable for ${mins} min.`;
      console.log('→ Sending reminder:', msg);
      await sendSms(msg).catch(console.error);
      lastAlertAt = Date.now();
    }

  } else {
    if (isDown) {
      const mins = Math.round((Date.now() - downSince.getTime()) / 60_000);
      const msg  = `[RECOVERED] ${CONFIG.targetUrl} is back after ${mins} min.`;
      console.log('→ Sending recovery:', msg);
      await sendSms(msg).catch(console.error);
    }
    consecutiveFailures = 0;
    isDown              = false;
    downSince           = null;
    console.log(`[${ts}] OK — ${statusCode}`);
  }
}

// ── Start ────────────────────────────────────────────────────

console.log(`Monitor starting. Target: ${CONFIG.targetUrl}`);
console.log(`Interval: ${CONFIG.checkIntervalMs / 1000}s | Threshold: ${CONFIG.failureThreshold} failures`);

check();
setInterval(check, CONFIG.checkIntervalMs);

.env

# Target
TARGET_URL=https://yourserver.com/health
ALERT_PHONE=+46700000000

# 46elks credentials — dashboard.46elks.com
ELKS_API_USER=your_api_user
ELKS_API_PASSWORD=your_api_password

# Tuning
ALERT_FROM=Monitor
CHECK_INTERVAL_MS=30000
FAILURE_THRESHOLD=3
TIMEOUT_MS=5000
COOLDOWN_MS=600000

── Results ──

Tested over 14 days continuous operation · 2 real incidents captured

Measurement	Observed
Threshold breach → API call	< 100ms
46elks API response time	180–320ms
SMS delivery to handset (Swedish number)	1.4–3.2s
Total: outage detected → SMS received	~92–95s worst case
False positives over 14 days	Zero
Missed real alerts	Zero
npm dependencies	1 (dotenv)

── INCIDENT LOG ──

Incident 1 — 2026-01-11 03:17 UTC Cause: VPS OOM kill, nginx process died Detected: 03:18:47 (91s after nginx stopped responding) Alert SMS received: 03:18:49 Resolved: 03:24:12 (manual: pm2 restart nginx) Recovery SMS received: 03:24:25 Total downtime: 6m 25s Incident 2 — 2026-01-19 14:02 UTC Cause: Brief network partition on host provider Duration: ~45 seconds Result: Never hit failure threshold — resolved between checks Alert: none sent (correct behaviour)

The 46elks API leg is not the bottleneck. The bottleneck is the intentional 90-second detection window — 3 checks at 30-second intervals. That's a design decision, not an API limitation. The API call itself completes in under 500ms.

── Edge Cases ──

01
Server flapping (up/down faster than check interval) If the server crashes and recovers within the 30-second window between checks, the monitor never sees it. No alert fires. This is a known trade-off of polling-based monitoring. Reducing the interval helps, but increases load and false-positive sensitivity.
02
DNS resolution failure vs server down To the monitor, ENOTFOUND and a dead server look identical. If your DNS fails, you get a downtime alert for a server that's actually fine. The alert is still correct from the monitor's perspective — the service is unreachable, for whatever reason.
03
5xx responses and partial failures A 503 from nginx in front of a dead upstream is caught. A 200 from a health endpoint that's lying — returning OK while the database is down — is not. Your /health endpoint needs to actually check its dependencies to be meaningful.
04
International numbers Tested only against a Swedish number (+46). 46elks supports international delivery, but latency varies by carrier and country. Allow an extra 2–8 seconds for non-Swedish recipients.
05
Monitor process crash If pm2 fails to restart the monitor, you go blind silently. There's no self-monitoring here. Addressed in Experiment #002 →

── Honest Limitations ──

No acknowledgment mechanism. When the SMS arrives, there's no way to reply and signal "I'm handling this." That requires inbound SMS and webhook handling — a different experiment.
Polling is not event-driven. The 90-second worst-case detection window is a direct consequence of polling. A dead server can't announce itself. This is an inherent constraint, not an implementation flaw.
SMS delivery is best-effort by protocol. 46elks has strong delivery rates, but SMS is not a guaranteed delivery protocol at the network level. Not suitable as a sole alerting channel for life-critical systems.
Single point of failure on the monitor host. If the monitoring VPS goes down, all alerts stop. You need either a second independent monitor or a periodic heartbeat SMS to catch this.
No escalation path. If the first SMS goes unread, nothing follows up except the 10-minute cooldown reminder. Escalation to a voice call is the subject of Experiment #002.

── Lessons Learned ──

—
The 46elks API is genuinely simple. One authenticated POST with three form fields. No SDK, no wrapper, no client library needed. The entire SMS-sending logic is 25 lines of Node with no npm dependencies beyond dotenv. When an API requires no library, that's a sign the abstraction is at the right level.
—
Hysteresis is more important than the alerting mechanism. Without the failure threshold, dozens of false-positive alerts would have arrived in the first 24 hours. The 3-failure threshold was the most important design decision in the whole system — not the API call.
—
Cooldown logic is easy to forget and critical to include. During an extended outage, a naive implementation sends an SMS every 30 seconds. One reminder every 10 minutes is enough to stay informed without creating noise.
—
Recovery alerts are as valuable as down alerts. Without them, you're left manually checking whether the server came back. Closing the loop automatically frees mental overhead during an incident.
—
Run the monitor on different infrastructure. If the monitor lives on the server it monitors, you lose both simultaneously and get no alert at all.

SMS-Based Server DowntimeAlert System

SMS-Based Server Downtime
Alert System