Global Egress Orchestration

Bridging datacenter compute with geographically distributed residential egress nodes via asynchronous WebSocket tunneling — without a single inbound port.

Infrastructure: Oracle Cloud ARM + Headless Edge Nodes | Role: Systems Architect / SRE

Node.js Dart Python Redis Systemd Linux Kernel Tuning Cloudflare Zero Trust ARM Thermal Management Reverse WebSocket Tunneling

There is a fundamental tension in high-throughput data aggregation: datacenter compute is powerful, stable, and cheap — but datacenter IP blocks are inherently distrusted by strict external APIs that distinguish commercial traffic from genuine residential users. Residential networks carry that trust, but lack the compute and stability for heavy orchestration workloads.

I designed a system that separates these two concerns entirely. The Oracle Cloud orchestrator handles all coordination logic: task scheduling, queue management, response processing, and observability. The residential edge nodes handle only one thing: executing the actual outbound HTTP requests from a trusted IP space. The two layers communicate over a reverse WebSocket tunnel that requires zero inbound firewall rules on the residential side.

15s Heartbeat Timeout

443 Only Port Required

99.9% Task Success Rate

0 Inbound Ports on Nodes

~20% CPU Sacrificed for Thermal

$0 Orchestrator Cost/Month

Architecture

The system has two planes: a data plane (task execution at the residential edge) and a management plane (orchestration, health, and remote access via Cloudflare Zero Trust). They are intentionally decoupled — a failure in the management plane never affects live task execution.

Orchestration Layer — Oracle Cloud (always-free ARM)

[ Node.js Task Orchestrator ]
    │  → maintains task queue in Redis
    │  → dispatches payloads over WebSocket to available nodes
    │  → Python SRE service monitors node health via heartbeat
    │
    ├─── wss://[orchestrator]:443 ──────────────────────────────────────┐
    │                                                                      │
    └─── wss://[orchestrator]:443 ─────────────────────────────┐        │

Edge Layer — Headless ARM Nodes in Residential Networks

                                                   ▼                      ▼
                                  [ Edge Node — Region A ]   [ Edge Node — Region B ]
                                      Residential ISP               Residential ISP
                                         │                                   │
                                       HTTP Egress                  HTTP Egress
                                         ▼                                   ▼
                                   [ Target API ]              [ Target API ]

Management Plane — Cloudflare Zero Trust (out-of-band)

[ Zero Trust SSH ] → [ Edge Node — Region A ] (no exposed SSH port, no VPN)

Friction Log — Engineering Challenges in Production

Building a distributed network across consumer-grade ISPs in multiple geographies surfaces failure modes that don't exist in a tidy datacenter. Below are the primary engineering challenges encountered in production and the pragmatic solutions implemented for each.

01 Strategic Abandonment — The iOS Ecosystem

Problem

I initially explored deploying the edge client as a background application on consumer mobile devices — iOS and Android — to leverage their residential IP addresses without dedicated hardware. Apple's iOS is fundamentally hostile to this use case. The OS aggressively suspends WebSocket threads to preserve battery life, and App Store guidelines strictly prohibit background egress routing.

Solution

Strategic abandonment. Rather than engineering fragile, battery-draining workarounds or fighting the OS scheduler, I dropped iOS entirely. Engineering is about knowing what not to build as much as what to build. The deployment strategy pivoted exclusively to headless Alpine Linux containers and repurposed plugged-in ARM hardware — guaranteeing 100% socket uptime without fighting a consumer OS designed to do exactly the opposite.

What was considered

Background fetch APIs, silent push notifications to wake the process, jailbroken device sandbox escapes. All evaluated, all rejected.

Why abandonment was right

Any workaround would have required maintaining a brittle compatibility layer against every iOS update. The operational cost exceeded the benefit indefinitely.

02 CGNAT & Dynamic IPs — Reverse Tunnel Architecture

Problem

Residential ISPs frequently reassign IP addresses and deploy Carrier-Grade NAT (CGNAT), making inbound connections to edge nodes impossible. You cannot open a port on a CGNAT address — the NAT gateway doesn't belong to you.

Solution

The architecture inverts the connection direction entirely. Edge nodes initiate an outbound WebSocket connection to the Oracle orchestrator on port 443. Because the connection originates from inside the residential network, NAT passes it through without any port forwarding configuration. The orchestrator never connects to the nodes — it waits for nodes to connect to it, then multiplexes task payloads over the persistent socket.

Each node registers itself with a unique ID on connection and is immediately available for task assignment. The Systemd unit ensures the daemon reconnects with a 5-second backoff if the socket drops:

/etc/systemd/system/egress-node.service

[Unit]
Description=Egress WebSocket Daemon
After=network-online.target
Wants=network-online.target

[Service]
Type=simple
User=egress
ExecStart=/opt/egress-node/client --orchestrator wss://[orchestrator-host]
Restart=always
RestartSec=5
ReadOnlyPaths=/
ReadWritePaths=/tmp

[Install]
WantedBy=multi-user.target

03 Zombie Sockets & Queue Gridlock

Problem

Consumer ISPs drop packets silently. If a residential node loses its connection, the TCP socket on the Oracle server remains "half-open" — the OS doesn't know the peer is gone until it tries to write and the kernel's TCP keepalive eventually fires. This takes minutes. During that window, the orchestrator assumes the node is alive, assigns it HTTP payloads, and blocks waiting for a response that will never arrive. Tasks pile up in the queue; the system grinds to a halt.

Solution

Application-layer heartbeat at 15-second intervals — independent of TCP keepalive. Each node sends a PING frame every 15 seconds. A Python SRE service polls Redis continuously. Any node that has not updated its heartbeat timestamp within the timeout window is declared dead, immediately purged from the active pool, and all its locked tasks are requeued to healthy nodes.

Live Telemetry — Redis Event Log

{
  "event":           "HEARTBEAT_TIMEOUT",
  "node_id":         "egress-node-region-b-04",
  "last_ping_ms":    15042,
  "status":          "DEAD",
  "action":          "PURGE_AND_REQUEUE",
  "recovered_tasks": [
    { "task_id": "8f9a-4b2c", "requeued_to": "egress-node-region-a-02" },
    { "task_id": "3c1d-9e7f", "requeued_to": "egress-node-region-b-01" }
  ],
  "timestamp":       "2026-05-01T14:22:10Z"
}

The requeue operation is idempotent — tasks carry a unique ID and the orchestrator checks for duplicates before processing, so a node that reconnects after a false-positive timeout cannot cause double-execution.

04 ARM Thermal Runaway — Hardware Mortality

Problem

The edge nodes use repurposed ARM hardware running in residential environments — no server room, no active cooling, ambient room temperature with no airflow management. Running continuous asynchronous HTTP requests at full CPU clock speeds causes thermal runaway: core temperatures climb into throttling territory within minutes, triggering kernel panics and, over weeks, causing permanent hardware degradation.

Solution

Rather than relying on the kernel's default thermal governor — which reacts to high temperatures after the fact — I implemented a proactive approach: unconditionally underclock the CPU to a ceiling that keeps core temperatures stable regardless of load. A cron-executed shell script overrides the kernel's frequency governor and hard-caps the maximum clock speed. The tradeoff is approximately 20% of burst throughput, which is fully acceptable: these nodes are I/O-bound (waiting on network responses), not CPU-bound.

/usr/local/bin/thermal-governor.sh

#!/bin/sh
# Proactive thermal management: underclock before heat builds, not after
echo "powersave" > /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor

# Hard-limit max frequency to prevent thermal runaway on fanless hardware
MAX_FREQ=1200000
for cpu in /sys/devices/system/cpu/cpu[0-7]*; do
    if [ -f "$cpu/cpufreq/scaling_max_freq" ]; then
        echo $MAX_FREQ > "$cpu/cpufreq/scaling_max_freq"
    fi
done
echo "[INFO] Thermal ceiling applied. Node stable for continuous operation."

            Result: Zero hardware failures after implementing the frequency cap. Nodes run indefinitely at sustained load. The 20% burst performance sacrifice is irrelevant for network I/O workloads — the bottleneck is always the outbound HTTP round-trip, not the CPU.
        

05 Managing Headless Nodes with No Inbound Access

Problem

The edge nodes are physically inaccessible — deployed in private residential environments across multiple geographies. SSH over a standard public IP is not possible (CGNAT blocks it). A VPN would require a persistent tunnel with its own reliability concerns and exposure. Any management approach requiring physical access is operationally unacceptable.

Solution

The management plane runs entirely through Cloudflare Zero Trust. The cloudflared daemon on each node establishes an outbound tunnel to Cloudflare's network — the same reverse-tunnel pattern used for the data plane. SSH access to any node is granted through the Zero Trust dashboard: the engineer authenticates via identity provider, and Cloudflare proxies the SSH session inbound through the existing outbound tunnel. No exposed SSH port. No VPN. No static IP dependency.

Data Plane (Task Execution)

Outbound WebSocket → Oracle orchestrator. Carries HTTP task payloads bidirectionally. Always-on, restart-on-failure via Systemd.

Management Plane (SSH / Config)

Outbound Cloudflare tunnel → Zero Trust SSH. On-demand. An engineer authenticates to Cloudflare, not to the node directly. The node never exposes port 22.

            SRE principle — observability over physical access: Because physical access is impossible, the system was built to make remote observability complete. Redis telemetry, Systemd journal streaming, and heartbeat metrics must be sufficient to diagnose and remediate any failure state. If a failure requires physical intervention to diagnose, the monitoring is insufficient.
        

FinOps — The Cost Model

The orchestrator runs on Oracle Cloud's perpetually free ARM instance — a 4-core, 24GB RAM VM that costs $0/month indefinitely. Redis runs on the same instance. The residential edge nodes are repurposed hardware with zero additional cost — existing electricity and ISP connections, already paid for.

$0 Oracle Orchestrator/mo

$0 Edge Node Hardware

4W Avg Node Power Draw

The only marginal cost is the edge nodes' power consumption — approximately 4 watts each at the underclocked frequency, equivalent to leaving an LED light on. The total infrastructure bill for a multi-region distributed egress network is effectively the electricity for a few nightlights.

Engineering Takeaways

The architecture's central insight is that a distributed system doesn't require symmetric connectivity. The standard assumption — that a server needs to be able to reach its clients — can be inverted. Clients can connect to the server, and the server can use those persistent connections for bidirectional communication without ever needing to know the client's IP address or punch through any NAT. This is the same pattern used by Cloudflare Tunnels, by Tailscale's relay network, and by any modern NAT traversal system.

        The constraint that shaped the design: CGNAT is not a problem to solve — it is a constraint to design around. Trying to work against it (dynamic DNS, UPnP port mapping, NAT-PMP) creates fragility dependent on ISP cooperation and router firmware. Designing with it (reverse tunnels, outbound-only connections) creates robustness that doesn't care what the ISP does.
    

The second takeaway is the value of separating the data plane from the management plane. On day one, it would have been simpler to route SSH over the same WebSocket as task data. Keeping them separate meant that an outage in task execution never blocked access to the node for diagnosis, and a misconfigured management tunnel never disrupted live task traffic.