Global Egress Orchestration
There is a fundamental tension in high-throughput data aggregation: datacenter compute is powerful, stable, and cheap — but datacenter IP blocks are inherently distrusted by strict external APIs that distinguish commercial traffic from genuine residential users. Residential networks carry that trust, but lack the compute and stability for heavy orchestration workloads.
I designed a system that separates these two concerns entirely. The Oracle Cloud orchestrator handles all coordination logic: task scheduling, queue management, response processing, and observability. The residential edge nodes handle only one thing: executing the actual outbound HTTP requests from a trusted IP space. The two layers communicate over a reverse WebSocket tunnel that requires zero inbound firewall rules on the residential side.
Architecture
The system has two planes: a data plane (task execution at the residential edge) and a management plane (orchestration, health, and remote access via Cloudflare Zero Trust). They are intentionally decoupled — a failure in the management plane never affects live task execution.
│ → maintains task queue in Redis
│ → dispatches payloads over WebSocket to available nodes
│ → Python SRE service monitors node health via heartbeat
│
├─── wss://[orchestrator]:443 ──────────────────────────────────────┐
│ │
└─── wss://[orchestrator]:443 ─────────────────────────────┐ │
[ Edge Node — Region A ] [ Edge Node — Region B ]
Residential ISP Residential ISP
│ │
HTTP Egress HTTP Egress
▼ ▼
[ Target API ] [ Target API ]
Friction Log — Engineering Challenges in Production
Building a distributed network across consumer-grade ISPs in multiple geographies surfaces failure modes that don't exist in a tidy datacenter. Below are the primary engineering challenges encountered in production and the pragmatic solutions implemented for each.
I initially explored deploying the edge client as a background application on consumer mobile devices — iOS and Android — to leverage their residential IP addresses without dedicated hardware. Apple's iOS is fundamentally hostile to this use case. The OS aggressively suspends WebSocket threads to preserve battery life, and App Store guidelines strictly prohibit background egress routing.
SolutionStrategic abandonment. Rather than engineering fragile, battery-draining workarounds or fighting the OS scheduler, I dropped iOS entirely. Engineering is about knowing what not to build as much as what to build. The deployment strategy pivoted exclusively to headless Alpine Linux containers and repurposed plugged-in ARM hardware — guaranteeing 100% socket uptime without fighting a consumer OS designed to do exactly the opposite.
What was considered
Background fetch APIs, silent push notifications to wake the process, jailbroken device sandbox escapes. All evaluated, all rejected.
Why abandonment was right
Any workaround would have required maintaining a brittle compatibility layer against every iOS update. The operational cost exceeded the benefit indefinitely.
Residential ISPs frequently reassign IP addresses and deploy Carrier-Grade NAT (CGNAT), making inbound connections to edge nodes impossible. You cannot open a port on a CGNAT address — the NAT gateway doesn't belong to you.
SolutionThe architecture inverts the connection direction entirely. Edge nodes initiate an outbound WebSocket connection to the Oracle orchestrator on port 443. Because the connection originates from inside the residential network, NAT passes it through without any port forwarding configuration. The orchestrator never connects to the nodes — it waits for nodes to connect to it, then multiplexes task payloads over the persistent socket.
Each node registers itself with a unique ID on connection and is immediately available for task assignment. The Systemd unit ensures the daemon reconnects with a 5-second backoff if the socket drops:
[Unit] Description=Egress WebSocket Daemon After=network-online.target Wants=network-online.target [Service] Type=simple User=egress ExecStart=/opt/egress-node/client --orchestrator wss://[orchestrator-host] Restart=always RestartSec=5 ReadOnlyPaths=/ ReadWritePaths=/tmp [Install] WantedBy=multi-user.target
Consumer ISPs drop packets silently. If a residential node loses its connection, the TCP socket on the Oracle server remains "half-open" — the OS doesn't know the peer is gone until it tries to write and the kernel's TCP keepalive eventually fires. This takes minutes. During that window, the orchestrator assumes the node is alive, assigns it HTTP payloads, and blocks waiting for a response that will never arrive. Tasks pile up in the queue; the system grinds to a halt.
SolutionApplication-layer heartbeat at 15-second intervals — independent of TCP keepalive. Each node sends a PING frame every 15 seconds. A Python SRE service polls Redis continuously. Any node that has not updated its heartbeat timestamp within the timeout window is declared dead, immediately purged from the active pool, and all its locked tasks are requeued to healthy nodes.
{
"event": "HEARTBEAT_TIMEOUT",
"node_id": "egress-node-region-b-04",
"last_ping_ms": 15042,
"status": "DEAD",
"action": "PURGE_AND_REQUEUE",
"recovered_tasks": [
{ "task_id": "8f9a-4b2c", "requeued_to": "egress-node-region-a-02" },
{ "task_id": "3c1d-9e7f", "requeued_to": "egress-node-region-b-01" }
],
"timestamp": "2026-05-01T14:22:10Z"
}
The requeue operation is idempotent — tasks carry a unique ID and the orchestrator checks for duplicates before processing, so a node that reconnects after a false-positive timeout cannot cause double-execution.
The edge nodes use repurposed ARM hardware running in residential environments — no server room, no active cooling, ambient room temperature with no airflow management. Running continuous asynchronous HTTP requests at full CPU clock speeds causes thermal runaway: core temperatures climb into throttling territory within minutes, triggering kernel panics and, over weeks, causing permanent hardware degradation.
SolutionRather than relying on the kernel's default thermal governor — which reacts to high temperatures after the fact — I implemented a proactive approach: unconditionally underclock the CPU to a ceiling that keeps core temperatures stable regardless of load. A cron-executed shell script overrides the kernel's frequency governor and hard-caps the maximum clock speed. The tradeoff is approximately 20% of burst throughput, which is fully acceptable: these nodes are I/O-bound (waiting on network responses), not CPU-bound.
#!/bin/sh
# Proactive thermal management: underclock before heat builds, not after
echo "powersave" > /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
# Hard-limit max frequency to prevent thermal runaway on fanless hardware
MAX_FREQ=1200000
for cpu in /sys/devices/system/cpu/cpu[0-7]*; do
if [ -f "$cpu/cpufreq/scaling_max_freq" ]; then
echo $MAX_FREQ > "$cpu/cpufreq/scaling_max_freq"
fi
done
echo "[INFO] Thermal ceiling applied. Node stable for continuous operation."
The edge nodes are physically inaccessible — deployed in private residential environments across multiple geographies. SSH over a standard public IP is not possible (CGNAT blocks it). A VPN would require a persistent tunnel with its own reliability concerns and exposure. Any management approach requiring physical access is operationally unacceptable.
SolutionThe management plane runs entirely through Cloudflare Zero Trust. The cloudflared daemon on each node establishes an outbound tunnel to Cloudflare's network — the same reverse-tunnel pattern used for the data plane. SSH access to any node is granted through the Zero Trust dashboard: the engineer authenticates via identity provider, and Cloudflare proxies the SSH session inbound through the existing outbound tunnel. No exposed SSH port. No VPN. No static IP dependency.
Data Plane (Task Execution)
Outbound WebSocket → Oracle orchestrator. Carries HTTP task payloads bidirectionally. Always-on, restart-on-failure via Systemd.
Management Plane (SSH / Config)
Outbound Cloudflare tunnel → Zero Trust SSH. On-demand. An engineer authenticates to Cloudflare, not to the node directly. The node never exposes port 22.
FinOps — The Cost Model
The orchestrator runs on Oracle Cloud's perpetually free ARM instance — a 4-core, 24GB RAM VM that costs $0/month indefinitely. Redis runs on the same instance. The residential edge nodes are repurposed hardware with zero additional cost — existing electricity and ISP connections, already paid for.
The only marginal cost is the edge nodes' power consumption — approximately 4 watts each at the underclocked frequency, equivalent to leaving an LED light on. The total infrastructure bill for a multi-region distributed egress network is effectively the electricity for a few nightlights.
Engineering Takeaways
The architecture's central insight is that a distributed system doesn't require symmetric connectivity. The standard assumption — that a server needs to be able to reach its clients — can be inverted. Clients can connect to the server, and the server can use those persistent connections for bidirectional communication without ever needing to know the client's IP address or punch through any NAT. This is the same pattern used by Cloudflare Tunnels, by Tailscale's relay network, and by any modern NAT traversal system.
The second takeaway is the value of separating the data plane from the management plane. On day one, it would have been simpler to route SSH over the same WebSocket as task data. Keeping them separate meant that an outage in task execution never blocked access to the node for diagnosis, and a misconfigured management tunnel never disrupted live task traffic.