Self-Healing SRE Trading Platform

Architecture: Hybrid Cloud (Proxmox + GKE) | Role: SRE / Platform Engineering | Pattern: Incident-Controlled Automation

Kubernetes (K3s / GKE) Apache Kafka MySQL GTID Replication Terraform Ansible Prometheus & Grafana HashiCorp Vault HAProxy / NGINX Chaos Engineering Python / Bash

A financial trading system is arguably the most unforgiving workload for an SRE: stale data costs money, downtime costs more, and an unsafe automated action at the wrong moment can cause more damage than the incident itself. I built this platform not just to handle failure — but to handle failure correctly.

This project is a full-stack SRE demonstration: a production-grade, self-healing system with automated failover, real-time observability, error budget tracking, chaos validation, and a safety-governed incident control plane.

~10s Automated DB Failover

~70% MTTR Reduction

99.9% SLO Target

43 min Monthly Error Budget

0 Unsafe Promotions

3-tier Cluster Architecture

Architecture Overview

The system is built as a hybrid cloud, three-tier architecture with a strict separation of concerns between the traffic layer, application layer, and data layer.

Traffic Layer

[ Internet / Load Balancer ]
│
[ NGINX Ingress ]──[ HAProxy (TCP L4) ]

Application Layer — Kubernetes (K3s/GKE)

[ Trade Ingestion API ] ──► [ Kafka Topic: trades.raw ]
│
[ Risk Engine Consumer ] ──► [ Order Book Consumer ]

Data Layer

[ MySQL Primary ] ──(GTID replication)──► [ MySQL Replica × 2 ]
│
[ Incident Control Plane ]──[ HashiCorp Vault ]

Observability

[ Prometheus ] ──► [ Grafana Dashboards ] ──► [ Alertmanager ]

The Database Replication Layer

Trading systems write hard financial data. The database layer uses MySQL GTID (Global Transaction ID) based replication — every transaction carries a globally unique ID that survives a failover, making it impossible for a newly promoted replica to re-apply already-committed transactions.

Under normal operation, the primary handles all writes and the replicas serve read traffic. HAProxy performs TCP-level health checks every two seconds. When the primary becomes unresponsive, the incident control plane activates the automated failover sequence.

Why GTID over Traditional Binlog Position?

Classic replication tracks position by a binary log file name and offset. If two replicas have slightly different offsets at the moment of failure, promoting the wrong one can cause data divergence — a split-brain scenario. GTID removes this entirely: the control plane can identify which replica is most current by comparing executed GTID sets and safely promote it without guesswork.

The Incident Control Plane

Fully autonomous remediation is seductive but dangerous. A system that blindly promotes a replica under an ambiguous network partition may promote the wrong node, causing data loss. I built a rule-governed incident control plane that separates what the system detects automatically from what it is allowed to do automatically.

        Safety Rule: An automated promotion is only permitted when the control plane can verify that the replica's GTID executed set is a strict superset of the primary's last known GTID, AND that the primary has been unreachable for a sustained period exceeding the heartbeat threshold — not just a momentary network blip.
    

If any safety condition is unmet, the control plane halts automated remediation, fires a PagerDuty-style alert, and presents the on-call engineer with a pre-validated runbook rather than leaving them to diagnose from scratch.

T+0s — Primary Failure Detected

HAProxy health check fails. Prometheus fires DBPrimaryDown alert.

T+2s — Sustained Failure Confirmed

Control plane validates: heartbeat timeout exceeded. Network partition rules evaluated.

T+4s — Replica Election

GTID sets compared. Most-advanced replica selected. Safety conditions checked.

T+8s — Promotion Executed

Replica promoted to primary. HAProxy backend reconfigured via API. DNS TTL propagates.

T+10s — Traffic Restored

Write traffic flowing to new primary. Kafka consumers reconnect. Alert resolved.

Event-Driven Trade Processing with Kafka

The application layer is decoupled via Apache Kafka. The Trade Ingestion API writes raw trade events to a Kafka topic. Downstream consumers — the Risk Engine and Order Book — process events independently. This means:

A spike in trade volume doesn't directly pressure the database; Kafka absorbs the burst
A slow consumer (e.g., a complex risk calculation) doesn't block fast consumers
Events are replayed on consumer restart, giving the system natural crash recovery without data loss

Observability Stack

Every layer emits metrics scraped by Prometheus on a 15-second interval. Grafana dashboards are structured around the Four Golden Signals: latency, traffic, errors, and saturation. Alerting thresholds are derived from the error budget, not arbitrary numbers.

SLO tracking — error budget burn rate by service:

Trade API

99.97%

Risk Engine

99.91%

DB Write Path

99.9%

Kafka

99.99%

Infrastructure as Code: Terraform + Ansible

No component of this platform is configured by hand. Terraform provisions the cloud resources (GKE node pools, VPC networking, firewall rules, load balancers) while Ansible handles in-node configuration: MySQL replication setup, Kafka broker tuning, Vault agent injection, and HAProxy backend templating.

        Immutability principle: Any configuration change goes through code review. Running terraform plan before apply generates a diff visible in CI. A node that drifts from its Ansible playbook is treated as a misconfigured node, not a special snowflake.
    

Chaos Engineering & Validation

An SRE platform that has never been tested under failure is just a theory. I implemented a chaos engineering suite that deliberately injects failure conditions to validate that every alert fires, every automated action executes correctly, and every runbook step actually works before an incident does.

DB Kill Test: Sends SIGKILL to the MySQL primary process. Validates 10-second failover path end-to-end
Network Partition Test: Uses iptables rules to partition the primary from replicas without killing the process, verifying split-brain detection
Kafka Broker Eviction: Removes a broker from the cluster mid-stream, confirming consumer group rebalancing and zero message loss
Pod Drain Test: Cordon and drain a Kubernetes node while traffic is flowing, validating PodDisruptionBudget enforcement
Vault Seal Test: Seals Vault mid-operation, confirming that cached leases are valid and secret renewal fails gracefully with alerting

Secrets Management with HashiCorp Vault

All database credentials, Kafka SASL passwords, and API keys are injected at runtime via HashiCorp Vault's Kubernetes Auth method. The Vault Agent sidecar reads a service account token, authenticates to Vault, fetches dynamic secrets, and writes them to a shared in-memory volume. The application never touches a static credential file.

Dynamic database credentials from Vault's database secrets engine are rotated every hour. Even if a credential leaks, its lifetime is bounded.

Engineering Takeaways

The most important engineering decision in this project wasn't a technology choice — it was the choice of where not to automate. Full automation in a financial system creates systemic risk: an automated action that fires on a false positive can be worse than the original incident. The incident control plane's safety rules encode the on-call engineer's institutional knowledge into the system, making automation trustworthy rather than reckless.

        Result: ~70% reduction in MTTR across simulated incidents. The on-call engineer transitions from firefighter to supervisor — the system handles detection, diagnosis, and safe remediation; the human handles judgment calls and post-incident review.