Self-Healing SRE Trading Platform
A financial trading system is arguably the most unforgiving workload for an SRE: stale data costs money, downtime costs more, and an unsafe automated action at the wrong moment can cause more damage than the incident itself. I built this platform not just to handle failure — but to handle failure correctly.
This project is a full-stack SRE demonstration: a production-grade, self-healing system with automated failover, real-time observability, error budget tracking, chaos validation, and a safety-governed incident control plane.
Architecture Overview
The system is built as a hybrid cloud, three-tier architecture with a strict separation of concerns between the traffic layer, application layer, and data layer.
│
[ NGINX Ingress ]──[ HAProxy (TCP L4) ]
│
[ Risk Engine Consumer ] ──► [ Order Book Consumer ]
│
[ Incident Control Plane ]──[ HashiCorp Vault ]
The Database Replication Layer
Trading systems write hard financial data. The database layer uses MySQL GTID (Global Transaction ID) based replication — every transaction carries a globally unique ID that survives a failover, making it impossible for a newly promoted replica to re-apply already-committed transactions.
Under normal operation, the primary handles all writes and the replicas serve read traffic. HAProxy performs TCP-level health checks every two seconds. When the primary becomes unresponsive, the incident control plane activates the automated failover sequence.
Why GTID over Traditional Binlog Position?
Classic replication tracks position by a binary log file name and offset. If two replicas have slightly different offsets at the moment of failure, promoting the wrong one can cause data divergence — a split-brain scenario. GTID removes this entirely: the control plane can identify which replica is most current by comparing executed GTID sets and safely promote it without guesswork.
The Incident Control Plane
Fully autonomous remediation is seductive but dangerous. A system that blindly promotes a replica under an ambiguous network partition may promote the wrong node, causing data loss. I built a rule-governed incident control plane that separates what the system detects automatically from what it is allowed to do automatically.
If any safety condition is unmet, the control plane halts automated remediation, fires a PagerDuty-style alert, and presents the on-call engineer with a pre-validated runbook rather than leaving them to diagnose from scratch.
DBPrimaryDown alert.Event-Driven Trade Processing with Kafka
The application layer is decoupled via Apache Kafka. The Trade Ingestion API writes raw trade events to a Kafka topic. Downstream consumers — the Risk Engine and Order Book — process events independently. This means:
- A spike in trade volume doesn't directly pressure the database; Kafka absorbs the burst
- A slow consumer (e.g., a complex risk calculation) doesn't block fast consumers
- Events are replayed on consumer restart, giving the system natural crash recovery without data loss
Observability Stack
Every layer emits metrics scraped by Prometheus on a 15-second interval. Grafana dashboards are structured around the Four Golden Signals: latency, traffic, errors, and saturation. Alerting thresholds are derived from the error budget, not arbitrary numbers.
Infrastructure as Code: Terraform + Ansible
No component of this platform is configured by hand. Terraform provisions the cloud resources (GKE node pools, VPC networking, firewall rules, load balancers) while Ansible handles in-node configuration: MySQL replication setup, Kafka broker tuning, Vault agent injection, and HAProxy backend templating.
terraform plan before apply generates a diff visible in CI. A node that drifts from its Ansible playbook is treated as a misconfigured node, not a special snowflake.
Chaos Engineering & Validation
An SRE platform that has never been tested under failure is just a theory. I implemented a chaos engineering suite that deliberately injects failure conditions to validate that every alert fires, every automated action executes correctly, and every runbook step actually works before an incident does.
- DB Kill Test: Sends SIGKILL to the MySQL primary process. Validates 10-second failover path end-to-end
- Network Partition Test: Uses
iptablesrules to partition the primary from replicas without killing the process, verifying split-brain detection - Kafka Broker Eviction: Removes a broker from the cluster mid-stream, confirming consumer group rebalancing and zero message loss
- Pod Drain Test: Cordon and drain a Kubernetes node while traffic is flowing, validating PodDisruptionBudget enforcement
- Vault Seal Test: Seals Vault mid-operation, confirming that cached leases are valid and secret renewal fails gracefully with alerting
Secrets Management with HashiCorp Vault
All database credentials, Kafka SASL passwords, and API keys are injected at runtime via HashiCorp Vault's Kubernetes Auth method. The Vault Agent sidecar reads a service account token, authenticates to Vault, fetches dynamic secrets, and writes them to a shared in-memory volume. The application never touches a static credential file.
Dynamic database credentials from Vault's database secrets engine are rotated every hour. Even if a credential leaks, its lifetime is bounded.
Engineering Takeaways
The most important engineering decision in this project wasn't a technology choice — it was the choice of where not to automate. Full automation in a financial system creates systemic risk: an automated action that fires on a false positive can be worse than the original incident. The incident control plane's safety rules encode the on-call engineer's institutional knowledge into the system, making automation trustworthy rather than reckless.