Ibuildtheplatformsproductionrunson.
Senior platform engineer with 6+ years owning Kafka, Kubernetes and the LGTM stack across multi-tenant enterprise SaaS — turning distributed chaos into observable, cost-efficient infrastructure.
- Apache Kafka
- Kubernetes
- Terraform
- OpenTelemetry
- Grafana LGTM
- ArgoCD
- Spring Boot
- AWS · Azure · GCP
- annual cost saved
- $0M+
- events / day
- 0K+
- lower MTTR
- 0%
- microservices
- 0+
Illumio observability rebuild
real-time pipelines
SLO-driven alerting
multi-tenant SaaS
layer 01
Apache Kafka
layer 02
Kubernetes · GitOps
layer 03
LGTM Observability
About
Reliability is a feature. Someone has to ship it.
I’m a platform engineer who treats production like a product — every service traced, every alert intentional, every dollar of cloud spend defensible.
Over the past six years at Sarvaha, I’ve owned the design of multi-tenant SaaS platforms for Illumio, ApexaIQ, and Tesla’s connected-car program — turning Apache Kafka, Kubernetes and the LGTM stack into systems teams actually trust at 2am.
The work I’m proudest of isn’t the scale (though we hit 50K+ events/day sub-second into BigQuery). It’s the silence — alerts that only fire when something real happens, dashboards engineers actually open, runbooks that cut MTTR by 45%.
I work end-to-end across architecture, infrastructure, and the long tail of operational glue that keeps distributed systems honest. Java and Spring Boot for event-driven backends, Terraform and ArgoCD for GitOps, Prometheus/Loki/ Tempo for the parts you only notice when they’re broken.
Core expertise
- Distributed Systems
- Event-Driven Architecture
- Kafka Streaming
- Kubernetes & GitOps
- Observability Engineering
- Cloud Infrastructure
- Real-Time Data Pipelines
- Platform Engineering
- Multi-Tenant SaaS
- DevOps Automation
- SRE & Reliability
- Microservices
- Based
- India · remote
- Experience
- 6+ years
- Focus
- Platform · SRE
- Open to
- Senior / Staff roles
Experience
Six years. One mission.
Senior platform engineer at Sarvaha Systems — embedded with enterprise clients to design, ship, and operate the distributed systems they bet their products on.
Senior Software Engineer
Sarvaha Systems Pvt. Ltd.
Dec 2019 — Present · India · Remote
Trusted platform partner across six enterprise products — from Illumio’s observability rebuild to Tesla’s real-time fleet telemetry. Own architecture, rollout and reliability for distributed systems running in production at global scale.
- Designed multi-tenant SaaS platforms across 100+ microservices, 20+ tenant deployments.
- Drove $2M+ in annual cost savings through a self-hosted observability platform migration.
- Owned Kafka-on-Kubernetes architecture, GitOps rollouts, and SLO-based reliability.
- Cross-functional lead on architecture decisions and platform rollouts for global enterprise clients.
Selected work
Production case studies. Real impact, real metrics.
Six engagements over six years — observability, streaming pipelines, event-driven backends, and the multi-tenant platforms underneath them.
Reliability · SRE
01
Enterprise Observability Platform Migration
Illumio
A self-hosted LGTM rebuild that retired a $2M SaaS bill. Led the migration from Observe SaaS to a fully self-hosted Prometheus / Loki / Tempo / Grafana stack across three production environments, with SLO-driven alerting on top.
- Eliminated $2M+ in annual licensing spend, cut environment-specific incidents by 40%.
- Built 15+ production dashboards and PromQL / LogQL / TraceQL queries across 50+ services.
- Migrated 100+ alerts to Helm-managed Alertmanager — 60% faster incident response.
- SLO-based alerting reduced alert volume by 70% and improved MTTR by 45%.
Platform · Security
02
Asset Management & Cybersecurity SaaS
ApexaIQ
A real-time vulnerability pipeline on Kafka, hardened for 20+ tenants. Architected a sub-second vulnerability detection pipeline on Kafka, fully automated on EKS with GitOps and tier-1 multi-tenant observability.
- 10K+ records/day processed at sub-second latency across 20+ multi-tenant deployments.
- Terraform + Helm cut Kafka-on-EKS spin-up from 2 days to 30 minutes.
- ArgoCD GitOps shipped 50+ zero-downtime releases per month.
- 4-tier Grafana dashboards (Global / Tenant / Security / Ops) across 6 metric domains.
- 35% faster queries via per-accelerator partitioning; auto-routed 500+ tickets/month via Workato.
Event-Driven · Java
03
Google Integration Service
Multi-Tenant Command Orchestration
Event-driven command orchestration with 99.5% delivery reliability. Designed and shipped a Java + Kafka command-execution platform orchestrating concurrent device commands across isolated tenants, integrated with Google GAC APIs.
- 500+ concurrent device commands across 10+ isolated tenants.
- Configurable retry & expiry orchestration — 80% lower command failure rate.
- 99.5% message delivery reliability under production load.
- Indexed schema scaling to 100K+ batch, device and command records.
Streaming · IoT
04
Real-Time Fleet Telemetry Platform
Connected Cars (Tesla EV)
50K+ telemetry events/day from a Tesla EV fleet, into BigQuery sub-second. Architected the ingestion pipeline for a connected-car program — streaming Tesla EV telemetry into BigQuery in real time, with Strimzi-managed Kafka on AKS and GitOps rollouts.
- 50K+ events/day streamed sub-second into BigQuery.
- Terraform + ArgoCD provisioning cut cluster setup time by 75%.
- 30+ automated deployments per month with zero downtime.
- 8+ analytics REST APIs powering vehicle performance and driver-behavior insights.
AI · Observability
05
Personalized AI Customer Support Platform
Agentic-AI
Full LGTM stack instrumenting an AI agent fleet end-to-end. Modernised diagnostics for a distributed AI customer-support platform — full LGTM stack on Kubernetes with OpenTelemetry tracing across every microservice.
- MTTD reduced by 50% across distributed AI services.
- 100% trace coverage on 5+ instrumented microservices.
- End-to-end signal correlation cut debug time by 60%.
Data · Healthcare
06
Clinical Data Pipeline
OMOP ETL
Customer databases → OMOP Common Data Model, 10+ DBT mappings. Modelled DBT-driven ETL pipelines converting heterogeneous customer databases into the OMOP Common Data Model, authoring spec across 20+ modules.
- 10+ DBT-driven mappings into OMOP CDM.
- Technical specs authored across 20+ modules.
- Standards-aligned pipeline ready for OHDSI tooling.
Toolkit
The stack I ship with. End-to-end.
Languages and frameworks I reach for daily — from event-driven backends to the observability tooling that makes them honest.
Languages
01- Java
- TypeScript
- JavaScript
- Python
Backend
02- Spring Boot
- Node.js
- Express.js
- NestJS
Frontend
03- React
- Vue.js
- Angular
Cloud & DevOps
04- AWS (EKS)
- Azure (AKS)
- GCP
- Kubernetes
- Terraform
- Docker
- Helm
- ArgoCD
Streaming & Data
05- Apache Kafka
- Strimzi
- PostgreSQL
- MongoDB
- BigQuery
Observability
06- Grafana
- Prometheus
- Loki
- Tempo
- Mimir
- OpenTelemetry
- Alertmanager
- PagerDuty
AI / ML
07- LangChain
- OpenAI
- TensorFlow
Education
Where the foundations were laid.
B.Tech, Computer Science and Engineering
SGGSIET, Nanded
Autonomous Institute of the Government of India
Numbers, the boring kind
Six years, in production receipts.
Every metric below is owned, shipped, and measured against a real workload — no rounded marketing numbers.
annual licensing saved
self-hosted LGTM stack at Illumio
telemetry events / day
sub-second Kafka → BigQuery
message delivery
event-driven command platform
alert volume cut
noise-reduction & SLO strategy
faster incident response
100+ alerts on Helm-managed Alertmanager
faster cluster setup
Terraform + ArgoCD GitOps
lower MTTD
full LGTM rollout on Kubernetes
trace coverage
OpenTelemetry across 5+ services
Let’s talk
Have a platform that needs to scale quietly?
Open to senior and staff-level platform, SRE and backend roles. Happy to talk about observability rebuilds, Kafka migrations, or anything multi-tenant.