available for senior + staff platform roles—:— IST

Ibuildtheplatformsproductionrunson.

Senior platform engineer with 6+ years owning Kafka, Kubernetes and the LGTM stack across multi-tenant enterprise SaaS — turning distributed chaos into observable, cost-efficient infrastructure.

  • Apache Kafka
  • Kubernetes
  • Terraform
  • OpenTelemetry
  • Grafana LGTM
  • ArgoCD
  • Spring Boot
  • AWS · Azure · GCP
annual cost saved
$0M+

Illumio observability rebuild

events / day
0K+

real-time pipelines

lower MTTR
0%

SLO-driven alerting

microservices
0+

multi-tenant SaaS

layer 01

Apache Kafka

50K+ events/day · sub-secondp99 · 412ms

layer 02

Kubernetes · GitOps

Terraform · ArgoCD · Helm20+ tenants · 0 downtime

layer 03

LGTM Observability

Prometheus · Loki · Tempo · GrafanaMTTR · -45%
↘ production · liveuptime · 99.95%
scroll

About

Reliability is a feature. Someone has to ship it.

I’m a platform engineer who treats production like a product — every service traced, every alert intentional, every dollar of cloud spend defensible.

Over the past six years at Sarvaha, I’ve owned the design of multi-tenant SaaS platforms for Illumio, ApexaIQ, and Tesla’s connected-car program — turning Apache Kafka, Kubernetes and the LGTM stack into systems teams actually trust at 2am.

The work I’m proudest of isn’t the scale (though we hit 50K+ events/day sub-second into BigQuery). It’s the silence — alerts that only fire when something real happens, dashboards engineers actually open, runbooks that cut MTTR by 45%.

I work end-to-end across architecture, infrastructure, and the long tail of operational glue that keeps distributed systems honest. Java and Spring Boot for event-driven backends, Terraform and ArgoCD for GitOps, Prometheus/Loki/ Tempo for the parts you only notice when they’re broken.

Core expertise

  • Distributed Systems
  • Event-Driven Architecture
  • Kafka Streaming
  • Kubernetes & GitOps
  • Observability Engineering
  • Cloud Infrastructure
  • Real-Time Data Pipelines
  • Platform Engineering
  • Multi-Tenant SaaS
  • DevOps Automation
  • SRE & Reliability
  • Microservices
Based
India · remote
Experience
6+ years
Focus
Platform · SRE
Open to
Senior / Staff roles

Experience

Six years. One mission.

Senior platform engineer at Sarvaha Systems — embedded with enterprise clients to design, ship, and operate the distributed systems they bet their products on.

  1. Senior Software Engineer

    Sarvaha Systems Pvt. Ltd.

    Dec 2019 — Present · India · Remote

    Trusted platform partner across six enterprise products — from Illumio’s observability rebuild to Tesla’s real-time fleet telemetry. Own architecture, rollout and reliability for distributed systems running in production at global scale.

    • Designed multi-tenant SaaS platforms across 100+ microservices, 20+ tenant deployments.
    • Drove $2M+ in annual cost savings through a self-hosted observability platform migration.
    • Owned Kafka-on-Kubernetes architecture, GitOps rollouts, and SLO-based reliability.
    • Cross-functional lead on architecture decisions and platform rollouts for global enterprise clients.

Selected work

Production case studies. Real impact, real metrics.

Six engagements over six years — observability, streaming pipelines, event-driven backends, and the multi-tenant platforms underneath them.

Reliability · SRE

01

Enterprise Observability Platform Migration

Illumio

A self-hosted LGTM rebuild that retired a $2M SaaS bill. Led the migration from Observe SaaS to a fully self-hosted Prometheus / Loki / Tempo / Grafana stack across three production environments, with SLO-driven alerting on top.

  • Eliminated $2M+ in annual licensing spend, cut environment-specific incidents by 40%.
  • Built 15+ production dashboards and PromQL / LogQL / TraceQL queries across 50+ services.
  • Migrated 100+ alerts to Helm-managed Alertmanager — 60% faster incident response.
  • SLO-based alerting reduced alert volume by 70% and improved MTTR by 45%.
PrometheusGrafanaLokiTempoAlertmanagerPromQLLogQLPagerDuty

Platform · Security

02

Asset Management & Cybersecurity SaaS

ApexaIQ

A real-time vulnerability pipeline on Kafka, hardened for 20+ tenants. Architected a sub-second vulnerability detection pipeline on Kafka, fully automated on EKS with GitOps and tier-1 multi-tenant observability.

  • 10K+ records/day processed at sub-second latency across 20+ multi-tenant deployments.
  • Terraform + Helm cut Kafka-on-EKS spin-up from 2 days to 30 minutes.
  • ArgoCD GitOps shipped 50+ zero-downtime releases per month.
  • 4-tier Grafana dashboards (Global / Tenant / Security / Ops) across 6 metric domains.
  • 35% faster queries via per-accelerator partitioning; auto-routed 500+ tickets/month via Workato.
KafkaKubernetesAWS EKSTerraformArgoCDOpenTelemetryVue.jsSpring BootPostgreSQL

Event-Driven · Java

03

Google Integration Service

Multi-Tenant Command Orchestration

Event-driven command orchestration with 99.5% delivery reliability. Designed and shipped a Java + Kafka command-execution platform orchestrating concurrent device commands across isolated tenants, integrated with Google GAC APIs.

  • 500+ concurrent device commands across 10+ isolated tenants.
  • Configurable retry & expiry orchestration — 80% lower command failure rate.
  • 99.5% message delivery reliability under production load.
  • Indexed schema scaling to 100K+ batch, device and command records.
Java 17Spring Boot 6Apache KafkaGoogle GAC APIPostgreSQL

Streaming · IoT

04

Real-Time Fleet Telemetry Platform

Connected Cars (Tesla EV)

50K+ telemetry events/day from a Tesla EV fleet, into BigQuery sub-second. Architected the ingestion pipeline for a connected-car program — streaming Tesla EV telemetry into BigQuery in real time, with Strimzi-managed Kafka on AKS and GitOps rollouts.

  • 50K+ events/day streamed sub-second into BigQuery.
  • Terraform + ArgoCD provisioning cut cluster setup time by 75%.
  • 30+ automated deployments per month with zero downtime.
  • 8+ analytics REST APIs powering vehicle performance and driver-behavior insights.
Apache KafkaStrimziAzure AKSTerraformArgoCDBigQueryNode.jsFlutter

AI · Observability

05

Personalized AI Customer Support Platform

Agentic-AI

Full LGTM stack instrumenting an AI agent fleet end-to-end. Modernised diagnostics for a distributed AI customer-support platform — full LGTM stack on Kubernetes with OpenTelemetry tracing across every microservice.

  • MTTD reduced by 50% across distributed AI services.
  • 100% trace coverage on 5+ instrumented microservices.
  • End-to-end signal correlation cut debug time by 60%.
OpenTelemetryGrafana LGTMWinston-LokiKubernetesNext.jsNestJSPython

Data · Healthcare

06

Clinical Data Pipeline

OMOP ETL

Customer databases → OMOP Common Data Model, 10+ DBT mappings. Modelled DBT-driven ETL pipelines converting heterogeneous customer databases into the OMOP Common Data Model, authoring spec across 20+ modules.

  • 10+ DBT-driven mappings into OMOP CDM.
  • Technical specs authored across 20+ modules.
  • Standards-aligned pipeline ready for OHDSI tooling.
Google BigQueryOMOP CDMDBTOHDSIAthena

Toolkit

The stack I ship with. End-to-end.

Languages and frameworks I reach for daily — from event-driven backends to the observability tooling that makes them honest.

Languages

01
  • Java
  • TypeScript
  • JavaScript
  • Python

Backend

02
  • Spring Boot
  • Node.js
  • Express.js
  • NestJS

Frontend

03
  • React
  • Vue.js
  • Angular

Cloud & DevOps

04
  • AWS (EKS)
  • Azure (AKS)
  • GCP
  • Kubernetes
  • Terraform
  • Docker
  • Helm
  • ArgoCD

Streaming & Data

05
  • Apache Kafka
  • Strimzi
  • PostgreSQL
  • MongoDB
  • BigQuery

Observability

06
  • Grafana
  • Prometheus
  • Loki
  • Tempo
  • Mimir
  • OpenTelemetry
  • Alertmanager
  • PagerDuty

AI / ML

07
  • LangChain
  • OpenAI
  • TensorFlow

Education

Where the foundations were laid.

B.Tech, Computer Science and Engineering

SGGSIET, Nanded

Autonomous Institute of the Government of India

Numbers, the boring kind

Six years, in production receipts.

Every metric below is owned, shipped, and measured against a real workload — no rounded marketing numbers.

$0M+

annual licensing saved

self-hosted LGTM stack at Illumio

0K+

telemetry events / day

sub-second Kafka → BigQuery

0.0%

message delivery

event-driven command platform

0%

alert volume cut

noise-reduction & SLO strategy

0%

faster incident response

100+ alerts on Helm-managed Alertmanager

0%

faster cluster setup

Terraform + ArgoCD GitOps

0%

lower MTTD

full LGTM rollout on Kubernetes

0%

trace coverage

OpenTelemetry across 5+ services

Let’s talk

Have a platform that needs to scale quietly?

Open to senior and staff-level platform, SRE and backend roles. Happy to talk about observability rebuilds, Kafka migrations, or anything multi-tenant.