Innovative Monitoring Systems by Google Engineer Bhasker Goel

Innovative Monitoring Systems by Google Engineer Bhasker Goel

Synopsis

Google Ads engineer Bhasker Goel is developing innovative monitoring systems to detect subtle failures in complex, distributed infrastructure. His comparative monitoring approach contrasts system segments in real-time, identifying divergences that signal regressions before traditional alerts trigger. This method moves beyond static thresholds to a live validation system, crucial for preventing costly outages in modern software development.

Listen to this article in summarized format

Spotlight Wire
When a major platform ships a broken update, the damage rarely arrives with a clear warning. A core metric drifts imperceptibly, a graph tilts a few degrees, and engineers argue over whether the signal is real. Bhasker Goel, a software engineer at Google Ads in Bengaluru, works on the narrow but consequential gap between a system beginning to fail and a monitoring system recognising that failure.

Goel, an alumnus of IIT Guwahati, built financial infrastructure at DE Shaw before joining Google, where he now designs reliability systems for Google Ads. In environments of this size, small anomalies can cascade rapidly. The regressions his systems are built to catch tend to be gradual and silent; left undetected, they can become expensive quickly.

He recognised early on that traditional monitoring struggles to keep up with modern infrastructure. Older systems were designed for predictable, monolithic applications.

Today’s distributed infrastructure involves shifting traffic, layered dependencies and continuous deployment pipelines. A fixed threshold calibrated for normal conditions might miss a regression affecting only a specific user segment, or it might fire false alerts so frequently that engineers suffer alarm fatigue and simply ignore them.

The pace of software development is compounding the problem. AI-assisted coding tools are increasing the volume of code moving through production systems, which means more changes, more edge cases and less time for engineers to distinguish a real regression from routine variation. In that environment, monitoring systems built for slower, more predictable release cycles are under growing strain1.

“The default industry model is still to pick a threshold, watch a dashboard, and alert when a number crosses a line,” said Bhasker Goel. “That worked well when systems were smaller and traffic was predictable. At current scale, relying solely on static thresholds can become a liability,” he added.

Goel’s work centres on an approach known as comparative monitoring. Rather than measuring metrics against a static line, his systems compare two segments of the same environment in real time. If functionally identical parts of the system begin diverging under the same external conditions, that divergence surfaces a regression well before a conventional alert fires. Because both sides share the same live traffic patterns, the ambient noise that makes conventional monitoring unreliable is largely filtered out.

The logic extends beyond advertising. Cloud infrastructure rollouts, algorithmic trading platforms and global consumer apps all face the same fundamental engineering challenge: distinguishing genuine degradation from normal production variation fast enough to prevent an outage.

Goel’s work aims to shift monitoring from a reactive reporting layer to a system of live validation, where every change is tested against a dynamic baseline before it spreads.

Vinay Kakade, co-founder of Infino AI and former senior staff engineer at Lyft, noted that Goel’s contributions address a widespread industry challenge. “Once you operate distributed systems at scale, you stop trusting static thresholds as your primary defence because the baseline is always moving,” Kakade said.

“What Goel is working on asks a much more robust question: did two identical parts of the system suddenly stop agreeing? That is often the fastest way to separate a real regression from production noise,” he added.

“You stop asking what happened after the fact, and start asking whether what is happening right now matches what was supposed to happen,” Goel said.

As software infrastructure grows more complex, approaches like these are becoming harder to ignore. Catching failures before they become incidents rarely makes product release headlines, but it remains some of the most critical engineering work happening behind the scenes2.

References:
  1. https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4945566
  2. https://sre.google/workbook/reaching-beyond/
Disclaimer: This article is generated and published by the ET Spotlight team. You can get in touch with them on [email protected]

This editorial summary reflects ET Tech and other public reporting on Innovative Monitoring Systems by Google Engineer Bhasker Goel.

Reviewed by WTGuru editorial team.