What Took Down Gmail Is Also Happening in Your ICU
Key takeaways:
- The system was designed to wait for a human to decide it’s their problem, and that human is already managing forty other things.
- Knowing where a problem is doesn’t resolve it if the information still has to travel through three phone calls to reach someone who can act.
- The same failure mode that took down Gmail happens in hospitals every time a resource goes missing and three people search for it without knowing the others already did.
When I was running engineering at Jet, a critical service went down overnight. The alarm fired correctly. Three separate people got woken up. Each one looked at the alert, assumed it belonged to another team, and went back to sleep.
In the morning, the site had no products to sell. We noticed immediately.
We did a post-mortem. We cleaned up the alert noise. We ran a re-education session on on-call ownership. All reasonable responses. All of them addressed the symptoms.
The design flaw underneath stayed intact: we were still building systems that required a human to decide it was their problem before anything happened.
We got away with it. The outage ran from 4am to 7am — the lowest traffic window we had. In a different context — say, a nurse trying to locate a ventilator at 3am — you don’t get to learn that lesson cheaply.
The dashboard problem
The last decade of healthcare technology investment built excellent dashboards. EHR integrations, RTLS location tracking, staffing tools, bed management systems. All of them generate real data. All of them require a human to convert that data into action.
Same design flaw. Different stakes.
What we hear consistently from clinicians and ops teams: a patient is discharged and leaves the hospital. The information eventually finds its way to the EHR, after an hour. EVS and transport dispatch see a terminal clean and transport request in their dashboard, and prioritize based on first in – first out. The next patient has been waiting in the ED for hours already, and the bed sits empty for nearly 2 hours more.
No one failed. The data was right. The design assumption — that someone would be watching, and free, and consider it their job — was wrong.
A charge nurse running a twelve-hour shift isn’t watching a dashboard. Patient placement is evaluating 40 open bed requests at once, and prioritizing the ones they get phone calls about. Insight lives in a screen room. The problem lives on the floor.
Visibility tells you there is a problem. It does not fix the problem.
What “agentic” actually means
The word gets used loosely. Here’s what I mean: an agent closes the loop without waiting for a human to decide it’s their problem.
The first piece — and the hardest — is synthesis. EHR sees the orders. RTLS says where the equipment is. The staffing tool knows who’s available. These systems were never designed to talk to each other. They generate real data in different formats, managed by different teams, on different refresh cycles.
That’s what we’ve built in the Intelligent Orchestration Platform: a layer that connects those systems, cleans the data, and creates a unified operational picture of what’s actually happening in the building right now.
That matters on its own. Most hospitals are running on EHR-plus-phone-calls and calling it operations. A unified real-time picture is a material improvement over rearview reports.
But synthesis is still passive. It’s a better dashboard.
What we’re building on top of it is the action layer. Here’s what that looks like in practice.
The discharge occurs. The EHR updates. The platform sees that the right EVS team member is two floors up, finishing a task that ends in four minutes — and that the next patient in the ED has been waiting forty minutes. The task routes. The bed prep starts. The charge nurse doesn’t need to call patient placement. Bed placement prioritizes the right patient. The patient occupies the bed hours sooner.
Nobody coordinated it. The system did.
That’s what closing the loop actually means. Not a better alert. Not a smarter dashboard. A system that acts on the full picture without waiting for a human to carry information from one system to another.
What happens when it breaks
The question every hospital CIO asks in the second meeting.
When I ran the storage SRE team at Google, Gmail went down. All fingers pointed at Colossus — our distributed storage infrastructure. Ben Treynor, who built Google’s SRE practice, got pulled into the war room.
We diagnosed quickly: an intercontinental cable failure had cut cross-Atlantic bandwidth in half. That explained the capacity problem. It didn’t explain what we were actually seeing. Traffic wasn’t dropping. It was going up. Systems were failing and generating more load, not less.
We went deep into the packets. The answer was retries. Every layer of the stack — storage, application, transport — was independently retrying failed requests. Each layer assumed the problem was below it. None of them knew the others were doing the same thing. We barely had enough capacity to handle one retry per layer. We were running four or five.
The cable didn’t take Gmail down. The retry storm did.
The fix was circuit breakers — software that could recognize a capacity constraint and shed load gracefully rather than amplify it. Systems designed to run locally blind rather than burn the network down screaming for confirmation.
What hospital ops and tech companies have in common
I think about that incident every time I walk through a hospital managing a resource shortage.
A ventilator is urgently needed. Three nurses are independently searching. Two have separately called central supply. Every one of them is doing the right thing. Every one of them is adding load to a system that’s already constrained. Each individual retry is reasonable. The aggregate is catastrophic.
The Intelligent Orchestration Platform is the circuit breaker. It sees the constraint forming before the stack collapses. It coordinates the response — single task, single owner, single path to resolution — instead of letting every actor retry independently. When a floor runs short on equipment, the platform sees what’s in transit, what’s available two units over, and which request is most urgent. It routes once. The rest of the retries never happen.
Resilience in this context isn’t redundancy for its own sake. It’s a system that understands the whole before any individual part does.
Why I keep building this
Every system I’ve built that failed the same way left the same lesson: the data was there. The failure was the assumption that data plus a human equals action.
It doesn’t. Not when the humans are busy. Not when ownership is diffuse. Not when the gap between seeing a problem and acting on it is three workflow steps and two phone calls wide.
We fixed the alert noise at Jet. We wrote the circuit breakers at Google. What we didn’t have then — couldn’t have — was a system that saw the cascade forming before the first retry fired.
That’s what we’re building now. Not another dashboard for your teams to babysit. A system that closes the loop, coordinates the response, and understands the whole — so the people who chose a career in healthcare focus on taking care of patients.
