Six indicators your deploy infrastructure has crossed from working to dangerous

She told me their deploy pipeline was "mostly fine."

She was the CTO of a sixty-engineer Series C, and "mostly fine" was, on its face, a defensible answer. Their deploys went out roughly on schedule. Their incidents were rare. Their engineering team wasn't visibly miserable. The dashboards were green more often than not.

Six weeks later — after I'd interviewed four of her senior engineers, read the last ninety days of deploy logs, and audited the configurations across her three environments — we had a list of seventeen recurring failure modes that her team had quietly worked around, and a category of incidents I called "invisible firefighting" that her dashboards had no way to see.

She wasn't wrong that things were "mostly fine." She was wrong about what "fine" meant. Most CTOs are.

The pattern is structural, not personal. The engineers who could tell you how bad it really is have been desensitized to the pain. They've normalized it. Six months ago a deploy that required SSH-into-host intervention was a story they'd tell at lunch. Today it's a Tuesday. They've stopped escalating things they've been working around for half a year, and you've stopped hearing about a category of work your team is silently absorbing.

The absence of complaint isn't a signal of organizational health. It's a signal of acceptance — and acceptance is the most expensive failure mode in engineering, because it removes the very feedback signal you need to decide where to invest.

The six indicators below are diagnostic precisely because they don't depend on your team's perception. They are observable facts about your system, retrievable from logs and configurations and the answers to specific questions. Read each one. Count yours.

Indicator 1: Your deploys have started failing in different ways for the same underlying reason

Your senior engineer reports a failed deploy. She reads the log, identifies the cause, fixes it, retries. The retry fails too — but with a different error. She fixes that. The third retry succeeds. She moves on, satisfied.

What she didn't notice — what no one on your team has the vantage point to notice — is that all three errors traced to a single architectural condition. An orphan helper container holding a cache mount lock. A BuildKit worker in a degraded state. A Docker daemon that needed to be restarted but nobody knew it. The "different errors" weren't different problems. They were different symptoms of the same underlying state, and the symptom shifted because the state evolved between retries.

Why teams normalize it: Each retry that succeeds feels like a win. The engineer who got the deploy through is the hero of the moment. Nobody is incentivized to ask "but why did it take three different fixes for one problem?"

What it actually means: Your infrastructure is in a state where the failure manifestation is decoupled from the failure cause. This is the strongest possible signal of compound state — multiple unaddressed conditions stacking until any one of them can produce an incident. The fact that retries eventually work is what makes it dangerous: the team builds confidence in a process that's hiding the underlying issue.

Self-test: Pull your deploy logs from the last 30 days. For each failed deploy that eventually succeeded, look at how many retry attempts were needed and whether the retries failed for the same reason or different reasons. If more than one in five "successful" deploys took two or more retries with non-identical failure modes, you have this indicator.

Indicator 2: Some part of your platform requires SSH-into-host intervention to recover

This one is binary. Either your platform abstracts the underlying host enough that engineers never need to touch it, or it doesn't.

The instant any deploy in a recent quarter required someone to SSH into the host — to kill an orphan container, to restart a daemon, to manually clear a cache, to remove a stuck lock file — your platform abstraction has officially leaked. The platform exists to make this unnecessary. Every SSH session your engineers have done is a debt your platform owes back to your team.

Why teams normalize it: SSH access feels powerful, especially to senior engineers who came up before container orchestration was mature. They reach for it because it works. They forget that "needing to SSH" is a signal that the abstraction has failed.

What it actually means: You have implicit reliance on individual engineers' host-level expertise to keep your deploys functioning. That expertise is unwritten, untransferable, and tied to the people who happen to know which docker ps | grep helper incantation works on which host. When those people are unavailable, the platform doesn't recover on its own. That's not a platform — that's a manual operation with platform branding.

Self-test: Ask your senior engineers, individually, when they last SSH'd into a host to fix a deploy issue. If any of them answer with a date in the last quarter, you have this indicator. If all of them answer with "I don't remember the last time," you don't.

Indicator 3: Your deploy logs go silent for periods longer than any individual build step should take

Most build steps complete in seconds or single-digit minutes. Pulling a base image, installing dependencies, running tests, exporting layers — these all have known characteristic durations. A reasonable Node.js application's build is a stack of maybe twenty steps, each individually under five minutes, with the longest one (typically pnpm install or the test suite) running ten to fifteen.

If your deploy logs go quiet for thirty minutes, sixty minutes, ninety minutes — with no output between two consecutive log lines — your build process is not running. It's stuck. Something downstream of the visible logging is wedged: a daemon, a worker, a lock, a network call to a registry that's hung. The build doesn't know it's hung. Your platform doesn't know it's hung. Your engineers find out at the timeout.

Why teams normalize it: "Sometimes the build takes a long time" is what gets said. The team learns to budget extra time for deploys. The deploys take longer. The team budgets more time. Nobody questions whether the long deploys are actually long building, or just long waiting.

What it actually means: You have failure modes that your observability can't see. The platform's output stream isn't a reliable signal of progress. Time-based heuristics ("if the build takes more than 45 minutes, retry") are your team's substitute for actual visibility into what's happening.

Self-test: Pull deploy logs from your last 30 days. For any deploy that took longer than your typical build time, find the longest gap between consecutive log lines. If any deploy had a gap longer than fifteen minutes between log entries, you have this indicator. The gap itself is the evidence.

Indicator 4: Your "deployed" status lags reality

A container can be running and crash-looping at the same time. Your deploy dashboard reports "running" because the platform's health check is polling every 30 or 60 seconds, and a 90-second crash cycle has at least one window of liveness inside each polling interval. The container is technically up when the platform looks at it. The fact that it's also down for 30 seconds out of every 90 is invisible to the layer that reports status.

The same logic applies more subtly to applications where the front-end serves traffic but a background worker has crashed. The dashboard reports green. The newsletter generation queue isn't draining. Your customers don't notice for an hour. Your team notices when someone happens to look at the right log.

Why teams normalize it: The status reported by the platform is what most teams trust. Engineers learn over time that "deployed" doesn't necessarily mean "working" — but they don't formalize that knowledge into monitoring or alerting. They keep checking by hand.

What it actually means: Your monitoring is measuring liveness, not correctness. Liveness is the easy thing to measure. Correctness is the thing you actually care about. The gap between them is where production incidents live.

Self-test: For any production incident in the last 90 days, calculate the time between the incident's actual start and the moment your dashboard reflected the problem. If any of those gaps exceeded ten minutes — and "actual start" is ascertained by reading logs, not by checking the dashboard — you have this indicator.

Indicator 5: Your senior engineers reach for Stack Overflow before reaching for your runbook

This one is a culture indicator that maps directly to an infrastructure indicator.

Watch what happens the next time a deploy fails. Does the engineer responsible go to your internal documentation? Or does she paste the error message into Google?

If she's going to Google first, your runbook is either stale, incomplete, or non-existent. If she's going to Google exclusively — never opening internal docs at all — your team has stopped trusting the internal documentation as a source of truth. They've learned that the answers there are out of date or incorrect, and they've stopped looking.

Why teams normalize it: Documentation is usually someone's side project. It rots when it isn't anyone's job. Engineers find that public knowledge sources are more current than internal ones, and they adapt. They don't tell anyone they've stopped using the runbook. They just stop.

What it actually means: Every time an engineer solves a deploy problem by Googling, you're paying for a knowledge transfer that your organization has already paid for once before — and you're paying for it because the first time wasn't captured. Multiply by team size, by frequency of issues, by years of operation, and you have a substantial line item in undocumented operational knowledge that lives in individuals' heads.

Self-test: Ask your three most senior engineers, individually, when they last opened your internal deploy runbook. If two or more of them say they don't remember or "we don't really keep one updated," you have this indicator.

Indicator 6: You don't know which of your environments share infrastructure with which others

This one's the simplest test and the one most likely to come back as "I'd have to ask."

Right now, without looking anything up: which of your environments — QA, staging, production, demo, sandbox, whatever you call them — share a host? A network? A Docker daemon? A BuildKit cache? A Kubernetes cluster? A database server (even if separate databases)?

If you can answer that confidently, you know your blast radius. If you can't, you don't.

Why teams normalize it: The infrastructure was set up months or years ago, often by an engineer who's no longer at the company. The configurations are spread across deploy platform settings, cloud console, IaC repos. Nobody has needed to map it end-to-end recently, so nobody has.

What it actually means: A failure in staging that takes down production because they share a daemon is a one-paragraph postmortem and a board-level conversation about whether your engineering organization knows what it's doing. The mitigation isn't complicated, but you can't mitigate what you haven't mapped.

Self-test: Try to draw — physically, on a whiteboard — the infrastructure topology of your three highest-stakes environments. Hosts, networks, daemons, shared services. If you can't do it from memory, and your senior engineers also can't, you have this indicator. (The full seven-dimensional staging-vs-production mapping is here.)

Score yourself

For each indicator: yes, no, or "I'd have to ask."

"I'd have to ask" counts as yes. It is, itself, the indicator. The fact that you'd have to ask means the answer isn't readily available, which means nobody has audited it, which means the indicator is present whether or not the underlying condition is.

Tally:

0-1 yes answers: Your deploy infrastructure is in healthy shape. Keep doing what you're doing. The thing that makes you healthy is the explicit attention being paid — don't let it lapse.
2-3 yes answers: Working with friction. You haven't crossed into architectural debt yet, but you're paying a higher tax than necessary. A focused review of the specific indicators where you scored yes is usually enough to course-correct without a full intervention.
4-5 yes answers: You've crossed the threshold. Compound infrastructure debt is now a measurable cost on your engineering organization. The longer you wait, the worse the math gets — and the math compounds nonlinearly past this point. This is where structured intervention starts to pay back inside the first year.
6 yes answers: You knew before you finished reading. The pattern is no longer subtle and the cost is no longer manageable inside normal sprint cadence. Intervention is overdue rather than optional.

The reason these six indicators work is that none of them require you to interpret your team's behavior. Each one is a question with a binary answer that exists in your logs, your configurations, or your conversations with senior engineers. The CTOs who score themselves honestly are the ones who get useful answers.

The CTOs who score themselves generously are the ones who keep firefighting at 2am.

Considering an audit of your own? I run four-to-six-week deploy infrastructure audits for engineering organizations carrying compound infrastructure debt. The output is a current-state map, a prioritized risk register, and an intervention sequence with cost-benefit per item — written for your CFO. Investment: $55,000–$75,000 fixed-fee.