We built the tool we needed at 2am on a Friday
How this started
For four years, the two of us were part of the platform team at a mid-sized SaaS company shipping roughly 40 deploys a day across 22 microservices. The infrastructure was solid. The CI was fast. But incidents kept taking longer to resolve than they should — not because we lacked monitoring, but because turning monitoring data into a causal explanation required too many manual steps.
The thing that finally broke us was a Friday evening in November. A config key in one service got renamed as part of a cleanup PR — SESSION_TTL to SESSION_TIMEOUT_MS. The PR was reviewed, CI passed everywhere. The old key still existed as an environment variable in three downstream services that nobody had explicitly updated, because the contract between those services was informal. On deploy, auth-proxy read SESSION_TTL, got undefined, and crashed. The cascading failure took out the session layer for 90 minutes.
Tracing from the cascading failure back to the config rename took two engineers and two and a half hours. The data was all there — in GitHub, in Datadog, in the Kubernetes event log, in Slack. Nobody had connected it. On Monday morning, we started building an internal tool that would connect it automatically. A Slack bot, at first. Then a proper service with a UI.
Within six weeks, the rest of the engineering team had started using it without being asked. The weekly ops review started referencing it. We began getting questions from friends at other companies about whether they could use it too.
A year later, we left to build it properly. That's NessForge.
Engineering principles
Evidence over alerts
A tool that tells you something is broken is less useful than one that tells you why. We build toward the latter. Every hypothesis NessForge surfaces comes with the specific evidence behind it and a confidence score, so you can evaluate the reasoning — not just accept the conclusion.
Your stack, your terms
NessForge models your environment as it is, not as idealized. If you have three different ways to deploy the same service depending on the environment, it handles that. If your job names are creative and your Docker tags are inconsistent, it handles that too. We don't require you to restructure your pipeline to fit our assumptions.
No false precision
If NessForge can't determine why something failed with sufficient confidence, it says so. It shows you the evidence it collected — the services involved, the changes that landed, the failure timeline — and lets your engineers reason from it. Confident wrong answers are worse than honest uncertainty during an incident.
Read-only by default
NessForge reads from your CI provider, source control, and orchestration layer. It never modifies your pipeline, your repository, or your infrastructure. Write access is only requested for the specific output channels you configure (Slack, PagerDuty PR status checks) — and only the minimum scope required for each.
Who's building this
Co-founder · CEO
Previously: Platform Engineering
Co-founder · CTO
Previously: Infrastructure Engineering
Early Engineer
Graph systems & data pipeline
We're a small team and we're hiring. If you've spent years staring at CI pipelines and have strong opinions about how deployment data should be modeled, get in touch.
We're working with early teams now
If your team ships frequently enough that a 45-minute post-deploy investigation is a regular occurrence, we'd like to talk.