Keep - Open-source AIOps platform - Building a new shift-left approach for alerting

Building a new shift-left approach for alerting

August 4, 2024

Alerting (aka monitors/alarms) always felt like a second-class citizen within all the different monitoring/observability/infrastructure tools with a very narrow feature set, which in turn results in poor alerts, alert fatigue (yes, your muted Slack channel), unreliable product and a complete alerting-hell.

Keep is an open-source alerting CLI tool that @shaharglazner and I wrote out of a pain we felt throughout our careers as developers and developers managers. Alerting (aka monitors/alarms) always felt like a second-class citizen within all the different monitoring/observability/infrastructure tools with a very narrow feature set, which in turn results in poor alerts, alert fatigue (yes, your muted Slack channel), unreliable product and a complete alerting-hell.

It's not only that we couldn't create better applicative/infrastructure alerts, but it's also that it is tough to maintain them and ensure they work over time.

Organizations today have so many tools they use for alerting that it's becoming an absolute nightmare.

‍

Alerting as a first-class citizen

The best way to describe what we had in mind when we first built Keep is how one of our first users puts it:

‍

Keep is doing to alerting what GitHub actions did to CI/CD

‍

There were three main guidelines when we started coding:

Good alerts are not just over thresholds/logs BUT should be treated as workflows with multiple "tests" (steps/actions).
The tool should be 100% data agnostic - agnostic to where data resides (& not only "traditional" data sources but also a DB, for example). There's no real reason why it shouldn't be abstracted from developers.
Maintained and lives in your code - allowing it to be integrated with all CI/CD processes (imagine a gate that fails your PR when you break alerts).

‍

‍

What's Ahead?

We constantly try to improve with our promised:

‍

Try our first mock alert and get it up and running in <5 minutes.

‍

So we're adding plenty more deployment options, providers, and functions. We're working on simplifying the syntax furthermore.

What do you think about the need for this kind of "abstraction"? What do you think about alerts as post-production tests? How do you manage and control your alerting chaos right now?

Would love to hear your thoughts; feel free to comment here / on our Github repo / in our Slack.