In this blog post, we will demonstrate the strength of a unified API in consolidating and managing alerts. We will create a workflow that, upon an alert triggers, generates a ServiceNow ticket, enriches it with data from a production database, and notifies the stakeholders.
This technical blog post will guide you on how to:
Before we delve into the technicalities, let's have a brief introduction.
Keep is an open-source alert management and automation platform that integrates with your monitoring tools' alerts and provides an abstraction layer.
Despite a trend towards consolidation in the observability space, many organizations still utilize multiple tools to generate alerts.
The Grafana's Observability Survey from 2023 indicates that over 52% of companies employ more than six observability tools, often due to legacy systems, cost considerations, and specific functionalities.
# Clone Keep's repo and install Keep CLI using poetry
gh repo clone keephq/keep
cd keep && poetry install
# or just install it using pip
pip install keepcli
# for other installation options (e.g. docker) see https://docs.keephq.dev/cli/installation
You can easily start using Keep's managed platform without any other prerequisites by running:
# This will launch an oauth2 flow that will create a tenant for you and set you upkeep auth login
If you are using Keep's open source, run keep config to configure the CLI:
You can start using Keep without API key (the default docker-compose configuration).
Once you deploy Keep to production, read about
how to add authentication.
keep config
Enter your keep url [http://localhost:8080]:
Enter your api key (leave blank for localhost) []:
Config file created at .keep.yaml
Verify everything is OK
keep whoami
Api key valid{'tenant_id': 'XXXXXX-YYYY-ZZZZ-8b5a-939af9d7f63b'}
Now we are going to connect all the providers we need - Datadog to get the alerts, ServiceNow to create and track the tickets, MySQL to enrich alerts with production data, and Slack - to notify who is needed.
# no providers
keep provider list
+----+------+------+--------------+-------------------+
| ID | Type | Name | Installed by | Installation time |
+----+------+------+--------------+-------------------+
+----+------+------+--------------+-------------------+
# list available providers
keep provider list --available
+-----------------+-------------------------------------------------------+
| Provider | Description |
+-----------------+-------------------------------------------------------+
| aks | Enrich alerts using data from AKS. |
...
| zabbix | Pull/Push alerts from Zabbix into Keep. |
| zenduty | Create incident in Zenduty. |
+-----------------+-------------------------------------------------------+
Now, let's connect datadog, MySQL, servicenow and slack
# For every provider, you can what authentication details needed
keep provider connect datadog --help
+----------+--------------+----------+-----------------+
| Provider | Config Param | Required | Description |
+----------+--------------+----------+-----------------+
| datadog | api_key | True | Datadog Api Key |
| | app_key | True | Datadog App Key |
+----------+--------------+----------+-----------------+
# Connect Slack
keep provider connect slack --provider-name slack-prod --webhook-url https://hooks.slack.com/services/T03PMXXXXX/B0656YYYY/yQ7zncdkuhzrGDWILtuZZZZZ
Provider slack-prod installed successfully
Provider id: 82a2c69d26e64d3f8ec81eb25d13f972
# Connect datadog
keep provider connect datadog --provider-name datadog-prod --api-key XXXXXXX --app-key YYYYYYY
Provider datadog-prod installed successfully
Provider id: e33c9960d862453dace829f6a8aecbcf
# Connect mysql
keep provider connect mysql --provider-name mysql-prod --username dbuser --password dbpass --host keepdb
Provider mysql-prod installed successfully
Provider id: d1c3a24621254565970ac6fab74697b7
# Connect Service Now
keep provider connect servicenow --provider-name servicenow-prod --service-now-base-url https://dev123456.service-now.com --username user --password password
# Verify the providers connected
keep provider list
+----------------------------------+------------+-----------------+-------------------+----------------------------+
| ID | Type | Name | Installed by | Installation time |
+----------------------------------+------------+-----------------+-------------------+----------------------------+
| e33c9960d862453dace829f6a8aecbcf | datadog | datadog-prod | apikey@keephq.dev | 2023-11-08T13:23:29.531775 |
| d1c3a24621254565970ac6fab74697b7 | mysql | mysql-prod | apikey@keephq.dev | 2023-11-08T13:26:12.249923 |
| 066f2a02326c41819c19d61ed6976b65 | servicenow | servicenow-prod | apikey@keephq.dev | 2023-11-08T13:28:35.930792 |
| 82a2c69d26e64d3f8ec81eb25d13f972 | slack | slack-prod | apikey@keephq.dev | 2023-11-08T13:19:00.539780 |
+----------------------------------+------------+-----------------+-------------------+----------------------------+
If we go the the UI at http://localhost:3000, we can see that the providers are installed:
In this section, we are going to review the alerts, show how the alert looks in Keep, and demonstrate enrichment and filtering capabilities.
bash
# list all alerts
keep alert list
+---------------------+------------------------------------------------------------------+--------------------------------+----------+-----------+-------------+---------+-------------+---------------------+
| ID | Fingerprint | Name | Severity | Status | Environment | Service | Source | Last Received |
+---------------------+------------------------------------------------------------------+--------------------------------+----------+-----------+-------------+---------+-------------+---------------------+
| 7308482322424796476 | 5bcafb4ea94749f36871a2e1169d5252ecfb1c589d7464bd8bf863cdeb76b864 | Unauthorized access to API | high | Recovered | undefined | None | ['datadog'] | 2023-11-13T15:32:38 |
| 7308433771057253905 | 39f3a0d2cfe87885be0283c94ffd1cc35be1fd1bdd108c86ddf8e9db5d3bd7f0 | Test Alert | critical | Recovered | undefined | None | ['datadog'] | 2023-11-13T14:44:24 |
...
more alerts
...
+-----------+----------------------------+----------------------------+----------+--------+-------------+----------+-------------+---------------------------+
# Filter by attribute
keep alert list --filter service=keep-api
+-----------+----------------------------+----------------------------+----------+--------+-------------+----------+-------------+---------------------------+
| ID | Fingerprint | Name | Severity | Status | Environment | Service | Source | Last Received |
+-----------+----------------------------+----------------------------+----------+--------+-------------+----------+-------------+---------------------------+
| 120458754 | 5bcafb4ea94749f36871a2e1169d5252ecfb1c589d7464bd8bf863cdeb76b864 | 4xx-5xx Status Code Alert | medium | OK | production | keep-api | ['datadog'] | 2023-05-31T10:59:29+00:00 |
| 122655180 | 5bcafb4ea94749f36871a2e1169d5252ecfb1c389d7464bd8bf863cdeb76b864 | Unauthorized access to API | high | OK | production | keep-api | ['datadog'] | 2023-11-08T13:29:31+00:00 |
+-----------+----------------------------+----------------------------+----------+--------+-------------+----------+-------------+---------------------------+
keep alert list --filter severity=critical
+-----------+-------------+------------+----------+--------+-------------+----------+-------------+---------------------------+
| ID | Fingerprint | Name | Severity | Status | Environment | Service | Source | Last Received |
+-----------+-------------+------------+----------+--------+-------------+----------+-------------+---------------------------+
| 117493674 | 5bcafb4ea94749f36871a2e1169d5252ecfb1c589d7464bd8bf863cdeb76b862 | Prod Alert | critical | OK | production | tal-test | ['datadog'] | 2023-09-13T11:20:25+00:00 |
But what's even cooler is that we can filter on ANY alert attribute. Together with that Keep lets you enrich alerts with attributes from different sources, and you can achieve very cool things.
To put things into earth, let's say we created (we will of course automate this later) a ticket in our ticketing system.
We want to correlate the alert with the ticket, so we will be able to sync any further changes to the ticket.
We also want information about the customer that is stored on our customers' database. We can get this information by running
select * from customers where customer_id = %customer_id%
+----+---------------------+------------+---------------------+--------------+---------------+-----------------------------+--------------------------------------+
| id | name | tier | email | phone_number | address | notes | customer_id |
+----+---------------------+------------+---------------------+--------------+---------------+-----------------------------+--------------------------------------+
| 1 | ABC Corporation | Enterprise | abc@example.com | 123-456-7890 | 123 Main St | Customer since 2010 | 05bc71af-820a-11ee-b23f-0242ac110002 |
Assuming we want to enrich the alert with customer name, customer email and ticket id:
keep alert enrich --fingerprint 39f3a0d2cfe87885be0283c94ffd1cc35be1fd1bdd108c86ddf8e9db5d3bd7f0 customer_id=1234 ticket_id=INC00001 customer_email=abd@example.com
# Now we can filter by responder:
keep alert list --filter ticket_id=INC00001
So far, we connected the providers, reviewed our Datadog alerts, and enriched them with customer data and ServiceNow tickets.
Now we will wrap it up and automate the whole process using Keep Workflows.
Before diving into the CLI commands, let's review the workflow we are going to run. Keep Workflows are very similar to GitHub Action workflows. We didn't want to invent the wheel here, so you should be pretty familiar with the syntax.
The full workflow YAML can be found here.
workflow:
# some metadata
id: example-workflow
description: Enriches the alert and create a ServiceNow ticket
# The first part is the triggers. We want this workflow to execute only on critical alerts. We can filter on any alert attribute and also use regex.
triggers:
- type: alert
filters:
- key: severity
value: critical
steps:
# The first step is to enrich the alert based on the SQL query. We want to add the customer name, email, and tier.
- name: get-more-details
provider:
type: mysql
config: " {{ providers.mysql-prod }} "
# {{ alert.customer_id }} will be extracted on runtime
with:
query: "select * from customers where customer_id = {{ alert.customer_id }}"
# Add those fields to the alert so we can use it
enrich_alert:
- key: customer_name
value: results[0].name
- key: customer_email
value: results[0].email
- key: customer_tier
value: results[0].tier
# second part - the actions
actions:
# create the servicenow ticket
- name: create-service-now-ticket
# In case the alert already assigned a ticket id, don't create a new one (imagine the case when the alert was triggered and then resolved, we don't want another ticket for the resolved). Also, we want to create a ticket only for Enterprise customers.
if: "not '{{ alert.ticket_id }}' and '{{ alert.tier }}' == 'Enterprise'"
provider:
type: servicenow
config: " {{ providers.servicenow }} "
with:
table_name: INCIDENT
payload:
short_description: "{{ alert.name }} - {{ alert.description }} [created by Keep]"
description: "{{ alert.description }}"
# Enrich the alert with these fields so we will have correlation between the alert and the ticket
enrich_alert:
- key: ticket_type
value: servicenow
- key: ticket_id
value: results.sys_id
- key: ticket_url
value: results.link
- key: ticket_status
value: results.stage
- key: table_name
value: "{{ alert.annotations.ticket_type }}"
Now after we have the workflow, let's apply and run it.
# no workflows
keep workflow list
+--------------------------------------+--------------------------------------+----------------------------+-------------------------------------------------+--------------------------+----------------+
| ID | Workflow ID | Start Time | Triggered By | Status | Execution Time |
+--------------------------------------+--------------------------------------+----------------------------+-------------------------------------------------+--------------------------+----------------+
+--------------------------------------+--------------------------------------+----------------------------+-------------------------------------------------+--------------------------+----------------+
# Apply it:
keep workflow apply -f workflow.yaml
Workflow examples/workflows/blogpost.yml applied successfully
Workflow id: 652fe84e-5239-425b-8271-40accb1af72f
Workflow revision: 1
keep workflow list
+--------------------------------------+-------------------+-----------------------------------+----------+--------------+----------------------------+----------------------------+----------------------------+-----------------------+
| ID | Name | Description | Revision | Created By | Creation Time | Update Time | Last Execution Time | Last Execution Status |
+--------------------------------------+-------------------+-----------------------------------+----------+--------------+----------------------------+----------------------------+----------------------------+-----------------------+
| 652fe84e-5239-425b-8271-40accb1af72f | blogpost-workflow | Enrich the alerts and open ticket | 10 | keep | 2023-11-12T08:08:43.585226 | 2023-11-12T14:34:07.544301 | None | None |
+--------------------------------------+-------------------+-----------------------------------+----------+--------------+----------------------------+----------------------------+----------------------------+-----------------------+
# Run it with alert as input
keep workflow run --workflow-id blogpost-workflow --fingerprint 39f3a0d2cfe87885be0283c94ffd1cc35be1fd1bdd108c86ddf8e9db5d3bd7f0
Workflow blogpost-workflow run successfully
Workflow Run ID 33e71955-81f4-4118-9771-7b638f8c59b0
# Let's review the run
keep workflow runs logs 33e71955-81f4-4118-9771-7b638f8c59b0
+-----+----------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | ID | Timestamp | Message |
+-----+----------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| 733 | 2023-11-13T16:11:40.462000 | Running step get-more-details |
| 734 | 2023-11-13T16:11:40.463000 | Action get-more-details evaluated to run! Reason: no condition, hence true. |
| 735 | 2023-11-13T16:11:40.524000 | Step get-more-details ran successfully |
| 736 | 2023-11-13T16:11:40.525000 | Running action create-service-now-ticket |
| 737 | 2023-11-13T16:11:40.525000 | Action create-service-now-ticket evaluated to run! Reason: no condition, hence true. |
| 738 | 2023-11-13T16:11:44.784000 | Created ticket: {'result': {'parent': '', 'made_sla': 'true', 'caused_by': '', 'watch_list': '', 'upon_reject': 'cancel', 'sys_updated_on': '2023-11-13 14:11:41', 'child_incidents': '0', 'hold_reason': '', 'origin_table': '', 'task_effective_number': 'INC' |
| 740 | 2023-11-13T16:12:47.552000 | Enriching alert |
| 741 | 2023-11-13T16:12:47.572000 | Alert enriched |
| 742 | 2023-11-13T16:12:47.573000 | Action create-service-now-ticket ran successfully |
| 743 | 2023-11-13T16:12:47.574000 | Finish to run workflow blogpost-workflow |
+-----+----------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
keep workflow runs list
+--------------------------------------+--------------------------------------+----------------------------+-------------------------------+-------------+----------------------------------------------------+----------------+
| ID | Workflow ID | Start Time | Triggered By | Status | Error | Execution Time |
+--------------------------------------+--------------------------------------+----------------------------+-------------------------------+-------------+----------------------------------------------------+----------------+
| 103df0aa-d6be-4290-9938-1563f8005e55 | 75c7eba2-51dc-411d-b39c-a500c98e3893 | 2023-11-13T14:11:37.911898 | manually by apikey@keephq.dev | success | None | 69 |
+--------------------------------------+--------------------------------------+----------------------------+-------------------------------+-------------+----------------------------------------------------+----------------+
# Let's make sure the alert was enriched with the ticket id
keep alert get 39f3a0d2cfe87885be0283c94ffd1cc35be1fd1bdd108c86ddf8e9db5d3bd7f0 | jq .ticket_id
"0f9982ec97667110beb0f0571153afa1"
# :)
Voila! Now, whenever an alert is triggered, it will be automatically enriched with data from our production database, and appropriate actions will be taken. If the alert is of high or critical severity, a ServiceNow ticket will be created and the alert will be updated with the ticket ID. For less severe alerts, the relevant individual will simply be notified.
1. Join our Slack and start talking about alerting and monitoring.
2. ⭐️ Keep repo.
3. Start playing with Keep (no credit card needed!) at https://platform.keephq.dev
4. Missing any provider/feature? just open an issue at https://github.com/keephq/keep and we will add it ASAP (and of course contributions are welcome!)