Kubernetes, Prometheus and Pager Duty: engineering 24/7 ops capabilities for larger teams

Engineering 24/7 ops capabilities for larger teams has its own set of challenges in terms of scalability. Enabling teams to be self-served can work quite well. Especially for organisations with small platform engineering or SRE teams. In this post I'm going to show you an example setup that's suitable for an organisations with similar profile.

Monitoring, Kubernetes, Prometheus, Pager Duty

Know when things go wrong

Running a complex IT system 24/7 that’s developed and supported by multiple teams can be challenging to scale from ops perspective. An important aspect of running a production system is monitoring and alerting, which like anything else needs a good setup. Some organisations may have large SRE teams doing some of the setup and configuration of monitoring and alerting systems. When an organisation has only a small platform teams against larger development team this could be challenging. Especially if a development team is proliferating a new service every week or so. You can turn challenge into opportunity by giving developers right tools to self-service. Simply, enable them to define alerting rules and label services in such a way that right team will be notified on failure.

With few simple steps you can improve site reliability, empower developers to be more self-sufficient. This will also decrease load on a platform or SRE team.

The problem

You have 50 or so services in your system, there are some common alerting rules for all of them, there are some more specific for other. Your platform/SRE team is small. It doesn’t have capacity to set and manage alerting rules or to ensure alerts are routed to correct Pager Duty services. You want Pager Duty to notify a team or support person with knowledge of faulty component. You want to minimise MTTR (mean time to repair). You want your team to be notified accurately.

The approach

How we can go about this? How can we empower engineering team to self-service monitoring and alerting? How can you spread the load on setting up and maintaining alerting system without a massive platform team? First some assumptions: system is deployed on Kubernetes, with good level of deployment automation and team has control over what’s deployed. The team has an ownership of Kubernetes manifests. A working Prometheus is deployed (ideally through Prometheus Operator) and there is a Pager Duty account available to the team. Second, let’s set some standards, such as:

each component/service of the system should have Kubernetes labels set. For example: team, system, subsystem, etc - whatever reasonably divides a bunch of services into smaller manageable groups.
Prometheus jobs map Kubernetes metadata labels into Prometheus’ labels
each Prometheus alerting rule maps labels into alert’s labels
when mapping of labels on alert is not possible, those values are then set directly on an alert rule
Alert Manager is configured to send alerts to Pager Duty’s Event Rule.
Event Rule is configured to forward incidents to Pager Duty’s service which is representing a team, system, sub-system or a component
Each Pager Duty’s service has on-call support with rota among development teams
Development teams have control over defining Alerting Rules, the process has full CICD automation (i.e. Prometheus Operator + Alert CRD, pipelines auto-deploying approved rules, etc).

How does this make life easier for Platform/DevOps/SRE team? It shifts responsibility for defining alerts and ensuring they are defined to developers writing and maintaining components of the system. Give developers right tools, guidance and some help and they will maintain and support the system. When development team grows and more services are created, there won’t be extra load on Platform team to define or take care of alerting rules.

2020-02-24 SRE
kubernetes prometheus pagerduty sre 247