Running a complex IT system 24/7 that’s developed and supported by multiple teams can be challenging to scale from ops perspective. An important aspect of running a production system is monitoring and alerting, which like anything else needs a good setup. Some organisations may have large SRE teams doing some of the setup and configuration of monitoring and alerting systems. When an organisation has only a small platform teams against larger development team this could be challenging. Especially if a development team is proliferating a new service every week or so. You can turn challenge into opportunity by giving developers right tools to self-service. Simply, enable them to define alerting rules and label services in such a way that right team will be notified on failure.
With few simple steps you can improve site reliability, empower developers to be more self-sufficient. This will also decrease load on a platform or SRE team.
You have 50 or so services in your system, there are some common alerting rules for all of them, there are some more specific for other. Your platform/SRE team is small. It doesn’t have capacity to set and manage alerting rules or to ensure alerts are routed to correct Pager Duty services. You want Pager Duty to notify a team or support person with knowledge of faulty component. You want to minimise MTTR (mean time to repair). You want your team to be notified accurately.
How we can go about this? How can we empower engineering team to self-service monitoring and alerting? How can you spread the load on setting up and maintaining alerting system without a massive platform team? First some assumptions: system is deployed on Kubernetes, with good level of deployment automation and team has control over what’s deployed. The team has an ownership of Kubernetes manifests. A working Prometheus is deployed (ideally through Prometheus Operator) and there is a Pager Duty account available to the team. Second, let’s set some standards, such as:
subsystem, etc - whatever reasonably divides a bunch of services into smaller manageable groups.
How does this make life easier for Platform/DevOps/SRE team? It shifts responsibility for defining alerts and ensuring they are defined to developers writing and maintaining components of the system. Give developers right tools, guidance and some help and they will maintain and support the system. When development team grows and more services are created, there won’t be extra load on Platform team to define or take care of alerting rules.
kubernetes prometheus pagerduty sre 247