There is a story recounted in Richard Feynman’s book Surely You’re Joking, Mr. Feynman! about what Feynman dubbed cargo cult science: In the South Seas there is a cargo cult of people. During the war they saw airplanes land with lots of good materials, and they want the same thing to happen now. So they’ve arranged to imitate things like runways, to put fires along the sides of the runways, to make a wooden hut for a man to sit in, with two wooden pieces on his head like headphones and bars of bamboo sticking out like antennas —he’s the controller —and they wait for the airplanes to land. They’re doing everything right. The form is perfect. It looks exactly the way it looked before. But it doesn’t work. No airplanes land. So I call these things cargo cult science, because they follow all the apparent precepts and forms of scientific investigation, but they’re missing something essential, because the planes don’t land. Over the years, this observation from the science community has become applied to software engineering and system administration: adopting tools and procedures of more successful teams and companies in the misguided notion that the tools and procedures are what made those teams successful, so they will also make your own team successful in the same ways. Sadly, the cause and effect are backward: the success that team experienced led them to create the tools and procedures, not the other way around.
when it comes to monitoring, there’s a problem: how can you build monitoring for a thing you don’t understand? Thus, the anti-pattern: monitoring is not a job —it’s a skill, and it’s a skill everyone on your team should have to some degree.
Are there high-level checks you can perform to verify it’s working? For example, if we’re talking about a webapp, the first check I would set up is an HTTP GET /. I would record the HTTP response code, expect an HTTP 200 OK response, specific text to be on the page, and the request latency.
Opt for collecting metrics at least every 60 seconds. If you have a high-traffic system, opt for more often, such as every 30 seconds or even every 10 seconds. Some people have argued that collecting metrics more often places too much load on the system, which I call baloney. Modern servers and network gear have very high performance and can easily handle the minuscule load more monitoring will place on them. Of course, keeping high-granularity metrics around on disk for a long period of time can get expensive. You probably don’t need to store a year of CPU metric data at 10-second granularity. Make sure you configure a roll-up period that makes sense for your metrics.
Stop Using Email for Alerts An email isn’t going to wake someone up, nor should you expect that it would. Sending alerts to email is also a great way to overwhelm everyone with noise, which will lead to alert fatigue.
Response/ action required immediately Send this to your pager, whether it’s an SMS, PagerDuty, or what-have-you. This is an actual alert,
Awareness needed, but immediate action not required I like to send these to internal chat rooms. Some teams have built small webapps to receive and store these for review with great success. You could send these to email, but be careful —it’s easy to overwhelm an inbox. The other options are usually better.
Write Runbooks A runbook is a great way to quickly orient yourself when an alert fires. In more complex environments, not everyone on the team is going to have knowledge about every system, and Runbooks are a great way to spread that knowledge around.
A good runbook is written for a particular service and answers several questions: What is this service, and what does it do? Who is responsible for it? What dependencies does it have? What does the infrastructure for it look like? What metrics and logs does it emit, and what do they mean? What alerts are set up for it, and why? For every alert, include a link to your runbook for that service. When someone responds to the alert, they will open the runbook and understand what’s going on, what the alert means, and potential remediation steps.
As with many good things, runbooks can be easy to abuse. If your remediation steps for an alert are as simple as copy-pasting commands, then you’ve started to abuse runbooks. You should automate that fix or resolve the underlying issue, then delete the alert entirely. A runbook is for when human judgment and diagnosis is necessary to resolve something.
I strongly encourage you to put software engineers into the on-call rotation as well. The idea behind this is to avoid the “throw-it-over-the-wall” version of software engineering. If software engineers are aware of the struggles that come up during on-call, and they themselves are part of that rotation, then they are incentivized to build better software. There’s also a more subtle reason here: empathy. Putting software engineers and operations engineers together in some way increases empathy for each other, and it’s awfully hard to be upset at someone you genuinely understand and like.
Pay your team extra for their on-call shift. It’s standard practice in the medical profession for on-call to receive additional pay for on-call shifts, ranging from an additional $ 2/ hr for nurses up to $ 2,000/ day for neurosurgeons.
What about for incidents that are actual outages and last longer than a few minutes? In that case, a well-defined set of roles becomes crucial. Each of these roles has a singular function and they should not be doing double-duty: Incident commander (IC) This person’s job is to make decisions. Notably, they are not performing any remediation, customer or internal communication, or investigation. Their job is to oversee the outage investigation and that’s it. Often, the on-call person adopts the IC role at the start of the incident. Sometimes the IC role is handed off to someone else, especially if the person on-call is better suited for another role. Scribe The scribe’s job is to write down what’s going on. Who’s saying what and when. What decisions are being made? What follow-up items are being identified? Again, this role should not be performing any investigation or remediation. Communication liaison This role communicates status updates to stakeholders, whether they are internal or external. In a sense, they are the sole communication point between people working on the incident and people demanding to know what’s going on. One facet of this role is to prevent stakeholders (e.g., managers) from interfering with the incident by directly asking those working on resolving the incident for status updates. Subject matter experts (SMEs) These are the people actually working on the incident.
when dealing with datasets that are highly skewed in one direction, the median can often be more representative of the dataset than the mean. To calculate the median, you first must sort the dataset in ascending order, then calculate the middle using the formula (n + 1) / 2, where n is the number of entries in the dataset. 3 If your dataset contains an odd number of entries, the median is the exact middle entry. However, if your dataset contains an even number of entries, then the two middle numbers will be averaged, resulting in a median value that is not a number found in the original dataset. For example, consider the dataset: 0, 1, 1, 2, 3, 5, 8, 13, 21. The median is 3. If we added a 10th number so the dataset becomes 0, 1, 1, 2, 3, 5, 8, 13, 21, 34, then the median becomes 4.
Write to Disk or Write to Network? Write to disk, with a service that comes along at regular intervals to send to an external location. Many log services support writing from inside the app directly to a network location. This makes it easy to ship your logs off for storage and analysis, but as your app’s traffic increases, this could become a significant and troublesome bottleneck. After all, you’re having to make a network connection every time you send a log entry, which can get expensive in terms of resource utilization. Instead, it’s better to write the log entry to a file on disk. You can have a service that comes along at regular intervals (even near real time) to send the log entries to an external location. This allows for log shipping to be done asynchronously from the app, potentially saving a lot of resources. You can have this done using rsyslog’s forwarding functionality. Alternatively, many of the SaaS logging services have agents that perform the same job.
Distributed tracing is a methodology and toolchain for monitoring the complex interactions inherent in a microservice architecture. Popularized by the Google Dapper paper and first implemented outside of Google with Zipkin, distributed tracing is becoming an integral component of the monitoring toolset for teams running microservice architectures. How it works is straightforward: for every request that comes in, “tag” it with a unique request ID. This request ID stays with the request and resulting requests throughout its life, allowing you to see what services a request touches and how much time is spent in each service. One important distinction of tracing versus metrics is that tracing is more concerned with individual requests than the aggregate (though it can also be used for that).
Distributed tracing is far-and-away the most challenging and time-consuming monitoring technique to implement properly, not to mention only being useful for a small segment of the industry. Distributed tracing is not for the faint of heart or the understaffed-and-overworked engineering team. If you’ve already got all the metrics and logs you want, but still find yourself struggling to understand inter-service performance and troubleshooting in a distributed architecture, then distributed tracing might be for you (and the same goes for those of you with significant serverless infrastructures). Otherwise, effectively instrumenting your apps with metrics and logs is going to result in much better (and quicker!) outcomes.
Another way to watch for serious memory issues is by monitoring the OOMKiller spawning in your logs. This process is responsible for terminating processes in an effort to increase the available memory to a system when it’s under high pressure. Grepping for killed process in your syslog will spot this. I recommend creating an alert in your log management system for any occurrences of OOMKiller. Any time the OOMKiller is coming into the picture, you’ve got a problem somewhere, especially because OOMKiller is unpredictable in its choice of target processes to terminate.
Monitoring SSL certificates is simple: you just want to know how long you have until they expire and for something to let you know before that happens.
If the SSL certificate is in use externally, you can use external site monitoring tools (e.g., Pingdom and StatusCake) to check and alert you on the certificate expiration.
When it comes to web servers, there is one golden metric for assessing performance and traffic level: requests per second (req/ sec). Fundamentally, req/ sec is a measurement of throughput. Less critical to performance, but still important for overall visibility, is monitoring your HTTP response codes.
There are a great many interesting things you’ll find in your logs, most of which will depend entirely on your infrastructure. To get you started, I recommend logging and paying attention to these: HTTP responses sudo usage SSH logins cron job results MySQL/ PostgreSQL slow queries Analyzing logs is largely a matter of which tool you use, whether it’s Splunk, the ELK stack, or some SaaS tool. I strongly encourage you to use a log aggregation tool for analyzing and working with your log data.
auditd is great for tracking user actions and other events through its high level of configurability. For example, some of the types of events it can report on: All sudo executions, the command executed, and who did it File access or changes to specific files, when, and by whom User authentication attempts and failures