Writing a Postmortem Report
504 Error while accessing a given URL
Any software system will eventually fail, and that failure can come stem from a wide range of possible factors: bugs, traffic spikes, security issues, hardware failures, natural disasters, human error… Failing is normal and failing is actually a great opportunity to learn and improve. Any great Software Engineer must learn from his/her mistakes to make sure that they won’t happen again. Failing is fine, but failing twice because of the same issue is not.
A postmortem is a tool widely used in the tech industry. After any outage, the team(s) in charge of the system will write a summary that has 2 main goals:
- To provide the rest of the company’s employees easy access to information detailing the cause of the outage. Often outages can have a huge impact on a company, so managers and executives have to understand what happened and how it will impact their work.
- And to ensure that the root cause(s) of the outage has been discovered and that measures are taken to make sure it will be fixed.
In here am going to give you template example of a postmortem report, incase you’re asked to give a report of what’s wrong the server or the development process you’re working on in your company.
Incident report for 504 error / Site Outage
On April 24th, 2022 at midnight the server access went down resulting in 504 error for anyone trying to access a website. Background on the server being based on a LAMP stack.
- 00:00 PST — 500 error for anyone trying to access the website
- 00:05 PST — Ensuring Apache and MySQL are up and running.
- 00:10 PST — The website was not loading properly which on background check revealed that the server was working properly as well as the database.
- 00:12 PST — After quick restart to Apache server returned a status of 200 and OK while trying to curl the website.
- 00:18 PST — Reviewing error logs to check where the error might be coming from.
- 00:25 PST — Check /var/log to see that the Apache server was being prematurely shut down. The error log for PHP were nowhere to be found.
- 00:30 PST — Checking php.ini settings revealed all error logging had been turned off. Turning the error logging on.
- 00:32 PST — Restarting apache server and going to the error logs to check what is being logged into the php error logs.
- 00:36 PST — Reviewing error logs for php revealed a mistyped file name which was resulting in incorrect loading and premature closing of apache.
- 00:38 PST — Fixing file name and restarting Apache server.
- 00:40 PST — Server is now running normally and the website is loading properly.
Root Cause and Resolution
The issue was connected with a wrong file name being referred to in the wp-settings.php file. The error was raised when trying to curl the server, wherein the server responded with 500 error. By checking the error logs it was found that no error log file was being created for the php errors and reading the default error log for apache did not result in much information regarding the premature closing of the server. Once understood that the errors for php logs were not being directed anywhere the engineer chose to review the error log setting for the php in the php.ini file and found that all error logging was turned off. Once turned on, the error logging the apache server was restarted to check if any errors were being registered in the log. As suspected, the php log showed that a file with a .php extension was not found in the wp-settings.php file. This was clearly a misspelled error that resulted in the error to site access. As this was one server that the error was found in, this error might have been replicated in other servers as well. An easy fix by changing the file extension by puppet would result in the fix being made to other servers as well. A quick deployment of the puppet code replaced all misspelled file extensions with the right one and restarting of the server resulted in properly loading of the site and server.
Corrective and Preventive Measures
- All servers and sites should have error logging turned on to easily identify errors if anything goes wrong.
- All servers and sites should be tested locally before deploying on a multi-server setup this will result in correcting errors before going live resulting in less fixing time if site goes down.