DEV Community

Jonas Brømsø
Jonas Brømsø

Posted on

Downtime 2025-02-22

I was working on a blog post about calculating allowed downtime for human consumption. I had a draft ready and was about to publish it as the weekly post for week 8. Then I received several alerts from my monitoring system, so I went from somewhat related theory to practice.

I have rescheduled the post about downtime to be published later, and I decided to publish this post instead, since it was almost too good to be true, when a post practically drops in your lap.

The alerts came Saturday and was for a hobby project of mine I have written about a few times (1, 2). I have had alerts, but this was the a real downtime issue, not just a minor glitch.

Over the course of the last week I have organized and held two postmortems at work, so you could say I was in the right mindset. Let me begin by giving you a bit of background on the project and it's architecture.

The project aims to be a URL shortener basically, compared to generic URL shorteners this one is dedicated to a specific set of URLs for a single website.

The architecture is simple, consisting of:

  • A reverse proxy (NginX)
    • Hosted with DigitalOcean in a DropLet, since their load-balancers are not within the limits of my budget for a hobby project
  • A serverless function (Apache OpenWhisk)
    • Hosted with DigitalOcean in their functions platform, well within the limits of my budget for a hobby project, due to the generous free tier
  • A short domain name: (pxy.fi)
    • Registered with the registry for Finland (.fi), because it was short, cheap and available and sounded close to what I wanted to convey with the project

The monitoring I used montitors on 2 points (endpoints):

  • The reverse proxy
  • and the serverless function

The first uses a /health endpoint, which is a simple endpoint returning a 200 OK, the second uses a a basic use case of simulating a users and expects to succeed with a 200 OK.

Since both had emitted alerts, I started investigating from a browswer, emulating a basic use case. From: https://jonasbn.github.io/til/clang/diagnostic_flags.html I opened the following URL: https://pxy.fi/4/rmodule-build. That failed by being unresponsive so I checked the DNS set up.

First the registration of the domain, which was fine, then the DNS setup, which was also fine.

whois -c fi pxy.fi

domain.............: pxy.fi
status.............: Registered
created............: 18.10.2022 19:00:22
expires............: 18.10.2025 19:00:22
available..........: 18.11.2025 19:00:22
modified...........: 1.10.2024 17:16:25
RegistryLock.......: no

Nameservers

nserver............: ns1.digitalocean.com [Technical Error]
nserver............: ns3.digitalocean.com [Technical Error]

... snip ...
Enter fullscreen mode Exit fullscreen mode

The: [Technical Error] was a bit odd, but I now assume it is a glitch in the WHOIS tool.

I checked the DNS setup with dig. The domain was pointing to the correct IP address.

dig pxy.fi

; <<>> DiG 9.10.6 <<>> pxy.fi
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 11073
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1232
;; QUESTION SECTION:
;pxy.fi.                                IN      A

;; ANSWER SECTION:
pxy.fi.                 2759    IN      A       104.248.91.106

;; Query time: 25 msec
;; SERVER: 212.10.10.5#53(212.10.10.5)
;; WHEN: Sun Feb 23 14:00:38 CET 2025
;; MSG SIZE  rcvd: 51
Enter fullscreen mode Exit fullscreen mode

I also checked the 3 DNS servers using dig as 3 separate queries:

dig @ns1.digitalocean.com pxy.fi
dig @ns2.digitalocean.com pxy.fi
dig @ns3.digitalocean.com pxy.fi
Enter fullscreen mode Exit fullscreen mode

And all returned the same IP address, so the DNS setup was fine.

I logged into the DigitalOcean console and checked the status of the droplet and the function, both were running. I then logged into the reverse proxy and checked the logs for NginX, which had just stopped logging, but no indication as to a reason.

Do note that I am by no means a DevOps or SRE person, so I am not an expert in this area, but I have some experience with a background as as a developer, so I can do some basic troubleshooting, but professionally I work as a porduct manager, so this is not something I get exposed to on a daily basis, so it is always a learning experience and always based on a lot of searching and reading.

Anyway I suspected the reverse proxy, so I tried something I had not tried before, I asked ChatGPT for help diagnosing the issue.

my nginx is unresponsive what should I check? please help

Yes I know some people might think that being polite is a waste of prompt, but I just finished reading: "Scary Smart" by Mo Gawdat, so I am trying to follow his advice on being more polite and respectful.

Anyway polite or not the was very helpful.

I checked the status of NginX:

systemctl status nginx
Enter fullscreen mode Exit fullscreen mode

And it was not running, so I restarted it:

systemctl restart nginx
Enter fullscreen mode Exit fullscreen mode

Having been through two postmortems at work, I was a bit more ambitious, since I wanted to know the root cause, I was however on my own since this is a hobby project - so I prompted ChapGPT a second time.

This was great, it was the nginx that had stopped responding, how do I find the root cause for this?

Again a wealth of very usefull information and I tried out some if the suggestions:

dmesg | grep nginx
Enter fullscreen mode Exit fullscreen mode

And I got a lot of output, but in a format I was not familiar with, so I asked ChatGPT again.

How do I interpret the information from dmesg?

The answer here was good, but just not what I needed, perhaps a bit over my head at this time, so I turned to searching and reading, based on the limited information I had found using dmesg, based on the suggestion from ChatGPT.

And I found the following resource:

Yes, I found others but this one seemed to be the most relevant. The logging did not indicate a clear root cause, based on my limited knowledge and basic interpretation, it believe it do be something minor, since it does not seem to be logged on a deeper level

Based on that I asked ChatGPT one final question:

How can I see if systemd has nginx configured for automatic restarting?

Again a very helpful answer.

If this, as I suspect, is something minor I do not see a problem with NginX recovering automatically, so I added the following to the NginX configuration (/etc/systemd/system/nginx.service) in the [Service] section, based on the suggestion from ChatGPT:

Restart=always
RestartSec=5s
Enter fullscreen mode Exit fullscreen mode

The NginX was set up using the default configuration, all of my own changes were in the configuration for the reverse proxy site, this will hopefully catch this kind of issues in the future - but let's see how that works out.

ChatGPT was a very helpful colleague in this situation and I could get back to enjoying my weekend. I will go back to the step for finding the root cause, since the wealth of information from ChatGPT was a bit overwhelming, but I will get back to it, since I am curious and I do want to learn and I do like that ChatGPT has saved my prompt history, so I can go back and review the information and additional suggestions I did not try out.

I could just have restarted the NginX to begin with, but I am glad that I got the configuration improved.

Top comments (1)

Collapse
 
jonasbn profile image
Jonas Brømsø • Edited

I did some additional reading and added the following today I learned to my TIL collection on configuring systemd for automatic restart handling of NginX and of course other services relying on systemd.

Based on that I am thinking about changing the configuration to:

  • on-failure