DEV Community

Yury Bushmelev
Yury Bushmelev

Posted on • Edited on

An overview of logging pipelines: Lazada 2017 vs Cloudflare 2023

Recently, I finished reading the Cloudflare blog post about their logging pipelines. Guess what? I found quite a lot of similarities with what we did in Lazada in 2017! Thanks to the layoff, I got some free time to write this long-enough-read about logging pipelines. I’m going to compare what we did in Lazada with what Cloudflare just published recently. Let’s dive into the history together!

Just for your information, Lazada is an e-commerce platform that was quite popular in Southeast Asia (and still is, I believe). Though, Lazada in 2017 and Lazada now are completely different. Alibaba replaced everything with their own in-house implementations in 2018. So what I’m going to describe below is non-existent anymore.

Below is a bird’s-eye view of the logging pipeline we’ve implemented.

Lazada 2017 logging pipeline overview

Local delivery pipeline

On every host, we have a rsyslog instance, which listens on a Unix Datagram socket in a directory. This directory is mapped to every container running on the host. Every microservice running in a container writes its logs into the socket.

Why do we do this? Well… Let me elaborate a bit more here. Famous “The Twelve-Factor App” methodology says that “each running process writes its event stream, unbuffered, to stdout”. That sounds nice, and there are reasons behind it. But, on a scale, things are not that simple. Who is reading that stdout? What if your infra engineers want the reader process to be restarted for some reason? What if the reader process is stuck on a disk IO, e.g.? In the best case, your writer will receive a signal. In the worst case, it’ll block in write().

What can we do with this? Either we accept the risk or we implement a wrapper to write log messages in a separate thread and to not block on write(). Then we should push this to every microservice in your product.

So now we have a library that can write to stdout without blocking or crashing. Why the Unix socket, then? The answers are speed and reliability. There are more moving parts involved in the stdout logging pipeline. You should read the logs from the pipe and then deliver them somewhere. At the moment, I saw no software that can do this fast and reliably. Also, we were trying to avoid local disk writes in the local delivery pipeline as much as possible. That’s why we decided to use rsyslog with a Unix socket instead.

Honestly speaking, our library was configurable. By default, it logs to stdout to simplify developer’s experience. At the same time, it has an option to log to a Unix socket, which was enabled in our staging and production configurations.

Now let’s see what Cloudflare did:

This means that for our heavier workloads, where every millisecond of delay counts, we provide a more direct path into our pipeline. In particular, we offer a Unix Domain Socket that syslog-ng listens on. This socket operates as a separate source of logs into the same pipeline that the journald logs follow, but allows greater throughput by eschewing the need for a global ordering that journald enforces. Logging in this manner is a bit more involved than outputting logs to the stdout streams, as services have to have a pipe created for them and then manually open that socket to write to. As such, this is generally reserved for services that need it, and don’t mind the management overhead it requires.

They have different setup, but the overall idea is mostly the same. In our case, almost every microservice was critical enough to implement the library.

Thanks to rsyslog lookup tables and Puppet+MCO, we were able to enable/disable log collection per microservice/environment/DC. For example, it was possible to stop collecting logs from a noisy microservice on staging in a few minutes. Also, it was possible to enable/disable writing a local log file in case we really needed that.

Global logs delivery pipeline

As you may see in the picture above, logs from every server in a datacenter are sent to a pair of the datacenter’s log collectors via syslog-over-tcp. The collector (conditionally) forwards the message to DC long-term file storage and to Main DC’s log collector. It has an internal queue to hold messages for about a day in case the Main DC is inaccessible for some reason. The stream between a DC and the Main DC uses the compressed syslog-over-tcp protocol. That introduces some delay, but dramatically saves the bandwidth. Though, unlike the Cloudflare, we were using just a single DC as a Main DC. Also, the log collectors (per-DC and Main DC) were active-passive with manual failover. So there was definitely room for improvement.

Message routing

I omitted some details in the description above, though. One of those details is message routing. A log collector needs some information to make a decision what to do with the message. Partially, we have some information in the syslog message fields (severity, facility, hostname). Though most of the information required is JSON-encoded, including a microservice name, a venture (a country the microservice is serving), an environment, etc.

The easy way here is to parse JSON in the message and use fields to make the decision. But JSON parsing is a quite expensive operation. Instead, we add a syslog tag to the message in the following format: $microservice/$venture/$environment[/$optional…]. This way, we can just parse a short, slash-separated string instead of a full JSON message. Combined with the usual syslog message fields, this gives us enough information to route the message in any way we may need.

Kafka

When a message arrives in the Main DC, its syslog tag is stripped and the JSON is injected into a Kafka topic according to its source environment. There are 2 big reasons to use Kafka here. Firstly, it allows other teams (BI, Security, Compliance, etc.) to access the logs. Secondly, it gives us some flexibility in the infrastructure operations.

Onwards to storage

Developers were able to access logs in two ways. The first (and most popular) was a shell box on the per-DC file storage. Log files were distributed in a directory tree, which allows developers to use old’n’good Unix tools (tail/grep/sed/awk/sort/etc.) to deal with them. The second was the Graylog UI, backed by ElasticSearch v5.5.

We had 16 powerful servers for the ES cluster. That allows us to implement the hot/warm architecture of 16 hot (SSD) and 32 warm (HDD) instances. Also, we created three dedicated masters to coordinate the cluster. These days, I understand that it was a good idea to dedicate a subset of the data nodes to only serve the HTTP traffic (as Cloudflare did).

Pipeline metrics

So we had a log delivery pipeline working in some way. The next question was how to measure the quality of logs delivery. To answer this question, we implemented a few things.

On the microservice side, our logging library had some metrics exposed as a subset of the microservice metrics. That includes the count of log messages written and the count of log messages dropped.

On the rsyslog side, I implemented a simple Prometheus exporter, which exposes rsyslog’s metrics (impstats) and its custom counters (dys_stats). This way, we can monitor its queues and can count how many messages were received from a microservice.

With all that implemented, we were able to monitor the logging delivery pipeline and its quality of service. A few alerts were configured to react when inter-DC logs delivery got stuck, e.g.

What was next?

With all the above, we found that the Graylog cluster was able to process up to almost 200k msg/s, which I personally consider a bit on the low side for the hardware we had under. The Graylog was definitely a bottleneck there. So my next idea to try in 2018 was to set up rsyslog to write to ES directly, then drop Graylog and use Kibana instead.

From another side, I had the impression that ElasticSearch was the wrong tool here. Nobody needs the full-text search in logs, actually. So I added ClickHouse to my to-do list to explore, though I was concerned by a missing UI. I filled out a Feature Request for the omclickhouse module in the rsyslog GitHub repo, and it was implemented one year later.

Another items on my to-do list for 2018 was:

  • to work on High Availability issues
  • to configure rate limits on every microservice and per-DC log collectors
  • to define the log delivery SLA with our developers

But… Reality was different. Company management and the tech stack were replaced completely. I left the company in June 2018. I had no chance to work on logs at a scale anymore.

P.S. You might want to read this article also: https://dev.to/jay7x/random-thoughts-about-logs-delivery-pipelines-and-everything-2h6e

Top comments (0)