What is High Scale Application Monitoring?
Let's assume we are working on an Application for months and the Application is getting complex day by day and are needed to be managed on a large scale in order to ensure that the infrastructure of your application stays operational. To ensure that your application is running well, You have to solve some Questions.
- Is your Application is Up or Down?
- Are the resources are utilized well enough or not?
- What is the growth of resources needed after each Release?
So, It is important to have a centralized view of a Sys to Pinpoint a source or problem.
Typically you have, let's say multiple servers running containers on them. As the user input grows, it makes sense to distribute these services individually, getting us to a microservice infrastructure. Now, if services want to connect with each other, there should be some sort of a way for them to be interconnected.
Why Not Debug Our Code? Why Monitoring?
Let's say you have completed your thousand dollar Project after a day and night hard work, and another day when you woke up you have
seen that your application stopped working, some of the build components of your application or microservices got failed/stopped running and these errors are too much that you are not able to find which component or services are failing or caused the failure. Or let's say your application is responding very slowly as all the traffic is being directed to just limited servers. That is a place no one would want to be in. As debugging this manually is going to be very time-consuming, So at this place, monitoring and alerts are places an important role.
What is the Solution?
So how do you ensure that your application is being maintained properly, and is running with no downtime? We need some sort of an automated tool that constantly monitors our application and alerts us when something goes wrong (or right depending on the use case). Now, in our previous example, we would be notified when a service causes failure, and hence we can prevent our application from going down.
What is Prometheus?
Prometheus is a free software application used for event monitoring and alerting. It records real-time metrics in a time series database (allowing for high dimensionality) built using an HTTP pull model, with flexible queries and real-time alerting. Prometheus is built by SoundCloud and currently, it is a F/OSS incubated project under Cloud Native Cloud Foundation(CNCF).
Prometheus Architecture and Metrics
Prometheus Terminologies
Target - It is What the Prometheus Monitors, it can be your microservice, application, or docker containers.
Metric - For our target, we would like to monitor some particular things. Let's say we have some docker container(Target) running and we want to monitor the CONTAINER_MEMORY_USAGE(Metric) for every running container.
In the above, we see some important components of Prometheus Server. It consists of three parts:
Time Series Database(TSDB) - It stores the metric data. It also ingest it (append-only), compacts, and allows querying efficiently. Time series data are simply measurements or events that are tracked, monitored, downsampled, and aggregated over time. This could be server metrics, application performance monitoring.
Scrape Engine - Pulls the metrics (description above) from our target resources and sends them to the TSDB. (Prometheus pulls are called scrapes).
Server - Used to make queries for the data stored in TSDB using a very powerful query language using PromQL.This is also used to display the metric in a dashboard using Grafana or Prometheus UI.
A More Detailed Querying Structure:
More About Metrics
The metrics are defined with two types of major attributes TYPE
and HELP
to increase logs readability.
HELP
: It shows the details about the metric with a description.-
TYPE
: Prometheus offers 4 core metric types so that we can classify different types of metrics easily and we can also create custom tags using these existing metric types also for a specific use.- Counter: A counter is a cumulative metric that represents a single monotonically increasing counter whose value can only increase or be reset to zero on restart. For example, you can use a counter to represent the number of requests served, tasks completed, or errors.
- Gauge: A gauge is a metric that represents a single numerical value that can arbitrarily go up and down.For eg: CPU_MEMORY_USAGE.
- Histogram: A histogram samples observations (usual things like request durations or response sizes) and counts them in configurable buckets. It also provides a sum of all observed values.
- Summary: Similar to a histogram, a summary samples observations (usually things like request durations and response sizes). While it also provides a total count of observations and a sum of all observed values, it calculates configurable quantiles over a sliding time window.
Understanding Prometheus Backend
How does it Pull the data from Targets?
The Data Retrieval Worker pulls the data from the HTTP endpoints of the targets on path /metrics
. Here we notice 2 things:
The endpoint should be exposed to the
/metric
path.The data provided by the endpoint should be in the correct format that Prometheus understands
Q. How do we make sure that the target services expose /metric & that data is in correct format?
A. Some of them expose the endpoint by default. Ones that do not, need a component to do so. This component is known as an Exporter. An Exporter does the following:
1.Fetch data from the target
2.Convert data into a format that Prometheus understands
3.Expose the /metrics
endpoint (This can now be retrieved by the Data Retrieval Worker) For different types of services, like APIs, Databases, Storage, HTTP, etc, Prometheus has a Exporters you can use.
How you can Monitor your own Application?
Let's say you have written your application in python and you want to expose it to the HTTP endpoint /metric
on your application instance so you need some client libraries or you can say an exporter which then can be used to send data to the data retrieval worker. In the official documentation of Prometheus Clients, you can get the list of all clients.
PUSH VS PULL Model for Metric Collection
Metrics are one of the โgo-toโ standards for any monitoring system of which there are a variety of different types. At its core, a metric is essentially a measurement of a property of a portion of an application or system. Metrics make an observation by keeping track of the state of an object. These observations are some value or a series of related values combined with a timestamp that describes the observations, the output of which is commonly called time-series data.
Prometheus is a pull-based system that pulls data from configured sources at regular intervals.
As mentioned above, Prometheus uses a pull mechanism to get data from targets. But mostly, other monitoring systems use a push mechanism. How is this different and what makes Prometheus so special?
Q. What do you mean by push mechanism?
A. Instead of the server of the monitoring tool making requests to get the data, the servers of the application push the data to a database instead.
Q. Why is Prometheus better?
A. You can just get the data from the endpoint of the target, by multiple Prometheus instances. Also note that this way Prometheus can also monitor whether an application is responsive or not, rather than waiting for the target to push data.
(Check out the official comparison documentation)
NOTE: But what happens if the targets don't give us enough time to make a pull request? For this, Prometheus uses Pushgateway. Using this, these services can now push their data to the Data Retrieval Worker instead of it pulling data like it usually does. Using this, you get the best out of both ways!
How to Use Prometheus?
Till now we have got to know What is Prometheus? and How the Prometheus Architecture? look like, Now let's see How we can set up Prometheus Locally?
Q. When you define what targets you want to collect data from in the file, how does Prometheus find these targets
A. Using the Service Discovery. It also discovers services automatically based on the application running.
Prometheus.yml File
The most important file in Prometheus is the config(yml) file, which is Prometheus.yml here we define all set of instruction that should build the Prometheus Server:
global:
# How frequently to scrape targets by default.
[ scrape_interval: <duration> | default = 1m ]
# How long until a scrape request times out.
[ scrape_timeout: <duration> | default = 10s ]
# How frequently to evaluate rules.
[ evaluation_interval: <duration> | default = 1m ]
# The labels to add to any time series or alerts when communicating with
# external systems (federation, remote storage, Alertmanager).
external_labels:
[ <labelname>: <labelvalue> ... ]
# File to which PromQL queries are logged.
# Reloading the configuration will reopen the file.
[ query_log_file: <string> ]
# Rule files specify a list of globs. Rules and alerts are read from
# all matching files.
rule_files:
[ - <filepath_glob> ... ]
# A list of scrape configurations.
scrape_configs:
[ - <scrape_config> ... ]
# Alerting specifies settings related to the Alertmanager.
alerting:
alert_relabel_configs:
[ - <relabel_config> ... ]
alertmanagers:
[ - <alertmanager_config> ... ]
# Settings related to the remote write feature.
remote_write:
[ - <remote_write> ... ]
# Settings related to the remote read feature.
remote_read:
[ - <remote_read> ... ]
(Check the official documentation for configuration)
A More Simplified Prometheus.yml!!!
global:
scrape_interval: 15s
evaluation_interval: 15s
rule_files:
# - "first.rules"
# - "second.rules"
scrape_configs:
- job_name: prometheus
static_configs:
- targets: ['localhost:9090']
global
-scrape_interval
defines how often Prometheus is going to collect data from the targets mentioned in the file. This can of course be overridden.-
rule_files
- This allows us to set rules for metrics & alerts. These files can be reloaded at runtime by sendingSIGHUP
to the Prometheus process. Theevaluation_interval
defines how often these rules are evaluated. Prometheus supports 2 types of such rules:- Recording Rules - If you are performing some frequent operations, they can be precomputed and saved in as a new set of time series. This makes the monitoring system a bit faster.
- Alerting Rules - This lets you define conditions to send alerts to external services, for example, when a particular condition is triggered.
scrape_configs
- Here we define the services/targets that we need Prometheus to monitor. In this example file, thejob_name
isprometheus
. Meaning that it is monitoring the target as the Prometheus server itself. In short, it will get data from the/metrics
endpoint exposed by the Prometheus server. Here, the target by default islocalhost:9090
which is where Prometheus will expect the metrics to be, at/metrics
.
Alerting in Prometheus
Prometheus has an AlertManager that is used to set Alerts and send them using Emails, webhooks, Slack, and other methods. As mentioned above, the Prometheus server uses the Alerting Rules to send alerts.
Where is the data stored?
The data collected by the Data Retrieval Worker is stored in a TSDB and queried using PromQL query language. You can use a Web UI to request data from the Prometheus server via PromQL.
A combined Prometheus Architecture
Running Prometheus Using Docker
For the demonstration purposes we will use Docker to make things really easy and reproducible anywhere. Here is a simple Dockerfile which sets up the stage for us.
It starts off with Ubuntu 18.04 (bionic) official image
Installs some of the tools we will be using like wget, screen and vimIt downloads the latest binary releases for Prometheus, node_exporter, and alert manager.
It also downloads Grafana which will be used later for Visualization
Exposes the default ports from the respective services
Dockerfile
FROM ubuntu:bionic
LABEL Name="Ritesh" Mail="daydreamingguy941@gmail.com"
RUN apt-get update \
&& apt-get install -y wget \
&& apt-get install -y screen \
&& apt-get install -y vim
WORKDIR /root
RUN wget -nv https://github.com/prometheus/prometheus/releases/download/v2.22.2/prometheus-2.22.2.linux-amd64.tar.gz
RUN wget -nv https://github.com/prometheus/node_exporter/releases/download/v0.18.1/node_exporter-0.18.1.linux-amd64.tar.gz
RUN wget -nv https://github.com/prometheus/alertmanager/releases/download/v0.21.0/alertmanager-0.21.0.linux-amd64.tar.gz
RUN wget -nv https://dl.grafana.com/oss/release/grafana-7.3.3.linux-amd64.tar.gz
# node_exporter expose port
EXPOSE 9100
# prometheus server expose port
EXPOSE 9090
# grafana
EXPOSE 3000
# alertmanager
EXPOSE 9093
Now, we will create a simple docker-compose file to prepare our docker image to run, only thing it does is, it gives us a handy name to work with which will expose multiple ports to the host for us. With this, we can access them from the host machine. I generally prefer docker-compose a lot than writing long docker run commands with options.
docker-compose.yml
version: '3'
services:
prometheus_demo:
build: .
ports:
- "9100:9100"
- "9090:9090"
- "3000:3000"
- "9093:9093"
Once these two files are there, you can go to the folder containing them and run the service from docker-compose
in interactive mode. Note that --service-ports
is a very important option, it allows us to do the port binding right (which is disabled by default).
docker-compose run --service-ports prometheus_demo
Once you run the command, you will see a bunch of outputs corresponding to the build steps in Dockerfile.
After all the steps are done you will be thrown into an interactive shell inside a docker container. If you want to check the no of the running container with a prometheus_demo
image name you can use docker-compose ps
or docker ps -a
you will get the single container running multiple services like Grafana, Prometheus, node_exporter, and alert manager on a default port or ports mentioned in Dockerfile.
Now let's extract all the Packages which we have downloaded using Dockerfile.
tar xvf prometheus-2.22.2.linux-amd64.tar.gz
tar xvf node_exporter-0.18.1.linux-amd64.tar.gz
tar xvf grafana-7.3.3.linux-amd64.tar.gz
Similarly, you can extract other packages like alertmanager.
The next step is to move to the package directory. I am showing with Prometheus you can follow these steps with other packages also.
MOVE TO DIREC
cd prometheus-2.22.2.linux-amd64
Now you can run the Prometheus file and make your Prometheus Monitoring System running.
RUN
./prometheus
Now you can see a msg in the logs that Server is Up and ready to receive web request
.
Your Server is running it on localhost:9090
, you'll get the following Prometheus UI Dashboard that you can now configure:
In the above Prometheus UI Dashboard we're monitoring the Docker Containers.
Now you can connect your Prometheus Server to node_exporter(It is an exporter used to get a Linux server health by exposing /metric
endpoint to Data Retrieval Worker) by mentioning the other job_name in your prometheus.yml file which we have discussed in Prometheus.yml file section,node_exporter will fetch you the metrics like MEMORY_USAGE_INFO
or CPU_LOAD
in your Prometheus dashboard and you can also see the node_exporter /metric
HTTP endpoint on localhost:9100
.
Connect To Grafana
You can set up Grafana by following the below steps
cd grafana-7.3.3.linux-amd64
cd bin
./grafana-server
You will find your Grafana Server Running at localhost:3000
Now you can fetch the metrics and see the beautiful visualization in Gafana Dashboard.
- Monitor Prometheus Health
- Monitor HOST Machine Health
Thanks for Reading๐ฅฐ!!
In the next blog, we will Discuss more Prometheus Advanced Visualization with Grafana and also monitors K8's using Prometheus!
Connect with Me๐ธ
๐ฉ LinkedIn
๐ฉ Shoot me a Mail
Top comments (1)
Impressive article, Ritesh! Prometheus sounds intimidating. I want to try it with the different back-end frameworks I have written. ๐๐ผ