“Houston, we have a problem.”
You hear this statement filtering through from a client, except replace “Houston” with your name and replace “problem” with “bug”. As you open your logs in fear you think back to all those times when writing code that you said…
“Ah, I’ll write some logs for that later”
“Lets just fire the full response object in to this log, I’m sure there’ll be something useful in it”
“I’ll just log ‘Failure’ here and that should be enough to know something's up”
…a bead of sweat appears on your forehead and you frown angrily thinking of all those articles you skipped over discussing logging and it’s extreme importance to a system. distant screams
Fear not, you wake up from your nightmare and rush to your computer to begin reading up on every article discussing logging (while also continually being frustrated by websites about tree cutting) and land here - a paradise of logging advice.
Why log?
Well hopefully my box office breaking movie script above gives you a taster on the importance of not only “Why log?” but also “Why log correctly?”. Here’s a few reasons you will often come across on why logging is beneficial:
Analysing errors - A simple one, you have an issue in your application and you need to understand what’s going on. Logging should provide context and traceability of a user/flow of data in a system and allow you to pinpoint what and where something may have gone wrong and action it.
Alerting issues - This is not the logs themselves but the ability to track these logs as metrics and perform actions upon log patterns whether erroneous or not. These can be simple remediation actions or P0 alerts waking up your support engineer in the night informing them the application is imploding.
Assessing threats - Similar to alerting issues, you can assess potential threats to your system by analysing logs and the metrics they produce before the threats affect other services. Additionally, you can produce reports on logs to trace potential bottlenecks or high volume areas that need improving before they fail.
Amplifying business intelligence - I’m stretching it on the “A” theme here but logs do help provide business insights to the developers and more importantly the client which is vital. With logging they can see popular areas of their application or potential downfalls causing frustration among customers.
These are just a few of the benefits of logging in your application. You might be thinking “OK then, let’s just log like crazy and make sure we have everything outputted to its fullest”, but now the pendulum swings the other way. You are being overwhelmed with bloated, useless information and all you want to know is...what’s going on! So it’s important to also think about what to log and not just why log.
What to log?
This is one that can be so hard to get right and is best to put in as much thought as you can when writing the logs. You really want to ensure your future self or colleague is not left baffled by what a log means when it is being read as part of debugging session or it’s not filled with blinding information distracting you from the real issue.
You want to give your logs meaning and context, you want a reader to be able to understand pretty quickly what the log is saying and also where it is in relation to a particular process in your application.
For example this log has a little bit of meaning but not much context:
"Failed to find user"
You’ll read this and understand your application failed to find a user but you’re left thinking:
- What user was it looking for?
- Where was it looking for this user?
- What part of the system was doing the looking which failed?
An improved log should be much clearer - it should provide a log with meaning (i.e User ID which is an item that was not found) but also context that it was in the User table specifically. See below:
"Failed to find item in 'User' table for user with id: [user-id-1]"
Additionally covered later in this blog with “Structured vs Unstructured” logging we can enhance these logs even more to include additional context such as session IDs, timestamps or even the exact function being invoked.
The key point to remember here is to ensure there is meaning and context behind your logs, they need to be readable pretty quickly and able to be understood immediately. You don’t want someone guessing what the log means or what it relates to in the application, it needs to be clear and contain all the information required to successfully debug.
What not to log?
This is where it can get scary, you might have followed the process very easily for “What to Log?” and included all the relevant information in your new log providing both meaning and context. But, you need to be extremely careful here as there is information you definitely don’t want being revealed in your logs, some can be obvious and others you really need to watch out for. Here are some examples:
Tokens/Secrets - This should hopefully be obvious but watching out for passwords, tokens and secrets is critical to ensuring you don’t easily reveal a way in to your application to the outside world if intercepted.
-
PII - Personally Identifiable Information could easily have a blog post on it’s own. There is many different categories of identifiable information which if revealed together or pieced together from different logs by someone can cause extreme damage to a person. Some of these include:
- Unique ID numbers i.e Passport number
- Bank account information
- Medical records
- Full name
- Phone number
- Date of Birth
- IP address
Information user has asked not to track - This could be anything which the user has explicitly opted out of being recorded or the consent has expired for collection.
Note: It’s important to note that sometimes sensitive information may be required in a log but this should be on a rare occasion and very carefully considered when done so. You can also apply some form of masking, hashing or encryption to ensure the information is not revealed in full.
I’d recommend having a look at the OWASP Cheat Sheet Series for logging to find an extensive list of what to watch out for when writing your logs.
There is also some tools available to help with this, for example AWS provides automatic detection, masking and alerting for sensitive information when using their CloudWatch Logs application:
Help protect sensitive log data with masking - AWS Cloudwatch Logs
How to log?
So you’ve got the reason for why you are logging, you’ve got the contents to what you are logging but how best do you log to get the information you want in a clear and usable format. Here is a couple of key concepts to consider when logging.
Log Levels
Log levels are common in most logging applications and serve as a filter for you or your system allowing the ability to easily sort through logs and action on particular log levels focussing on ones with a higher priority or more relevance to your issue. The most common log levels that you will come across are TRACE, DEBUG, INFO, WARN, ERROR & FATAL and quite often TRACE/FATAL are used in very little cases so you’ll mostly be using the 4 in the middle. There’s a decent breakdown provided here which I have mainly used to form the following log level rule of thumb:
TRACE - Used for extremely detailed and potentially high level logs. Useful if you want complete visibility of code execution for extended debugging and if used, should only be turned on for a small period of time due to the sheer amount of log data it would produce (can negatively affect application performance and your wallet). This would nearly be line by line or decision by decision logging which would be excessive for basic debugging but useful for finding a very specific occurrence or issue in the code.
DEBUG - Messages that are helpful in tracking flow through a system but not important enough to make it into INFO. Useful for Development & QA phases and will be diagnostically helpful in finding issues through logging interesting events, decision points and entry/exits of non-trivial methods. DEBUG logs are generally not enabled in Production due to the volume of logs and should only be enabled if required to do a small period of debugging and then disabled again. These could include database queries, external API call request/responses and logging of configuration values.
INFO - Stepping up from DEBUG these are informative events which are generally useful in production and would usually ignore in normal circumstances but use when we want to analyse an issue and have some context. If your event is a common “normal” occurrence, then in production logs it’s likely it would be INFO. This would include system lifecycle events (e.g start/stop), session lifecycle events (e.g login/logout) and significant boundary events (e.g database calls, remote API calls).
WARN - An unexpected technical or business event but with no immediate human intervention required. It’s often still a log we would like investigate but not urgently and shouldn’t be causing major issues to customers. Likely the application can automatically recover and you would just like this warning logged for later tracing, investigation or metric analysing.
ERROR - This is a log to signal the system is in distress. An unexpected occurrence in which likely customers are being effected and you will want to alert someone urgently to begin investigating the issue and intervene with a manual fix. It’s often an error which affects a particular operation rather than the whole system i.e missing data or an external API not responding on time
FATAL - Big uh oh. This is rarely used but is reserved for large application failures in which the whole system doesn’t fulfil it’s intended functionality, a major service being completely unavailable or where there is going to be major data corruption or loss. You might use this to alert systems to shut down applications to prevent more damage occurring.
Structured vs Unstructured Logs
Now you’ve got your level of log, you’ll need to decide how you want the information to be presented within it. You’ve got two main options, structured or unstructured. You can probably make a good assumption of the difference between these already but these are the key points to clarify. I’ll start with what the industry recommends you not do, and then work my way to the best practices.
Unstructured Logs
Unstructured logs are usually human readable plain text with no consistent or predetermined message format. They are difficult to treat as data sets and not easily queried or searched, and since they are designed for a human to read then of course a computer struggles to understand them without some form of Natural Language Processor, and that just gets complex.
Here’s an example an unstructured log:
2023-09-01T08:36:00Z INFO Successfully did big maths with debited amount: [£45.00]
They really aren't the best and that’s the honest truth, ultimately they are the plainest form of logging and often used for firing into your code at first to make following the flow of data easy and perform debugging in a small capacity. Once the application grows or team scales then the logs become a little messier, more inconsistent and start to look real scary when an issue occurs…are you getting the point.
Quickly moving on…
Structured Logs
Structured logging is basically the opposite of above and much better to use. They are formed with a consistent and predetermined message format that can be queried much quicker and easier by a computer. By using structured logging you are making sure you can utilise the large amount of logs you may be producing to form key metrics, quickly follow traces through logs or just ensure you have a consistent form with all the information you need in every log produced.
With structured logs you can standardise certain data points within them such as the timestamp, actor IP address, session ID, log level, function name and a message etc. which makes querying so much easier across all your logs. The message attribute is often then the more human-readable part which we can get the most context from (this kind of trumps the unstructured logging as you have your human readable message in here amongst the structured format).
{
"timestamp": "2023-09-01T08:36:00Z"
"level": "INFO",
"sessionId": "9ccd503c-973f-4f17-bce9-3431875cd94a",
"functionName": "BigMathFunction",
"message": "Successfully did big maths with debited amount: [£45.00]"
}
You can see in this example above that the structured format of the attributes means you can easily query with the right tools, sorting and filtering exactly what you need to debug effectively. With a predefined structure you have consistency, and with consistency you have efficiency - simply marvellous.
Ultimately, the industry recommends against plain old unstructured logging from the start. Get your predetermined message format defined at the beginning of your project, include what you think you’ll need as the application expands and keep it consistent. There are many good standard logging packages out there that will make this a lot easier for you too. Your future self with thank you a million times over.
Securing logs & using logs for security
So now your logging but you want to make sure those logs are securely stored and transmitted but also useful in maintaining and monitoring the security of your application. It’s something that is of critical importance to any application and should be considered heavily when adding logs to your system. Similarly to some other sections in this blog security could easily be it’s own blog post but I will briefly cover some key concepts which are good to keep in your head and prompt further investigation when implementing logs yourself in an application.
Securing logs
Log Protection
Protection of logs is extremely important, especially as they may contain required sensitive information, application source code or valuable company data that would be extremely useful to a bad actor. This protection can be helped with masking or encryption as mentioned earlier in the “What not to log?” section and can help with the initial assurance that even if logs are leaked in a worst case scenario that there is no sensitive information included.
But you of course still want good log protection and here are some considerations you should make:
Ensuring logs have tamper protection so you can identify if logs have been altered or removed maliciously
Access to view logs should also be monitored by auditing user access events, this helps prevent or alert access by bad actors
Ensure logs are sent securely when transmitted outside of your own network
Consider Denial of Service (DOS) protection as logs may be subject to floods of information from an attacker to cover their tracks of bad activity or severely cripple an application with log overload
Log Retention
This is hopefully simple enough to understand but ensure that your logs are only retained for the specified time required, this is often set by the client as a legal or regulatory amount. So ensure the logs are retained for that length of time and deleted once the time has been reached (no earlier or no later).
Logs for security
As well as ensuring your logs are secure you will also want to use logs for security, tracking not only customer flows but developer activity too. It’s important to log changes of settings, movements and actions throughout the application made by owner/developers and track their access to logs and sensitive information.
There is some good guidelines located below on what you should be logging for security including setting changes and owner/developer changes.
Logging in a Distributed System
I’ve decided to diverge a little from the basics of loggings to briefly cover logging in a distributed system as there is some key points to cover here which are important to think about when deciding on how to log in your application.
If you remember from above I talked about the important of context in “What to Log?”, this becomes ever more important in distributed systems. Within distributed systems a flow completed from start to finish will likely pass through multiple micro services, each producing their own logs but only logging within the context of themselves (as that’s all they know about!). The logs may be clear for a single micro service but if you want to analyse the complete flow at a higher level it becomes incredibly difficult to link these logs together and understand them as a whole. To help with this, you can use a Correlation ID.
Correlation IDs
This is a unique identifier that can be generated and added to the first request going into your system as part of a flow and subsequently passed along to each micro service within the flow. This Correlation ID can then be placed into each log produced from that flow no matter the micro service. This provides a unique link between all of the events logged allowing for that high level understanding of a flow within your system previously mentioned - now you’ve got context.
Correlation IDs also back the importance of using “Structured Logs”, with this ID being placed in every related log in a flow then as long as the logs are structured, querying can be relatively simple using that specific ID.
Distributed logging products
Some examples of products which can help you implement the distributed logging and tracing include:
With these tools you can not only trace and view logs of a particular request but you can hook into observability tooling to give a better picture of performance within your application.
Logging vs Metrics
A brief area to finish on is the difference between logs and metrics. You’ll have noticed both mentioned already in this blog and it can often be difficult to know which to use in particular scenarios and how they are beneficial to you and your application
Logs
If we take the case of having some structured logs being produced from your application, then in most cases there is going to be a significantly high volume of these. They can be read by some form of filtering or querying solution and extremely useful in debugging error cases or following transactional flows within your application. So your main use cases here are:
One off querying of logs to investigate a particular situation i.e error case
Following a user or data flow right through your application
Identifying anomalies to understand potential bottlenecks
Identifying potential security threats
You may though have some regular searches you want to perform on particular logs, whether that be for producing alerts to identify any issues or monitoring system performance at a higher level - this can be where metrics come in.
Metrics
Using metrics can be extremely useful when you want to provide high observability of a system based on planned events occurring over time. Metrics are numbers generated based on particular patterns of events which can be used to monitor performance of a system but also then alert based on particular metric thresholds being hit by your application.
You have millions of metrics available to you based on logs produced from your application and it’s important to pick the right ones that benefit you and your application.
Some key use cases for metrics would be:
Monitoring health and performance of a particular component or system as a whole
Automated alerting based on metrics
High level observability and key metric gathering for clients
So what do we use?
As briefly explained above you’ll hopefully see that Logs and Metrics have different use cases within an application and are both incredibly important to the observability of a system. So ultimately it’s a mix of both to get the best out of your application. Use metrics to monitor performance, specific events and assist in alerting - then have logs for one off debugging sessions and following a flow right through your system.
There’s a lot to think about regarding both and hence why it’s stressed here and in many other articles that logging is of such high importance to an application, don’t let it fall through the cracks.
“Houston, we have a problem.”
You hear this statement filtering through from a client, except instead of waking up from your nightmare sweating and hyperventilating your brain reminds you that after reading those many articles about the importance of logging you said…
“Yes, let’s write that log now, I have the context so i’ll get it in before I forget”
“I’ll take only what I think I need from this response object and include it in the wonderfully structured and easily queryable log”
“I’ll not only log ‘Failure’ here but include the relevant Correlation ID, timestamp, usefully human readable error message and anything I think might be useful for the support engineer reading the logs at 2am to gain the full meaning and context of what they are looking at”
…and you drift peacefully back to sleep.
Top comments (1)
Logging was my passion for as long as I remember myself being a programmer. I was trying to log absolutely anything that can or can't be useful, sometimes, of course, trashing the log file to the point where it was impossible to read anything. But now, after I have read this, I think I might gitgud, so thank you for your effort very much.