Introducton
In Kubernetes, monitoring node health is crucial for maintaining a reliable cluster. While Kubernetes provides built-in node conditions, these basic health checks might not be sufficient for production environments. This is where Node Problem Detector (NPD) comes in, extending the default monitoring capabilities with rich system-level problem detection.
This article delves into the features and benefits of NPD, showing how it extends beyond the default Kubernetes node healthy monitoring to proactively detect and address potential node issues.
Default Kubernetes Node Conditions
By default, Kubernetes nodes come with several built-in conditions that provide basic health information about the nodes in the cluster.
These conditions are:
- Ready: Is the node healthy and able to schedule pods?
- MemoryPressure: Is the node running low on memory?
- DiskPressure: Are disk space or I/O operations causing problems?
- PIDPressure: Is the node overloaded with too many processes?
- NetworkUnavailable: Are network configurations causing connectivity issues?
Each condition is represented by status indicators that describe the current health or operational state of a node. There are three possible statuses:
- True: The condition is currently happening. For instance, if MemoryPressure is True, it means the node is experiencing memory pressure at the moment.
- False: The condition is not happening. For example, if DiskPressure is False, the node has sufficient disk space and no I/O issues.
- Unknown: The system cannot determine the status of the condition, often due to a lack of communication or incomplete data from the node.
These conditions can be viewed using the command:
kubectl describe node <node-name>
This command will return each node condition along with its respective status.
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message
---- ------ ----------------- ------------------ ------ -------
MemoryPressure False Mon, 13 Jan 2025 21:19:43 +0100 Sun, 01 Dec 2024 01:03:13 +0100 KubeletHasSufficientMemory kubelet has sufficient memory available
DiskPressure False Mon, 13 Jan 2025 21:19:43 +0100 Sun, 01 Dec 2024 01:03:13 +0100 KubeletHasNoDiskPressure kubelet has no disk pressure
PIDPressure False Mon, 13 Jan 2025 21:19:43 +0100 Sun, 01 Dec 2024 01:03:13 +0100 KubeletHasSufficientPID kubelet has sufficient PID available
Ready True Mon, 13 Jan 2025 21:19:43 +0100 Sun, 01 Dec 2024 01:03:33 +0100 KubeletReady kubelet is posting ready status
Based on these statuses, Kubernetes adds the necessary taints that match the condition affecting the node. While these default conditions offer a quick glimpse into a node’s health, they may miss deeper, system-level issues. This is where Node Problem Detector steps in to fill the gap.
Node Problem Detector: Enhanced Node Monitoring
Node Problem Detector extends Kubernetes' native node monitoring capabilities by detecting and reporting various system-level issues. It runs as a daemon on the node, detects node problems, and reports them to the apiserver.
How Node Problem Detector Works
The problem daemon is the core component that monitors and detects node problems. Its function is to identify and report specific node problems to the node problem detector. NPD supports several types of problem daemons:
-
SystemLogMonitor: Watches system logs (journald, syslog, etc) for predefined patterns and reports problems and metrics accordingly. The types of node conditions reported by this daemon are:
- KernelDeadlock
- ReadonlyFilesystem
- FrequentDockerRestart
- FrequentKubeletRestart
- FrequentContainerdRestart
CustomPluginMonitor: Executes custom scripts for specific problem detection.
HealthChecker: Performs periodic health checks. The types of node conditions reported by this daemon are KubeletUnhealthy and ContainerRuntimeUnhealthy.
Upon detection of problems, NPD makes the problem visible to the Kubernetes management stack through the apiserver. Problems are reported as NodeCondition (if it is a permanent problem that will make the node unavailable for pod scheduling) or Event (if it is a temporary problem that has limited impact).
Deploying Node Problem Detector
Method 1: Using Helm
NPD can be deployed using the official Node Problem Detector Helm chart:
helm repo add deliveryhero https://charts.deliveryhero.io/
helm install node-problem-detector deliveryhero/node-problem-detector \
--namespace kube-system
Method 2: As a System Service
For environments without DaemonSet support, NPD can run as a system service. To achieve this:
- Download the Node Problem Detector binaries.
- Create a systemd service file.
- Enable the service using systemd commands.
- Start the Node Problem Detector service.
Customizing Node Problem Detector
One of NPD’s standout features is its ability to adapt to your specific needs. By leveraging the CustomPluginMonitor problem daemon, you can define custom node conditions and rules to monitor exactly what matters most to your workloads.
1. Adding Custom Conditions and Detection Rules
This example demonstrates a custom-plugin JSON file. This file defines custom condition and rules that enable NPD to identify problems based on specific patterns.
{
"plugin": "custom",
"pluginConfig": {
"invoke_interval": "30s",
"timeout": "5s",
"max_output_length": 80,
"concurrency": 3,
"enable_message_change_based_condition_update": false
},
"source": "ntp-custom-plugin-monitor",
"metricsReporting": true,
"conditions": [
{
"type": "NTPProblem",
"reason": "NTPIsUp",
"message": "ntp service is up"
}
],
"rules": [
{
"type": "temporary",
"reason": "NTPIsDown",
"path": "./config/plugin/check_ntp.sh",
"timeout": "3s"
},
{
"type": "permanent",
"condition": "NTPProblem",
"reason": "NTPIsDown",
"path": "./config/plugin/check_ntp.sh",
"timeout": "3s"
}
]
}
2. Writing Custom Plugin Script
The custom plugin script is the executable that performs the actual health checks. The output of the script must align with the patterns defined in the JSON file to trigger corresponding node conditions.
#!/bin/bash
readonly OK=0
readonly NONOK=1
readonly UNKNOWN=2
readonly SERVICE='ntp.service'
# Check systemd cmd present
if ! command -v systemctl >/dev/null; then
echo "Could not find 'systemctl' - require systemd"
exit $UNKNOWN
fi
# Return success if service active (i.e. running)
if systemctl -q is-active "$SERVICE"; then
echo "$SERVICE is running"
exit $OK
else
# Does not differentiate stopped/failed service from non-existent
echo "$SERVICE is not running"
exit $NONOK
fi
Conclusion
Node Problem Detector is more than just a monitoring tool — it’s a safety net for your Kubernetes clusters. By expanding beyond default node conditions and offering unparalleled customization, NPD equips you to tackle challenges head-on, ensuring high availability and smooth operations.
Embrace NPD and take a proactive approach to node monitoring in your Kubernetes journey!
Top comments (1)
Fantastic