DEV Community

Hritik Raj
Hritik Raj

Posted on

Learning by Doing: Building a System Health Script for the DevOps SRE Challenge

Day 1: Diving into System Health Checks with a Menu-Driven Script!

Hey DevOps enthusiasts! ๐Ÿ‘‹

I'm super excited to kick off the DevOps SRE Daily Challenge, and today's task was a fantastic introduction to system health checks. We were tasked with creating a menu-driven script to monitor disk, services, memory, and CPU usage, and then send a report via email every four hours.

Here's a breakdown of my experience and what I learned:

The Challenge:

The goal was to build a script that:

  • Provides a user-friendly menu for selecting system health checks.
  • Performs disk usage, running services, memory usage, and CPU usage checks.
  • Sends a comprehensive report via email every four hours.
  • Includes exception handling and debugging features.
  • Is well-documented for beginners.

My Approach:

I decided to use Python for this challenge, as it's versatile and has excellent libraries for system monitoring and email sending.

  1. Menu Implementation: I used a simple while loop and input() to create the menu, allowing users to select the desired health check.
  2. System Monitoring: I utilized the psutil library to gather system information (disk, memory, CPU) and the systemctl command for service checks.
  3. Email Reporting: I used the smtplib library to send email reports, and the datetime library to format the reports and schedule the four-hour intervals.
  4. Exception Handling: I implemented try-except blocks to handle potential errors, such as incorrect user input or failed service checks.
  5. Debugging: I added print statements and logging to track the script's execution and identify any issues.
  6. Scheduling: I used a while loop, and the time.sleep function to schedule the email reports.

Key Learnings:

  • psutil is a powerful library: I was impressed by how easy it was to gather system information using psutil.
  • Systemctl is very useful: Controlling and checking the status of services through systemctl in a script is very handy.
  • Email automation is essential: Automating email reports is crucial for proactive system monitoring.
  • Exception handling is vital: Robust error handling ensures that the script continues to run even when unexpected issues occur.
  • Debugging is your friend: Taking the time to add debugging statements can save you a lot of time in the long run.
  • Scheduling is simple: Using the time.sleep function is a simple way to schedule tasks.

Challenges Faced:

  • Initially, I struggled with the email sending part, but I found some great resources online that helped me resolve the issues.
  • Scheduling the emails to run every 4 hours took a little bit of testing to get correct.

Code Snippet (Illustrative):

import psutil
import subprocess
import smtplib
from email.mime.text import MIMEText
import time
import datetime
import logging
import os

# Configure logging
logging.basicConfig(level=logging.DEBUG, format='%(asctime)s - %(levelname)s - %(message)s')

def check_disk_usage():
    try:
        disk_usage = psutil.disk_usage('/')
        report = f"Disk Usage:\nTotal: {disk_usage.total / (1024**3):.2f} GB\nUsed: {disk_usage.used / (1024**3):.2f} GB\nFree: {disk_usage.free / (1024**3):.2f} GB\nPercentage: {disk_usage.percent}%\n"
        logging.debug(report)
        return report
    except Exception as e:
        logging.error(f"Error checking disk usage: {e}")
        return f"Error checking disk usage: {e}\n"

def monitor_services():
    try:
        services = ["sshd", "nginx", "crond"]  # Example services
        report = "Service Status:\n"
        for service in services:
            try:
                result = subprocess.run(["systemctl", "is-active", service], capture_output=True, text=True, check=True)
                if result.stdout.strip() == "active":
                    report += f"{service}: Running\n"
                else:
                    report += f"{service}: Stopped\n"
            except subprocess.CalledProcessError as e:
                report += f"{service}: Error checking status ({e})\n"
        logging.debug(report)
        return report
    except Exception as e:
        logging.error(f"Error monitoring services: {e}")
        return f"Error monitoring services: {e}\n"

def check_memory_usage():
    try:
        memory = psutil.virtual_memory()
        report = f"Memory Usage:\nTotal: {memory.total / (1024**3):.2f} GB\nAvailable: {memory.available / (1024**3):.2f} GB\nUsed: {memory.used / (1024**3):.2f} GB\nPercentage: {memory.percent}%\n"
        logging.debug(report)
        return report
    except Exception as e:
        logging.error(f"Error checking memory usage: {e}")
        return f"Error checking memory usage: {e}\n"

def check_cpu_usage():
    try:
        cpu_percent = psutil.cpu_percent(interval=1)
        report = f"CPU Usage: {cpu_percent}%\n"
        logging.debug(report)
        return report
    except Exception as e:
        logging.error(f"Error checking CPU usage: {e}")
        return f"Error checking CPU usage: {e}\n"

def send_email_report(report_text):
    try:
        sender_email = os.environ.get("SENDER_EMAIL") # use environment variables for sensitive data.
        sender_password = os.environ.get("SENDER_PASSWORD")
        receiver_email = os.environ.get("RECEIVER_EMAIL")

        if not all([sender_email, sender_password, receiver_email]):
            logging.error("Email credentials not set in environment variables.")
            return

        message = MIMEText(report_text)
        message["Subject"] = "System Health Report"
        message["From"] = sender_email
        message["To"] = receiver_email

        with smtplib.SMTP_SSL("smtp.gmail.com", 465) as server:
            server.login(sender_email, sender_password)
            server.sendmail(sender_email, receiver_email, message.as_string())
        logging.info("Email report sent successfully.")

    except Exception as e:
        logging.error(f"Error sending email report: {e}")

def main():
    while True:
        print("\nSystem Health Check Menu:")
        print("1. Check Disk Usage")
        print("2. Monitor Running Services")
        print("3. Check Memory Usage")
        print("4. Check CPU Usage")
        print("5. Send Report Now")
        print("6. Exit")

        choice = input("Enter your choice: ")

        if choice == "1":
            print(check_disk_usage())
        elif choice == "2":
            print(monitor_services())
        elif choice == "3":
            print(check_memory_usage())
        elif choice == "4":
            print(check_cpu_usage())
        elif choice == "5":
            report = check_disk_usage() + monitor_services() + check_memory_usage() + check_cpu_usage()
            send_email_report(report)
        elif choice == "6":
            break
        else:
            print("Invalid choice. Please try again.")

    while True:
        now = datetime.datetime.now()
        if now.hour % 4 == 0 and now.minute == 0:
            report = check_disk_usage() + monitor_services() + check_memory_usage() + check_cpu_usage()
            send_email_report(report)
            time.sleep(60*60*4) #sleep for 4 hours
        else:
            time.sleep(60) # check every minute.

if __name__ == "__main__":
    main()

Enter fullscreen mode Exit fullscreen mode

Overall:

This challenge was a great way to reinforce my understanding of system monitoring and scripting. I'm looking forward to the challenges to come and continuing to learn and grow in the DevOps and SRE space!

What were your experiences with this challenge? Share your thoughts and code snippets in the comments below!

Remember to join the conversation using #getfitwithsagar, #SRELife, and #DevOpsForAll!

Happy coding! ๐Ÿš€

Top comments (0)