Introduction
Why AWS SageMaker?
Cyber threats are growing more sophisticated, and traditional rule-based security systems often fail to detect advanced attacks. Machine Learning (ML) and AWS SageMaker provide a scalable, automated way to analyze large volumes of security logs and detect anomalies in real-time. However, setting up an ML environment can be challenging, requiring:
- Powerful compute resources (GPUs, high-memory instances)
- Proper data storage and management
- Scalability for real-world applications
- Model deployment pipelines for making real-time predictions
Enter AWS SageMaker
AWS SageMaker is a fully managed service that simplifies ML by providing:
- Pre-configured environments – No need to install ML libraries manually.
- Built-in algorithms – Use optimized ML models like XGBoost, TensorFlow, and PyTorch.
- Scalability – Train on multiple GPUs or CPUs without managing infrastructure.
- Easy Deployment – Deploy models as APIs with a few clicks or lines of code.
- Seamless Integration – Works with S3 (for data), Lambda (for automation), and other AWS services.
With AWS SageMaker, security teams can:
✅ Identify suspicious network traffic and insider threats
✅ Detect malware patterns from system logs
✅ Predict potential security breaches before they occur
✅ Automate security response using real-time ML-based alerts
What We’ll Cover in This Guide
In this blog, we’ll take a hands-on approach to AWS SageMaker. We’ll walk through how to train and deploy an ML model on AWS SageMaker to detect cybersecurity threats. We’ll:
- Set up AWS SageMaker with a notebook instance 📂
- Train an anomaly detection model on network security data 🔍
- Deploy the model as an endpoint for real-time threat analysis 🚀
- Make predictions to classify normal vs. malicious network activity 🎯
By the end, you’ll have a fully operational cybersecurity threat detection model running in the cloud. Let’s get started!
Step 1: Set Up AWS SageMaker
1.1 Create a SageMaker Notebook Instance
1) Log in to the AWS Management Console and navigate to Amazon SageMaker AI.
2) In the left menu, select Notebook Instances → Click Create notebook instance.
3) Name your instance (e.g., cybersecurity-detection).
4) Choose an instance type (ml.t2.medium for free tier or ml.m5.large for better performance).
5) Create an IAM role with AmazonSageMakerFullAccess and S3FullAccess.
6) Click Create and wait for the instance to be ready (InService).
7) Click Open Jupyter Lab to launch the notebook.
8) Create a new notebook in the JupyterLab view by selecting on the File menu, choose New, and then choose Notebook. Change the name of the file to cybersecurity_detection.ipynb
.
9) For Select Kernel, choose conda_python3. This preinstalled environment includes the default Anaconda installation and Python 3.
Step 2: Prepare Cybersecurity Data
For this example, we’ll use a network intrusion detection dataset (like NSL-KDD or CICIDS2017) that contains network traffic labeled as normal or malicious.
2.1 Download the Dataset
In your Jupyter Notebook, run:
import pandas as pd
# Load NSL-KDD dataset (training set)
url = "https://raw.githubusercontent.com/defcom17/NSL_KDD/master/KDDTrain+.txt"
df = pd.read_csv(url, header=None)
# Define column names based on NSL-KDD documentation
column_names = [
"duration", "protocol_type", "service", "flag", "src_bytes", "dst_bytes", "land",
"wrong_fragment", "urgent", "hot", "num_failed_logins", "logged_in", "num_compromised",
"root_shell", "su_attempted", "num_root", "num_file_creations", "num_shells",
"num_access_files", "num_outbound_cmds", "is_host_login", "is_guest_login",
"count", "srv_count", "serror_rate", "srv_serror_rate", "rerror_rate",
"srv_rerror_rate", "same_srv_rate", "diff_srv_rate", "srv_diff_host_rate",
"dst_host_count", "dst_host_srv_count", "dst_host_same_srv_rate",
"dst_host_diff_srv_rate", "dst_host_same_src_port_rate", "dst_host_srv_diff_host_rate",
"dst_host_serror_rate", "dst_host_srv_serror_rate", "dst_host_rerror_rate",
"dst_host_srv_rerror_rate", "label", "difficulty_level"
]
# Assign column names
df.columns = column_names
# Display first few rows
df.head()
2.2 Identify Feature Types
We separate features into continuous (numerical) and symbolic (categorical) types.
1) Symbolic (Categorical) Features:
protocol_type
service
flag
-
label
(attack type, which we will convert to binary)
2) Continuous (Numerical) Features:
- All other columns.
2.3 Encode Categorical Features
Since machine learning models cannot process categorical (symbolic) data, we need to convert protocol_type
, service
, and flag
into numerical representations.
from sklearn.preprocessing import LabelEncoder
# Apply label encoding to categorical features
categorical_cols = ["protocol_type", "service", "flag"]
encoder = LabelEncoder()
for col in categorical_cols:
df[col] = encoder.fit_transform(df[col])
print("Categorical encoding complete!")
df.head()
2.4 Convert Attack Labels into Binary Classes
The label
column contains different attack types. To simplify, we will convert it into a binary classification problem:
-
normal
→ 0 (Normal Traffic) - Any other attack type (
neptune
,smurf
, etc.) → 1 (Attack)
# Convert attack types into binary labels
df["label"] = df["label"].apply(lambda x: 0 if x == "normal" else 1)
print("Labels converted to binary format (0: Normal, 1: Attack)")
df["label"].value_counts()
2.5 Normalize Continuous Features
Since machine learning models perform better with normalized numerical data, we will scale all continuous features between 0 and 1 using MinMaxScaler.
from sklearn.preprocessing import MinMaxScaler
# Select numerical columns
numerical_cols = [
"duration", "src_bytes", "dst_bytes", "land", "wrong_fragment", "urgent", "hot",
"num_failed_logins", "logged_in", "num_compromised", "root_shell", "su_attempted",
"num_root", "num_file_creations", "num_shells", "num_access_files", "num_outbound_cmds",
"is_host_login", "is_guest_login", "count", "srv_count", "serror_rate", "srv_serror_rate",
"rerror_rate", "srv_rerror_rate", "same_srv_rate", "diff_srv_rate", "srv_diff_host_rate",
"dst_host_count", "dst_host_srv_count", "dst_host_same_srv_rate", "dst_host_diff_srv_rate",
"dst_host_same_src_port_rate", "dst_host_srv_diff_host_rate", "dst_host_serror_rate",
"dst_host_srv_serror_rate", "dst_host_rerror_rate", "dst_host_srv_rerror_rate"
]
# Apply MinMax scaling
scaler = MinMaxScaler()
df[numerical_cols] = scaler.fit_transform(df[numerical_cols])
print("Numerical features normalized!")
2.6 Save the Preprocessed Data
Now that the dataset is cleaned, encoded, and normalized, save it for AWS SageMaker.
df.to_csv("cybersecurity_preprocessed.csv", index=False, header=False)
print("Preprocessed data saved!")
2.7 Upload Data to AWS S3
Before training, we need to upload the dataset to AWS S3.
import boto3
import random
# Generate a unique S3 bucket name with 5 random numbers
random_suffix = random.randint(10000, 99999) # Generate 5-digit random number
s3_bucket = f"cybersecurity-dataset-{random_suffix}" # Unique bucket name
s3_prefix = "sagemaker/cybersecurity"
# Create an S3 client
s3 = boto3.client("s3")
# Create the S3 bucket (in your default AWS region)
try:
s3.create_bucket(Bucket=s3_bucket)
print(f"S3 bucket '{s3_bucket}' created successfully!")
except Exception as e:
print(f"Error creating S3 bucket: {e}")
# Upload preprocessed dataset to S3
s3.upload_file("cybersecurity_preprocessed.csv", s3_bucket, f"{s3_prefix}/cybersecurity_data.csv")
# Print S3 file path
s3_data_path = f"s3://{s3_bucket}/{s3_prefix}/cybersecurity_data.csv"
print("Data uploaded to S3:", s3_data_path)
Conclusion
So far, we have completed Step 1: Setting up AWS SageMaker and Step 2: Preparing the NSL-KDD dataset for training. Here’s a quick review of our progress and what’s left:
Steps Completed:
✅Step 1: Set Up AWS SageMaker
- Launched a SageMaker Notebook Instance in AWS.
- Installed necessary dependencies (e.g., boto3, sagemaker).
✅Step 2: Prepare Cybersecurity Data
- Loaded and cleaned the NSL-KDD dataset (dropped difficulty_label).
- Encoded categorical features (protocol_type, service, flag).
- Converted labels to a binary classification format (0: Normal, 1: Attack).
- Normalized numerical features using MinMaxScaler.
- Saved and uploaded the dataset to an S3 bucket (with unique random digits).
⏳ Next Steps to Complete:
- Step 3: Train an Anomaly Detection Model on Network Security Data
- Step 4: Deploy the Model as an Endpoint for Real-Time Threat Analysis
- Step 5: Make Predictions to Classify Normal vs. Malicious Network Activity
We will complete the rest of this tutorial in Part 2.
Top comments (0)