Title: Segmenting Customers with K-Means Clustering: A Data-Driven Approach to Targeted Marketing

Introduction to Machine Learning

Machine learning is all about teaching machines to learn from data and make decisions without explicit programming. By analyzing data, machines can identify patterns and rules that help them predict outcomes. In simple terms, the formula is: Data + Output = Rules.

Machine learning can be divided into three main categories:

Supervised Learning: The algorithm is trained using both input data and the correct output labels. The algorithm learns patterns that help predict future outputs. This method is often used in applications like linear regression for predicting continuous values, such as house prices or stock prices.
Unsupervised Learning: Here, we feed the algorithm only input data without any predefined outputs. The algorithm finds patterns or groups (clusters) in the data based on similarities. This approach is often used when we want to uncover hidden structures in data, like customer segmentation, without having labeled data.
Reinforcement Learning: In this type of learning, an agent learns through trial and error, receiving rewards or penalties for its actions. It’s used in fields like robotics and game playing.

In this blog post, we’ll focus on unsupervised learning and explore how the K-Means clustering algorithm can be used to segment customers based on their income and spending behavior. This method helps businesses target marketing efforts more effectively.

Why Is Customer Segmentation Important?

Customer segmentation is all about dividing your customer base into distinct groups with shared characteristics or behaviors. By doing this, businesses can create tailored strategies to improve engagement and drive revenue. Here’s why segmentation is so valuable:

Personalized Marketing: By targeting specific customer groups, businesses can create highly relevant and effective marketing campaigns.
Product Development: Understanding the unique needs of different customer segments helps guide product innovation and improvements.
Customer Retention: Focusing on high-value customers allows businesses to strengthen relationships and reduce churn.
Resource Allocation: With clear customer groups, businesses can allocate their marketing budget and resources more efficiently.

Step-by-Step Process for Customer Segmentation Using K-Means Clustering

Let’s walk through how to segment customers using the K-Means clustering algorithm.

1. Load and Explore the Dataset

We begin by loading the customer dataset and exploring its structure. This helps us understand the features, detect missing data, and look for any obvious patterns that could guide the segmentation. For our analysis, we'll focus on key features like Annual Income and Spending Score.

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Load the customer dataset
# I download it from Kaggle web site .
customer_df = pd.read_csv('Customers.csv') 

# Display basic information about the dataset
print(customer_df.info())

2. Data Preprocessing

Before applying the K-Means algorithm, we need to preprocess the data. This involves handling missing values, selecting relevant features, and scaling the data. Scaling ensures that all features have the same weight in the clustering process, preventing one feature from dominating due to its scale.

from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_samples ,silhouette_score
from sklearn.metrics import silhouette_samples, silhouette_score

df = pd.read_csv("Customers.csv")

# Visualizing first 5 records from created Dataframe (head()==5 default)
print(df.head())

#Checking Number of Rows and Columns in the dataset
print(df.shape)

#Checking Datatypes for each Column or features
print(df.info())

# Printing the statistics details like Mean, Median, STD only Numerical data  for better readability i am doing Transpose
print(df.describe().T)

#Check if we have any duplicated data
print(df.duplicated().sum())

#Check if we any missing values
print(df.isnull().sum())

# Creating the ELBOW Method to find the Optimal K Values
inertias =[]
k_range =range(1,11)
for k in k_range:
    kmeans =KMeans(n_clusters=k,random_state=42)
    kmeans.fit(df_scaled)
    inertias.append(kmeans.inertia_)

plt.figure(figsize=(10,6))
plt.plot(k_range,inertias,"bx-")
plt.xlabel("K")
plt.ylabel("inertia")
plt.show()

Apply K-Means with the optimal number of clusters

We got the cluster value 5 from above ELBOW Method

kmeans = KMeans(n_clusters=5, random_state=42)
df_clean['Cluster'] = kmeans.fit_predict(df_scaled)
# Visualize the clusters
plt.figure(figsize=(12, 8))
scatter = plt.scatter(df_clean['Annual_Income_(k$)'],df_clean['Spending_Score'],c=df_clean['Cluster'],cmap='viridis')
plt.colorbar(scatter)
plt.xlabel('Annual Income (k$)')
plt.ylabel('Spending Score (1-100)')
plt.title('Customer Segments')
plt.show()

from sklearn.preprocessing import StandardScaler

# Select relevant features for clustering
features = ['Annual Income (k$)', 'Spending Score (1-100)']

# Scale the features
scaler = StandardScaler()
customer_df_scaled = scaler.fit_transform(customer_df[features])

3. Apply K-Means Clustering

Now that the data is ready, we apply the K-Means clustering algorithm. But how do we determine how many clusters (groups) to use? This is where the Elbow Method comes in. By plotting the inertia (within-cluster sum of squares) for different values of K (the number of clusters), we can identify the optimal number of clusters.

from sklearn.cluster import KMeans

# Elbow method to determine the optimal number of clusters
inertias = []
k_range = range(1, 11)

for k in k_range:
    kmeans = KMeans(n_clusters=k, random_state=42)
    kmeans.fit(customer_df_scaled)
    inertias.append(kmeans.inertia_)

Based on the elbow plot, we determine that the optimal number of clusters is K=5.

4. Interpret the Clusters

After clustering, we analyze the characteristics of each segment. We can use box plots to visualize the distribution of features like Annual Income and Spending Score within each cluster. This allows us to interpret the nature of each segment, such as identifying:

Premium Customers: High income, high spending.
Careful Spenders: Low income, high spending.
Budget Conscious: Low income, low spending.

# Visualize cluster characteristics
sns.boxplot(x='Cluster', y='Annual Income (k$)', data=customer_df)

5. Evaluate Clustering Performance

To evaluate how well the algorithm has clustered the data, we use the Silhouette Score. This score measures how similar each data point is to other points in its own cluster compared to other clusters. A higher score indicates better-defined clusters.

from sklearn.metrics import silhouette_score

# Evaluate the clustering
silhouette_avg = silhouette_score(customer_df_scaled, customer_df['Cluster'])
print(f"The average silhouette score is: {silhouette_avg}")

The Silhouette Score is a measure of how well each data point fits into its assigned cluster compared to other clusters. It helps assess the quality of clustering by evaluating both cohesion (how close points within a cluster are) and separation (how distinct clusters are from each other).

Silhouette Score Range:

1:** A score of 1 indicates** that the data point is well-clustered. It is far from the other clusters and very close to the center of its assigned cluster. This means the clustering is of high quality, and the point is placed in the correct cluster.

0: A score of 0 indicates that the data point is on or near the boundary between two clusters. It’s equally close to points in its own cluster as it is to points in other clusters, meaning the clustering could be improved, as the data point is not clearly assigned to one cluster.

-1: A score of -1 indicates that the data point is likely placed in the wrong cluster. It is closer to points in other clusters than to points in its own cluster. This suggests poor clustering, and the algorithm might have assigned the point to the wrong cluster.

In summary:
Silhouette Score = 1: Perfectly clustered, well-separated from other clusters.
Silhouette Score = 0: Indifferent or ambiguous assignment, possibly between clusters.
Silhouette Score = -1: Poorly clustered, likely misclassified.
Higher silhouette scores (closer to 1) indicate better clustering, where points are tightly grouped within their clusters and well-separated from others. Lower scores (closer to -1) suggest that the clustering might need improvement.

6. Potential Improvements and Future Work

While K-Means clustering provides valuable insights, there are several ways to improve and refine the results:

Feature Engineering: Create new features to capture deeper insights into customer behavior.
Advanced Clustering Algorithms: Try algorithms like DBSCAN or Gaussian Mixture Models for more flexibility in handling complex clusters.
Dimensionality Reduction: For datasets with many features, use techniques like PCA (Principal Component Analysis) to reduce the number of dimensions, which can help improve clustering performance.

Conclusion

In this blog post, we’ve demonstrated how to segment customers using the K-Means clustering algorithm based on their Annual Income and Spending Score. By identifying distinct customer groups, businesses can tailor their marketing strategies, enhance product development, and allocate resources more effectively.

Customer segmentation is a powerful way to personalize your approach, increasing customer satisfaction and loyalty. Using unsupervised learning techniques like K-Means clustering, businesses can uncover actionable insights that drive growth and profitability.

By making data-driven decisions, businesses can stay competitive and better meet the needs of their customer base.

Thanks
Sreeni Ramadorai