🚀 A Data Science Approach to Identifying Key Customer Groups
Introduction
Customer segmentation is essential for businesses to understand and engage with their customers effectively. In this project, we apply Recency, Frequency, and Monetary (RFM) analysis to an Online Retail dataset to segment customers based on their purchasing behavior.
We’ll walk through data cleaning, feature engineering, clustering using K-Means, and customer insights to drive better business decisions.
Dataset Used: Online Retail II Dataset
📊 Data Exploration & Cleaning
Understanding the Dataset
The dataset contains transactions from a UK-based online retailer from 2009 to 2011. Before diving into analysis, we explore and clean the data.
Key Issues Identified:
- ✅ Missing values in Customer ID
- ✅ Negative values in Quantity & Price
- ✅ Invalid entries in Invoice & StockCode
Cleaning Steps:
# Remove invalid invoices
cleaned_df = df.copy()
cleaned_df = cleaned_df[df["Invoice"].str.match("^\d{6}$") == True]
# Remove non-standard stock codes
cleaned_df = cleaned_df[cleaned_df["StockCode"].str.match("^\d{5}$") == True]
- ✅ 27% of records were removed after cleaning to ensure data accuracy.
🔎 Feature Engineering: RFM Metrics
RFM analysis categorizes customers based on:
- Recency (R): Days since last purchase
- Frequency (F): Number of purchases
- Monetary (M): Total spending
aggregated_df = cleaned_df.groupby("Customer ID").agg(
MonetaryValue=("SalesLineTotal", "sum"),
Frequency=("Invoice", "nunique"),
LastInvoiceDate=("InvoiceDate", "max")
)
aggregated_df["Recency"] = (aggregated_df["LastInvoiceDate"].max() - aggregated_df["LastInvoiceDate"]).dt.days
- ✅ RFM values help us group customers based on their buying behavior.
📈 Data Visualization & Outlier Handling
Analyzing RFM Distributions
We plotted histograms and boxplots to understand the spread of Recency, Frequency, and Monetary values.
plt.hist(aggregated_df["MonetaryValue"], bins=10, color="skyblue", edgecolor="black")
plt.title("Monetary Value Distribution")
plt.xlabel("Monetary Value")
plt.ylabel("Count")
plt.show()
📌 Key Observations:
- 🔹 Significant outliers exist in Frequency & Monetary values.
- 🔹 Customers with extreme spending patterns need special treatment.
Outlier Handling Using IQR
Q1 = aggregated_df["MonetaryValue"].quantile(0.25)
Q3 = aggregated_df["MonetaryValue"].quantile(0.75)
IQR = Q3 - Q1
filtered_df = aggregated_df[(aggregated_df["MonetaryValue"] <= Q3 + 1.5 * IQR)]
- ✅ Extreme spenders were categorized separately for better insights.
âš¡ K-Means Clustering for Customer Segmentation
Finding the Optimal K (Elbow & Silhouette Method)
- We applied K-Means Clustering to segment customers. The Elbow method & Silhouette score helped us determine the ideal number of clusters.
k_values = range(2, 12)
silhouette_scores = []
for k in k_values:
kmeans = KMeans(n_clusters=k, random_state=42)
labels = kmeans.fit_predict(scaled_data)
silhouette_scores.append(silhouette_score(scaled_data, labels))
- ✅ The optimal K = 4 was selected.
🚀 Customer Segments & Business Insights
- After clustering, we analyzed customer groups and their business implications.
Cluster | Label | Characteristics | Business Strategy |
---|---|---|---|
0 | Loyal Retainers | High-value, frequent buyers | Retention programs, exclusive discounts |
1 | Re-Engagement Targets | Infrequent buyers, lower spending | Re-targeting ads, special promotions |
2 | Growth Potential | Recent buyers, low spending | Upsell strategies, better recommendations |
3 | Top Performers | High-frequency, high-value buyers | VIP programs, premium services |
📌 Key Actions:
- ✔ Offer personalized marketing for Loyal Retainers & Top Performers
- ✔ Use discount strategies to re-engage dormant customers
- ✔ Implement recommendation engines for Growth Potential customers
📌 Conclusion
🎯 Takeaways:
- ✅ RFM analysis effectively segments customers based on behavior.
- ✅ K-Means clustering identifies distinct customer groups for better engagement.
- ✅ Business insights help improve marketing & retention strategies.
🚀 Future Work:
- 🔹 Apply hierarchical clustering for better segmentation.
- 🔹 Integrate predictive modeling for dynamic customer targeting.
🔗 Check out the full project on GitHub:
👉 GitHub Repository
Top comments (0)