🫠 REAL CUSTOMER CHURN PREDICTION vs KAGGLE VERSION

#datascience #python #mrzaizai2k

It's 5X more expensive to get a new customer than to retain an existing one

This saying succinctly captures the significance of identifying potential customer churn

While the issue is importance, the approach taken in the Kaggle data version appears so simple. However, the reality is far more complex.

Data

As expected, real-world data differs significantly from educational sources. While the Kaggle data is quite clean despite having some missing values... the data from Vietnamese banks is notorious for its poor quality. i.e, date formats vary across different sources (e.g. d_m_yy, m_d_yy, yy/m/d). I only discovered this when calculating customers' ages. Additionally, the data team confirmed that features were extracted from scroll boxes, theoretically eliminating spelling errors, yet, some errors still persist

As I have read hundreds of codes and papers online. Seriously, I don't believe them. What kind of banks give you the data to take experiment and publicly share all results? In our bank, I can't have direct access to these datasets. Consequently, That's where proplems appear, such as whether the data team comprehends my requirements accurately, if the SQL code is error-free.

Algorithm:

Most Kaggle datasets are based on static data, include attributes like age, gender, and account balances, rather than time series data. While this approach is not inherently incorrect, it is not enough. Do women churn more frequently than men? Does the amount of money held in an account impact churn rates? Ultimately, all customers churn at some point. To some extent, these static factors hold true (e.g., individuals with outstanding debts are less likely to churn). However, they are insufficient

Detecting churn necessitates analyzing time series data. i.e if a customer intends to churn, they would first need to settle their debts and withdraw their money, which takes time. A snapshot of data at a moment cannot provide conclusive insights. Some customers may still possess money in their accounts, yet they have not engaged with our services for 3 months. In essence, these customers have already ceased using our services, but the available data is insufficient to confirm their churn status

Labeling:

As mentioned above, determining churn is hard due to delayed label. Without explicit confirmation from customers (e.g., app uninstallation). However, waiting for confirmation is too late for retention efforts. (My boss even said that 1 month in advance is still too late)

Interestingly, the Kaggle data provides labels for all customers, which raises concerns about their validity. Moreover, their churn rate is around 10-20%. In contrast, my bank's churn rate (confirmed churners) stands at 0.0025% (<100 ppl out of a total of 1.3 mil users last month). I wonder how can they come up with the labels?

It's 3000 characters now. If you're students trying to enhance your CV with Kaggle projects, Good Luck!!

Top comments (1)

Mai Chi Bao • Feb 10

It's always cheaper to keep a customer than to find a new one! Understanding churn is critical, but real-world data challenges make it a tough problem to crack.

Forem

🫠 REAL CUSTOMER CHURN PREDICTION vs KAGGLE VERSION

Data

Algorithm:

Labeling:

Top comments (1)

Read next

From Docker to Lambda: An AWS Admin's Journey into Python Applications

How to Install Ollama with DeepSeek-r1 and Integrate it with Python on Windows

15 Prompting Techniques Every Developer Should Know for Code Generation

Python в 2025: стоит ли начинать с нуля? Личный опыт и рекомендации