It's 5X more expensive to get a new customer than to retain an existing one
This saying succinctly captures the significance of identifying potential customer churn
While the issue is importance, the approach taken in the Kaggle data version appears so simple. However, the reality is far more complex.
Data
As expected, real-world data differs significantly from educational sources. While the Kaggle data is quite clean despite having some missing values... the data from Vietnamese banks is notorious for its poor quality. i.e, date formats vary across different sources (e.g. d_m_yy, m_d_yy, yy/m/d). I only discovered this when calculating customers' ages. Additionally, the data team confirmed that features were extracted from scroll boxes, theoretically eliminating spelling errors, yet, some errors still persist
As I have read hundreds of codes and papers online. Seriously, I don't believe them. What kind of banks give you the data to take experiment and publicly share all results? In our bank, I can't have direct access to these datasets. Consequently, That's where proplems appear, such as whether the data team comprehends my requirements accurately, if the SQL code is error-free.
Algorithm:
Most Kaggle datasets are based on static data, include attributes like age, gender, and account balances, rather than time series data. While this approach is not inherently incorrect, it is not enough. Do women churn more frequently than men? Does the amount of money held in an account impact churn rates? Ultimately, all customers churn at some point. To some extent, these static factors hold true (e.g., individuals with outstanding debts are less likely to churn). However, they are insufficient
Detecting churn necessitates analyzing time series data. i.e if a customer intends to churn, they would first need to settle their debts and withdraw their money, which takes time. A snapshot of data at a moment cannot provide conclusive insights. Some customers may still possess money in their accounts, yet they have not engaged with our services for 3 months. In essence, these customers have already ceased using our services, but the available data is insufficient to confirm their churn status
Labeling:
As mentioned above, determining churn is hard due to delayed label. Without explicit confirmation from customers (e.g., app uninstallation). However, waiting for confirmation is too late for retention efforts. (My boss even said that 1 month in advance is still too late)
Interestingly, the Kaggle data provides labels for all customers, which raises concerns about their validity. Moreover, their churn rate is around 10-20%. In contrast, my bank's churn rate (confirmed churners) stands at 0.0025% (<100 ppl out of a total of 1.3 mil users last month). I wonder how can they come up with the labels?
It's 3000 characters now. If you're students trying to enhance your CV with Kaggle projects, Good Luck!!
Top comments (1)
It's always cheaper to keep a customer than to find a new one! Understanding churn is critical, but real-world data challenges make it a tough problem to crack.