Import Data & Libraries

Investigate Dataset¶

One can deduct that Customer ID will be droped and MonthlyCharges,TotalCharges,tenure are numerical, the rest is categorical and due to low cardinality they will be one hot encoded. Potentially it can be applied PAC to reduce the dimension of the new columns.

Identify Categories and Numerical Features

Remove Missing and Duplicates

Due to the low number (11) of missing values we will drop them instead of replace with a calculation (mean/mode/median)

Explore Trian Dataset

Potentially SMOTE to increase the porportion of 'Yes' Churn.

Distribution of target in numercal attributes

Distribution Categorical Features

Correlation Numerical Columns

One Hot Encoding

SMOTE

Lets try model a model with RAW data with XGBoost

The evaluation metric is be based on the Area Under the ROC Curve.

If predicted probabilities

Let's apply SMOTE

SMOTE

Does not improve

SMOTE and UnderSample

Better result with SMOTE and UnderSampling so we will generate a new DataFrame with this data

Feature Selection

Now we are going to select the top features for the model (because we have done One-Hot-Encoding and there are to many variables at the moment). There are different ways to do this such as Manually selecting by the correlation (example ternure, contract 2 years, internet service... see above), PCA: To select the top features or Train a model: select the top importance variables for a simple tree.

We are going to train for this instance a Random Forest model and use its feature importance information to select the best columns.

Feature Importance Random Forest

We select the 10 best features.

Train the model

Now, with the resampled dataset and the selected feaures, we are going to train an xgboost model and due to the lack of time search the best parameters with H2O AutoML. We used XGBoost but if we had more time we could try other supervised algorithm for clasification such as logistics regression, Random Forest, Catboost, LightGBM, SVM, or AdaBoost.

We execute AutoML with H2O on the resampled dataset and only with the final variables