fail for one reason: bad variable selection. You pick variables that work on your training data. They fall apart on new data. The model looks great in development and breaks in production.
There is a better way. This article shows you how to select variables that are stable, interpretable, and robust, no matter how you split the data.
The Core Idea: Stability Over Performance
A variable is robust if it matters on every subset of your data, not just on the full dataset.
To check this, we split the training data into 4 folds using stratified cross-validation. We stratify by the default variable and the year to ensure each fold is representative of the full population.
from sklearn.model_selection import StratifiedKFold.
skf = StratifiedKFold(n_splits=4, shuffle=True, random_state=42)
train_imputed[“fold”] = -1
for fold, (_, test_idx) in enumerate(skf.split(train_imputed, train_imputed[“def_year”])):
train_imputed.loc[test_idx, “fold”] = fold
We then build four pairs (train, test). Each pair uses three folds for training and one fold for testing. We apply every selection rule on the training set only, never on the test set. This prevents data leakage.
folds = build_and_save_folds(train_imputed, fold_col=”fold”, save_dir=”folds/”)
A variable survives selection only if it passes the criteria on all four folds. One weak fold is enough to eliminate it.
The Dataset
We use the Credit Scoring Dataset from Kaggle. It contains 32,581 loans issued to individual borrowers.
The loans cover medical, personal, educational, and professional needs — as well as debt consolidation. Loan amounts range from $500 to $35,000.
The dataset has two types of variables:
- Contract characteristics: loan amount, interest rate, loan purpose, credit grade, time since origination
- Borrower characteristics: age, income, years of experience, housing status
We identified 7 continuous variables:
- person_income
- person_age
- person_emp_length
- loan_amnt
- loan_int_rate
- loan_percent_income
- cb_person_cred_hist_length
We identified 4 categorical variables:
- person_home_ownership
- cb_person_default_on_file
- loan_intent
- loan_grade
The target is default: 1 if the borrower defaulted, 0 otherwise.
We handled missing values and outliers in a previous article. Here, we focus on variable selection.
The Filter Method: Four Rules
The filter method uses statistical measures of association. It does not need a predictive model. It is fast, auditable, and easy to explain to non-technical stakeholders.
We apply four rules in sequence. Each rule feeds its output into the next.
Rule 1: Drop continuous variables not linked to the default
We run a Kruskal-Wallis test between each continuous variable and the default target. If the p-value exceeds 5% on at least one fold, we drop the variable. It is not reliably linked to default.
rule1_vars = filter_uncorrelated_with_target(
folds=folds,
variables=continuous_vars,
target=”def_year”,
pvalue_threshold=0.05,
)
Result: All continuous variables pass Rule 1. Every continuous variable shows a significant association with default in all four folds.
Rule 2: Drop categorical variables weakly linked to default
We compute Cramér’s V between each categorical variable and the default target. Cramér’s V measures the association between two categorical variables. It ranges from 0 (no link) to 1 (perfect link).
We drop a variable if its Cramér’s V falls below 10% on at least one fold. A strong association requires a V above 50%.
rule2_vars = filter_categorical_variables(
folds=folds,
cat_variables=categorical_vars,
target=”def_year”,
low_threshold=0.10,
high_threshold=0.50,
)
Result: We keep 3 out of 4 categorical variables. The variable loan_int is dropped; its default link is too weak in at least one fold.
Rule 3: Drop redundant continuous variables
Two continuous variables that carry the same information hurt the model. They create multicollinearity.
We compute the Spearman correlation between every pair of continuous variables. If the correlation reaches 60% or more on at least one fold, we drop one variable from the pair. We keep the one with the stronger link to default , measured by the lowest Kruskal-Wallis p-value.
selected_continuous = filter_correlated_variables_kfold(
folds=folds,
variables=rule1_vars,
target=”def_year”,
threshold=0.60,
)
Result: We keep 5 continuous variables. We drop loan_amnt and cb_person_cred_hist_length — both were strongly correlated with other retained variables. This matches our findings in this article.
Rule 4: Drop redundant categorical variables
We apply the same logic to categorical variables. We compute Cramér’s V between every pair of categorical variables retained after Rule 2. If the V reaches 50% or more on at least one fold, we drop the variable least linked to default.
selected_categorical = filter_correlated_categorical_variables(
folds=folds,
cat_variables=rule2_vars,
target=”def_year”,
high_threshold=0.50,
)
Result: We keep 2 categorical variables. We drop loan_grade, which is strongly correlated with another retained variable, and it has a weaker link to default.
Final Selection: 7 Variables
The filter method selects 7 variables in total, 5 continuous and 2 categorical. Each one is significantly linked to default. None of them are redundant. And they all hold up on every fold.
This selection is auditable. You can show every decision to a regulator or a business stakeholder. You can explain why each variable was kept or dropped. That matters in credit scoring.
Each rule runs on the training set of each fold. A variable is dropped if it fails on any single fold. This is what makes the selection robust.
In the next article, we will study the monotonicity and temporal stability of these 7 variables. A variable can be significant today and unstable over time. Both properties matter in production scoring models.
Main key points from the article :
- Most data scientists select variables based on the training data. They break on new data. Rule 1 fixes this: we run a Kruskal-Wallis test on every fold separately. The correlation between the continuous variable and default must be significant in all four folds.
- Categorical variables are the silent killers of scoring models. They look correlated with default on the full dataset. They fall apart on a subset. Rule 2 catches them: we compute Cramér’s V on each fold independently. Below 10% on any single fold, it’s gone.
- Two continuous variables that say the same thing don’t double your signal. They destroy your model. Rule 3 detects every correlated pair (Spearman ≥ 60%) across all folds. When two variables fight, the one with the weakest link to default loses.
- Categorical redundancy is invisible until your model fails an audit. Rule 4 surfaces it: we compute Cramér’s V between every pair of categorical variables. Above 50% on any fold, one goes. We keep the one the most correlated with default variable.
Found this useful? Star the repo on GitHub and stay tuned for the next post on monotonicity and temporal stability.
How do you select variables robustly in your own models?
Image Credits
All images and visualizations in this article were created by the author using Python (pandas, matplotlib, seaborn, and plotly) and excel, unless otherwise stated.
References
[1] Lorenzo Beretta and Alessandro Santaniello.
Nearest Neighbor Imputation Algorithms: A Critical Evaluation.
National Library of Medicine, 2016.
[2] Nexialog Consulting.
Traitement des données manquantes dans le milieu bancaire.
Working paper, 2022.
[3] John T. Hancock and Taghi M. Khoshgoftaar.
Survey on Categorical Data for Neural Networks.
Journal of Big Data, 7(28), 2020.
[4] Melissa J. Azur, Elizabeth A. Stuart, Constantine Frangakis, and Philip J. Leaf.
Multiple Imputation by Chained Equations: What Is It and How Does It Work?
International Journal of Methods in Psychiatric Research, 2011.
[5] Majid Sarmad.
Robust Data Analysis for Factorial Experimental Designs: Improved Methods and Software.
Department of Mathematical Sciences, University of Durham, England, 2006.
[6] Daniel J. Stekhoven and Peter Bühlmann.
MissForest—Non-Parametric Missing Value Imputation for Mixed-Type Data.Bioinformatics, 2011.
[7] Supriyanto Wibisono, Anwar, and Amin.
Multivariate Weather Anomaly Detection Using the DBSCAN Clustering Algorithm.
Journal of Physics: Conference Series, 2021.
[8] Laborda, J., & Ryoo, S. (2021). Feature selection in a credit scoring model. Mathematics, 9(7), 746.
Data & Licensing
The dataset used in this article is licensed under the Creative Commons Attribution 4.0 International (CC BY 4.0) license.
This license allows anyone to share and adapt the dataset for any purpose, including commercial use, provided that proper attribution is given to the source.
For more details, see the official license text: CC0: Public Domain.
Disclaimer
Any remaining errors or inaccuracies are the author’s responsibility. Feedback and corrections are welcome.
