How to Select Variables Robustly in a Scoring Model

fail for one reason: bad variable selection. You pick variables that work on your training data. They fall apart on new data. The model looks great in development and breaks in production.

There is a better way. This article shows you how to select variables that are stable, interpretable, and robust, no matter how you split the data.

The Core Idea: Stability Over Performance

A variable is robust if it matters on every subset of your data, not just on the full dataset.

To check this, we split the training data into 4 folds using stratified cross-validation. We stratify by the default variable and the year to ensure each fold is representative of the full population.

from sklearn.model_selection import StratifiedKFold.
skf = StratifiedKFold(n_splits=4, shuffle=True, random_state=42)
train_imputed[“fold”] = -1

for fold, (_, test_idx) in enumerate(skf.split(train_imputed, train_imputed[“def_year”])):
train_imputed.loc[test_idx, “fold”] = fold

We then build four pairs (train, test). Each pair uses three folds for training and one fold for testing. We apply every selection rule on the training set only, never on the test set. This prevents data leakage.

folds = build_and_save_folds(train_imputed, fold_col=”fold”, save_dir=”folds/”)

A variable survives selection only if it passes the criteria on all four folds. One weak fold is enough to eliminate it.

The Dataset

We use the Credit Scoring Dataset from Kaggle. It contains 32,581 loans issued to individual borrowers.

The loans cover medical, personal, educational, and professional needs — as well as debt consolidation. Loan amounts range from $500 to $35,000.

The dataset has two types of variables:

Contract characteristics: loan amount, interest rate, loan purpose, credit grade, time since origination
Borrower characteristics: age, income, years of experience, housing status

We identified 7 continuous variables:

person_income
person_age
person_emp_length
loan_amnt
loan_int_rate
loan_percent_income
cb_person_cred_hist_length

We identified 4 categorical variables:

person_home_ownership
cb_person_default_on_file
loan_intent
loan_grade

The target is default: 1 if the borrower defaulted, 0 otherwise.
We handled missing values and outliers in a previous article. Here, we focus on variable selection.

The Filter Method: Four Rules

The filter method uses statistical measures of association. It does not need a predictive model. It is fast, auditable, and easy to explain to non-technical stakeholders.

We apply four rules in sequence. Each rule feeds its output into the next.

Rule 1: Drop continuous variables not linked to the default

We run a Kruskal-Wallis test between each continuous variable and the default target. If the p-value exceeds 5% on at least one fold, we drop the variable. It is not reliably linked to default.

rule1_vars = filter_uncorrelated_with_target(
folds=folds,
variables=continuous_vars,
target=”def_year”,
pvalue_threshold=0.05,
)

Result: All continuous variables pass Rule 1. Every continuous variable shows a significant association with default in all four folds.

Rule 2: Drop categorical variables weakly linked to default

We compute Cramér’s V between each categorical variable and the default target. Cramér’s V measures the association between two categorical variables. It ranges from 0 (no link) to 1 (perfect link).
We drop a variable if its Cramér’s V falls below 10% on at least one fold. A strong association requires a V above 50%.

rule2_vars = filter_categorical_variables(
folds=folds,
cat_variables=categorical_vars,
target=”def_year”,
low_threshold=0.10,
high_threshold=0.50,
)

Result: We keep 3 out of 4 categorical variables. The variable loan_int is dropped; its default link is too weak in at least one fold.

Rule 3: Drop redundant continuous variables

Two continuous variables that carry the same information hurt the model. They create multicollinearity.

We compute the Spearman correlation between every pair of continuous variables. If the correlation reaches 60% or more on at least one fold, we drop one variable from the pair. We keep the one with the stronger link to default , measured by the lowest Kruskal-Wallis p-value.

selected_continuous = filter_correlated_variables_kfold(
folds=folds,
variables=rule1_vars,
target=”def_year”,
threshold=0.60,
)

Result: We keep 5 continuous variables. We drop loan_amnt and cb_person_cred_hist_length — both were strongly correlated with other retained variables. This matches our findings in this article.

Rule 4: Drop redundant categorical variables

We apply the same logic to categorical variables. We compute Cramér’s V between every pair of categorical variables retained after Rule 2. If the V reaches 50% or more on at least one fold, we drop the variable least linked to default.

selected_categorical = filter_correlated_categorical_variables(
folds=folds,
cat_variables=rule2_vars,
target=”def_year”,
high_threshold=0.50,
)

Result: We keep 2 categorical variables. We drop loan_grade, which is strongly correlated with another retained variable, and it has a weaker link to default.

Final Selection: 7 Variables

The filter method selects 7 variables in total, 5 continuous and 2 categorical. Each one is significantly linked to default. None of them are redundant. And they all hold up on every fold.

This selection is auditable. You can show every decision to a regulator or a business stakeholder. You can explain why each variable was kept or dropped. That matters in credit scoring.

Each rule runs on the training set of each fold. A variable is dropped if it fails on any single fold. This is what makes the selection robust.
In the next article, we will study the monotonicity and temporal stability of these 7 variables. A variable can be significant today and unstable over time. Both properties matter in production scoring models.

Main key points from the article :

Most data scientists select variables based on the training data. They break on new data. Rule 1 fixes this: we run a Kruskal-Wallis test on every fold separately. The correlation between the continuous variable and default must be significant in all four folds.
Categorical variables are the silent killers of scoring models. They look correlated with default on the full dataset. They fall apart on a subset. Rule 2 catches them: we compute Cramér’s V on each fold independently. Below 10% on any single fold, it’s gone.
Two continuous variables that say the same thing don’t double your signal. They destroy your model. Rule 3 detects every correlated pair (Spearman ≥ 60%) across all folds. When two variables fight, the one with the weakest link to default loses.
Categorical redundancy is invisible until your model fails an audit. Rule 4 surfaces it: we compute Cramér’s V between every pair of categorical variables. Above 50% on any fold, one goes. We keep the one the most correlated with default variable.

Found this useful? Star the repo on GitHub and stay tuned for the next post on monotonicity and temporal stability.

How do you select variables robustly in your own models?

Image Credits

All images and visualizations in this article were created by the author using Python (pandas, matplotlib, seaborn, and plotly) and excel, unless otherwise stated.

References

[1] Lorenzo Beretta and Alessandro Santaniello.
Nearest Neighbor Imputation Algorithms: A Critical Evaluation.
National Library of Medicine, 2016.

[2] Nexialog Consulting.
Traitement des données manquantes dans le milieu bancaire.
Working paper, 2022.

[3] John T. Hancock and Taghi M. Khoshgoftaar.
Survey on Categorical Data for Neural Networks.
Journal of Big Data, 7(28), 2020.

[4] Melissa J. Azur, Elizabeth A. Stuart, Constantine Frangakis, and Philip J. Leaf.
Multiple Imputation by Chained Equations: What Is It and How Does It Work?
International Journal of Methods in Psychiatric Research, 2011.

[5] Majid Sarmad.
Robust Data Analysis for Factorial Experimental Designs: Improved Methods and Software.
Department of Mathematical Sciences, University of Durham, England, 2006.

[6] Daniel J. Stekhoven and Peter Bühlmann.
MissForest—Non-Parametric Missing Value Imputation for Mixed-Type Data.Bioinformatics, 2011.

[7] Supriyanto Wibisono, Anwar, and Amin.
Multivariate Weather Anomaly Detection Using the DBSCAN Clustering Algorithm.
Journal of Physics: Conference Series, 2021.

[8] Laborda, J., & Ryoo, S. (2021). Feature selection in a credit scoring model. Mathematics, 9(7), 746.

Data & Licensing

The dataset used in this article is licensed under the Creative Commons Attribution 4.0 International (CC BY 4.0) license.

This license allows anyone to share and adapt the dataset for any purpose, including commercial use, provided that proper attribution is given to the source.

For more details, see the official license text: CC0: Public Domain.

Disclaimer

Any remaining errors or inaccuracies are the author’s responsibility. Feedback and corrections are welcome.

Latest post

The Latest Push to Extend Key US Spy Powers Is Still a Mess

When Production Logs Become Your Best QA Asset

Nuclear startup X-energy raises $1B in data center-driven IPO

How to Select Variables Robustly in a Scoring Model

Anthropic Denies Intentional Slowdown Of Claude Code

IOS 26.4.2 Patch Fixes Notification Database Privacy Flaw

IoT in Manufacturing: Strategy, Components, Use Cases, and Challenges

Microsoft Unveils Tools To Build Infrastructure For Agentic Web

The Latest Push to Extend Key US Spy Powers Is Still a Mess

When Production Logs Become Your Best QA Asset

Nuclear startup X-energy raises $1B in data center-driven IPO

Hackers Exploit Agent ID Administrator Role to Hijack Service Principals

The Latest Push to Extend Key US Spy Powers Is Still a Mess

When Production Logs Become Your Best QA Asset

Nuclear startup X-energy raises $1B in data center-driven IPO

Hackers Exploit Agent ID Administrator Role to Hijack Service Principals

Latest post

How to Select Variables Robustly in a Scoring Model

The Core Idea: Stability Over Performance

The Dataset

The Filter Method: Four Rules

Rule 1: Drop continuous variables not linked to the default

Rule 2: Drop categorical variables weakly linked to default

Rule 3: Drop redundant continuous variables

Rule 4: Drop redundant categorical variables

Final Selection: 7 Variables

Main key points from the article :

Image Credits

References

Data & Licensing

Disclaimer

Related Posts