Credit Risk Estimation with Non-Financial Features: Evidence from a Synthetic Istanbul Dataset
ArXiv ID: 2512.12783 “View on arXiv”
Authors: Atalay Denknalbant, Emre Sezdi, Zeki Furkan Kutlu, Polat Goktas
Abstract
Financial exclusion constrains entrepreneurship, increases income volatility, and widens wealth gaps. Underbanked consumers in Istanbul often have no bureau file because their earnings and payments flow through informal channels. To study how such borrowers can be evaluated we create a synthetic dataset of one hundred thousand Istanbul residents that reproduces first quarter 2025 TÜİK census marginals and telecom usage patterns. Retrieval augmented generation feeds these public statistics into the OpenAI o3 model, which synthesises realistic yet private records. Each profile contains seven socio demographic variables and nine alternative attributes that describe phone specifications, online shopping rhythm, subscription spend, car ownership, monthly rent, and a credit card flag. To test the impact of the alternative financial data CatBoost, LightGBM, and XGBoost are each trained in two versions. Demo models use only the socio demographic variables; Full models include both socio demographic and alternative attributes. Across five fold stratified validation the alternative block raises area under the curve by about one point three percentage and lifts balanced (F_{“1”}) from roughly 0.84 to 0.95, a fourteen percent gain. We contribute an open Istanbul 2025 Q1 synthetic dataset, a fully reproducible modeling pipeline, and empirical evidence that a concise set of behavioural attributes can approach bureau level discrimination power while serving borrowers who lack formal credit records. These findings give lenders and regulators a transparent blueprint for extending fair and safe credit access to the underbanked.
Keywords: Credit Scoring, Alternative Data, Underbanked, CatBoost, Synthetic Data
Complexity vs Empirical Score
- Math Complexity: 2.0/10
- Empirical Rigor: 7.0/10
- Quadrant: Street Traders
- Why: The paper uses standard machine learning models (CatBoost, LightGBM, XGBoost) with relatively simple implementation and presents clear empirical results (AUC, F1 scores) from a reproducible pipeline on a synthetic dataset, but does not involve advanced mathematical derivations or complex statistical theory.
flowchart TD
A["Research Goal:\nEstimate credit risk for underbanked\nusing alternative data"] --> B["Data Synthesis:\nCreate Istanbul 2025 Q1 synthetic dataset\nvia RAG + OpenAI o3"]
B --> C["Features & Splits:\nSocio-demographic + Alternative attributes\n5-fold stratified validation"]
C --> D["Modeling:\nTrain CatBoost, LightGBM, XGBoost\n(Demo vs Full models)"]
D --> E["Outcomes:\nAUC +1.3%, F1 0.84→0.95\nOpen dataset & pipeline released"]