Modern approaches to building interpretable models of the property market using machine learning on the base of mass cadastral valuation
ArXiv ID: 2506.15723 “View on arXiv”
Authors: Irina G. Tanashkina, Alexey S. Tanashkin, Alexander S. Maksimchuik, Anna Yu. Poshivailo
Abstract
In this article, we review modern approaches to building interpretable models of property markets using machine learning on the base of mass valuation of property in the Primorye region, Russia. The researcher, lacking expertise in this topic, encounters numerous difficulties in the effort to build a good model. The main source of this is the huge difference between noisy real market data and ideal data which is very common in all types of tutorials on machine learning. This paper covers all stages of modeling: the collection of initial data, identification of outliers, the search and analysis of patterns in the data, the formation and final choice of price factors, the building of the model, and the evaluation of its efficiency. For each stage, we highlight potential issues and describe sound methods for overcoming emerging difficulties on actual examples. We show that the combination of classical linear regression with interpolation methods of geostatistics allows to build an effective model for land parcels. For flats, when many objects are attributed to one spatial point the application of geostatistical methods is difficult. Therefore we suggest linear regression with automatic generation and selection of additional rules on the base of decision trees, so called the RuleFit method. Thus we show, that despite such a strong restriction as the requirement of interpretability which is important in practical aspects, for example, legal matters, it is still possible to build effective models of real property markets.
Keywords: mass valuation, RuleFit, geostatistics, regression modeling, feature engineering, Real Estate
Complexity vs Empirical Score
- Math Complexity: 6.0/10
- Empirical Rigor: 7.5/10
- Quadrant: Holy Grail
- Why: The paper uses advanced statistical methods (regression-kriging, RuleFit) for spatial modeling and mentions graph theory, indicating moderate-to-high math complexity. It is heavily data/implementation-focused, covering full real-world data pipeline from collection to outlier detection and model evaluation on specific property types (flats/land parcels), demonstrating high empirical rigor.
flowchart TD
A["Research Goal: Build interpretable ML models<br>for mass property valuation in Primorye"] --> B{"Data Processing &<br>Feature Engineering"}
B --> C["Geostatistical Interpolation<br>(for Land Parcels)"]
B --> D["RuleFit (Linear Regression +<br>Decision Tree Rules)<br>(for Flats)"]
C --> E["Computational Process<br>Model Building & Validation"]
D --> E
E --> F["Key Findings/Outcomes<br>Effective interpretable models achieved despite data noise.<br>Land: Linear regression + Geostatistics.<br>Flats: RuleFit allows interpretability with complex data."]