Articles

Classification algorithms for modeling economic choice

Anton Gerunov — Sofia University St. Kliment Ohridski
Published: 01.02.2020

Abstract

The article shows how some novel machine learning algorithms can be applied to economic problems of discrete binary choice. An examination is made of three typical business tasks – classifying overdraft applications, credit risk management, and marketing segmentation. Both traditional econometric methods (logistic regression and linear discriminant analysis) as well as five more advanced machine learning algorithms (neural networks, k-nearest neighbours, naïve Bayes classifier, random forest, and support vector machine) have been used for modelling these tasks. For all the classification tasks, the random forest algorithm robustly registers improved forecasting accuracy over the more traditional approaches. This underlines the need to supplement the classical econometric toolbox with innovative methods, with the random forest, the support vector machine, and the neural network being prime candidates.

References

Кабакчиева, Д. (2012). Изследване на Data Mining модели за класификация (дисертация за присъждане на ОНС „доктор“). С.: Институт по информационни и комуникационни технологии, БАН.

Матеев, С. (2016). Оценка на методи за диагностика и прогнозиране – аналитични процедури и интерпретация на данните. Нов български университет.

Семерджиева, В., Б. Георгиев, Ч. Дамянов (2013). Анализ на данни от диагностични тестове. Научни трудове на УХТ, 60, с. 292-297.

Akinci, S., E. Kaynak, E. Atilgan, & Ş. Aksoy (2007). Where does the logistic regression analysis stand in marketing literature? A comparison of the market positioning of prominent marketing journals. European Journal of Marketing, 41(5/6), рр. 537-567.

Breiman, L. (2001). Random forests. Machine learning, 45(1), рр. 5-32.

Breiman, L., J. H. Friedman, R. A. Olshen & C. J. Stone (1984). Classification and regression trees. Belmont, CA: Wadsworth. International Group, 432, рр. 151-166.

Carletta, J. (1996). Assessing agreement on classification tasks: the kappa statistic. Computational linguistics, 22(2), рр. 249-254.

Chandola, V., A. Banerjee & V. Kumar (2009). Anomaly detection: A survey. ACM computing surveys (CSUR), 41(3), рр. 1-58.

Cortes, C. & V. Vapnik (1995). Support-vector network. Machine Learning, 20, рр. 1-25.

Cox, D. R. (1958). The regression analysis of binary sequences. Journal of the Royal Statistical Society: Series B (Methodological), 20(2), рр. 215-232.

Eggermont, J., J. N. Kok & W. A. Kosters (2004). Genetic programming for data classification: Partitioning the search space. Proceedings of the 2004 ACM symposium on Applied computing. ACM, March, рp. 1001-1005.

Fawcett, T. (2004). ROC graphs: Notes and practical considerations for researchers. Machine learning, 31(1), рр. 1-38.

Fernández-Delgado, M., E. Cernadas, S. Barro & D. Amorim (2014). Do we need hundreds of classifiers to solve real world classification problems? The Journal of Machine Learning Research, 15(1), рр. 3133-3181.

Gerunov, A. (2019). Modeling Economic Choice under Radical Uncertainty: Machine Learning Approaches. International Journal of Business Intelligence and Data Mining, 14 (1-2), рр. 238-252.

Hastie, T., R. Tibshirani & J. Friedman (2013). The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer Science & Business Media. World classification problems? The Journal of Machine Learning Research, 15(1), рр. 31333181.

Hastie, T., R. Tibshirani & J. Friedman & J. Franklin (2005). The elements of statistical learning: data mining, inference and prediction. Springer Science & Business Media.

Hensher, D. A. & L. W. Johnson (2018). Applied discrete-choice modelling. Routledge.

Hofmann, H. (1994). German Credit Data (Statlog). Institute for Statistic and Econometrics. University of Hamburg.

Hyman, M. R. & Z. Yang (2001). International marketing serials: a retrospective. International Marketing Review, 18(6), рр. 667-718.

Lewis, D. D. (1998). Naive (Bayes) at forty: The independence assumption in information retrieval. In: European conference on machine learning. Springer, Berlin, Heidelberg, рp. 4-15.

McFadden, D. (1981). Econometric models of probabilistic choice. Structural analysis of discrete data with econometric applications. US: Berkeley, рр. 198-272.

Moro, S., P. Cortez & P. Rita (2014). A data-driven approach to predict the success of bank telemarketing. Decision Support Systems, 62, рр. 22-31.

Omar, S., A. Ngadi & H. H. Jebur (2013). Machine learning techniques for anomaly detection: an overview. International Journal of Computer Applications, 79(2).

Peterson, L. E. (2009). K-nearest neighbor. Scholarpedia, 4(2), 1883.

Phua, C., V. Lee, K. Smith, & R. Gayler (2010). A comprehensive survey of data mining-based fraud detection research. ArXiv preprint arXiv:1009.6119.

Qiu, J., Q. Wu, G. Ding, Y. Xu & S. Feng (2016). A survey of machine learning for big data processing. EURASIP Journal on Advances in Signal Processing, 2016(1), 67.

Ripley, B. D. & N. L. Hjort (1996). Pattern recognition and neural networks. Cambridge, UK: Cambridge University Press.

Rousseeuw, P. J. & M. Hubert (2018). Anomaly detection by robust statistics. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 8(2), pp. 1-14.

Tanwani, A. K., J. Afridi, M. Z. Shafiq & M. Farooq (2009). Guidelines to select machine learning scheme for classification of biomedical datasets. In: European Conference on Evolutionary Computation, Machine Learning and Data Mining in Bioinformatics. Springer, Berlin, Heidelberg, рр. 128-139.

Walter, S. D. (2005). The partial area under the summary ROC curve. Statistics in medicine, 24(13), рр. 2025-2040.

Yeh, I. C. & C. H. Lien (2009). The comparisons of data mining techniques for the predictive accuracy of probability of default of credit card clients. Expert Systems with Applications, 36(2), рр. 2473-2480.