Main Catalog Informatics, Computer Engineering and Control Information Technology. Computer techologies. Theory of computers and systems

Data augmentation method for training a classification model based on the imposition of random noise, taking into account the values of the target feature

Authors: Artyukhin N.P.
Published in issue: #4(99)/2025
DOI:
Category: Informatics, Computer Engineering and Control \| Chapter: Information Technology. Computer techologies. Theory of computers and systems
Keywords: machine learning, classification model, training dataset augmentation, data transformation, random data generation, data mixing, structured data, tables, Gini index
Published: 28.08.2025

The paper investigates the problem of insufficient data in the sample to train the classification model and application of various methods to solve it. It analyzes the subject area devoted to this problem and the existing data augmentation methods for the classification model. It formulates the criteria for comparing these methods. The paper describes a new data augmentation algorithm based on the imposition of random noise on numerical features and replacing the values of categorical features with the most common ones, taking into account the value of the target feature of each record of the original dataset. It studies the impact of the developed data augmentation algorithm on the quality of the classification model. To do this, it compares the results of training the model on the initial dataset and after applying each of the considered methods of sample expansion: adding real data, adding randomly generated data, adding mixed initial data. It uses the Gini coefficient to assess the quality of the trained classification models. It shows that the result of applying the developed algorithm to the initial training dataset lead to improving the accuracy of the classification model predictions and developed algorithm surpasses similar methods of adding synthetic data.

References

[1] Mumuni A., Mumuni F. Increasing the volume of data: a comprehensive overview of modern approaches. Array, 2022, vol. 16 (6), art. no. 100258. https://doi.org/10.1016/j.array.2022.100258

[2] Melnikova M.E. The procedure and conditions of personal data processing. National Scientific Journal, 2022, No. 1, pp. 16-21. (In Russ.).

[3] Fonseca J., Bakao F. Research directions and application of algorithms for increasing data volume. NOVA School of Information Management, University of Nova de Lisbon, 2022. https://doi.org/10.48550/arXiv.2207.08817

[4] Data extension for machine learning. URL: https://www.akkio.com/data-augmentation-for-machine-learning (accessed 19.10.2024).

[5] Open source data warehouses for ML. URL: https://www.restack.io/p/ci-cd-machine-learning-answer-open-source-data-repositories-cat-ai (accessed 22.10.2024).

[6] Shorten S., Khoshgoftaar T.M. Research on image data augmentation for deep learning. Big Data Journal, 2019, vol. 6 (1). https://doi.org/10.1186/s40537-019-0197-0

[7] Alomar K., Aysel H.I. Increasing the volume of data in classification and segmentation: an overview and new strategies. Journal of Imaging, 2023, vol. 9 (2), art. no. 46. https://doi.org/10.3390/jimaging9020046

[8] Wei S., Zou S., Liao F. Comparison of data augmentation methods based on deep learning for sound classification. Journal of Physics Conference Series, 2020, vol. 1453 (1), art. no. 012085. https://doi.org/10.1088/1742-6596/1453/1/012085

[9] Blagus R., Lusa L. WERE STRUCK by multidimensional data, unbalanced by class. BMC Bioinformatics, 2013, vol. 106. https://doi.org/10.1186/1471-2105-14-106

[10] Nitesh V.S., Kevin W.B., Lawrence O.H., and others. SMOTE: synthetic minority over-sampling method. Journal of Artificial Intelligence Research, 2002, vol. 16 (1), pp. 321-357. https://doi.org/10.1613/jair .953

[11] Gini Index: decision tree, formula, calculator, Gini coefficient in machine learning. URL: https://blog.quantinsti.com/gini-index/ (accessed 02.11.2024).

[12] Farris F.A. The Gini index and indicators of inequality. American Mathematical Monthly, 2010, vol. 117 (10), pp. 851-864. https://doi.org/10.4169/000298910X523344