Analysis and design of scalable pre-processing techniques of instances for imbalanced Big Data problems : Applications in humanitarian emergencies situations

Abstract

The enormous volume of data from different sources, really varied in its typology, generated and processed at great speed, is known as Big Data. The importance of data lies in extracting knowledge from it. Hence, being able to take advantage of a large amount of data allows us to explore and better understand the problems, providing a priori higher quality solutions. To do this, applying Machine Learning for the generation of models is essential, as well as Smart Data so that these models reflect reality and support decision-making. However, it must be noted that the Machine Learning techniques that until now have offered good results are not always able to handle Big Data due to scalability issues. For this reason, they need to be adapted to work in distributed environments, or new techniques or strategies need to be created to deal with this new scenario. In addition, datasets can usually have certain undesired characteristics or complexities that interfere with the effectiveness of the knowledge extraction process, so they must be preprocessed due to the fact that most learning models assume that the data are free of those characteristics. Therefore, and since there are few scalable solutions capable of handling Big Data related to this topic, this thesis addresses the distributed and scalable pre-processing of Big Data sets, in order to obtain good quality data, known as Smart Data. Particularly, it focuses on classification problems, and on addressing the following characteristics: (a) imbalanced data; (b) redundancy; (c) high dimensionality; and (d) overlapping.Resumen de la tesis defendida por el autor en mayo de 2022 en la UNLP.Facultad de Informátic

    Similar works