9 research outputs found

    A Survey of Methods for Handling Disk Data Imbalance

    Full text link
    Class imbalance exists in many classification problems, and since the data is designed for accuracy, imbalance in data classes can lead to classification challenges with a few classes having higher misclassification costs. The Backblaze dataset, a widely used dataset related to hard discs, has a small amount of failure data and a large amount of health data, which exhibits a serious class imbalance. This paper provides a comprehensive overview of research in the field of imbalanced data classification. The discussion is organized into three main aspects: data-level methods, algorithmic-level methods, and hybrid methods. For each type of method, we summarize and analyze the existing problems, algorithmic ideas, strengths, and weaknesses. Additionally, the challenges of unbalanced data classification are discussed, along with strategies to address them. It is convenient for researchers to choose the appropriate method according to their needs

    A HYBRID DEEP LEARNING APPROACH FOR SENTIMENT ANALYSIS IN PRODUCT REVIEWS

    Get PDF
    Product reviews play a crucial role in providing valuable insights to consumers and producers. Analyzing the vast amount of data generated around a product, such as posts, comments, and views, can be challenging for business intelligence purposes. Sentiment analysis of this content helps both consumers and producers gain a better understanding of the market status, enabling them to make informed decisions. In this study, we propose a novel hybrid approach based on deep neural networks (DNNs) for sentiment analysis in product reviews, focusing on the classification of sentiments expressed. Our approach utilizes the recursive neural network (RNN) algorithm for sentiment classification. To address the imbalanced distribution of positive and negative samples in social network data, we employ a resampling technique that balances the dataset by increasing samples from the minority class and decreasing samples from the majority class. We evaluate our approach using Amazon data, comprising four product categories: clothing, cars, luxury goods, and household appliances. Experimental results demonstrate that our proposed approach performs well in sentiment analysis for product reviews, particularly in the context of digital marketing. Furthermore, the attention-based RNN algorithm outperforms the baseline RNN by approximately 5%. Notably, the study reveals consumer sentiment variations across different products, particularly in relation to appearance and price aspects

    Predictive analytics applied to firefighter response, a practical approach

    Get PDF
    Time is a crucial factor for the outcome of emergencies, especially those that involve human lives. This paper looks at Lisbon’s firefighter’s occurrences and presents a model,based on city characteristics and climacteric data, to predict whether there will be an occurrence at a certain location, according to the weather forecasts. In this study three algorithms were considered, Logistic Regression, Decision Tree and Random Forest.Measured by the AUC, the best performant modelwasa random forestwith random under-sampling at 0.68. This model was well adjusted across the city and showed that precipitation and size of the subsection are themost relevant featuresin predicting firefighter’s occurrences.The work presented here has clear implications on the firefighter’s decision-makingregarding vehicle allocation, as now they can make an informed decision considering the predicted occurrences

    Optimization of firefighter response with predictive analytics : practical application to Lisbon, Portugal

    Get PDF
    Dissertation presented as the partial requirement for obtaining a Master's degree in Information Management, specialization in Knowledge Management and Business IntelligenceTime is a crucial factor for the outcome of emergencies, especially those that involve human lives. This paper looks at Lisbon’s firefighter’s occurrences and presents a model, based on city characteristics and climacteric data, to predict whether there will be an occurrence at a certain location, according to the weather forecasts. In this study three algorithms were considered, Logistic Regression, Decision Tree and Random Forest, as well as four techniques to balance the data – random over-sampling, SMOTE, random under-sampling and Near Miss –, which were compared to the baseline, the imbalanced data. Measured by the AUC, the best performant model was a random forest with random under-sampling at 0.68. This model was well adjusted across the city and showed that precipitation and size of the subsection are the most relevant features in predicting firefighter’s occurrences. The work presented here has clear implications on the firefighter’s decision-making regarding vehicle allocation, as now they can make an informed decision considering the predicted occurrences

    Data driven approach for predicting student dropout in secondary schools

    Get PDF
    A Thesis Submitted in Fulfillment of the Requirements for the Degree of Doctor of Philosophy in Information and Communication Science and Engineering of the Nelson Mandela African Institution of Science and TechnologyStudent dropout is among the challenges that face most schools in developing countries particularly in Africa. In Tanzania alone, student dropout in secondary schools is pronounced to be around 36%. In addressing the student dropout problem, a thorough understanding of the fundamental factors that cause the student dropout is essential. Several researchers have identified and proposed causes, methods and strategies that will help to reduce or stop the student dropout problem, however, most of the proposed solutions didn’t show promising results and the students dropout trend continue to increase over time. This study focused on developing a data driven approach that will help to identify and predict students who are at risk of dropping out of school in order to facilitate an intervention program as an active measure in eliminating the problem of dropout in Tanzania. In doing so, (a) 122 research articles were examined, (b) 4 focus group discussions and 2 round table surveys with 38 respondents from 5 districts (Arusha, Mbeya, Kisarawe, Rufiji and Nzega) were conducted, and (c) 3 datasets from Tanzania and India were used in order to identify factors that contribute significantly to student dropout problem, disclose the best classifier from the commonly used classifiers (Logistic Regression, Random Forest, K-nearest Neighbor and Multilayer Perceptron) and assessing the data balancing techniques for predictive performance of the model. Results revealed that, most of the respondents mentioned students’ gender, age, parent’s income, number of qualified teachers and remoteness as the main contributing factors to the students’ dropout problem in secondary schools. Furthermore, results from the examined articles indicated that, most studies conducted in developing countries focused on the social aspects of student dropout, and a paltry mentioned the use of other approaches such as machine learning. Nevertheless, results from data driven approach development shows that the Logistic Regression and Multilayer perceptron achieved the highest performance when over-sampling technique was employed. Also, the hyper parameter tuning improved the algorithm's performance compared to its baseline settings, and stacking of the classifiers improved the overall predictive performance of the developed approach. The study, therefore, recommends the developed approach to be considered by relevant authorities in identifying and predicting students at risk of dropping out for early intervention, planning and informative decisions making on addressing the student dropout problem

    Definición de una metodología para análisis de discurso basado en lingüística computacional y técnicas de aprendizaje de máquina

    Get PDF
    Las diferentes acciones realizadas por un ente regulador del estado, generan múltiples opiniones entre los ciudadanos, las cuales forman debates entre las personas haciendo que se encuentren de acuerdo, desacuerdo o parcialmente de acuerdo con las decisiones o estrategias planteadas. Con el fin de conocer las opiniones de los ciudadanos, en Chile se origina un proyecto llamado “Tenemos que hablar Chile” el cual realizaba preguntas estructuradas a un grupo de ciudadanos, donde la respuesta de cada persona era clasificada por el moderador. Dicha etiqueta fue utilizada para diferentes análisis de discurso que se empezaron a desarrollar sin ningún orden específico. Este proyecto fue replicado en Colombia, bajo la misma dinámica para así conocer las opiniones de los ciudadanos, sin embargo, las técnicas utilizadas fueron diferentes al proyecto chileno. Como resultado, se observa que a pesar de que ambos proyectos tenían la misma dinámica y buscaban un resultado similar, no se pudo reutilizar las técnicas desarrolladas en el proyecto de Chile en Colombia. Debido a esto, la propuesta de este proyecto de maestría busca la implementación de una metodología que permite usar diferentes técnicas de análisis de discurso basado en lingüística computacional y aprendizaje de máquina que dote al equipo de analistas con un esquema de etapas las cuales contarán con herramientas y técnicas de Natural Language processing (NLP, por sus siglas en inglés) para mejorar la eficiencia de este tipo de proyectos. Dentro de este proyecto se puede destacar las fortalezas del director quien tiene una alta experiencia en Machine Learning (ML, por sus siglas en ingles) y de NLP, además de las fortalezas del codirector con un amplio entendimiento del proyecto de "Tenemos que Hablar Colombia” (TQHC), y finalmente el estudiante de este proyecto con una base en la Maestría de Ciencia de los Datos y Analítica para generar una investigación sobre las técnicas de NLP.The different actions carried out by a state regulatory body generate multiple opinions among citizens, which form debates among people, causing them to agree, disagree or partially agree with the decisions or strategies proposed. In order to know the opinions of the citizens, in Chile a project called "Tenemos que hablar Chile" (We have to talk Chile) was created, which asked structured questions to a group of citizens, where the answer of each person was classified by the moderator. each person's answer was classified by the moderator. This label was used for different discourse analyses that began to be developed without any specific order. This project was replicated in Colombia, under the same dynamics in order to know the opinions of the citizens, however, the techniques used were different from the Chilean project. As a result, it is observed that although both projects had the same dynamics and sought a similar result, it was not possible to reuse the techniques developed in the Chilean project in Colombia. Due to this, the proposal of this master's project seeks the implementation of a methodology that allows the use of different techniques of discourse analysis based on computational linguistics and machine learning that will provide the team of analysts with a scheme of stages which will have tools and techniques of Natural Language processing (NLP) to improve the efficiency of this type of projects. Within this project we can highlight the strengths of the director who has a high experience in Machine Learning (ML) and NLP, in addition to the strengths of the co-director with a broad understanding of the project "Tenemos que Hablar Colombia" (TQHC), and finally the student of this project with a base in the Master of Data Science and Analytics to generate a research on NLP techniques

    Predicting Driver Takeover Performance and Designing Alert Systems in Conditionally Automated Driving

    Full text link
    With the Society of Automotive Engineers Level 3 automation, drivers are no longer required to actively monitor driving environments, and can potentially engage in non-driving related tasks. Nevertheless, when the automation reaches its operational limits, drivers will have to take over control of vehicles at a moment’s notice. Drivers have difficulty with takeover transitions, as they become increasingly decoupled from the operational level of driving. In response to the takeover difficulty, existing literature has investigated various factors affecting takeover performance. However, not all the factors were studied comprehensively, and the results of some factors were mixed. Meanwhile, there is a lack of research on the development of computational models that predict drivers’ takeover performance using their physiological and driving environment data. Furthermore, current research on the design of in-vehicle alert systems suffers from methodological shortcomings and presents identical takeover warnings regardless of event criticality. To address these shortcomings, the goals of this dissertation were to (1) examine the effects of drivers' cognitive load, emotions, traffic density, and takeover request lead time on their driving behavioral (takeover timeliness and quality) and psychophysiological responses (eye movements, galvanic skin responses, and heart rate activities) to takeover requests; (2) develop computational models to predict drivers’ takeover performance using their physiological and driving environment data via machine learning algorithms; and (3) design in-vehicle alert systems with different display modalities and information types and evaluate the displays in different event criticality conditions via human-subject experiments. The results of three human-subject experiments showed that positive emotional valence led to smoother takeover behaviors. Only when drivers had low cognitive load, they had shorter takeover reaction time in high oncoming traffic conditions. High oncoming traffic led to higher collision risk. High speed led to higher collision risk and harsher takeover behaviors in lane changing scenarios, but engendered longer takeover reaction time and smoother takeover behaviors in lane keeping scenarios. Meanwhile, we developed a random forest model to predict drivers' takeover performance with an accuracy of 84.3% and an F1-score of 64.0%. Our model had finer granularity than and outperformed other machine learning models used in prior studies. The findings of alert system design studies showed that drivers had more anxiety with the why only information compared to the why + what will information when information was presented in the speech modality. They felt more prepared to take over control of the vehicle and had more preference for the combination of augmented reality and speech conditions than others when drivers were in high event criticality situations. This dissertation can add to the knowledge base about takeover response investigation, takeover performance prediction, and in-vehicle alert system design. The results will enhance the understanding of how drivers’ emotions, cognitive load, traffic density, and scenario type influence their takeover responses. The computational models for takeover performance prediction are underlying algorithms of in-vehicle monitoring systems in real-world applications. The findings will provide design recommendations to automated vehicle manufacturers on in-vehicle alert systems. This will ultimately enhance the interaction between drivers and automated vehicles and improve driving safety in intelligent transportation systems.PHDIndustrial & Operations EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/169727/1/nadu_1.pd
    corecore