403 research outputs found

    ANALYZING THE IMPACT OF RESAMPLING METHOD FOR IMBALANCED DATA TEXT IN INDONESIAN SCIENTIFIC ARTICLES CATEGORIZATION

    Get PDF
    The extremely skewed data in artificial intelligence, machine learning, and data mining cases are often given misleading results. It is caused because machine learning algorithms are designated to work best with balanced data. However, we often meet with imbalanced data in the real situation. To handling imbalanced data issues, the most popular technique is resampling the dataset to modify the number of instances in the majority and minority classes into a standard balanced data. Many resampling techniques, oversampling, undersampling, or combined both of them, have been proposed and continue until now. Resampling techniques may increase or decrease the classifier performance. Comparative research on resampling methods in structured data has been widely carried out, but studies that compare resampling methods with unstructured data are very rarely conducted. That raises many questions, one of which is whether this method is applied to unstructured data such as text that has large dimensions and very diverse characters. To understand how different resampling techniques will affect the learning of classifiers for imbalanced data text, we perform an experimental analysis using various resampling methods with several classification algorithms to classify articles at the Indonesian Scientific Journal Database (ISJD). From this experiment, it is known resampling techniques on imbalanced data text generally to improve the classifier performance but they are doesn’t give significant result because data text has very diverse and large dimensions

    Applications of Mining Arabic Text: A Review

    Get PDF
    Since the appearance of text mining, the Arabic language gained some interest in applying several text mining tasks over a text written in the Arabic language. There are several challenges faced by the researchers. These tasks include Arabic text summarization, which is one of the challenging open areas for research in natural language processing (NLP) and text mining fields, Arabic text categorization, and Arabic sentiment analysis. This chapter reviews some of the past and current researches and trends in these areas and some future challenges that need to be tackled. It also presents some case studies for two of the reviewed approaches

    Interpretable Machine Learning을 활용한 구간단속시스템 설치에 따른 인명피해사고 감소 효과 연구

    Get PDF
    학위논문 (박사) -- 서울대학교 대학원 : 공과대학 건설환경공학부, 2020. 8. 김동규.In this study, a prediction model for casualty crash occurrence was developed considering whether to install SSES and the effect of SSES installation was quantified by dividing it into direct and indirect effects through the analysis of mediation effect. Also, it was recommended what needs to be considered in selecting the candidate sites for SSES installation. For this, crash prediction model was developed by using the machine learning for binary classification based on whether or not casualty crash occurred and the effects of SSES installation were analyzed based on crashes and speed-related variables. Especially, the IML methodology was applied that considered the predictive performance as well as the interpretability of the forecast results as important. When developing the IML which consisted of black-box and interpretable model, KNN, RF, and SVM were reviewed as black-box model, and DT and BLR were reviewed as interpretable model. In the model development, the hyper-parameters that could be set in each methodology were optimized through k-fold cross validation. The SVM with a polynomial kernel trick was selected as black-box model and the BLR was selected as interpretable model to predict the probability of casualty crash occurrence. For the developed IML model, the evaluation was conducted through comparison with the typical BLR from the perspective of the PDR framework. The evaluation confirmed that the results of the IML were more excellent than the typical BLR in terms of predictive accuracy, descriptive accuracy, and relevancy from a human in the loop. Using the result of IML's model development, the effect on SSES installation were quantified based on the probability equation of casualty crash occurrence. The equation is the logistic function that consists of SSES, SOR, SV, TVL, HVR, and CR. The result of analysis confirmed that the SSES installation reduced the probability of casualty crash occurrence by about 28%. In addition, the analysis of mediation effects on the variables affected by installing SSES was conducted to quantify the direct and indirect effects on the probability of reducing the casualty crashes caused by the SSES installation. The proportion of indirect effects through reducing the ratio of exceeding the speed limit (SOR) was about 30% and the proportion of indirect effects through reduction of speed variance (SV) was not statistically significant at the 95% confidence level. Finally, the probability equation of casualty crash occurrence developed in this study was applied to the sections of Yeongdong Expressway to compare the crash risk section with the actual crash data to examine the applicability of the development model. The analysis result verified that the equation was reasonable. Therefore, it may be considered to select dangerous sites based on casualty crash and speeding firstly, and then to install SSES at the section where traffic volume (TVL), heavy vehicle ratio (HVR), and curve ratio (CR) are higher than the other sections.본 연구에서는 구간단속시스템(Section Speed Enforcement System, SSES) 설치 효과를 정량화하기 위해 인명피해사고 예측모형을 개발하고, 매개효과 분석을 통해 SSES 설치에 대한 직접효과와 간접효과를 구분하여 정량화하였다. 또한, 개발한 예측모형에 대한 고속도로에서의 적용 가능성을 검토하고, SSES 설치 대상지 선정 시 고려해야할 사항을 제안하였다. 모형 개발에는 인명피해사고 발생 여부를 종속변수로 하는 이진분류형 기계학습을 활용하였으며, 기계학습 중에서는 모형의 예측 성능과 더불어 예측 결과에 대한 해석력을 중요하게 고려하는 인터프리터블 머신 러닝(Interpretable Machine Learning, IML) 방법론을 적용하였다. IML은 블랙박스 모델과 인터프리터블 모델로 구성되며, 본 연구에서는 블랙박스 모델로 KNN, RF 및 SVM을, 인터프리터블 모델로 DT와 BLR을 검토하였다. 모형 개발 시에는 각 기법에서 튜닝이 가능한 하이퍼 파라미터에 대하여 교차검증 과정을 거쳐 최적화하였다. 블랙박스 모델은 폴리노미얼 커널 트릭을 활용한 SVM을, 인터프리터블 모델은 BLR을 적용하여 인명피해사고 발생 확률을 예측하는 모형을 개발하였다. 개발된 IML 모델에 대해서는 PDR(Predictive accuracy, Descriptive accuracy and Relevancy) 프레임워크 관점에서 (typical) BLR 모델과 비교 평가를 진행하였다. 평가 결과 예측 정확도, 해석 정확도 및 인간의 이해관점에서의 적합성 등에서 모두 IML 모델이 우수함을 확인하였다. 또한, 본 연구에서 개발된 IML 모델 기반의 인명피해사고 발생 확률식은 SSES, SOR, SV, TVL, HVR 및 CR의 독립변수로 구성되었으며, 이 확률식을 기반으로 SSES 설치에 대한 효과를 정량화하였다. 정량화 분석 결과, SSES 설치로 인해 약 28% 정도의 인명피해사고 발생 확률이 감소함을 확인할 수 있었다. 또한, 모형 개발에 활용된 변수 중 SSES 설치로 인해 영향을 받는 변수들(SOR 및 SV)에 대한 매개효과 분석을 통해 SSES 설치로 인한 인명피해사고 감소 확률을 직접효과와 간접효과를 구분하여 제시하였다. 분석 결과, SSES와 제한속도 초과비율(SOR)의 관계에서 있어서는 약 30%가 간접효과이고, SSES와 속도분산(SV)의 관계에 있어서는 매개효과가 통계적으로 유의하지 않음을 확인할 수 있었다. 마지막으로 영동고속도로를 대상으로 인명피해사고 발생 확률식 기반의 예측 위험구간과 실제 인명사고 다발 구간에 대한 비교 분석을 통해 연구 결과의 활용 가능성을 확인하였다. 또한, SSES 설치 대상지 선정 시에는 사고 및 속도 분석을 통한 위험구간을 선별한 후 교통량(TVL)이 많은 곳, 통과차량 중 중차량 비율(HVR)이 높은 곳 및 구간 내 곡선비율(CR)이 높은 곳을 우선적으로 검토하는 것을 제안하였다.1. Introduction 1 1.1. Background of research 1 1.2. Objective of research 4 1.3. Research Flow 6 2. Literature Review 11 2.1. Research related to SSES 11 2.1.1. Effectiveness of SSES 11 2.1.2. Installation criteria of SSES 15 2.2. Machine learning about transportation 17 2.2.1. Machine learning algorithm 17 2.2.2. Machine learning algorithm about transportation 19 2.3. Crash prediction model 23 2.3.1. Frequency of crashes 23 2.3.2. Severity of crash 26 2.4. Interpretable Machine Learning (IML) 31 2.4.1. Introduction 31 2.4.2. Application of IML 33 3. Model Specification 37 3.1. Analysis of SSES effectiveness 37 3.1.1. Crashes analysis 37 3.1.2. Speed analysis 39 3.2. Data collection & pre-analysis 40 3.2.1. Data collection 40 3.2.2. Basic statistics of variables 42 3.3. Response variable selection 50 3.4. Model selection 52 3.4.1. Binary classification 52 3.4.2. Accuracy vs. Interpretability 53 3.4.3. Overview of IML 54 3.4.4. Process of model specification 57 4. Model development 59 4.1. Black-box and interpretable model 59 4.1.1. Consists of IML 59 4.1.2. Black-box model 60 4.1.3. Interpretable model 68 4.2. Model development 72 4.2.1. Procedure 72 4.2.2. Measures of effectiveness 74 4.2.3. K-fold cross validation 76 4.3. Result of model development 78 4.3.1. Result of black-box model 78 4.3.2. Result of interpretable model 85 5. Evaluation & Application 91 5.1. Evaluation 91 5.1.1. The PDR framework for IML 91 5.1.2. Predictive accuracy 93 5.1.3. Descriptive accuracy 94 5.1.4. Relevancy 99 5.2. Impact of Casualty Crash Reduction 102 5.2.1. Quantification of the effectiveness 102 5.2.2. Mediation effect analysis 106 5.3. Application for the Korean expressway 118 6. Conclusion 121 6.1. Summary and Findings 121 6.2. Further Research 125Docto

    An Optimized Approach for Maximizing Business Intelligence using Machine Learning

    Get PDF
    The subject of study known as business intelligence is responsible for the development of techniques and tools for the analysis of business information with the goal of assisting in the management and decision-making processes of corporations. In the current climate, business intelligence is essential to the process of formulating a strategy and carrying out operations that are data-driven. Throughout the many stages of the company operation, an organization will need assistance evaluating data and making decisions; a decision support system may provide this assistance by including business intelligence as an essential component. The fact that this enormous quantity of data is distributed over a number of different types of platforms, however, makes it a difficult challenge, in particular to understand the information that is actually relevant and to make efficient use of it for business intelligence. One of the most important challenges facing modern society is maximizing business intelligence through the application of machine learning. It offers a full analysis that is based on predictions and is extracted for Business Intelligence techniques along with current application fields. This anomalous gap has been pointed up, and solutions and future research areas have been offered to overcome it in order to create effective business strategies

    Selection of Projects for Project Portfolio Using Fuzzy TOPSIS and Machine Learning

    Get PDF
    Project portfolio management (PPM) is extremely important nowadays due to the increasing severe competitions, accelerated product developments, project complexity, uncertainty, and challenges from global competitors. Therefore, businesses involved in many (dozens or even hundreds) projects need to formulate tactics and strategies to secure firms’ competencies and, most importantly, increase their productivities. Under this globalization context, PPM is to opti-mize the value provided to the customers while minimizing risks and the resources committed to the projects, while critical success factors (CSFs) is applied to anticipate the project’s risk and financial value by early assessment thus to help from the organizational level to predict the per-formance. Despite its importance, the literature on PPM and CSFs at a project level is rather limited, which demands a more profound knowledge about the assessment, ranking, and prior-itization of projects in the early stage. This study seeks to address the following two research questions: Do CSFs vary according to the project category, and how a supportive method can be established to help portfolio managers to select the project for portfolio. As a result, this re-search focuses on the multi-project context in order to fill the above-mentioned research gaps. As the contributions of this study, this study intends to (1) verify the hypothesis that different project category has different CSFs, and (2) contribute to explore how machine learning technol-ogy can be utilized for project selection. Projektisalkun hallinta (PPM) on nykyään erittäin tärkeää lisääntyvien kovien kilpailujen, nopeutuneen tuotekehityksen, projektien monimutkaisuuden, epävarmuuden ja globaalien kilpailijoiden haasteiden vuoksi. Siksi moniin (kymmeniin tai jopa satoihin) hankkeisiin osallistuvien yritysten on laadittava taktiikat ja strategiat, joilla varmistetaan yritysten osaaminen ja mikä tärkeintä, lisää tuottavuuttaan. Tässä globalisaatiokehyksessä PPM: n on optimoitava asiakkaille tarjottu arvo minimoiden riskit ja hankkeisiin sitoutuvat resurssit, kun taas kriittisiä menestystekijöitä (CSF) käytetään ennakoimaan projektin riski ja taloudellinen arvo varhaisella arvioinnilla, jotta apua organisaatiotasolta suorituskyvyn ennustamiseksi. Tärkeydestään huolimatta kirjallisuus PPM: stä ja CSF: stä projektitasolla on melko rajallinen, mikä vaatii syvällisempää tietoa hankkeiden arvioinnista, luokittelusta ja ennakoinnista varhaisessa vaiheessa. Tässä tutkimuksessa pyritään käsittelemään kahta seuraavaa tutkimuskysymystä: vaihtelevatko CSF: t projektikategorian mukaan ja kuinka voidaan luoda tukeva menetelmä salkunhoitajien auttamiseksi valitsemaan projekti salkkuun. Tämän seurauksena tämä uudelleenhaku keskittyy moniprojektiyhteyteen edellä mainittujen tutkimuksen aukkojen täyttämiseksi. Tämän tutkimuksen myötä tämän tutkimuksen tarkoituksena on (1) tarkistaa hypoteesi, että eri projektikategorioilla on erilaiset CSF: t, ja (2) myötävaikuttaa siihen, kuinka koneoppimisen tekniikkaa voidaan hyödyntää projektin valinnassa

    Incident duration time prediction using a supervised topic modeling method

    Get PDF
    Precisely predicting the duration time of an incident is one of the most prominent components to implement proactive management strategies for traffic congestions caused by an incident. This thesis presents a novel method to predict incident duration time in a timely manner by using an emerging supervised topic modeling method. Based on Natural Language Processing (NLP) techniques, this thesis performs semantic text analyses with text-based incident dataset to train the model. The model is trained with actual 1,466 incident records collected by Korea Expressway Corporation from 2016-2019 by applying a Labeled Latent Dirichlet Allocation(L-LDA) approach. For the training, this thesis divides the incident duration times into two groups: shorter than 2-hour and longer than 2-hour, based on the MUTCD incident management guideline. The model is tested with randomly selected incident records that have not been used for the training. The results demonstrate that the overall prediction accuracies are approximately 74% and 82% for the incidents shorter and longer than 2-hour, respectively

    DravidianCodeMix: Sentiment Analysis and Offensive Language Identification Dataset for Dravidian Languages in Code-Mixed Text

    Get PDF
    This paper describes the development of a multilingual, manually annotated dataset for three under-resourced Dravidian languages generated from social media comments. The dataset was annotated for sentiment analysis and offensive language identification for a total of more than 60,000 YouTube comments. The dataset consists of around 44,000 comments in Tamil-English, around 7,000 comments in Kannada-English, and around 20,000 comments in Malayalam-English. The data was manually annotated by volunteer annotators and has a high inter-annotator agreement in Krippendorff's alpha. The dataset contains all types of code-mixing phenomena since it comprises user-generated content from a multilingual country. We also present baseline experiments to establish benchmarks on the dataset using machine learning methods. The dataset is available on Github (https://github.com/bharathichezhiyan/DravidianCodeMix-Dataset) and Zenodo (https://zenodo.org/record/4750858\#.YJtw0SYo\_0M).Comment: 36 page

    Deep learning for religious and continent-based toxic content detection and classification

    Get PDF
    With time, numerous online communication platforms have emerged that allow people to express themselves, increasing the dissemination of toxic languages, such as racism, sexual harassment, and other negative behaviors that are not accepted in polite society. As a result, toxic language identification in online communication has emerged as a critical application of natural language processing. Numerous academic and industrial researchers have recently researched toxic language identification using machine learning algorithms. However, Nontoxic comments, including particular identification descriptors, such as Muslim, Jewish, White, and Black, were assigned unrealistically high toxicity ratings in several machine learning models. This research analyzes and compares modern deep learning algorithms for multilabel toxic comments classification. We explore two scenarios: the first is a multilabel classification of Religious toxic comments, and the second is a multilabel classification of race or toxic ethnicity comments with various word embeddings (GloVe, Word2vec, and FastText) without word embeddings using an ordinary embedding layer. Experiments show that the CNN model produced the best results for classifying multilabel toxic comments in both scenarios. We compared the outcomes of these modern deep learning model performances in terms of multilabel evaluation metrics
    corecore