1,524 research outputs found

    Reducing the Effects of Detrimental Instances

    Full text link
    Not all instances in a data set are equally beneficial for inducing a model of the data. Some instances (such as outliers or noise) can be detrimental. However, at least initially, the instances in a data set are generally considered equally in machine learning algorithms. Many current approaches for handling noisy and detrimental instances make a binary decision about whether an instance is detrimental or not. In this paper, we 1) extend this paradigm by weighting the instances on a continuous scale and 2) present a methodology for measuring how detrimental an instance may be for inducing a model of the data. We call our method of identifying and weighting detrimental instances reduced detrimental instance learning (RDIL). We examine RIDL on a set of 54 data sets and 5 learning algorithms and compare RIDL with other weighting and filtering approaches. RDIL is especially useful for learning algorithms where every instance can affect the classification boundary and the training instances are considered individually, such as multilayer perceptrons trained with backpropagation (MLPs). Our results also suggest that a more accurate estimate of which instances are detrimental can have a significant positive impact for handling them.Comment: 6 pages, 5 tables, 2 figures. arXiv admin note: substantial text overlap with arXiv:1403.189

    Learning to Auto Weight: Entirely Data-driven and Highly Efficient Weighting Framework

    Full text link
    Example weighting algorithm is an effective solution to the training bias problem, however, most previous typical methods are usually limited to human knowledge and require laborious tuning of hyperparameters. In this paper, we propose a novel example weighting framework called Learning to Auto Weight (LAW). The proposed framework finds step-dependent weighting policies adaptively, and can be jointly trained with target networks without any assumptions or prior knowledge about the dataset. It consists of three key components: Stage-based Searching Strategy (3SM) is adopted to shrink the huge searching space in a complete training process; Duplicate Network Reward (DNR) gives more accurate supervision by removing randomness during the searching process; Full Data Update (FDU) further improves the updating efficiency. Experimental results demonstrate the superiority of weighting policy explored by LAW over standard training pipeline. Compared with baselines, LAW can find a better weighting schedule which achieves much more superior accuracy on both biased CIFAR and ImageNet.Comment: Accepted by AAAI 202

    An Easy to Use Repository for Comparing and Improving Machine Learning Algorithm Usage

    Full text link
    The results from most machine learning experiments are used for a specific purpose and then discarded. This results in a significant loss of information and requires rerunning experiments to compare learning algorithms. This also requires implementation of another algorithm for comparison, that may not always be correctly implemented. By storing the results from previous experiments, machine learning algorithms can be compared easily and the knowledge gained from them can be used to improve their performance. The purpose of this work is to provide easy access to previous experimental results for learning and comparison. These stored results are comprehensive -- storing the prediction for each test instance as well as the learning algorithm, hyperparameters, and training set that were used. Previous results are particularly important for meta-learning, which, in a broad sense, is the process of learning from previous machine learning results such that the learning process is improved. While other experiment databases do exist, one of our focuses is on easy access to the data. We provide meta-learning data sets that are ready to be downloaded for meta-learning experiments. In addition, queries to the underlying database can be made if specific information is desired. We also differ from previous experiment databases in that our databases is designed at the instance level, where an instance is an example in a data set. We store the predictions of a learning algorithm trained on a specific training set for each instance in the test set. Data set level information can then be obtained by aggregating the results from the instances. The instance level information can be used for many tasks such as determining the diversity of a classifier or algorithmically determining the optimal subset of training instances for a learning algorithm.Comment: 7 pages, 1 figure, 6 table

    Combining Cluster Validation Indices for Detecting Label Noise

    Get PDF
    In this paper, we show that cluster validation indices can be used for filtering mislabeled instances or class outliers prior to training in supervised learning problems. We propose a technique, entitled Cluster Validation Index (CVI)-based Outlier Filtering, in which mislabeled instances are identified and eliminated from the training set, and a classification hypothesis is then built from the set of remaining instances. The proposed approach assigns each instance several cluster validation scores representing its potential of being an outlier with respect to the clustering properties the used validation measures assess. We examine CVI-based Outlier Filtering and compare it against the Local Outlier Factor (LOF) detection method on ten data sets from the UCI data repository using five well-known learning algorithms and three different cluster validation indices. In addition, we study and compare three different approaches for combining the selected cluster validation measures. Our results show that for most learning algorithms and data sets, the proposed CVI-based outlier filtering algorithm outperforms the baseline method (LOF). The greatest increase in classification accuracy has been achieved by using union or ranked-based median strategies to assemble the used cluster validation indices and global filtering of mislabeled instances

    A data mining approach to guide students through the enrollment process based on academic performance

    Get PDF
    Student academic performance at universities is crucial for education management systems. Many actions and decisions are made based on it, specifically the enrollment process. During enrollment, students have to decide which courses to sign up for. This research presents the rationale behind the design of a recommender system to support the enrollment process using the students’ academic performance record. To build this system, the CRISP-DM methodology was applied to data from students of the Computer Science Department at University of Lima, PerĂș. One of the main contributions of this work is the use of two synthetic attributes to improve the relevance of the recommendations made. The first attribute estimates the inherent difficulty of a given course. The second attribute, named potential, is a measure of the competence of a student for a given course based on the grades obtained in relatedcourses. Data was mined using C4.5, KNN (K-nearest neighbor), NaĂŻve Bayes, Bagging and Boosting, and a set of experiments was developed in order to determine the best algorithm for this application domain. Results indicate that Bagging is the best method regarding predictive accuracy. Based on these results, the “Student Performance Recommender System” (SPRS) was developed, including a learning engine. SPRS was tested with a sample group of 39 students during the enrollment process. Results showed that the system had a very good performance under real-life conditions

    Automated Analysis of Customer Contacts – a Fintech Based Case Study

    Get PDF
    Seoses infotehnoloogia arenguga tekib igapĂ€evaselt enneolematu kogus andmeid, mille automaatne analĂŒĂŒsimine konkurentsieelise saavutamiseks on otsustava tĂ€htsusega. Traditsioonilised andmekaeve meetodid on leidnud laialdaselt Ă€rilisi rakendusi, kuid ei ole sobivad struktureerimata (nĂ€iteks tekstiliste) andmete puhul. Seevastu on valdav osa andmetest just struktureerimata kujul, mistĂ”ttu on iseĂ€ranis oluline luua lahendusi neist olulise teabe eraldamiseks. KĂ€esolev magistritöö on praktilise loomuga ning selle eesmĂ€rk oli luua automatiseeritud tekstianalĂŒĂŒsi mudel, mida saab kasutada sissetulevate kliendipĂ€ringute efektiivseks prioriseerimiseks ning mÔÔtmiseks kasutades TransferWise Ltd. andmeid. Tulenevalt pĂŒstitatud eesmĂ€rgist teostas autor arvukalt eksperimente kasutades nii klassikalisi kui ka uudseid loomuliku keele töötluse meetodeid. Seejuures ei taganud antud ĂŒlesande puhul uudsed tehnoloogiad mĂ€rgatavat paremust klassikaliste meetodite ees. Töö tulemusena valminud mudel on oluline nii ettevĂ”ttele kui ka selle klientidele – mudel vĂ”imaldab prioriseerida sissetulevaid pĂ€ringuid vastavalt nende keerukusele ning pakilisusele, mis parandab kliendikogemust ning soodustab ettevĂ”tte kasvu muutes operatsioonilisi protsesse efektiivsemaks. Peale praktilise vÀÀrtuse pakub kĂ€esolev töö ka ulatuslikku ĂŒlevaadet erinevatest loomuliku keele töötluse meetoditest, nende sobivusest ning nendega kaasnevatest vĂ”imalustest.The rapid development of information technologies has brought along abnormal amounts of data being generated on a daily basis and the need to automatically analyse it to gain a competitive advantage. Traditional data mining techniques have been efficiently applied in a variety of commercial applications, yet they are only applicable on structured data. However, an overwhelming amount of existing data is in an unstructured (e.g. textual) form, hence it is crucial for companies to build solutions to automatically extract useful information from it. Given master’s thesis is with a practical nature and its purpose was to implement an automated text analysis model using data from TransferWise Ltd. that can be used to efficiently prioritise and measure incoming customer contacts. To achieve this, the author conducted numerous experiments via employing classical as well as novel natural language processing techniques. Apropos, employing novel methods did not ensure a noticeably better outcome. The established model is important for both the company as well as its customers since it can be used to prioritise incoming contacts based on their complexity or urgency. This ensures a convenient customer experience and is likely to accelerate growth by making operational procedures more efficient. Besides its practical value, given thesis also provides an extensive comparison of numerous natural language processing techniques, their suitability and opportunities

    Design and development of a students' performance predicting LMS utilizing Machine Learning based on mental stress level measured through a Bluetooth enabled smart watch

    Get PDF
    Stress and academic anxiety problems can negatively impact numerous aspects of students lives, resulting in degrading their academic achievement, quality of life, and social behaviour. Various research suggests that depression is associated with lower academic performance of students. The aim of this research is twofold. Firstly, in order to establish a correlation between students mental stress level and their academic performance, a dataset has been compiled through gathering the data by conducting a survey in a university located in Punjab, Pakistan. The questionnaires were based on measuring the stress level of students using Perceived Stress Scale (PSS) , Cognitive performance assessment scale, in addition to some other demographic questions. Afterwards, this dataset has been analysed utilizing various machine learning algorithms. The second objective was to develop an innovative, affordable and smart performance predicting Learning Management System that takes into account students mental stress while predicting the students performance using machine learning models. The technique that was used for the mental stress measurements of the students was based on a phenomenon known as the Heart Rate Variability (HRV). A smart watch was utilized to measure the Heart Rate Variability of the students that was used to assess the stress level of students in academics. A Machine Learning (ML) model was trained using various parameters that were derived from the Heart Rate Variability. The original dataset that was used to train the model is known as Swell dataset. The SWELL dataset consists of HRV indices computed from the multimodal SWELL knowledge work dataset for research on stress and user modelling. The ML model effectively made prediction about the stress levels of the students with an accuracy of 98.1%.Objectius de Desenvolupament Sostenible::3 - Salut i BenestarObjectius de Desenvolupament Sostenible::4 - EducaciĂł de Qualita

    Searching for Needles in the Cosmic Haystack

    Get PDF
    Searching for pulsar signals in radio astronomy data sets is a difficult task. The data sets are extremely large, approaching the petabyte scale, and are growing larger as instruments become more advanced. Big Data brings with it big challenges. Processing the data to identify candidate pulsar signals is computationally expensive and must utilize parallelism to be scalable. Labeling benchmarks for supervised classification is costly. To compound the problem, pulsar signals are very rare, e.g., only 0.05% of the instances in one data set represent pulsars. Furthermore, there are many different approaches to candidate classification with no consensus on a best practice. This dissertation is focused on identifying and classifying radio pulsar candidates from single pulse searches. First, to identify and classify Dispersed Pulse Groups (DPGs), we developed a supervised machine learning approach that consists of RAPID (a novel peak identification algorithm), feature extraction, and supervised machine learning classification. We tested six algorithms for classification with four imbalance treatments. Results showed that classifiers with imbalance treatments had higher recall values. Overall, classifiers using multiclass RandomForests combined with Synthetic Majority Oversampling TEchnique (SMOTE) were the most efficient; they identified additional known pulsars not in the benchmark, with less false positives than other classifiers. Second, we developed a parallel single pulse identification method, D-RAPID, and introduced a novel automated multiclass labeling (ALM) technique that we combined with feature selection to improve execution performance. D-RAPID improved execution performance over RAPID by a factor of 5. We also showed that the combination of ALM and feature selection sped up the execution performance of RandomForest by 54% on average with less than a 2% average reduction in classification performance. Finally, we proposed CoDRIFt, a novel classification algorithm that is distributed for scalability and employs semi-supervised learning to leverage unlabeled data to inform classification. We evaluated and compared CoDRIFt to eleven other classifiers. The results showed that CoDRIFt excelled at classifying candidates in imbalanced benchmarks with a majority of non-pulsar signals (\u3e95%). Furthermore, CoDRIFt models created with very limited sets of labeled data (as few as 22 labeled minority class instances) were able to achieve high recall (mean = 0.98). In comparison to the other algorithms trained on similar sets, CoDRIFt outperformed them all, with recall 2.9% higher than the next best classifier and a 35% average improvement over all eleven classifiers. CoDRIFt is customizable for other problem domains with very large, imbalanced data sets, such as fraud detection and cyber attack detection

    Instance-specific and Model-adaptive Supervision for Semi-supervised Semantic Segmentation

    Full text link
    Recently, semi-supervised semantic segmentation has achieved promising performance with a small fraction of labeled data. However, most existing studies treat all unlabeled data equally and barely consider the differences and training difficulties among unlabeled instances. Differentiating unlabeled instances can promote instance-specific supervision to adapt to the model's evolution dynamically. In this paper, we emphasize the cruciality of instance differences and propose an instance-specific and model-adaptive supervision for semi-supervised semantic segmentation, named iMAS. Relying on the model's performance, iMAS employs a class-weighted symmetric intersection-over-union to evaluate quantitative hardness of each unlabeled instance and supervises the training on unlabeled data in a model-adaptive manner. Specifically, iMAS learns from unlabeled instances progressively by weighing their corresponding consistency losses based on the evaluated hardness. Besides, iMAS dynamically adjusts the augmentation for each instance such that the distortion degree of augmented instances is adapted to the model's generalization capability across the training course. Not integrating additional losses and training procedures, iMAS can obtain remarkable performance gains against current state-of-the-art approaches on segmentation benchmarks under different semi-supervised partition protocols
    • 

    corecore