198 research outputs found

    The NoiseFiltersR Package: Label Noise Preprocessing in R

    Get PDF
    In Data Mining, the value of extracted knowledge is directly related to the quality of the used data. This makes data preprocessing one of the most important steps in the knowledge discovery process. A common problem affecting data quality is the presence of noise. A training set with label noise can reduce the predictive performance of classification learning techniques and increase the overfitting of classification models. In this work we present the NoiseFiltersR package. It contains the first extensive R implementation of classical and state-of-the-art label noise filters, which are the most common techniques for preprocessing label noise. The algorithms used for the implementation of the label noise filters are appropriately documented and referenced. They can be called in a R-user-friendly manner, and their results are unified by means of the "filter" class, which also benefits from adapted print and summary methods.Spanish Research ProjectAndalusian Research PlanBrazilian grant-CeMEAI-FAPESPFAPESPUniv Granada, Dept Comp Sci & Artificial Intelligence, E-18071 Granada, SpainUniv Sao Paulo, Inst Ciencias Matemat & Comp, Trabalhador Sao Carlense Av 400, BR-13560970 Sao Carlos, SP, BrazilUniv Fed Sao Paulo, Inst Ciencia & Tecnol, Talim St 330, BR-12231280 Sao Jose Dos Campos, SP, BrazilUniv Fed Sao Paulo, Inst Ciencia & Tecnol, Talim St 330, BR-12231280 Sao Jose Dos Campos, SP, BrazilSpanish Research Project: TIN2014-57251-PAndalusian Research Plan: P11-TIC-7765CeMEAI-FAPESP: 2013/07375-0FAPESP: 2012/22608-8FAPESP: 2011/14602-7Web of Scienc

    Adaptive Normalization in Streaming Data

    Full text link
    In todays digital era, data are everywhere from Internet of Things to health care or financial applications. This leads to potentially unbounded ever-growing Big data streams and it needs to be utilized effectively. Data normalization is an important preprocessing technique for data analytics. It helps prevent mismodeling and reduce the complexity inherent in the data especially for data integrated from multiple sources and contexts. Normalization of Big Data stream is challenging because of evolving inconsistencies, time and memory constraints, and non-availability of whole data beforehand. This paper proposes a distributed approach to adaptive normalization for Big data stream. Using sliding windows of fixed size, it provides a simple mechanism to adapt the statistics for normalizing changing data in each window. Implemented on Apache Storm, a distributed real-time stream data framework, our approach exploits distributed data processing for efficient normalization. Unlike other existing adaptive approaches that normalize data for a specific use (e.g., classification), ours does not. Moreover, our adaptive mechanism allows flexible controls, via user-specified thresholds, for normalization tradeoffs between time and precision. The paper illustrates our proposed approach along with a few other techniques and experiments on both synthesized and real-world data. The normalized data obtained from our proposed approach, on 160,000 instances of data stream, improves over the baseline by 89% with 0.0041 root-mean-square error compared with the actual data

    SMOTE for Learning from Imbalanced Data: Progress and Challenges, Marking the 15-year Anniversary

    Get PDF
    The Synthetic Minority Oversampling Technique (SMOTE) preprocessing algorithm is considered \de facto" standard in the framework of learning from imbalanced data. This is due to its simplicity in the design of the procedure, as well as its robustness when applied to di erent type of problems. Since its publication in 2002, SMOTE has proven successful in a variety of applications from several di erent domains. SMOTE has also inspired several approaches to counter the issue of class imbalance, and has also signi cantly contributed to new supervised learning paradigms, including multilabel classi cation, incremental learning, semi-supervised learning, multi-instance learning, among others. It is standard benchmark for learning from imbalanced data. It is also featured in a number of di erent software packages | from open source to commercial. In this paper, marking the fteen year anniversary of SMOTE, we re ect on the SMOTE journey, discuss the current state of a airs with SMOTE, its applications, and also identify the next set of challenges to extend SMOTE for Big Data problems.This work have been partially supported by the Spanish Ministry of Science and Technology under projects TIN2014-57251-P, TIN2015-68454-R and TIN2017-89517-P; the Project 887 BigDaP-TOOLS - Ayudas Fundaci on BBVA a Equipos de Investigaci on Cient ca 2016; and the National Science Foundation (NSF) Grant IIS-1447795

    Efficient training procedures for multi-spectral demosaicing

    Get PDF
    The simultaneous acquisition of multi-spectral images on a single sensor can be efficiently performed by single shot capture using a mutli-spectral filter array. This paper focused on the demosaicing of color and near-infrared bands and relied on a convolutional neural network (CNN). To train the deep learning model robustly and accurately, it is necessary to provide enough training data, with sufficient variability. We focused on the design of an efficient training procedure by discovering an optimal training dataset. We propose two data selection strategies, motivated by slightly different concepts. The general term that will be used for the proposed models trained using data selection is data selection-based multi-spectral demosaicing (DSMD). The first idea is clustering-based data selection (DSMD-C), with the goal to discover a representative subset with a high variance so as to train a robust model. The second is an adaptive-based data selection (DSMD-A), a self-guided approach that selects new data based on the current model accuracy. We performed a controlled experimental evaluation of the proposed training strategies and the results show that a careful selection of data does benefit the speed and accuracy of training. We are still able to achieve high reconstruction accuracy with a lightweight model
    corecore