7,868 research outputs found

    PMLB: A Large Benchmark Suite for Machine Learning Evaluation and Comparison

    Full text link
    The selection, development, or comparison of machine learning methods in data mining can be a difficult task based on the target problem and goals of a particular study. Numerous publicly available real-world and simulated benchmark datasets have emerged from different sources, but their organization and adoption as standards have been inconsistent. As such, selecting and curating specific benchmarks remains an unnecessary burden on machine learning practitioners and data scientists. The present study introduces an accessible, curated, and developing public benchmark resource to facilitate identification of the strengths and weaknesses of different machine learning methodologies. We compare meta-features among the current set of benchmark datasets in this resource to characterize the diversity of available data. Finally, we apply a number of established machine learning methods to the entire benchmark suite and analyze how datasets and algorithms cluster in terms of performance. This work is an important first step towards understanding the limitations of popular benchmarking suites and developing a resource that connects existing benchmarking standards to more diverse and efficient standards in the future.Comment: 14 pages, 5 figures, submitted for review to JML

    Machine Learning and Integrative Analysis of Biomedical Big Data.

    Get PDF
    Recent developments in high-throughput technologies have accelerated the accumulation of massive amounts of omics data from multiple sources: genome, epigenome, transcriptome, proteome, metabolome, etc. Traditionally, data from each source (e.g., genome) is analyzed in isolation using statistical and machine learning (ML) methods. Integrative analysis of multi-omics and clinical data is key to new biomedical discoveries and advancements in precision medicine. However, data integration poses new computational challenges as well as exacerbates the ones associated with single-omics studies. Specialized computational approaches are required to effectively and efficiently perform integrative analysis of biomedical data acquired from diverse modalities. In this review, we discuss state-of-the-art ML-based approaches for tackling five specific computational challenges associated with integrative analysis: curse of dimensionality, data heterogeneity, missing data, class imbalance and scalability issues

    Interpretable Categorization of Heterogeneous Time Series Data

    Get PDF
    Understanding heterogeneous multivariate time series data is important in many applications ranging from smart homes to aviation. Learning models of heterogeneous multivariate time series that are also human-interpretable is challenging and not adequately addressed by the existing literature. We propose grammar-based decision trees (GBDTs) and an algorithm for learning them. GBDTs extend decision trees with a grammar framework. Logical expressions derived from a context-free grammar are used for branching in place of simple thresholds on attributes. The added expressivity enables support for a wide range of data types while retaining the interpretability of decision trees. In particular, when a grammar based on temporal logic is used, we show that GBDTs can be used for the interpretable classi cation of high-dimensional and heterogeneous time series data. Furthermore, we show how GBDTs can also be used for categorization, which is a combination of clustering and generating interpretable explanations for each cluster. We apply GBDTs to analyze the classic Australian Sign Language dataset as well as data on near mid-air collisions (NMACs). The NMAC data comes from aircraft simulations used in the development of the next-generation Airborne Collision Avoidance System (ACAS X).Comment: 9 pages, 5 figures, 2 tables, SIAM International Conference on Data Mining (SDM) 201

    Detection of Lying Electrical Vehicles in Charging Coordination Application Using Deep Learning

    Full text link
    The simultaneous charging of many electric vehicles (EVs) stresses the distribution system and may cause grid instability in severe cases. The best way to avoid this problem is by charging coordination. The idea is that the EVs should report data (such as state-of-charge (SoC) of the battery) to run a mechanism to prioritize the charging requests and select the EVs that should charge during this time slot and defer other requests to future time slots. However, EVs may lie and send false data to receive high charging priority illegally. In this paper, we first study this attack to evaluate the gains of the lying EVs and how their behavior impacts the honest EVs and the performance of charging coordination mechanism. Our evaluations indicate that lying EVs have a greater chance to get charged comparing to honest EVs and they degrade the performance of the charging coordination mechanism. Then, an anomaly based detector that is using deep neural networks (DNN) is devised to identify the lying EVs. To do that, we first create an honest dataset for charging coordination application using real driving traces and information revealed by EV manufacturers, and then we also propose a number of attacks to create malicious data. We trained and evaluated two models, which are the multi-layer perceptron (MLP) and the gated recurrent unit (GRU) using this dataset and the GRU detector gives better results. Our evaluations indicate that our detector can detect lying EVs with high accuracy and low false positive rate
    corecore