216,268 research outputs found

    Classification with Large Sparse Datasets: Convergence Analysis and Scalable Algorithms

    Get PDF
    Large and sparse datasets, such as user ratings over a large collection of items, are common in the big data era. Many applications need to classify the users or items based on the high-dimensional and sparse data vectors, e.g., to predict the profitability of a product or the age group of a user, etc. Linear classifiers are popular choices for classifying such datasets because of their efficiency. In order to classify the large sparse data more effectively, the following important questions need to be answered. 1. Sparse data and convergence behavior. How different properties of a dataset, such as the sparsity rate and the mechanism of missing data systematically affect convergence behavior of classification? 2. Handling sparse data with non-linear model. How to efficiently learn non-linear data structures when classifying large sparse data? This thesis attempts to address these questions with empirical and theoretical analysis on large and sparse datasets. We begin by studying the convergence behavior of popular classifiers on large and sparse data. It is known that a classifier gains better generalization ability after learning more and more training examples. Eventually, it will converge to the best generalization performance with respect to a given data distribution. In this thesis, we focus on how the sparsity rate and the missing data mechanism systematically affect such convergence behavior. Our study covers different types of classification models, including generative classifier and discriminative linear classifiers. To systematically explore the convergence behaviors, we use synthetic data sampled from statistical models of real-world large sparse datasets. We consider different types of missing data mechanisms that are common in practice. From the experiments, we have several useful observations about the convergence behavior of classifying large sparse data. Based on these observations, we further investigate the theoretical reasons and come to a series of useful conclusions. For better applicability, we provide practical guidelines for applying our results in practice. Our study helps to answer whether obtaining more data or missing values in the data is worthwhile in different situations, which is useful for efficient data collection and preparation. Despite being efficient, linear classifiers cannot learn the non-linear structures such as the low-rankness in a dataset. As a result, its accuracy may suffer. Meanwhile, most non-linear methods such as the kernel machines cannot scale to very large and high-dimensional datasets. The third part of this thesis studies how to efficiently learn non-linear structures in large sparse data. Towards this goal, we develop novel scalable feature mappings that can achieve better accuracy than linear classification. We demonstrate that the proposed methods not only outperform linear classification but is also scalable to large and sparse datasets with moderate memory and computation requirement. The main contribution of this thesis is to answer important questions on classifying large and sparse datasets. On the one hand, we study the convergence behavior of widely used classifiers under different missing data mechanisms; on the other hand, we develop efficient methods to learn the non-linear structures in large sparse data and improve classification accuracy. Overall, the thesis not only provides practical guidance for the convergence behavior of classifying large sparse datasets, but also develops highly efficient algorithms for classifying large sparse datasets in practice

    Machine Learning and Integrative Analysis of Biomedical Big Data.

    Get PDF
    Recent developments in high-throughput technologies have accelerated the accumulation of massive amounts of omics data from multiple sources: genome, epigenome, transcriptome, proteome, metabolome, etc. Traditionally, data from each source (e.g., genome) is analyzed in isolation using statistical and machine learning (ML) methods. Integrative analysis of multi-omics and clinical data is key to new biomedical discoveries and advancements in precision medicine. However, data integration poses new computational challenges as well as exacerbates the ones associated with single-omics studies. Specialized computational approaches are required to effectively and efficiently perform integrative analysis of biomedical data acquired from diverse modalities. In this review, we discuss state-of-the-art ML-based approaches for tackling five specific computational challenges associated with integrative analysis: curse of dimensionality, data heterogeneity, missing data, class imbalance and scalability issues

    FSS++ Workshop Report: Handling Uncertainty for Data Quality Management

    Full text link
    This report describes the results of the eSCF Awareness Workshop on Handling Uncertainty for Data Quality Management - Challenges from Transport and Supply Chain Management that was held on June 5, 2018 in Heeze, The Netherlands. The goal of this workshop was to create and enhance awareness into data quality management issues that are encountered in practice, for business organizations that aim to integrate a data-analytical mind set into their operations

    Toward a Standardized Strategy of Clinical Metabolomics for the Advancement of Precision Medicine

    Get PDF
    Despite the tremendous success, pitfalls have been observed in every step of a clinical metabolomics workflow, which impedes the internal validity of the study. Furthermore, the demand for logistics, instrumentations, and computational resources for metabolic phenotyping studies has far exceeded our expectations. In this conceptual review, we will cover inclusive barriers of a metabolomics-based clinical study and suggest potential solutions in the hope of enhancing study robustness, usability, and transferability. The importance of quality assurance and quality control procedures is discussed, followed by a practical rule containing five phases, including two additional "pre-pre-" and "post-post-" analytical steps. Besides, we will elucidate the potential involvement of machine learning and demonstrate that the need for automated data mining algorithms to improve the quality of future research is undeniable. Consequently, we propose a comprehensive metabolomics framework, along with an appropriate checklist refined from current guidelines and our previously published assessment, in the attempt to accurately translate achievements in metabolomics into clinical and epidemiological research. Furthermore, the integration of multifaceted multi-omics approaches with metabolomics as the pillar member is in urgent need. When combining with other social or nutritional factors, we can gather complete omics profiles for a particular disease. Our discussion reflects the current obstacles and potential solutions toward the progressing trend of utilizing metabolomics in clinical research to create the next-generation healthcare system.11Ysciescopu

    Smart Asset Management for Electric Utilities: Big Data and Future

    Full text link
    This paper discusses about future challenges in terms of big data and new technologies. Utilities have been collecting data in large amounts but they are hardly utilized because they are huge in amount and also there is uncertainty associated with it. Condition monitoring of assets collects large amounts of data during daily operations. The question arises "How to extract information from large chunk of data?" The concept of "rich data and poor information" is being challenged by big data analytics with advent of machine learning techniques. Along with technological advancements like Internet of Things (IoT), big data analytics will play an important role for electric utilities. In this paper, challenges are answered by pathways and guidelines to make the current asset management practices smarter for the future.Comment: 13 pages, 3 figures, Proceedings of 12th World Congress on Engineering Asset Management (WCEAM) 201

    Overcoming Barriers in Supply Chain Analytics—Investigating Measures in LSCM Organizations

    Get PDF
    While supply chain analytics shows promise regarding value, benefits, and increase in performance for logistics and supply chain management (LSCM) organizations, those organizations are often either reluctant to invest or unable to achieve the returns they aspire to. This article systematically explores the barriers LSCM organizations experience in employing supply chain analytics that contribute to such reluctance and unachieved returns and measures to overcome these barriers. This article therefore aims to systemize the barriers and measures and allocate measures to barriers in order to provide organizations with directions on how to cope with their individual barriers. By using Grounded Theory through 12 in-depth interviews and Q-Methodology to synthesize the intended results, this article derives core categories for the barriers and measures, and their impacts and relationships are mapped based on empirical evidence from various actors along the supply chain. Resultingly, the article presents the core categories of barriers and measures, including their effect on different phases of the analytics solutions life cycle, the explanation of these effects, and accompanying examples. Finally, to address the intended aim of providing directions to organizations, the article provides recommendations for overcoming the identified barriers in organizations

    Finding Top-k Dominance on Incomplete Big Data Using Map-Reduce Framework

    Full text link
    Incomplete data is one major kind of multi-dimensional dataset that has random-distributed missing nodes in its dimensions. It is very difficult to retrieve information from this type of dataset when it becomes huge. Finding top-k dominant values in this type of dataset is a challenging procedure. Some algorithms are present to enhance this process but are mostly efficient only when dealing with a small-size incomplete data. One of the algorithms that make the application of TKD query possible is the Bitmap Index Guided (BIG) algorithm. This algorithm strongly improves the performance for incomplete data, but it is not originally capable of finding top-k dominant values in incomplete big data, nor is it designed to do so. Several other algorithms have been proposed to find the TKD query, such as Skyband Based and Upper Bound Based algorithms, but their performance is also questionable. Algorithms developed previously were among the first attempts to apply TKD query on incomplete data; however, all these had weak performances or were not compatible with the incomplete data. This thesis proposes MapReduced Enhanced Bitmap Index Guided Algorithm (MRBIG) for dealing with the aforementioned issues. MRBIG uses the MapReduce framework to enhance the performance of applying top-k dominance queries on huge incomplete datasets. The proposed approach uses the MapReduce parallel computing approach using multiple computing nodes. The framework separates the tasks between several computing nodes that independently and simultaneously work to find the result. This method has achieved up to two times faster processing time in finding the TKD query result in comparison to previously presented algorithms