Search CORE

216,268 research outputs found

Classification with Large Sparse Datasets: Convergence Analysis and Scalable Algorithms

Author: Li Xiang
Publication venue: Scholarship@Western
Publication date: 24/07/2017
Field of study

Large and sparse datasets, such as user ratings over a large collection of items, are common in the big data era. Many applications need to classify the users or items based on the high-dimensional and sparse data vectors, e.g., to predict the profitability of a product or the age group of a user, etc. Linear classifiers are popular choices for classifying such datasets because of their efficiency. In order to classify the large sparse data more effectively, the following important questions need to be answered. 1. Sparse data and convergence behavior. How different properties of a dataset, such as the sparsity rate and the mechanism of missing data systematically affect convergence behavior of classification? 2. Handling sparse data with non-linear model. How to efficiently learn non-linear data structures when classifying large sparse data? This thesis attempts to address these questions with empirical and theoretical analysis on large and sparse datasets. We begin by studying the convergence behavior of popular classifiers on large and sparse data. It is known that a classifier gains better generalization ability after learning more and more training examples. Eventually, it will converge to the best generalization performance with respect to a given data distribution. In this thesis, we focus on how the sparsity rate and the missing data mechanism systematically affect such convergence behavior. Our study covers different types of classification models, including generative classifier and discriminative linear classifiers. To systematically explore the convergence behaviors, we use synthetic data sampled from statistical models of real-world large sparse datasets. We consider different types of missing data mechanisms that are common in practice. From the experiments, we have several useful observations about the convergence behavior of classifying large sparse data. Based on these observations, we further investigate the theoretical reasons and come to a series of useful conclusions. For better applicability, we provide practical guidelines for applying our results in practice. Our study helps to answer whether obtaining more data or missing values in the data is worthwhile in different situations, which is useful for efficient data collection and preparation. Despite being efficient, linear classifiers cannot learn the non-linear structures such as the low-rankness in a dataset. As a result, its accuracy may suffer. Meanwhile, most non-linear methods such as the kernel machines cannot scale to very large and high-dimensional datasets. The third part of this thesis studies how to efficiently learn non-linear structures in large sparse data. Towards this goal, we develop novel scalable feature mappings that can achieve better accuracy than linear classification. We demonstrate that the proposed methods not only outperform linear classification but is also scalable to large and sparse datasets with moderate memory and computation requirement. The main contribution of this thesis is to answer important questions on classifying large and sparse datasets. On the one hand, we study the convergence behavior of widely used classifiers under different missing data mechanisms; on the other hand, we develop efficient methods to learn the non-linear structures in large sparse data and improve classification accuracy. Overall, the thesis not only provides practical guidance for the convergence behavior of classifying large sparse datasets, but also develops highly efficient algorithms for classifying large sparse datasets in practice

Scholarship@Western

Machine Learning and Integrative Analysis of Biomedical Big Data.

Author: Choi Howard
Chung Neo Christopher
Mirza Bilal
Ping Peipei
Wang Jie
Wang Wei
Publication venue: eScholarship, University of California
Publication date: 01/01/2019
Field of study

Recent developments in high-throughput technologies have accelerated the accumulation of massive amounts of omics data from multiple sources: genome, epigenome, transcriptome, proteome, metabolome, etc. Traditionally, data from each source (e.g., genome) is analyzed in isolation using statistical and machine learning (ML) methods. Integrative analysis of multi-omics and clinical data is key to new biomedical discoveries and advancements in precision medicine. However, data integration poses new computational challenges as well as exacerbates the ones associated with single-omics studies. Specialized computational approaches are required to effectively and efficiently perform integrative analysis of biomedical data acquired from diverse modalities. In this review, we discuss state-of-the-art ML-based approaches for tackling five specific computational challenges associated with integrative analysis: curse of dimensionality, data heterogeneity, missing data, class imbalance and scalability issues

Multidisciplinary Digital Publishing Institute

Ezid

Directory of Open Access Journals

eScholarship - University of California

FSS++ Workshop Report: Handling Uncertainty for Data Quality Management

Author: Wilbik Anna
Publication venue
Publication date: 01/01/2018
Field of study

This report describes the results of the eSCF Awareness Workshop on Handling Uncertainty for Data Quality Management - Challenges from Transport and Supply Chain Management that was held on June 5, 2018 in Heeze, The Netherlands. The goal of this workshop was to create and enhance awareness into data quality management issues that are encountered in practice, for business organizations that aim to integrate a data-analytical mind set into their operations

arXiv.org e-Print Archive

Maastricht University Research Portal

Toward a Standardized Strategy of Clinical Metabolomics for the Advancement of Precision Medicine

Author: Kang Yun Pyo
Kim Hyung Min
Kwon Sung Won
NGHI TRAN
Nguyen Hoang Anh
Nguyen Phuoc Long
PARK SANG KI
Publication venue: 'MDPI AG'
Publication date: 01/01/2020
Field of study

Despite the tremendous success, pitfalls have been observed in every step of a clinical metabolomics workflow, which impedes the internal validity of the study. Furthermore, the demand for logistics, instrumentations, and computational resources for metabolic phenotyping studies has far exceeded our expectations. In this conceptual review, we will cover inclusive barriers of a metabolomics-based clinical study and suggest potential solutions in the hope of enhancing study robustness, usability, and transferability. The importance of quality assurance and quality control procedures is discussed, followed by a practical rule containing five phases, including two additional "pre-pre-" and "post-post-" analytical steps. Besides, we will elucidate the potential involvement of machine learning and demonstrate that the need for automated data mining algorithms to improve the quality of future research is undeniable. Consequently, we propose a comprehensive metabolomics framework, along with an appropriate checklist refined from current guidelines and our previously published assessment, in the attempt to accurately translate achievements in metabolomics into clinical and epidemiological research. Furthermore, the integration of multifaceted multi-omics approaches with metabolomics as the pillar member is in urgent need. When combining with other social or nutritional factors, we can gather complete omics profiles for a particular disease. Our discussion reflects the current obstacles and potential solutions toward the progressing trend of utilizing metabolomics in clinical research to create the next-generation healthcare system.11Ysciescopu

Multidisciplinary Digital Publishing Institute

SNU Open Repository and Archive

포항공과대학교

Smart Asset Management for Electric Utilities: Big Data and Future

Author: A Koronis
C Zhou
P Tamilselvan
SR Khuntia
SR Khuntia
V Khatri
X Peng
Publication venue
Publication date: 17/02/2018
Field of study

This paper discusses about future challenges in terms of big data and new technologies. Utilities have been collecting data in large amounts but they are hardly utilized because they are huge in amount and also there is uncertainty associated with it. Condition monitoring of assets collects large amounts of data during daily operations. The question arises "How to extract information from large chunk of data?" The concept of "rich data and poor information" is being challenged by big data analytics with advent of machine learning techniques. Along with technological advancements like Internet of Things (IoT), big data analytics will play an important role for electric utilities. In this paper, challenges are answered by pathways and guidelines to make the current asset management practices smarter for the future.Comment: 13 pages, 3 figures, Proceedings of 12th World Congress on Engineering Asset Management (WCEAM) 201

arXiv.org e-Print Archive

Crossref

Overcoming Barriers in Supply Chain Analytics—Investigating Measures in LSCM Organizations

Author: Gerlach Benno
Herden Tino T.
Nitsche Benjamin
Publication venue
Publication date: 29/04/2020
Field of study

While supply chain analytics shows promise regarding value, benefits, and increase in performance for logistics and supply chain management (LSCM) organizations, those organizations are often either reluctant to invest or unable to achieve the returns they aspire to. This article systematically explores the barriers LSCM organizations experience in employing supply chain analytics that contribute to such reluctance and unachieved returns and measures to overcome these barriers. This article therefore aims to systemize the barriers and measures and allocate measures to barriers in order to provide organizations with directions on how to cope with their individual barriers. By using Grounded Theory through 12 in-depth interviews and Q-Methodology to synthesize the intended results, this article derives core categories for the barriers and measures, and their impacts and relationships are mapped based on empirical evidence from various actors along the supply chain. Resultingly, the article presents the core categories of barriers and measures, including their effect on different phases of the analytics solutions life cycle, the explanation of these effects, and accompanying examples. Finally, to address the intended aim of providing directions to organizations, the article provides recommendations for overcoming the identified barriers in organizations

DepositOnce

Finding Top-k Dominance on Incomplete Big Data Using Map-Reduce Framework

Author: Ezatpoor Payam
Publication venue: Digital Scholarship@UNLV
Publication date: 01/05/2017
Field of study

Incomplete data is one major kind of multi-dimensional dataset that has random-distributed missing nodes in its dimensions. It is very difficult to retrieve information from this type of dataset when it becomes huge. Finding top-k dominant values in this type of dataset is a challenging procedure. Some algorithms are present to enhance this process but are mostly efficient only when dealing with a small-size incomplete data. One of the algorithms that make the application of TKD query possible is the Bitmap Index Guided (BIG) algorithm. This algorithm strongly improves the performance for incomplete data, but it is not originally capable of finding top-k dominant values in incomplete big data, nor is it designed to do so. Several other algorithms have been proposed to find the TKD query, such as Skyband Based and Upper Bound Based algorithms, but their performance is also questionable. Algorithms developed previously were among the first attempts to apply TKD query on incomplete data; however, all these had weak performances or were not compatible with the incomplete data. This thesis proposes MapReduced Enhanced Bitmap Index Guided Algorithm (MRBIG) for dealing with the aforementioned issues. MRBIG uses the MapReduce framework to enhance the performance of applying top-k dominance queries on huge incomplete datasets. The proposed approach uses the MapReduce parallel computing approach using multiple computing nodes. The framework separates the tasks between several computing nodes that independently and simultaneously work to find the result. This method has achieved up to two times faster processing time in finding the TKD query result in comparison to previously presented algorithms

University of Nevada, Las Vegas Repository