Search CORE

42,434 research outputs found

Type prediction in RDF knowledge bases using hierarchical multilabel classification

Author: Melo André
Paulheim Heiko
Völker Johanna
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2016
Field of study

Large Semantic Web knowledge bases are often noisy, incorrect, and incomplete with respect to type information. Automatic type prediction can help reduce such incompleteness, and, as previous works show, statistical methods are well-suited for this kind of data. Since most Semantic Web knowledge bases come with an ontology defining a type hierarchy, in this paper, we rephrase the type prediction problem as a hierarchical multilabel classification problem. We propose SLCN, a modification of the local classifier per node approach, which performs feature selection, instance sampling, and class balancing for each local classifier. Our approach improves scalability, facilitating its application on large Semantic Web datasets with high-dimensional feature and label spaces. We compare the performance of our proposed method with a state-of-the-art type prediction approach and popular hierarchical multilabel classifiers, and report on experiments with large-scale RDF datasets

MAnnheim DOCument Server

SCALABLE ALGORITHMS FOR HIGH DIMENSIONAL STRUCTURED DATA

Author: Ren Shaogang
Publication venue
Publication date: 16/01/2019
Field of study

Emerging technologies and digital devices provide us with increasingly large volume of data with respect to both the sample size and the number of features. To explore the benefits of massive data sets, scalable statistical models and machine learning algorithms are more and more important in different research disciplines. For robust and accurate prediction, prior knowledge regarding dependency structures within data needs to be formulated appropriately in these models. On the other hand, scalability and computation complexity of existing algorithms may not meet the needs to analyze massive high-dimensional data. This dissertation presents several novel methods to scale up sparse learning models to analyze massive data sets. We first present our novel safe active incremental feature (SAIF) selection algorithm for LASSO (least absolute shrinkage and selection operator), with the time complexity analysis to show the advantages over state of the art existing methods. As SAIF is targeting general convex loss functions, it potentially can be extended to many learning models and big-data applications, and we show how support vector machines (SVM) can be scaled up based on the idea of SAIF. Secondly, we propose screening methods to generalized LASSO (GL), which specifically considers the dependency structure among features. We also propose a scalable feature selection method for non-parametric, non-linear models based on sparse structures and kernel methods. Theoretical analysis and experimental results in this dissertation show that model complexity can be significantly reduced with the sparsity and structure assumptions

Texas A&M Repository

Recommended from our members

Flexible Sparse Learning of Feature Subspaces

Author: Ma Yuting
Publication venue: 'Columbia University Libraries/Information Services'
Publication date: 01/01/2017
Field of study

It is widely observed that the performances of many traditional statistical learning methods degenerate when confronted with high-dimensional data. One promising approach to prevent this downfall is to identify the intrinsic low-dimensional spaces where the true signals embed and to pursue the learning process on these informative feature subspaces. This thesis focuses on the development of flexible sparse learning methods of feature subspaces for classification. Motivated by the success of some existing methods, we aim at learning informative feature subspaces for high-dimensional data of complex nature with better flexibility, sparsity and scalability. The first part of this thesis is inspired by the success of distance metric learning in casting flexible feature transformations by utilizing local information. We propose a nonlinear sparse metric learning algorithm using a boosting-based nonparametric solution to address metric learning problem for high-dimensional data, named as the sDist algorithm. Leveraged a rank-one decomposition of the symmetric positive semi-definite weight matrix of the Mahalanobis distance metric, we restructure a hard global optimization problem into a forward stage-wise learning of weak learners through a gradient boosting algorithm. In each step, the algorithm progressively learns a sparse rank-one update of the weight matrix by imposing an L-1 regularization. Nonlinear feature mappings are adaptively learned by a hierarchical expansion of interactions integrated within the boosting framework. Meanwhile, an early stopping rule is imposed to control the overall complexity of the learned metric. As a result, without relying on computationally intensive tools, our approach automatically guarantees three desirable properties of the final metric: positive semi-definiteness, low rank and element-wise sparsity. Numerical experiments show that our learning model compares favorably with the state-of-the-art methods in the current literature of metric learning. The second problem arises from the observation of high instability and feature selection bias when applying online methods to highly sparse data of large dimensionality for sparse learning problem. Due to the heterogeneity in feature sparsity, existing truncation-based methods incur slow convergence and high variance. To mitigate this problem, we introduce a stabilized truncated stochastic gradient descent algorithm. We employ a soft-thresholding scheme on the weight vector where the imposed shrinkage is adaptive to the amount of information available in each feature. The variability in the resulted sparse weight vector is further controlled by stability selection integrated with the informative truncation. To facilitate better convergence, we adopt an annealing strategy on the truncation rate. We show that, when the true parameter space is of low dimension, the stabilization with annealing strategy helps to achieve lower regret bound in expectation

Columbia University Academic Commons

Adaptive multi-view feature selection for human motion retrieval

Author: Barnachon
Böhm
Chao
Chen
Gao
Han
Jia
Jian J. Zhang
Kapsouras
Ma
Michalis
Qi
Roweis
Shotton
Shum
Tang
Tian Qi
Wang
Wang
Wang
Wang
Wang
Wang
Wang
Xia
Xiao
Xiaosong Yang
Yang
Yang
Yinfu Feng
Yoshitaka
Yu
Zhang
Zhang
Zhang
Zhang
Zhang
Zhang
Zhang
Zhang
Zhao Wang
Zhou
Publication venue: 'Elsevier BV'
Publication date: 01/03/2016
Field of study

Human motion retrieval plays an important role in many motion data based applications. In the past, many researchers tended to use a single type of visual feature as data representation. Because different visual feature describes different aspects about motion data, and they have dissimilar discriminative power with respect to one particular class of human motion, it led to poor retrieval performance. Thus, it would be beneficial to combine multiple visual features together for motion data representation. In this article, we present an Adaptive Multi-view Feature Selection (AMFS) method for human motion retrieval. Specifically, we first use a local linear regression model to automatically learn multiple view-based Laplacian graphs for preserving the local geometric structure of motion data. Then, these graphs are combined together with a non-negative view-weight vector to exploit the complementary information between different features. Finally, in order to discard the redundant and irrelevant feature components from the original high-dimensional feature representation, we formulate the objective function of AMFS as a general trace ratio optimization problem, and design an effective algorithm to solve the corresponding optimization problem. Extensive experiments on two public human motion database, i.e., HDM05 and MSR Action3D, demonstrate the effectiveness of the proposed AMFS over the state-of-art methods for motion data retrieval. The scalability with large motion dataset, and insensitivity with the algorithm parameters, make our method can be widely used in real-world applications

Elsevier - Publisher Connector

Crossref

Bournemouth University Research Online

Feature selection in high-dimensional dataset using MapReduce

Author: Bontempi Gianluca
Borgne Yann-Aël Le
Reggiani Claudio
Publication venue
Publication date: 07/09/2017
Field of study

This paper describes a distributed MapReduce implementation of the minimum Redundancy Maximum Relevance algorithm, a popular feature selection method in bioinformatics and network inference problems. The proposed approach handles both tall/narrow and wide/short datasets. We further provide an open source implementation based on Hadoop/Spark, and illustrate its scalability on datasets involving millions of observations or features

arXiv.org e-Print Archive

DI-fusion

Machine Learning and Integrative Analysis of Biomedical Big Data.

Author: Choi Howard
Chung Neo Christopher
Mirza Bilal
Ping Peipei
Wang Jie
Wang Wei
Publication venue: eScholarship, University of California
Publication date: 01/01/2019
Field of study

Recent developments in high-throughput technologies have accelerated the accumulation of massive amounts of omics data from multiple sources: genome, epigenome, transcriptome, proteome, metabolome, etc. Traditionally, data from each source (e.g., genome) is analyzed in isolation using statistical and machine learning (ML) methods. Integrative analysis of multi-omics and clinical data is key to new biomedical discoveries and advancements in precision medicine. However, data integration poses new computational challenges as well as exacerbates the ones associated with single-omics studies. Specialized computational approaches are required to effectively and efficiently perform integrative analysis of biomedical data acquired from diverse modalities. In this review, we discuss state-of-the-art ML-based approaches for tackling five specific computational challenges associated with integrative analysis: curse of dimensionality, data heterogeneity, missing data, class imbalance and scalability issues

Multidisciplinary Digital Publishing Institute

Ezid

Directory of Open Access Journals

eScholarship - University of California