143 research outputs found
Machine Learning and Integrative Analysis of Biomedical Big Data.
Recent developments in high-throughput technologies have accelerated the accumulation of massive amounts of omics data from multiple sources: genome, epigenome, transcriptome, proteome, metabolome, etc. Traditionally, data from each source (e.g., genome) is analyzed in isolation using statistical and machine learning (ML) methods. Integrative analysis of multi-omics and clinical data is key to new biomedical discoveries and advancements in precision medicine. However, data integration poses new computational challenges as well as exacerbates the ones associated with single-omics studies. Specialized computational approaches are required to effectively and efficiently perform integrative analysis of biomedical data acquired from diverse modalities. In this review, we discuss state-of-the-art ML-based approaches for tackling five specific computational challenges associated with integrative analysis: curse of dimensionality, data heterogeneity, missing data, class imbalance and scalability issues
GENO -- GENeric Optimization for Classical Machine Learning
Although optimization is the longstanding algorithmic backbone of machine
learning, new models still require the time-consuming implementation of new
solvers. As a result, there are thousands of implementations of optimization
algorithms for machine learning problems. A natural question is, if it is
always necessary to implement a new solver, or if there is one algorithm that
is sufficient for most models. Common belief suggests that such a
one-algorithm-fits-all approach cannot work, because this algorithm cannot
exploit model specific structure and thus cannot be efficient and robust on a
wide variety of problems. Here, we challenge this common belief. We have
designed and implemented the optimization framework GENO (GENeric Optimization)
that combines a modeling language with a generic solver. GENO generates a
solver from the declarative specification of an optimization problem class. The
framework is flexible enough to encompass most of the classical machine
learning problems. We show on a wide variety of classical but also some
recently suggested problems that the automatically generated solvers are (1) as
efficient as well-engineered specialized solvers, (2) more efficient by a
decent margin than recent state-of-the-art solvers, and (3) orders of magnitude
more efficient than classical modeling language plus solver approaches
Bayesian factorizations of big sparse tensors
It has become routine to collect data that are structured as multiway arrays
(tensors). There is an enormous literature on low rank and sparse matrix
factorizations, but limited consideration of extensions to the tensor case in
statistics. The most common low rank tensor factorization relies on parallel
factor analysis (PARAFAC), which expresses a rank tensor as a sum of rank
one tensors. When observations are only available for a tiny subset of the
cells of a big tensor, the low rank assumption is not sufficient and PARAFAC
has poor performance. We induce an additional layer of dimension reduction by
allowing the effective rank to vary across dimensions of the table. For
concreteness, we focus on a contingency table application. Taking a Bayesian
approach, we place priors on terms in the factorization and develop an
efficient Gibbs sampler for posterior computation. Theory is provided showing
posterior concentration rates in high-dimensional settings, and the methods are
shown to have excellent performance in simulations and several real data
applications
Neural Networks for CollaborativeFiltering
Recommender systems are an integral part of almost all modern e-commerce companies. They contribute significantly to the overall customer satisfaction by helping the user discover new and relevant items, which consequently leads to higher sales and stronger customer retention. It is, therefore, not surprising that large e-commerce shops like Amazon or streaming platforms like Netflix and Spotify even use multiple recommender systems to further increase user engagement.
Finding the most relevant items for each user is a difficult task that is critically dependent on the available user feedback information. However, most users typically interact with products only through noisy implicit feedback, such as clicks or purchases, rather than providing explicit information about their preferences, such as product ratings. This usually makes large amounts of behavioural user data necessary to infer accurate user preferences. One popular approach to make the most use of both forms of feedback is called collaborative filtering. Here, the main idea is to compare individual user behaviour with the behaviour of all known users.
Although there are many different collaborative filtering techniques, matrix factorization models are among the most successful ones. In contrast, while neural networks are nowadays the state-of-the-art method for tasks such as image recognition or natural language processing, they are still not very popular for collaborative filtering tasks. Therefore, the main focus of this thesis is the derivation of multiple wide neural network architectures to mimic and extend matrix factorization models for various collaborative filtering problems and to gain insights into the connection between these models.
The basics of the proposed architecture are wide and shallow feedforward neural networks, which will be established for rating prediction tasks on explicit feedback datasets. These networks consist of large input and output layers, which allow them to capture user and item representation similar to matrix factorization models. By deriving all weight updates and comparing the structure of both models, it is proven that a simplified version of the proposed network can mimic common matrix factorization models: a result that has not been shown, as far as we know, in this form before. Additionally, various extensions are thoroughly evaluated. The new findings of this evaluation can also easily be transferred to other matrix factorization models.
This neural network architecture can be extended to be used for personalized ranking tasks on implicit feedback datasets. For these problems, it is necessary to rank products according to individual preferences using only the provided implicit feedback. One of the most successful and influential approaches for personalized ranking tasks is Bayesian Personalized Ranking, which attempts to learn pairwise item rankings and can also be used in combination with matrix factorization models.
It is shown, how the introduction of an additional ranking layer forces the network to learn pairwise item rankings. In addition, similarities between this novel neural network architecture and a matrix factorization model trained with Bayesian Personalized Ranking are proven. To the best of our knowledge, this is the first time that these connections have been shown. The state-of-the-art performance of this network is demonstrated in a detailed evaluation.
The most comprehensive feedback datasets consist of a mixture of explicit as well as implicit feedback information. Here, the goal is to predict if a user will like an item, similar to rating prediction tasks, even if this user has never given any explicit feedback at all: a problem, that has not been covered by the collaborative filtering literature yet. The network to solve this task is composed out of two networks: one for the explicit and one for the implicit feedback. Additional item features are learned using the implicit feedback, which capture all information necessary to rank items. Afterwards, these features are used to improve the explicit feedback prediction. Both parts of this combined network have different optimization goals, are trained simultaneously and, therefore, influence each other. A detailed evaluation shows that this approach is helpful to improve the network's overall predictive performance especially for ranking metrics
Sparse Modeling for Image and Vision Processing
In recent years, a large amount of multi-disciplinary research has been
conducted on sparse models and their applications. In statistics and machine
learning, the sparsity principle is used to perform model selection---that is,
automatically selecting a simple model among a large collection of them. In
signal processing, sparse coding consists of representing data with linear
combinations of a few dictionary elements. Subsequently, the corresponding
tools have been widely adopted by several scientific communities such as
neuroscience, bioinformatics, or computer vision. The goal of this monograph is
to offer a self-contained view of sparse modeling for visual recognition and
image processing. More specifically, we focus on applications where the
dictionary is learned and adapted to data, yielding a compact representation
that has been successful in various contexts.Comment: 205 pages, to appear in Foundations and Trends in Computer Graphics
and Visio
A Parallel and Efficient Algorithm for Learning to Match
Many tasks in data mining and related fields can be formalized as matching
between objects in two heterogeneous domains, including collaborative
filtering, link prediction, image tagging, and web search. Machine learning
techniques, referred to as learning-to-match in this paper, have been
successfully applied to the problems. Among them, a class of state-of-the-art
methods, named feature-based matrix factorization, formalize the task as an
extension to matrix factorization by incorporating auxiliary features into the
model. Unfortunately, making those algorithms scale to real world problems is
challenging, and simple parallelization strategies fail due to the complex
cross talking patterns between sub-tasks. In this paper, we tackle this
challenge with a novel parallel and efficient algorithm for feature-based
matrix factorization. Our algorithm, based on coordinate descent, can easily
handle hundreds of millions of instances and features on a single machine. The
key recipe of this algorithm is an iterative relaxation of the objective to
facilitate parallel updates of parameters, with guaranteed convergence on
minimizing the original objective function. Experimental results demonstrate
that the proposed method is effective on a wide range of matching problems,
with efficiency significantly improved upon the baselines while accuracy
retained unchanged.Comment: 10 pages, short version was published in ICDM 201
MLI: An API for Distributed Machine Learning
MLI is an Application Programming Interface designed to address the
challenges of building Machine Learn- ing algorithms in a distributed setting
based on data-centric computing. Its primary goal is to simplify the
development of high-performance, scalable, distributed algorithms. Our initial
results show that, relative to existing systems, this interface can be used to
build distributed implementations of a wide variety of common Machine Learning
algorithms with minimal complexity and highly competitive performance and
scalability
- …