894 research outputs found
An Open Source C++ Implementation of Multi-Threaded Gaussian Mixture Models, k-Means and Expectation Maximisation
Modelling of multivariate densities is a core component in many signal
processing, pattern recognition and machine learning applications. The
modelling is often done via Gaussian mixture models (GMMs), which use
computationally expensive and potentially unstable training algorithms. We
provide an overview of a fast and robust implementation of GMMs in the C++
language, employing multi-threaded versions of the Expectation Maximisation
(EM) and k-means training algorithms. Multi-threading is achieved through
reformulation of the EM and k-means algorithms into a MapReduce-like framework.
Furthermore, the implementation uses several techniques to improve numerical
stability and modelling accuracy. We demonstrate that the multi-threaded
implementation achieves a speedup of an order of magnitude on a recent 16 core
machine, and that it can achieve higher modelling accuracy than a previously
well-established publically accessible implementation. The multi-threaded
implementation is included as a user-friendly class in recent releases of the
open source Armadillo C++ linear algebra library. The library is provided under
the permissive Apache~2.0 license, allowing unencumbered use in commercial
products
Structure-Aware Dynamic Scheduler for Parallel Machine Learning
Training large machine learning (ML) models with many variables or parameters
can take a long time if one employs sequential procedures even with stochastic
updates. A natural solution is to turn to distributed computing on a cluster;
however, naive, unstructured parallelization of ML algorithms does not usually
lead to a proportional speedup and can even result in divergence, because
dependencies between model elements can attenuate the computational gains from
parallelization and compromise correctness of inference. Recent efforts toward
this issue have benefited from exploiting the static, a priori block structures
residing in ML algorithms. In this paper, we take this path further by
exploring the dynamic block structures and workloads therein present during ML
program execution, which offers new opportunities for improving convergence,
correctness, and load balancing in distributed ML. We propose and showcase a
general-purpose scheduler, STRADS, for coordinating distributed updates in ML
algorithms, which harnesses the aforementioned opportunities in a systematic
way. We provide theoretical guarantees for our scheduler, and demonstrate its
efficacy versus static block structures on Lasso and Matrix Factorization
BLSSpeller : exhaustive comparative discovery of conserved cis-regulatory elements
Motivation: The accurate discovery and annotation of regulatory elements remains a challenging problem. The growing number of sequenced genomes creates new opportunities for comparative approaches to motif discovery. Putative binding sites are then considered to be functional if they are conserved in orthologous promoter sequences of multiple related species. Existing methods for comparative motif discovery usually rely on pregenerated multiple sequence alignments, which are difficult to obtain for more diverged species such as plants. As a consequence, misaligned regulatory elements often remain undetected.
Results: We present a novel algorithm that supports both alignment-free and alignment-based motif discovery in the promoter sequences of related species. Putative motifs are exhaustively enumerated as words over the IUPAC alphabet and screened for conservation using the branch length score. Additionally, a confidence score is established in a genome-wide fashion. In order to take advantage of a cloud computing infrastructure, the MapReduce programming model is adopted. The method is applied to four monocotyledon plant species and it is shown that high-scoring motifs are significantly enriched for open chromatin regions in Oryza sativa and for transcription factor binding sites inferred through protein-binding microarrays in O. sativa and Zea mays. Furthermore, the method is shown to recover experimentally profiled ga2ox1-like KN1 binding sites in Z. mays
Methods to Improve Applicability and Efficiency of Distributed Data-Centric Compute Frameworks
The success of modern applications depends on the insights they collect from their data repositories. Data repositories for such applications currently exceed exabytes and are rapidly increasing in size, as they collect data from varied sources - web applications, mobile phones, sensors and other connected devices. Distributed storage and data-centric compute frameworks have been invented to store and analyze these large datasets. This dissertation focuses on extending the applicability and improving the efficiency of distributed data-centric compute frameworks
Machine Learning and Integrative Analysis of Biomedical Big Data.
Recent developments in high-throughput technologies have accelerated the accumulation of massive amounts of omics data from multiple sources: genome, epigenome, transcriptome, proteome, metabolome, etc. Traditionally, data from each source (e.g., genome) is analyzed in isolation using statistical and machine learning (ML) methods. Integrative analysis of multi-omics and clinical data is key to new biomedical discoveries and advancements in precision medicine. However, data integration poses new computational challenges as well as exacerbates the ones associated with single-omics studies. Specialized computational approaches are required to effectively and efficiently perform integrative analysis of biomedical data acquired from diverse modalities. In this review, we discuss state-of-the-art ML-based approaches for tackling five specific computational challenges associated with integrative analysis: curse of dimensionality, data heterogeneity, missing data, class imbalance and scalability issues
- …