5,589 research outputs found
Petuum: A New Platform for Distributed Machine Learning on Big Data
What is a systematic way to efficiently apply a wide spectrum of advanced ML
programs to industrial scale problems, using Big Models (up to 100s of billions
of parameters) on Big Data (up to terabytes or petabytes)? Modern
parallelization strategies employ fine-grained operations and scheduling beyond
the classic bulk-synchronous processing paradigm popularized by MapReduce, or
even specialized graph-based execution that relies on graph representations of
ML programs. The variety of approaches tends to pull systems and algorithms
design in different directions, and it remains difficult to find a universal
platform applicable to a wide range of ML programs at scale. We propose a
general-purpose framework that systematically addresses data- and
model-parallel challenges in large-scale ML, by observing that many ML programs
are fundamentally optimization-centric and admit error-tolerant,
iterative-convergent algorithmic solutions. This presents unique opportunities
for an integrative system design, such as bounded-error network synchronization
and dynamic scheduling based on ML program structure. We demonstrate the
efficacy of these system designs versus well-known implementations of modern ML
algorithms, allowing ML programs to run in much less time and at considerably
larger model sizes, even on modestly-sized compute clusters.Comment: 15 pages, 10 figures, final version in KDD 2015 under the same titl
Clustering-Based Predictive Process Monitoring
Business process enactment is generally supported by information systems that
record data about process executions, which can be extracted as event logs.
Predictive process monitoring is concerned with exploiting such event logs to
predict how running (uncompleted) cases will unfold up to their completion. In
this paper, we propose a predictive process monitoring framework for estimating
the probability that a given predicate will be fulfilled upon completion of a
running case. The predicate can be, for example, a temporal logic constraint or
a time constraint, or any predicate that can be evaluated over a completed
trace. The framework takes into account both the sequence of events observed in
the current trace, as well as data attributes associated to these events. The
prediction problem is approached in two phases. First, prefixes of previous
traces are clustered according to control flow information. Secondly, a
classifier is built for each cluster using event data to discriminate between
fulfillments and violations. At runtime, a prediction is made on a running case
by mapping it to a cluster and applying the corresponding classifier. The
framework has been implemented in the ProM toolset and validated on a log
pertaining to the treatment of cancer patients in a large hospital
Machine Learning and Integrative Analysis of Biomedical Big Data.
Recent developments in high-throughput technologies have accelerated the accumulation of massive amounts of omics data from multiple sources: genome, epigenome, transcriptome, proteome, metabolome, etc. Traditionally, data from each source (e.g., genome) is analyzed in isolation using statistical and machine learning (ML) methods. Integrative analysis of multi-omics and clinical data is key to new biomedical discoveries and advancements in precision medicine. However, data integration poses new computational challenges as well as exacerbates the ones associated with single-omics studies. Specialized computational approaches are required to effectively and efficiently perform integrative analysis of biomedical data acquired from diverse modalities. In this review, we discuss state-of-the-art ML-based approaches for tackling five specific computational challenges associated with integrative analysis: curse of dimensionality, data heterogeneity, missing data, class imbalance and scalability issues
- …