2,693 research outputs found
Machine Learning and Integrative Analysis of Biomedical Big Data.
Recent developments in high-throughput technologies have accelerated the accumulation of massive amounts of omics data from multiple sources: genome, epigenome, transcriptome, proteome, metabolome, etc. Traditionally, data from each source (e.g., genome) is analyzed in isolation using statistical and machine learning (ML) methods. Integrative analysis of multi-omics and clinical data is key to new biomedical discoveries and advancements in precision medicine. However, data integration poses new computational challenges as well as exacerbates the ones associated with single-omics studies. Specialized computational approaches are required to effectively and efficiently perform integrative analysis of biomedical data acquired from diverse modalities. In this review, we discuss state-of-the-art ML-based approaches for tackling five specific computational challenges associated with integrative analysis: curse of dimensionality, data heterogeneity, missing data, class imbalance and scalability issues
Embed and Conquer: Scalable Embeddings for Kernel k-Means on MapReduce
The kernel -means is an effective method for data clustering which extends
the commonly-used -means algorithm to work on a similarity matrix over
complex data structures. The kernel -means algorithm is however
computationally very complex as it requires the complete data matrix to be
calculated and stored. Further, the kernelized nature of the kernel -means
algorithm hinders the parallelization of its computations on modern
infrastructures for distributed computing. In this paper, we are defining a
family of kernel-based low-dimensional embeddings that allows for scaling
kernel -means on MapReduce via an efficient and unified parallelization
strategy. Afterwards, we propose two methods for low-dimensional embedding that
adhere to our definition of the embedding family. Exploiting the proposed
parallelization strategy, we present two scalable MapReduce algorithms for
kernel -means. We demonstrate the effectiveness and efficiency of the
proposed algorithms through an empirical evaluation on benchmark data sets.Comment: Appears in Proceedings of the SIAM International Conference on Data
Mining (SDM), 201
Tensor Numerical Methods in Quantum Chemistry: from Hartree-Fock Energy to Excited States
We resume the recent successes of the grid-based tensor numerical methods and
discuss their prospects in real-space electronic structure calculations. These
methods, based on the low-rank representation of the multidimensional functions
and integral operators, led to entirely grid-based tensor-structured 3D
Hartree-Fock eigenvalue solver. It benefits from tensor calculation of the core
Hamiltonian and two-electron integrals (TEI) in complexity using
the rank-structured approximation of basis functions, electron densities and
convolution integral operators all represented on 3D
Cartesian grids. The algorithm for calculating TEI tensor in a form of the
Cholesky decomposition is based on multiple factorizations using algebraic 1D
``density fitting`` scheme. The basis functions are not restricted to separable
Gaussians, since the analytical integration is substituted by high-precision
tensor-structured numerical quadratures. The tensor approaches to
post-Hartree-Fock calculations for the MP2 energy correction and for the
Bethe-Salpeter excited states, based on using low-rank factorizations and the
reduced basis method, were recently introduced. Another direction is related to
the recent attempts to develop a tensor-based Hartree-Fock numerical scheme for
finite lattice-structured systems, where one of the numerical challenges is the
summation of electrostatic potentials of a large number of nuclei. The 3D
grid-based tensor method for calculation of a potential sum on a lattice manifests the linear in computational work, ,
instead of the usual scaling by the Ewald-type approaches
Curriculum Guidelines for Undergraduate Programs in Data Science
The Park City Math Institute (PCMI) 2016 Summer Undergraduate Faculty Program
met for the purpose of composing guidelines for undergraduate programs in Data
Science. The group consisted of 25 undergraduate faculty from a variety of
institutions in the U.S., primarily from the disciplines of mathematics,
statistics and computer science. These guidelines are meant to provide some
structure for institutions planning for or revising a major in Data Science
- …