2,693 research outputs found

    Machine Learning and Integrative Analysis of Biomedical Big Data.

    Get PDF
    Recent developments in high-throughput technologies have accelerated the accumulation of massive amounts of omics data from multiple sources: genome, epigenome, transcriptome, proteome, metabolome, etc. Traditionally, data from each source (e.g., genome) is analyzed in isolation using statistical and machine learning (ML) methods. Integrative analysis of multi-omics and clinical data is key to new biomedical discoveries and advancements in precision medicine. However, data integration poses new computational challenges as well as exacerbates the ones associated with single-omics studies. Specialized computational approaches are required to effectively and efficiently perform integrative analysis of biomedical data acquired from diverse modalities. In this review, we discuss state-of-the-art ML-based approaches for tackling five specific computational challenges associated with integrative analysis: curse of dimensionality, data heterogeneity, missing data, class imbalance and scalability issues

    Embed and Conquer: Scalable Embeddings for Kernel k-Means on MapReduce

    Full text link
    The kernel kk-means is an effective method for data clustering which extends the commonly-used kk-means algorithm to work on a similarity matrix over complex data structures. The kernel kk-means algorithm is however computationally very complex as it requires the complete data matrix to be calculated and stored. Further, the kernelized nature of the kernel kk-means algorithm hinders the parallelization of its computations on modern infrastructures for distributed computing. In this paper, we are defining a family of kernel-based low-dimensional embeddings that allows for scaling kernel kk-means on MapReduce via an efficient and unified parallelization strategy. Afterwards, we propose two methods for low-dimensional embedding that adhere to our definition of the embedding family. Exploiting the proposed parallelization strategy, we present two scalable MapReduce algorithms for kernel kk-means. We demonstrate the effectiveness and efficiency of the proposed algorithms through an empirical evaluation on benchmark data sets.Comment: Appears in Proceedings of the SIAM International Conference on Data Mining (SDM), 201

    Tensor Numerical Methods in Quantum Chemistry: from Hartree-Fock Energy to Excited States

    Get PDF
    We resume the recent successes of the grid-based tensor numerical methods and discuss their prospects in real-space electronic structure calculations. These methods, based on the low-rank representation of the multidimensional functions and integral operators, led to entirely grid-based tensor-structured 3D Hartree-Fock eigenvalue solver. It benefits from tensor calculation of the core Hamiltonian and two-electron integrals (TEI) in O(nlogn)O(n\log n) complexity using the rank-structured approximation of basis functions, electron densities and convolution integral operators all represented on 3D n×n×nn\times n\times n Cartesian grids. The algorithm for calculating TEI tensor in a form of the Cholesky decomposition is based on multiple factorizations using algebraic 1D ``density fitting`` scheme. The basis functions are not restricted to separable Gaussians, since the analytical integration is substituted by high-precision tensor-structured numerical quadratures. The tensor approaches to post-Hartree-Fock calculations for the MP2 energy correction and for the Bethe-Salpeter excited states, based on using low-rank factorizations and the reduced basis method, were recently introduced. Another direction is related to the recent attempts to develop a tensor-based Hartree-Fock numerical scheme for finite lattice-structured systems, where one of the numerical challenges is the summation of electrostatic potentials of a large number of nuclei. The 3D grid-based tensor method for calculation of a potential sum on a L×L×LL\times L\times L lattice manifests the linear in LL computational work, O(L)O(L), instead of the usual O(L3logL)O(L^3 \log L) scaling by the Ewald-type approaches

    Curriculum Guidelines for Undergraduate Programs in Data Science

    Get PDF
    The Park City Math Institute (PCMI) 2016 Summer Undergraduate Faculty Program met for the purpose of composing guidelines for undergraduate programs in Data Science. The group consisted of 25 undergraduate faculty from a variety of institutions in the U.S., primarily from the disciplines of mathematics, statistics and computer science. These guidelines are meant to provide some structure for institutions planning for or revising a major in Data Science
    corecore