188 research outputs found

    Machine Learning and Integrative Analysis of Biomedical Big Data.

    Get PDF
    Recent developments in high-throughput technologies have accelerated the accumulation of massive amounts of omics data from multiple sources: genome, epigenome, transcriptome, proteome, metabolome, etc. Traditionally, data from each source (e.g., genome) is analyzed in isolation using statistical and machine learning (ML) methods. Integrative analysis of multi-omics and clinical data is key to new biomedical discoveries and advancements in precision medicine. However, data integration poses new computational challenges as well as exacerbates the ones associated with single-omics studies. Specialized computational approaches are required to effectively and efficiently perform integrative analysis of biomedical data acquired from diverse modalities. In this review, we discuss state-of-the-art ML-based approaches for tackling five specific computational challenges associated with integrative analysis: curse of dimensionality, data heterogeneity, missing data, class imbalance and scalability issues

    Bayesian factorizations of big sparse tensors

    Full text link
    It has become routine to collect data that are structured as multiway arrays (tensors). There is an enormous literature on low rank and sparse matrix factorizations, but limited consideration of extensions to the tensor case in statistics. The most common low rank tensor factorization relies on parallel factor analysis (PARAFAC), which expresses a rank kk tensor as a sum of rank one tensors. When observations are only available for a tiny subset of the cells of a big tensor, the low rank assumption is not sufficient and PARAFAC has poor performance. We induce an additional layer of dimension reduction by allowing the effective rank to vary across dimensions of the table. For concreteness, we focus on a contingency table application. Taking a Bayesian approach, we place priors on terms in the factorization and develop an efficient Gibbs sampler for posterior computation. Theory is provided showing posterior concentration rates in high-dimensional settings, and the methods are shown to have excellent performance in simulations and several real data applications

    Neural Networks for CollaborativeFiltering

    Get PDF
    Recommender systems are an integral part of almost all modern e-commerce companies. They contribute significantly to the overall customer satisfaction by helping the user discover new and relevant items, which consequently leads to higher sales and stronger customer retention. It is, therefore, not surprising that large e-commerce shops like Amazon or streaming platforms like Netflix and Spotify even use multiple recommender systems to further increase user engagement. Finding the most relevant items for each user is a difficult task that is critically dependent on the available user feedback information. However, most users typically interact with products only through noisy implicit feedback, such as clicks or purchases, rather than providing explicit information about their preferences, such as product ratings. This usually makes large amounts of behavioural user data necessary to infer accurate user preferences. One popular approach to make the most use of both forms of feedback is called collaborative filtering. Here, the main idea is to compare individual user behaviour with the behaviour of all known users. Although there are many different collaborative filtering techniques, matrix factorization models are among the most successful ones. In contrast, while neural networks are nowadays the state-of-the-art method for tasks such as image recognition or natural language processing, they are still not very popular for collaborative filtering tasks. Therefore, the main focus of this thesis is the derivation of multiple wide neural network architectures to mimic and extend matrix factorization models for various collaborative filtering problems and to gain insights into the connection between these models. The basics of the proposed architecture are wide and shallow feedforward neural networks, which will be established for rating prediction tasks on explicit feedback datasets. These networks consist of large input and output layers, which allow them to capture user and item representation similar to matrix factorization models. By deriving all weight updates and comparing the structure of both models, it is proven that a simplified version of the proposed network can mimic common matrix factorization models: a result that has not been shown, as far as we know, in this form before. Additionally, various extensions are thoroughly evaluated. The new findings of this evaluation can also easily be transferred to other matrix factorization models. This neural network architecture can be extended to be used for personalized ranking tasks on implicit feedback datasets. For these problems, it is necessary to rank products according to individual preferences using only the provided implicit feedback. One of the most successful and influential approaches for personalized ranking tasks is Bayesian Personalized Ranking, which attempts to learn pairwise item rankings and can also be used in combination with matrix factorization models. It is shown, how the introduction of an additional ranking layer forces the network to learn pairwise item rankings. In addition, similarities between this novel neural network architecture and a matrix factorization model trained with Bayesian Personalized Ranking are proven. To the best of our knowledge, this is the first time that these connections have been shown. The state-of-the-art performance of this network is demonstrated in a detailed evaluation. The most comprehensive feedback datasets consist of a mixture of explicit as well as implicit feedback information. Here, the goal is to predict if a user will like an item, similar to rating prediction tasks, even if this user has never given any explicit feedback at all: a problem, that has not been covered by the collaborative filtering literature yet. The network to solve this task is composed out of two networks: one for the explicit and one for the implicit feedback. Additional item features are learned using the implicit feedback, which capture all information necessary to rank items. Afterwards, these features are used to improve the explicit feedback prediction. Both parts of this combined network have different optimization goals, are trained simultaneously and, therefore, influence each other. A detailed evaluation shows that this approach is helpful to improve the network's overall predictive performance especially for ranking metrics

    GENO -- GENeric Optimization for Classical Machine Learning

    Full text link
    Although optimization is the longstanding algorithmic backbone of machine learning, new models still require the time-consuming implementation of new solvers. As a result, there are thousands of implementations of optimization algorithms for machine learning problems. A natural question is, if it is always necessary to implement a new solver, or if there is one algorithm that is sufficient for most models. Common belief suggests that such a one-algorithm-fits-all approach cannot work, because this algorithm cannot exploit model specific structure and thus cannot be efficient and robust on a wide variety of problems. Here, we challenge this common belief. We have designed and implemented the optimization framework GENO (GENeric Optimization) that combines a modeling language with a generic solver. GENO generates a solver from the declarative specification of an optimization problem class. The framework is flexible enough to encompass most of the classical machine learning problems. We show on a wide variety of classical but also some recently suggested problems that the automatically generated solvers are (1) as efficient as well-engineered specialized solvers, (2) more efficient by a decent margin than recent state-of-the-art solvers, and (3) orders of magnitude more efficient than classical modeling language plus solver approaches

    Statistical learning for predictive targeting in online advertising

    Get PDF

    Data-Driven Modeling For Decision Support Systems And Treatment Management In Personalized Healthcare

    Get PDF
    Massive amount of electronic medical records (EMRs) accumulating from patients and populations motivates clinicians and data scientists to collaborate for the advanced analytics to create knowledge that is essential to address the extensive personalized insights needed for patients, clinicians, providers, scientists, and health policy makers. Learning from large and complicated data is using extensively in marketing and commercial enterprises to generate personalized recommendations. Recently the medical research community focuses to take the benefits of big data analytic approaches and moves to personalized (precision) medicine. So, it is a significant period in healthcare and medicine for transferring to a new paradigm. There is a noticeable opportunity to implement a learning health care system and data-driven healthcare to make better medical decisions, better personalized predictions; and more precise discovering of risk factors and their interactions. In this research we focus on data-driven approaches for personalized medicine. We propose a research framework which emphasizes on three main phases: 1) Predictive modeling, 2) Patient subgroup analysis and 3) Treatment recommendation. Our goal is to develop novel methods for each phase and apply them in real-world applications. In the fist phase, we develop a new predictive approach based on feature representation using deep feature learning and word embedding techniques. Our method uses different deep architectures (Stacked autoencoders, Deep belief network and Variational autoencoders) for feature representation in higher-level abstractions to obtain effective and more robust features from EMRs, and then build prediction models on the top of them. Our approach is particularly useful when the unlabeled data is abundant whereas labeled one is scarce. We investigate the performance of representation learning through a supervised approach. We perform our method on different small and large datasets. Finally we provide a comparative study and show that our predictive approach leads to better results in comparison with others. In the second phase, we propose a novel patient subgroup detection method, called Supervised Biclustring (SUBIC) using convex optimization and apply our approach to detect patient subgroups and prioritize risk factors for hypertension (HTN) in a vulnerable demographic subgroup (African-American). Our approach not only finds patient subgroups with guidance of a clinically relevant target variable but also identifies and prioritizes risk factors by pursuing sparsity of the input variables and encouraging similarity among the input variables and between the input and target variables. Finally, in the third phase, we introduce a new survival analysis framework using deep learning and active learning with a novel sampling strategy. First, our approach provides better representation with lower dimensions from clinical features using labeled (time-to-event) and unlabeled (censored) instances and then actively trains the survival model by labeling the censored data using an oracle. As a clinical assistive tool, we propose a simple yet effective treatment recommendation approach based on our survival model. In the experimental study, we apply our approach on SEER-Medicare data related to prostate cancer among African-Americans and white patients. The results indicate that our approach outperforms significantly than baseline models
    • …
    corecore