452 research outputs found
Developing deep learning computational tools for cancer using omics data
Dissertação de mestrado em Computer ScienceThere has been an increasing investment in cancer research that generated an enormous
amount of biological and clinical data, especially after the advent of the next-generation
sequencing technologies. To analyze the large datasets provided by omics data of cancer
samples, scientists have successfully been recurring to machine learning algorithms, identifying
patterns and developing models by using statistical techniques to make accurate
predictions.
Deep learning is a branch of machine learning, best known by its applications in artificial
intelligence (computer vision, speech recognition, natural language processing and
robotics). In general, deep learning models differ from machine learning “shallow” methods
(single hidden layer) because they recur to multiple layers of abstraction. In this way, it
is possible to learn high level features and complex relations in the given data.
Given the context specified above, the main target of this work is the development and
evaluation of deep learning methods for the analysis of cancer omics datasets, covering both
unsupervised methods for feature generation from different types of data, and supervised
methods to address cancer diagnostics and prognostic predictions.
We worked with a Neuroblastoma (NB) dataset from two different platforms (RNA-Seq
and microarrays) and developed both supervised (Deep Neural Networks (DNN), Multi-Task
Deep Neural Network (MT-DNN)) and unsupervised (Stacked Denoising Autoencoders (SDA))
deep architectures, and compared them with shallow traditional algorithms.
Overall we achieved promising results with deep learning on both platforms, meaning
that it is possible to retrieve the advantages of deep learning models on cancer omics data.
At the same time we faced some difficulties related to the complexity and computational
power requirements, as well as the lack of samples to truly benefit from the deep architectures.
There was generated code that can be applied to other datasets, wich is available in a
github repository https://github.com/lmpeixoto/deepl_learning [49].Nos últimos anos tem havido um investimento significativo na pesquisa de cancro, o
que gerou uma quantidade enorme de dados biológicos e clínicos, especialmente após o
aparecimento das tecnologias de sequenciação denominadas de “próxima-geração”. Para
analisar estes dados, a comunidade científica tem recorrido, e com sucesso, a algoritmos
de aprendizado de máquina, identificando padrões e desenvolvendo modelos com recurso
a métodos estatísticos. Com estes modelos é possível fazer previsão de resultados. O aprendizado
profundo, um ramo do aprendizado de máquina, tem sido mais notório pelas suas
aplicações em inteligência artificial (reconhecimento de imagens e voz, processamento de
linguagem natural e robótica). De um modo geral, os modelos de aprendizado profundo
diferem dos métodos clássicos do aprendizado de máquina por recorrerem a várias camadas
de abstração. Desta forma, é possível “aprender” as representações complexas e
não lineares, com vários graus de liberdade dos dados analisados. Neste contexto, o objetivo
principal deste trabalho é desenvolver e avaliar métodos de aprendizado profundo para
analisar dados ómicos do cancro. Pretendem-se desenvolver tanto métodos supervisionados
como não-supervisionados e utilizar diferentes tipos de dados, construindo soluções
para diagnóstico e prognóstico do cancro. Para isso trabalhámos com uma matriz de dados
de Neuroblastoma, proveniente de duas plataformas diferentes (RNA-seq e microarrays),
nos quais aplicámos algumas arquiteturas de aprendizado profundo, tanto como métodos
supervisionados e não-supervisionados, e com as quais comparámos com algoritmos tradicionais
de aprendizado de máquina. No geral conseguimos obter resultados promissores
nas duas plataformas, o que significou ser possível beneficiar das vantagens dos modelos
do aprendizado profundo nos dados ómicos de cancro. Ao mesmo tempo encontrámos
algumas dificuldades, de modo especial relacionadas com a complexidade dos modelos e
o poder computacional exigido, bem como o baixo número de amostras disponíveis. Na
sequencia deste trabalho foi gerado código que pode ser aplicado a outros dados e está
disponível num repositório do github https://github.com/lmpeixoto/deepl_learning
[49]
Machine Learning Models for High-dimensional Biomedical Data
abstract: The recent technological advances enable the collection of various complex, heterogeneous and high-dimensional data in biomedical domains. The increasing availability of the high-dimensional biomedical data creates the needs of new machine learning models for effective data analysis and knowledge discovery. This dissertation introduces several unsupervised and supervised methods to help understand the data, discover the patterns and improve the decision making. All the proposed methods can generalize to other industrial fields.
The first topic of this dissertation focuses on the data clustering. Data clustering is often the first step for analyzing a dataset without the label information. Clustering high-dimensional data with mixed categorical and numeric attributes remains a challenging, yet important task. A clustering algorithm based on tree ensembles, CRAFTER, is proposed to tackle this task in a scalable manner.
The second part of this dissertation aims to develop data representation methods for genome sequencing data, a special type of high-dimensional data in the biomedical domain. The proposed data representation method, Bag-of-Segments, can summarize the key characteristics of the genome sequence into a small number of features with good interpretability.
The third part of this dissertation introduces an end-to-end deep neural network model, GCRNN, for time series classification with emphasis on both the accuracy and the interpretation. GCRNN contains a convolutional network component to extract high-level features, and a recurrent network component to enhance the modeling of the temporal characteristics. A feed-forward fully connected network with the sparse group lasso regularization is used to generate the final classification and provide good interpretability.
The last topic centers around the dimensionality reduction methods for time series data. A good dimensionality reduction method is important for the storage, decision making and pattern visualization for time series data. The CRNN autoencoder is proposed to not only achieve low reconstruction error, but also generate discriminative features. A variational version of this autoencoder has great potential for applications such as anomaly detection and process control.Dissertation/ThesisDoctoral Dissertation Industrial Engineering 201
Deep Learning for Genomics: A Concise Overview
Advancements in genomic research such as high-throughput sequencing
techniques have driven modern genomic studies into "big data" disciplines. This
data explosion is constantly challenging conventional methods used in genomics.
In parallel with the urgent demand for robust algorithms, deep learning has
succeeded in a variety of fields such as vision, speech, and text processing.
Yet genomics entails unique challenges to deep learning since we are expecting
from deep learning a superhuman intelligence that explores beyond our knowledge
to interpret the genome. A powerful deep learning model should rely on
insightful utilization of task-specific knowledge. In this paper, we briefly
discuss the strengths of different deep learning models from a genomic
perspective so as to fit each particular task with a proper deep architecture,
and remark on practical considerations of developing modern deep learning
architectures for genomics. We also provide a concise review of deep learning
applications in various aspects of genomic research, as well as pointing out
potential opportunities and obstacles for future genomics applications.Comment: Invited chapter for Springer Book: Handbook of Deep Learning
Application
Review of Deep Learning Algorithms and Architectures
Deep learning (DL) is playing an increasingly important role in our lives. It has already made a huge impact in areas, such as cancer diagnosis, precision medicine, self-driving cars, predictive forecasting, and speech recognition. The painstakingly handcrafted feature extractors used in traditional learning, classification, and pattern recognition systems are not scalable for large-sized data sets. In many cases, depending on the problem complexity, DL can also overcome the limitations of earlier shallow networks that prevented efficient training and abstractions of hierarchical representations of multi-dimensional training data. Deep neural network (DNN) uses multiple (deep) layers of units with highly optimized algorithms and architectures. This paper reviews several optimization methods to improve the accuracy of the training and to reduce training time. We delve into the math behind training algorithms used in recent deep networks. We describe current shortcomings, enhancements, and implementations. The review also covers different types of deep architectures, such as deep convolution networks, deep residual networks, recurrent neural networks, reinforcement learning, variational autoencoders, and others.https://doi.org/10.1109/ACCESS.2019.291220
Machine Learning Guided Discovery and Design for Inertial Confinement Fusion
Inertial confinement fusion (ICF) experiments at the National Ignition Facility (NIF) and their corresponding computer simulations produce an immense amount of rich data. However, quantitatively interpreting that data remains a grand challenge. Design spaces are vast, data volumes are large, and the relationship between models and experiments may be uncertain.
We propose using machine learning to aid in the design and understanding of ICF implosions by integrating simulation and experimental data into a common frame-work. We begin by illustrating an early success of this data-driven design approach which resulted in the discovery of a new class of high performing ovoid-shaped implosion simulations. The ovoids achieve robust performance from the generation of zonal flows within the hotspot, revealing physics that had not previously been observed in ICF capsules.
The ovoid discovery also revealed deficiencies in common machine learning algorithms for modeling ICF data. To overcome these inadequacies, we developed a novel algorithm, deep jointly-informed neural networks (DJINN), which enables non-data scientists to quickly train neural networks on their own datasets. DJINN is routinely used for modeling data ICF data and for a variety of other applications (uncertainty quantification; climate, nuclear, and atomic physics data). We demonstrate how DJINN is used to perform parameter inference tasks for NIF data, and how transfer learning with DJINN enables us to create predictive models of direct drive experiments at the Omega laser facility.
Much of this work focuses on scalar or modest-size vector data, however many ICF diagnostics produce a variety of images, spectra, and sequential data. We end with a brief exploration of sequence-to-sequence models for emulating time-dependent multiphysics systems of varying complexity. This is a first step toward incorporating multimodal time-dependent data into our analyses to better constrain our predictive models
- …