3,691 research outputs found
Unbalanced data processing using oversampling: machine Learning
Nowadays, the DL algorithms show good results when used in the solution of different problems which present similar characteristics as the great amount of data and high dimensionality. However, one of the main challenges that currently arises is the classification of high dimensionality databases, with very few samples and high-class imbalance. Biomedical databases of gene expression microarrays present the characteristics mentioned above, presenting problems of class imbalance, with few samples and high dimensionality. The problem of class imbalance arises when the set of samples belonging to one class is much larger than the set of samples of the other class or classes. This problem has been identified as one of the main challenges of the algorithms applied in the context of Big Data. The objective of this research is the study of genetic expression databases, using conventional methods of sub and oversampling for the balance of classes such as RUS, ROS and SMOTE. The databases were modified by applying an increase in their imbalance and in another case generating artificial noise
Elephant Search with Deep Learning for Microarray Data Analysis
Even though there is a plethora of research in Microarray gene expression
data analysis, still, it poses challenges for researchers to effectively and
efficiently analyze the large yet complex expression of genes. The feature
(gene) selection method is of paramount importance for understanding the
differences in biological and non-biological variation between samples. In
order to address this problem, a novel elephant search (ES) based optimization
is proposed to select best gene expressions from the large volume of microarray
data. Further, a promising machine learning method is envisioned to leverage
such high dimensional and complex microarray dataset for extracting hidden
patterns inside to make a meaningful prediction and most accurate
classification. In particular, stochastic gradient descent based Deep learning
(DL) with softmax activation function is then used on the reduced features
(genes) for better classification of different samples according to their gene
expression levels. The experiments are carried out on nine most popular Cancer
microarray gene selection datasets, obtained from UCI machine learning
repository. The empirical results obtained by the proposed elephant search
based deep learning (ESDL) approach are compared with most recent published
article for its suitability in future Bioinformatics research.Comment: 12 pages, 5 Tabl
Recommended from our members
Predicting survival from colorectal cancer histology slides using deep learning: A retrospective multicenter study
BACKGROUND: For virtually every patient with colorectal cancer (CRC), hematoxylin-eosin (HE)-stained tissue slides are available. These images contain quantitative information, which is not routinely used to objectively extract prognostic biomarkers. In the present study, we investigated whether deep convolutional neural networks (CNNs) can extract prognosticators directly from these widely available images.
METHODS AND FINDINGS: We hand-delineated single-tissue regions in 86 CRC tissue slides, yielding more than 100,000 HE image patches, and used these to train a CNN by transfer learning, reaching a nine-class accuracy of >94% in an independent data set of 7,180 images from 25 CRC patients. With this tool, we performed automated tissue decomposition of representative multitissue HE images from 862 HE slides in 500 stage I-IV CRC patients in the The Cancer Genome Atlas (TCGA) cohort, a large international multicenter collection of CRC tissue. Based on the output neuron activations in the CNN, we calculated a "deep stroma score," which was an independent prognostic factor for overall survival (OS) in a multivariable Cox proportional hazard model (hazard ratio [HR] with 95% confidence interval [CI]: 1.99 [1.27-3.12], p = 0.0028), while in the same cohort, manual quantification of stromal areas and a gene expression signature of cancer-associated fibroblasts (CAFs) were only prognostic in specific tumor stages. We validated these findings in an independent cohort of 409 stage I-IV CRC patients from the "Darmkrebs: Chancen der Verhütung durch Screening" (DACHS) study who were recruited between 2003 and 2007 in multiple institutions in Germany. Again, the score was an independent prognostic factor for OS (HR 1.63 [1.14-2.33], p = 0.008), CRC-specific OS (HR 2.29 [1.5-3.48], p = 0.0004), and relapse-free survival (RFS; HR 1.92 [1.34-2.76], p = 0.0004). A prospective validation is required before this biomarker can be implemented in clinical workflows.
CONCLUSIONS: In our retrospective study, we show that a CNN can assess the human tumor microenvironment and predict prognosis directly from histopathological images
Developing deep learning computational tools for cancer using omics data
Dissertação de mestrado em Computer ScienceThere has been an increasing investment in cancer research that generated an enormous
amount of biological and clinical data, especially after the advent of the next-generation
sequencing technologies. To analyze the large datasets provided by omics data of cancer
samples, scientists have successfully been recurring to machine learning algorithms, identifying
patterns and developing models by using statistical techniques to make accurate
predictions.
Deep learning is a branch of machine learning, best known by its applications in artificial
intelligence (computer vision, speech recognition, natural language processing and
robotics). In general, deep learning models differ from machine learning “shallow” methods
(single hidden layer) because they recur to multiple layers of abstraction. In this way, it
is possible to learn high level features and complex relations in the given data.
Given the context specified above, the main target of this work is the development and
evaluation of deep learning methods for the analysis of cancer omics datasets, covering both
unsupervised methods for feature generation from different types of data, and supervised
methods to address cancer diagnostics and prognostic predictions.
We worked with a Neuroblastoma (NB) dataset from two different platforms (RNA-Seq
and microarrays) and developed both supervised (Deep Neural Networks (DNN), Multi-Task
Deep Neural Network (MT-DNN)) and unsupervised (Stacked Denoising Autoencoders (SDA))
deep architectures, and compared them with shallow traditional algorithms.
Overall we achieved promising results with deep learning on both platforms, meaning
that it is possible to retrieve the advantages of deep learning models on cancer omics data.
At the same time we faced some difficulties related to the complexity and computational
power requirements, as well as the lack of samples to truly benefit from the deep architectures.
There was generated code that can be applied to other datasets, wich is available in a
github repository https://github.com/lmpeixoto/deepl_learning [49].Nos últimos anos tem havido um investimento significativo na pesquisa de cancro, o
que gerou uma quantidade enorme de dados biológicos e clínicos, especialmente após o
aparecimento das tecnologias de sequenciação denominadas de “próxima-geração”. Para
analisar estes dados, a comunidade científica tem recorrido, e com sucesso, a algoritmos
de aprendizado de máquina, identificando padrões e desenvolvendo modelos com recurso
a métodos estatísticos. Com estes modelos é possível fazer previsão de resultados. O aprendizado
profundo, um ramo do aprendizado de máquina, tem sido mais notório pelas suas
aplicações em inteligência artificial (reconhecimento de imagens e voz, processamento de
linguagem natural e robótica). De um modo geral, os modelos de aprendizado profundo
diferem dos métodos clássicos do aprendizado de máquina por recorrerem a várias camadas
de abstração. Desta forma, é possível “aprender” as representações complexas e
não lineares, com vários graus de liberdade dos dados analisados. Neste contexto, o objetivo
principal deste trabalho é desenvolver e avaliar métodos de aprendizado profundo para
analisar dados ómicos do cancro. Pretendem-se desenvolver tanto métodos supervisionados
como não-supervisionados e utilizar diferentes tipos de dados, construindo soluções
para diagnóstico e prognóstico do cancro. Para isso trabalhámos com uma matriz de dados
de Neuroblastoma, proveniente de duas plataformas diferentes (RNA-seq e microarrays),
nos quais aplicámos algumas arquiteturas de aprendizado profundo, tanto como métodos
supervisionados e não-supervisionados, e com as quais comparámos com algoritmos tradicionais
de aprendizado de máquina. No geral conseguimos obter resultados promissores
nas duas plataformas, o que significou ser possível beneficiar das vantagens dos modelos
do aprendizado profundo nos dados ómicos de cancro. Ao mesmo tempo encontrámos
algumas dificuldades, de modo especial relacionadas com a complexidade dos modelos e
o poder computacional exigido, bem como o baixo número de amostras disponíveis. Na
sequencia deste trabalho foi gerado código que pode ser aplicado a outros dados e está
disponível num repositório do github https://github.com/lmpeixoto/deepl_learning
[49]
Deep generative modeling for single-cell transcriptomics.
Single-cell transcriptome measurements can reveal unexplored biological diversity, but they suffer from technical noise and bias that must be modeled to account for the resulting uncertainty in downstream analyses. Here we introduce single-cell variational inference (scVI), a ready-to-use scalable framework for the probabilistic representation and analysis of gene expression in single cells ( https://github.com/YosefLab/scVI ). scVI uses stochastic optimization and deep neural networks to aggregate information across similar cells and genes and to approximate the distributions that underlie observed expression values, while accounting for batch effects and limited sensitivity. We used scVI for a range of fundamental analysis tasks including batch correction, visualization, clustering, and differential expression, and achieved high accuracy for each task
Knowledge about the presence or absence of miRNA isoforms (isomiRs) can successfully discriminate amongst 32 TCGA cancer types.
Isoforms of human miRNAs (isomiRs) are constitutively expressed with tissue- and disease-subtype-dependencies. We studied 10 271 tumor datasets from The Cancer Genome Atlas (TCGA) to evaluate whether isomiRs can distinguish amongst 32 TCGA cancers. Unlike previous approaches, we built a classifier that relied solely on \u27binarized\u27 isomiR profiles: each isomiR is simply labeled as \u27present\u27 or \u27absent\u27. The resulting classifier successfully labeled tumor datasets with an average sensitivity of 90% and a false discovery rate (FDR) of 3%, surpassing the performance of expression-based classification. The classifier maintained its power even after a 15× reduction in the number of isomiRs that were used for training. Notably, the classifier could correctly predict the cancer type in non-TCGA datasets from diverse platforms. Our analysis revealed that the most discriminatory isomiRs happen to also be differentially expressed between normal tissue and cancer. Even so, we find that these highly discriminating isomiRs have not been attracting the most research attention in the literature. Given their ability to successfully classify datasets from 32 cancers, isomiRs and our resulting \u27Pan-cancer Atlas\u27 of isomiR expression could serve as a suitable framework to explore novel cancer biomarkers
Genomic Methods for Bacterial Infection Identification
Hospital-acquired infections (HAIs) have high mortality rates around the world and are a challenge to medical science due to rapid mutation rates in their pathogens. A new methodology is proposed to identify bacterial species causing HAIs based on sets of universal biomarkers for next-generation microarray designs (i.e., nxh chips), rather than a priori selections of biomarkers. This method allows arbitrary organisms to be classified based on readouts of their DNA sequences, including whole genomes. The underlying models are based on the biochemistry of DNA, unlike traditional edit-distance based alignments. Furthermore, the methodology is fairly robust to genetic mutations, which are likely to reduce accuracy. Standard machine learning methods (neural networks, self-organizing maps, and random forests) produce results to identify HAIs on nxh chips that are very competitive, if not superior, to current standards in the field. The potential feasibility of translating these techniques to a clinical test is also discussed
The importance of data classification using machine learning methods in microarray data
The detection of genetic mutations has attracted global attention. several methods have proposed to detect diseases such as cancers and tumours. One of them is microarrays, which is a type of representation for gene expression that is helpful in diagnosis. To unleash the full potential of microarrays, machine-learning algorithms and gene selection methods can be implemented to facilitate processing on microarrays and to overcome other potential challenges. One of these challenges involves high dimensional data that are redundant, irrelevant, and noisy. To alleviate this problem, this representation should be simplified. For example, the feature selection process can be implemented by reducing the number of features adopted in clustering and classification. A subset of genes can be selected from a pool of gene expression data recorded on DNA micro-arrays. This paper reviews existing classification techniques and gene selection methods. The effectiveness of emerging techniques, such as the swarm intelligence technique in feature selection and classification in microarrays, are reported as well. These emerging techniques can be used in detecting cancer. The swarm intelligence technique can be combined with other statistical methods for attaining better results
- …