79 research outputs found

    Automatic discovery of 100-miRNA signature for cancer classification using ensemble feature selection

    Get PDF
    Lopez-Rincon A, Martinez-Archundia M, Martinez-Ruiz GU, Schönhuth A, Tonda A. Automatic discovery of 100-miRNA signature for cancer classification using ensemble feature selection. BMC Bioinformatics. 2019;20(1): 480

    Systems Analytics and Integration of Big Omics Data

    Get PDF
    A “genotype"" is essentially an organism's full hereditary information which is obtained from its parents. A ""phenotype"" is an organism's actual observed physical and behavioral properties. These may include traits such as morphology, size, height, eye color, metabolism, etc. One of the pressing challenges in computational and systems biology is genotype-to-phenotype prediction. This is challenging given the amount of data generated by modern Omics technologies. This “Big Data” is so large and complex that traditional data processing applications are not up to the task. Challenges arise in collection, analysis, mining, sharing, transfer, visualization, archiving, and integration of these data. In this Special Issue, there is a focus on the systems-level analysis of Omics data, recent developments in gene ontology annotation, and advances in biological pathways and network biology. The integration of Omics data with clinical and biomedical data using machine learning is explored. This Special Issue covers new methodologies in the context of gene–environment interactions, tissue-specific gene expression, and how external factors or host genetics impact the microbiome

    Strategies For Improving Epistasis Detection And Replication

    Get PDF
    Genome-wide association studies (GWAS) have been extensively critiqued for their perceived inability to adequately elucidate the genetic underpinnings of complex disease. Of particular concern is “missing heritability,” or the difference between the total estimated heritability of a phenotype and that explained by GWAS-identified loci. There are numerous proposed explanations for this missing heritability, but a frequently ignored and potentially vastly informative alternative explanation is the ubiquity of epistasis underlying complex phenotypes. Given our understanding of how biomolecules interact in networks and pathways, it is not unreasonable to conclude that the effect of variation at individual genetic loci may non-additively depend on and should be analyzed in the context of their interacting partners. It has been recognized for over a century that deviation from expected Mendelian proportions can be explained by the interaction of multiple loci, and the epistatic underpinnings of phenotypes in model organisms have been extensively experimentally quantified. Therefore, the dearth of inspiring single locus GWAS hits for complex human phenotypes (and the inconsistent replication of these between populations) should not be surprising, as one might expect the joint effect of multiple perturbations to interacting partners within a functional biological module to be more important than individual main effects. Current methods for analyzing data from GWAS are not well-equipped to detect epistasis or replicate significant interactions. The multiple testing burden associated with testing each pairwise interaction quickly becomes nearly insurmountable with increasing numbers of loci. Statistical and machine learning approaches that have worked well for other types of high-dimensional data are appealing and may be useful for detecting epistasis, but potentially require tweaks to function appropriately. Biological knowledge may also be leveraged to guide the search for epistasis candidates, but requires context-appropriate application (as, for example, two loci with significant main effects may not have a significant interaction, and vice versa). Rather than renouncing GWAS and the wealth of associated data that has been accumulated as a failure, I propose the development of new techniques and incorporation of diverse data sources to analyze GWAS data in an epistasis-centric framework

    Using gene and microRNA expression in the human airway for lung cancer diagnosis

    Full text link
    Lung cancer surpasses all other causes of cancer-related deaths worldwide. Gene-expression microarrays have shown that differences in the cytologically normal bronchial airway can distinguish between patients with and without lung cancer. In research reported here, we have used microRNA expression in bronchial epithelium and gene expression in nasal epithelium to advance biological understanding of the lung-cancer "field of injury" and develop new biomarkers for lung cancer diagnosis. MicroRNAs are known to mediate the airway response to tobacco smoke exposure but their role in the lung-cancer-associated field of injury was previously unknown. Microarrays can measure microRNA expression; however, they are probe-based and limited to detecting annotated microRNAs. MicroRNA sequencing, on the other hand, allows the identification of novel microRNAs that may play important biological roles. We have used microRNA sequencing to discover novel microRNAs in the bronchial epithelium. One of the predicted microRNAs, now known as miR-4423, is associated with lung cancer and airway development. This finding demonstrates for the first time a microRNA expression change associated with the lung-cancer field of injury and microRNA mediation of gene expression changes within that field. The National Lung Screening Trial showed that screening high-risk smokers using CT scans decreases lung-cancer-associated mortality. Nodules were detected in over 20% of participants; however, the overwhelming majority of screening-detected nodules were non-malignant. We therefore need biomarkers to determine which screening-detected nodules are benign and do not require further invasive testing. Given that the lung-cancer-associated field of injury extends to the bronchial epithelium, our group hypothesized that the field of injury may extend farther up in the airway. Using gene expression microarrays, we have identified a nasal epithelium gene-expression signature associated with lung cancer. Using samples from the bronchial epithelium and the nasal epithelium, we have established that there is a common lung-cancer-associated gene-expression signature throughout the airway. In addition, we have developed a nasal epithelium gene-expression biomarker for lung cancer together with a clinico-genomic classifier that includes both clinical factors and gene expression. Our data suggests that gene expression profiling in nasal epithelium might serve as a non-invasive approach for lung cancer diagnosis and screenin

    Simulating and Generating pre-miRNA using Variational Auto-Encoders

    Get PDF

    Opportunities and obstacles for deep learning in biology and medicine

    Get PDF
    Deep learning describes a class of machine learning algorithms that are capable of combining raw inputs into layers of intermediate features. These algorithms have recently shown impressive results across a variety of domains. Biology and medicine are data-rich disciplines, but the data are complex and often ill-understood. Hence, deep learning techniques may be particularly well suited to solve problems of these fields. We examine applications of deep learning to a variety of biomedical problems-patient classification, fundamental biological processes and treatment of patients-and discuss whether deep learning will be able to transform these tasks or if the biomedical sphere poses unique challenges. Following from an extensive literature review, we find that deep learning has yet to revolutionize biomedicine or definitively resolve any of the most pressing challenges in the field, but promising advances have been made on the prior state of the art. Even though improvements over previous baselines have been modest in general, the recent progress indicates that deep learning methods will provide valuable means for speeding up or aiding human investigation. Though progress has been made linking a specific neural network\u27s prediction to input features, understanding how users should interpret these models to make testable hypotheses about the system under study remains an open challenge. Furthermore, the limited amount of labelled data for training presents problems in some domains, as do legal and privacy constraints on work with sensitive health records. Nonetheless, we foresee deep learning enabling changes at both bench and bedside with the potential to transform several areas of biology and medicine

    An Integrated, Module-based Biomarker Discovery Framework

    Get PDF
    Identification of biomarkers that contribute to complex human disorders is a principal and challenging task in computational biology. Prognostic biomarkers are useful for risk assessment of disease progression and patient stratification. Since treatment plans often hinge on patient stratification, better disease subtyping has the potential to significantly improve survival for patients. Additionally, a thorough understanding of the roles of biomarkers in cancer pathways facilitates insights into complex disease formation, and provides potential druggable targets in the pathways. Many statistical methods have been applied toward biomarker discovery, often combining feature selection with classification methods. Traditional approaches are mainly concerned with statistical significance and fail to consider the clinical relevance of the selected biomarkers. Two additional problems impede meaningful biomarker discovery: gene multiplicity (several maximally predictive solutions exist) and instability (inconsistent gene sets from different experiments or cross validation runs). Motivated by a need for more biologically informed, stable biomarker discovery method, I introduce an integrated module-based biomarker discovery framework for analyzing high- throughput genomic disease data. The proposed framework addresses the aforementioned challenges in three components. First, a recursive spectral clustering algorithm specifically 4 tailored toward high-dimensional, heterogeneous data (ReKS) is developed to partition genes into clusters that are treated as single entities for subsequent analysis. Next, the problems of gene multiplicity and instability are addressed through a group variable selection algorithm (T-ReCS) based on local causal discovery methods. Guided by the tree-like partition created from the clustering algorithm, this algorithm selects gene clusters that are predictive of a clinical outcome. We demonstrate that the group feature selection method facilitate the discovery of biologically relevant genes through their association with a statistically predictive driver. Finally, we elucidate the biological relevance of the biomarkers by leveraging available prior information to identify regulatory relationships between genes and between clusters, and deliver the information in the form of a user-friendly web server, mirConnX

    Learning the Non-Coding Genome

    Get PDF
    The interpretation of the non-coding genome still constitutes a major challenge in the application of whole-genome sequencing. For example, disease and trait-associated variants represent a tiny minority of all known genetic variations, but millions of putatively neutral sites can be identified. In this context, machine learning (ML) methods for predicting disease-associated non-coding variants are faced with a chicken-and-egg problem – such variants cannot be easily found without ML, but ML cannot be applied efficiently until a sufficient number of instances have been found. Recent ML-based methods for variant prediction do not adopt specific imbalance-aware learning techniques to deal with imbalanced data that naturally arise in several genome-wide variant scoring problems, resulting in relatively poor performance with reduced sensitivity and precision. In this work, I present a ML algorithm, called hyperSMURF, that adopts imbalance-aware learning strategies based on resampling techniques and a hyper-ensemble approach which is able to handle extremely imbalanced datasets. It outperforms previous methods in the context of non-coding variants associated with Mendelian diseases or complex diseases. I show that imbalance-aware ML is a key issue for the design of robust and accurate prediction algorithms. Open-source implementations of hyperSMURF are available in R and Java, such that it can be applied effectively in other scientific projects to discover disease-associated variants out of millions of neutral sites from whole-genome sequencing. In addition the algorithm was used to create a new pathogenicity score for regulatory Mendelian mutations (ReMM score), which is significantly better than other commonly used scores to rank regulatory variants from rare genetic disorders. The score is integrated in Genomiser, an analysis framework that goes beyond scoring the relevance of variation in the non-coding genome. The tool is able to associate regulatory variants to specific Mendelian diseases. Genomiser scores variants through pathogenicity scores, like ReMM score for non-coding, and combines them with allele frequency, regulatory sequences, chromosomal topological domains, and phenotypic relevance to discover variants associated to specific Mendelian disorders. Overall, Genomiser is able to identify causal regulatory variants, allowing effective detection and discovery of regulatory variants in Mendelian disease.Bei der Genomsequenzierung stellt die Interpretation der nicht-kodierenden Bereiche des Genomes immer noch eine bedeutende Herausforderung dar. Im Vergleich zu den häufigen, meist neutralen, genetischen Veränderungen stellen Varianten, welche mit Krankheiten oder anderen Eigenschaften assoziiert sind, eine winzige Minderheit dar. In diesem Sinne stehen Methoden zur Vorhersage von nicht-kodierenden, krankheitsassozierten Varianten durch Maschinelles Lernen (ML) dem Henne-Ei-Problem gegenüber – solche Veränderungen sind ohne ML schwierig zu finden, aber ML ist meistens erst dann erfolgreich, wenn eine ausreichende Anzahl von Beispielen gefunden wurde. Die neuesten Methoden zur Vorhersage von Varianten durch ML integrieren keine speziellen Vorhersagetechniken um dieses Ungleichgewicht zu behandeln, was zu einer relativ schlechten Performanz mit reduzierter Sensitivität führt, da die zugrundeliegenden Anwendungen zur genomweiten Bewertung von Varianten nicht im Gleichgewicht sind. In dieser Arbeit stelle ich hyperSMURF vor, einen Algorithmus, der Verfahren zum Lernen von Daten mit extremer Differenz zwischen Observationsmengen benutzt, basierend auf Techniken zur Stichprobewiederholung und einer Hyper-Vereinigung. Im Bereich von nicht-kodierenden Varianten, welche mit Mendel’schen oder komplexen Erkrankungen assoziiert sind, übertrifft er vorherige Methoden. Ich zeige, dass das ML durch explizit entwickelte Techniken für Daten mit hohem Ungleichgewicht ein Schlüsselkonzept für eine robuste und genaue Vorhersage in diesem Bereich ist. HyperSMURF ist open-source und in R und Java implementiert und kann somit mühelos in anderen Wissenschaftsprojekten genutzt werden um krankheits-assoziierte Varianten unter Millionen von neutralen Veränderngen bei Genomsequenzierung zu finden. Des Weiteren wurde mit Hilfe des Algorithmus eine neue Bewertungsfunktion für Mendel’sche regulatorische Mutationen entwickelt (ReMM score). Sie ist signifikant besser als andere Bewertungen zum Erkennen von regulatorischen Varianten bei seltenen genetischen Funktions- störungen. ReMM score ist in dem Analyseframework Genomiser integriert, welches nicht nur kodierende, sondern auch relevante nicht-kodierende genomische Varianten bewertet und diese dann einer Erkrankung zuordnen kann. Genomiser benutzt hierfür Bewertungsfunktionen und kombiniert diese mit Allelefrequenzen, der Raumstruktur von Chromosomen und der phänotypischen Relevanz von Varianten zu bekannten Syndromen. Dadurch wird Genomiser zu einem effizienten Tool zur Entdeckung von neuen regulatorischen Varianten bei Medel’schen Erkrankungen

    Machine Learning Approaches for Healthcare Analysis

    Get PDF
    Machine learning (ML)is a division of artificial intelligence that teaches computers how to discover difficult-to-distinguish patterns from huge or complex data sets and learn from previous cases by utilizing a range of statistical, probabilistic, data processing, and optimization methods. Nowadays, ML plays a vital role in many fields, such as finance, self-driving cars, image processing, medicine, and Speech recognition. In healthcare, ML has been used in applications such as the detection, prognosis, diagnosis, and treatment of diseases due to Its capability to handle large data. Moreover, ML has exceptional abilities to predict disease by uncovering patterns from medical datasets. Machine learning and deep learning are better suited for analyzing medical datasets than traditional methods because of the nature of these datasets. They are mostly large and complex heterogeneous data coming from different sources, requiring more efficient computational techniques to handle them. This dissertation presents several machine-learning techniques to tackle medical issues such as data imbalance, classification and upgrading tumor stages, and multi-omics integration. In the second chapter, we introduce a novel method to handle class-imbalanced dilemmas, a common issue in bioinformatics datasets. In class-imbalanced data, the number of samples in each class is unequal. Since most data sets contain usual versus unusual cases, e.g., cancer versus normal or miRNAs versus other noncoding RNA, the minority class with the least number of samples is the interesting class that contains the unusual cases. The learning models based on the standard classifiers, such as the support vector machine (SVM), random forest, and k-NN, are usually biased towards the majority class, which means that the classifier is most likely to predict the samples from the interesting class inaccurately. Thus, handling class-imbalanced datasets has gained researchers’ interest recently. A combination of proper feature selection, a cost-sensitive classifier, and ensembling based on the random forest method (BCECSC-RF) is proposed to handle the class-imbalanced data. Random class-balanced ensembles are built individually. Then, each ensemble is used as a training pool to classify the remaining out-bagged samples. Samples in each ensemble will be classified using a class-sensitive classifier incorporating random forest. The sample will be classified by selecting the most often class that has been voted for in all sample appearances in all the formed ensembles. A set of performance measurements, including a geometric measurement, suggests that the model can improve the classification of the minority class samples. In the third chapter, we introduce a novel study to predict the upgrading of the Gleason score on confirmatory magnetic resonance imaging-guided targeted biopsy (MRI-TB) of the prostate in candidates for active surveillance based on clinical features. MRI of the prostate is not accessible to many patients due to difficulty contacting patients, insurance denials, and African-American patients are disproportionately affected by barriers to MRI of the prostate during Active surveillance [6,7]. Modeling clinical variables with advanced methods, such as machine learning, could allow us to manage patients in resource-limited environments with limited technological access. Upgrading to significant prostate cancer on MRI-TB was defined as upgrading to G 3+4 (definition 1 - DF1) and 4+3 (DF2). For upgrading prediction, the AdaBoost model was highly predictive of upgrading DF1 (AUC 0.952), while for prediction of upgrading DF2, the Random Forest model had a lower but excellent prediction performance (AUC 0.947). In the fourth chapter, we introduce a multi-omics data integration method to analyze multi-omics data for biomedical applications, including disease prediction, disease subtypes, biomarker prediction, and others. Multi-omics data integration facilitates collecting richer understanding and perceptions than separate omics data. Our method is constructed using the combination of gene similarity network (GSN) based on Uniform Manifold Approximation and Projection (UMAP) and convolutional neural networks (CNNs). The method utilizes UMAP to embed gene expression, DNA methylation, and copy number alteration (CNA) to a lower dimension creating two-dimensional RGB images. Gene expression is used as a reference to construct the GSN and then integrate other omics data with the gene expression for better prediction. We used CNNs to predict the Gleason score levels of prostate cancer patients and the tumor stage in breast cancer patients. The results show that UMAP as an embedding technique can better integrate multi-omics maps into the prediction model than SO

    Unveiling the frontiers of deep learning: innovations shaping diverse domains

    Full text link
    Deep learning (DL) enables the development of computer models that are capable of learning, visualizing, optimizing, refining, and predicting data. In recent years, DL has been applied in a range of fields, including audio-visual data processing, agriculture, transportation prediction, natural language, biomedicine, disaster management, bioinformatics, drug design, genomics, face recognition, and ecology. To explore the current state of deep learning, it is necessary to investigate the latest developments and applications of deep learning in these disciplines. However, the literature is lacking in exploring the applications of deep learning in all potential sectors. This paper thus extensively investigates the potential applications of deep learning across all major fields of study as well as the associated benefits and challenges. As evidenced in the literature, DL exhibits accuracy in prediction and analysis, makes it a powerful computational tool, and has the ability to articulate itself and optimize, making it effective in processing data with no prior training. Given its independence from training data, deep learning necessitates massive amounts of data for effective analysis and processing, much like data volume. To handle the challenge of compiling huge amounts of medical, scientific, healthcare, and environmental data for use in deep learning, gated architectures like LSTMs and GRUs can be utilized. For multimodal learning, shared neurons in the neural network for all activities and specialized neurons for particular tasks are necessary.Comment: 64 pages, 3 figures, 3 table
    • …
    corecore