293 research outputs found

    Gene Expression Analysis Methods on Microarray Data a A Review

    Get PDF
    In recent years a new type of experiments are changing the way that biologists and other specialists analyze many problems. These are called high throughput experiments and the main difference with those that were performed some years ago is mainly in the quantity of the data obtained from them. Thanks to the technology known generically as microarrays, it is possible to study nowadays in a single experiment the behavior of all the genes of an organism under different conditions. The data generated by these experiments may consist from thousands to millions of variables and they pose many challenges to the scientists who have to analyze them. Many of these are of statistical nature and will be the center of this review. There are many types of microarrays which have been developed to answer different biological questions and some of them will be explained later. For the sake of simplicity we start with the most well known ones: expression microarrays

    Multimodal Data Fusion and Quantitative Analysis for Medical Applications

    Get PDF
    Medical big data is not only enormous in its size, but also heterogeneous and complex in its data structure, which makes conventional systems or algorithms difficult to process. These heterogeneous medical data include imaging data (e.g., Positron Emission Tomography (PET), Computerized Tomography (CT), Magnetic Resonance Imaging (MRI)), and non-imaging data (e.g., laboratory biomarkers, electronic medical records, and hand-written doctor notes). Multimodal data fusion is an emerging vital field to address this urgent challenge, aiming to process and analyze the complex, diverse and heterogeneous multimodal data. The fusion algorithms bring great potential in medical data analysis, by 1) taking advantage of complementary information from different sources (such as functional-structural complementarity of PET/CT images) and 2) exploiting consensus information that reflects the intrinsic essence (such as the genetic essence underlying medical imaging and clinical symptoms). Thus, multimodal data fusion benefits a wide range of quantitative medical applications, including personalized patient care, more optimal medical operation plan, and preventive public health. Though there has been extensive research on computational approaches for multimodal fusion, there are three major challenges of multimodal data fusion in quantitative medical applications, which are summarized as feature-level fusion, information-level fusion and knowledge-level fusion: • Feature-level fusion. The first challenge is to mine multimodal biomarkers from high-dimensional small-sample multimodal medical datasets, which hinders the effective discovery of informative multimodal biomarkers. Specifically, efficient dimension reduction algorithms are required to alleviate "curse of dimensionality" problem and address the criteria for discovering interpretable, relevant, non-redundant and generalizable multimodal biomarkers. • Information-level fusion. The second challenge is to exploit and interpret inter-modal and intra-modal information for precise clinical decisions. Although radiomics and multi-branch deep learning have been used for implicit information fusion guided with supervision of the labels, there is a lack of methods to explicitly explore inter-modal relationships in medical applications. Unsupervised multimodal learning is able to mine inter-modal relationship as well as reduce the usage of labor-intensive data and explore potential undiscovered biomarkers; however, mining discriminative information without label supervision is an upcoming challenge. Furthermore, the interpretation of complex non-linear cross-modal associations, especially in deep multimodal learning, is another critical challenge in information-level fusion, which hinders the exploration of multimodal interaction in disease mechanism. • Knowledge-level fusion. The third challenge is quantitative knowledge distillation from multi-focus regions on medical imaging. Although characterizing imaging features from single lesions using either feature engineering or deep learning methods have been investigated in recent years, both methods neglect the importance of inter-region spatial relationships. Thus, a topological profiling tool for multi-focus regions is in high demand, which is yet missing in current feature engineering and deep learning methods. Furthermore, incorporating domain knowledge with distilled knowledge from multi-focus regions is another challenge in knowledge-level fusion. To address the three challenges in multimodal data fusion, this thesis provides a multi-level fusion framework for multimodal biomarker mining, multimodal deep learning, and knowledge distillation from multi-focus regions. Specifically, our major contributions in this thesis include: • To address the challenges in feature-level fusion, we propose an Integrative Multimodal Biomarker Mining framework to select interpretable, relevant, non-redundant and generalizable multimodal biomarkers from high-dimensional small-sample imaging and non-imaging data for diagnostic and prognostic applications. The feature selection criteria including representativeness, robustness, discriminability, and non-redundancy are exploited by consensus clustering, Wilcoxon filter, sequential forward selection, and correlation analysis, respectively. SHapley Additive exPlanations (SHAP) method and nomogram are employed to further enhance feature interpretability in machine learning models. • To address the challenges in information-level fusion, we propose an Interpretable Deep Correlational Fusion framework, based on canonical correlation analysis (CCA) for 1) cohesive multimodal fusion of medical imaging and non-imaging data, and 2) interpretation of complex non-linear cross-modal associations. Specifically, two novel loss functions are proposed to optimize the discovery of informative multimodal representations in both supervised and unsupervised deep learning, by jointly learning inter-modal consensus and intra-modal discriminative information. An interpretation module is proposed to decipher the complex non-linear cross-modal association by leveraging interpretation methods in both deep learning and multimodal consensus learning. • To address the challenges in knowledge-level fusion, we proposed a Dynamic Topological Analysis framework, based on persistent homology, for knowledge distillation from inter-connected multi-focus regions in medical imaging and incorporation of domain knowledge. Different from conventional feature engineering and deep learning, our DTA framework is able to explicitly quantify inter-region topological relationships, including global-level geometric structure and community-level clusters. K-simplex Community Graph is proposed to construct the dynamic community graph for representing community-level multi-scale graph structure. The constructed dynamic graph is subsequently tracked with a novel Decomposed Persistence algorithm. Domain knowledge is incorporated into the Adaptive Community Profile, summarizing the tracked multi-scale community topology with additional customizable clinically important factors

    Network-Based Biomarker Discovery : Development of Prognostic Biomarkers for Personalized Medicine by Integrating Data and Prior Knowledge

    Get PDF
    Advances in genome science and technology offer a deeper understanding of biology while at the same time improving the practice of medicine. The expression profiling of some diseases, such as cancer, allows for identifying marker genes, which could be able to diagnose a disease or predict future disease outcomes. Marker genes (biomarkers) are selected by scoring how well their expression levels can discriminate between different classes of disease or between groups of patients with different clinical outcome (e.g. therapy response, survival time, etc.). A current challenge is to identify new markers that are directly related to the underlying disease mechanism

    Statistical learning methods for multi-omics data integration in dimension reduction, supervised and unsupervised machine learning

    Get PDF
    Over the decades, many statistical learning techniques such as supervised learning, unsupervised learning, dimension reduction technique have played ground breaking roles for important tasks in biomedical research. More recently, multi-omics data integration analysis has become increasingly popular to answer to many intractable biomedical questions, to improve statistical power by exploiting large size samples and different types omics data, and to replicate individual experiments for validation. This dissertation covers the several analytic methods and frameworks to tackle with practical problems in multi-omics data integration analysis. Supervised prediction rules have been widely applied to high-throughput omics data to predict disease diagnosis, prognosis or survival risk. The top scoring pair (TSP) algorithm is a supervised discriminant rule that applies a robust simple rank-based algorithm to identify rank-altered gene pairs in case/control classes. TSP usually generates greatly reduced accuracy in inter-study prediction (i.e., the prediction model is established in the training study and applied to an independent test study). In the first part, we introduce a MetaTSP algorithm that combines multiple transcriptomic studies and generates a robust prediction model applicable to independent test studies. One important objective of omics data analysis is clustering unlabeled patients in order to identify meaningful disease subtypes. In the second part, we propose a group structured integrative clustering method to incorporate a sparse overlapping group lasso technique and a tight clustering via regularization to integrate inter-omics regulation flow, and to encourage outlier samples scattering away from tight clusters. We show by two real examples and simulated data that our proposed methods improve the existing integrative clustering in clustering accuracy, biological interpretation, and are able to generate coherent tight clusters. Principal component analysis (PCA) is commonly used for projection to low-dimensional space for visualization. In the third part, we introduce two meta-analysis frameworks of PCA (Meta-PCA) for analyzing multiple high-dimensional studies in common principal component space. Theoretically, Meta-PCA specializes to identify meta principal component (Meta-PC) space; (1) by decomposing the sum of variances and (2) by minimizing the sum of squared cosines. Applications to various simulated data shows that Meta-PCAs outstandingly identify true principal component space, and retain robustness to noise features and outlier samples. We also propose sparse Meta-PCAs that penalize principal components in order to selectively accommodate significant principal component projections. With several simulated and real data applications, we found Meta-PCA efficient to detect significant transcriptomic features, and to recognize visual patterns for multi-omics data sets. In the future, the success of data integration analysis will play an important role in revealing the molecular and cellular process inside multiple data, and will facilitate disease subtype discovery and characterization that improve hypothesis generation towards precision medicine, and potentially advance public health research

    nformation Theoretical Prediction of Alternative Splicing with Application to Type-2 Diabetes Mellitus.

    No full text
    Für die biomedizinische Grundlagenforschung ist es von besonderem Interesse, die Aktivität von Genen in verschiedenen Geweben eines Organismus zu bestimmen. Die Genaktivität wird hier bestimmt durch die Menge der direkten Produkte eines Gens, die Transkripte. Die Häufigkeit der Transkripte wird durch experimentelle Technologien quantifiziert und als Genexpression bezeichnet. Aber ein Gen produziert nicht immer nur ein Transkript, sondern kann mehrere Transkripte herstellen mittels der parallelen Kodierung, dem sogenannten alternativen Spleissen. Solch ein Mechanismus ist notwendig um die grosse Zahl an Proteinen und die verhältnismässig kleine Anzahl an Genen zu erklären: 25 000 Gene im Menschen gegenüber 20 000 im Fadenwurm caenorhabditis elegans. Alternatives Spleissen kontrolliert die Expression von verschiedenen Transkriptvarianten unter verschiedenen Bedingungen. Es ist nicht überraschend, dass auch kleine Fehler beim Spleissen pathologische Wirkung entfalten, d.h. Krankheiten auslösen können. Da Organismen wie der des Menschen etwa 25 000 verschiedene Gene besitzen, war es notwendig, für die Analyse der globalen Genexpression Hochdurchsatzmethoden zur Datengenerierung zu entwickeln. Mit dem alternativen Spleissen stehen all diesen Genen mehrere Transkripte gegenüber. Erst seit kurzem kann die notwendige Menge an Daten generiert werden durch Technologien wie z.Bsp. Microarrays oder Sequenzierungstechnologie der neuesten Generation. Gleichzeitig mit dem technischen Fortschritt müssen die Datenanalyseverfahren mithalten, um neuen Forschungsfragen zu entsprechen. Im Laufe dieser Arbeit wird eine Softwarepipeline vorgestellt für die Analyse von alternativem Spleissen sowie differentieller Genexpression. Sie wurde entwickelt und implementiert in der Programmiersprache und Statistik-Software R und BioConductor und umfasst die Schritte Qualitätskontrolle, Vorverarbeitung, statistische Auswertung der Expressionsveränderungen und Genmengenauswertung. Für die Erkennung von alternativem Spleissen wird die Informationstheorie in das Gebiet der Genexpression eingeführt. Die vorgestellte Lösung besteht aus einer Erweiterung der Shannon-Entropie auf die Erkennung veränderter Transkripthäufigkeiten und heisst ARH – Alternatives Spleissen Robuste Vorhersage mittels Entropie. Der Nutzen der entwickelten Methoden und Implementierungen wird aufgezeigt am Beispiel von Daten zum Typ-2 Diabetes Mellitus. Mittels Datenintegration und Metaanalyse von unterschiedlichen Datenquellen werden Markergene bestimmt mit Fokus auf differentielle Expression. Danach wird alternatives Spleissen untersucht mit speziellem Fokus auf die Markergene und funktionelle Genmengen, d.h. Stoffwechselwegen

    Cell Type-specific Analysis of Human Interactome and Transcriptome

    Get PDF
    Cells are the fundamental building block of complex tissues in higher-order organisms. These cells take different forms and shapes to perform a broad range of functions. What makes a cell uniquely eligible to perform a task, however, is not well-understood; neither is the defining characteristic that groups similar cells together to constitute a cell type. Even for known cell types, underlying pathways that mediate cell type-specific functionality are not readily available. These functions, in turn, contribute to cell type-specific susceptibility in various disorders

    Inference of biomolecular interactions from sequence data

    Get PDF
    This thesis describes our work on the inference of biomolecular interactions from sequence data. In particular, the first part of the thesis focuses on proteins and describes computational methods that we have developed for the inference of both intra- and inter-protein interactions from genomic data. The second part of the thesis centers around protein-RNA interactions and describes a method for the inference of binding motifs of RNA-binding proteins from high-throughput sequencing data. The thesis is organized as follows. In the first part, we start by introducing a novel mathematical model for the characterization of protein sequences (chapter 1). We then show how, using genomic data, this model can be successfully applied to two different problems, namely to the inference of interacting amino acid residues in the tertiary structure of protein domains (chapter 2) and to the prediction of protein-protein interactions in large paralogous protein families (chapters 3 and 4). We conclude the first part by a discussion of potential extensions and generalizations of the methods presented (chapter 5). In the second part of this thesis, we first give a general introduction about RNA- binding proteins (chapter 6). We then describe a novel experimental method for the genome-wide identification of target RNAs of RNA-binding proteins and show how this method can be used to infer the binding motifs of RNA-binding proteins (chapter 7). Finally, we discuss a potential mechanism by which KH domain-containing RNA- binding proteins could achieve the specificity of interaction with their target RNAs and conclude the second part of the thesis by proposing a novel type of motif finding algorithm tailored for the inference of their recognition elements (chapter 8)

    Hybridization biases of microarray expression data - A model-based analysis of RNA quality and sequence effects

    Get PDF
    Modern high-throughput technologies like DNA microarrays are powerful tools that are widely used in biomedical research. They target a variety of genomics applications ranging from gene expression profiling over DNA genotyping to gene regulation studies. However, the recent discovery of false positives among prominent research findings indicates a lack of awareness or understanding of the non-biological factors negatively affecting the accuracy of data produced using these technologies. The aim of this thesis is to study the origins, effects and potential correction methods for selected methodical biases in microarray data. The two-species Langmuir model serves as the basal physicochemical model of microarray hybridization describing the fluorescence signal response of oligonucleotide probes. The so-called hook method allows to estimate essential model parameters and to compute summary parameters characterizing a particular microarray sample. We show that this method can be applied successfully to various types of microarrays which share the same basic mechanism of multiplexed nucleic acid hybridization. Using appropriate modifications of the model we study RNA quality and sequence effects using publicly available data from Affymetrix GeneChip expression arrays. Varying amounts of hybridized RNA result in systematic changes of raw intensity signals and appropriate indicator variables computed from these. Varying RNA quality strongly affects intensity signals of probes which are located at the 3\'' end of transcripts. We develop new methods that help assessing the RNA quality of a particular microarray sample. A new metric for determining RNA quality, the degradation index, is proposed which improves previous RNA quality metrics. Furthermore, we present a method for the correction of the 3\'' intensity bias. These functionalities have been implemented in the freely available program package AffyRNADegradation. We show that microarray probe signals are affected by sequence effects which are studied systematically using positional-dependent nearest-neighbor models. Analysis of the resulting sensitivity profiles reveals that specific sequence patterns such as runs of guanines at the solution end of the probes have a strong impact on the probe signals. The sequence effects differ for different chip- and target-types, probe types and hybridization modes. Theoretical and practical solutions for the correction of the introduced sequence bias are provided. Assessment of RNA quality and sequence biases in a representative ensemble of over 8000 available microarray samples reveals that RNA quality issues are prevalent: about 10% of the samples have critically low RNA quality. Sequence effects exhibit considerable variation within the investigated samples but have limited impact on the most common patterns in the expression space. Variations in RNA quality and quantity in contrast have a significant impact on the obtained expression measurements. These hybridization biases should be considered and controlled in every microarray experiment to ensure reliable results. Application of rigorous quality control and signal correction methods is strongly advised to avoid erroneous findings. Also, incremental refinement of physicochemical models is a promising way to improve signal calibration paralleled with the opportunity to better understand the fundamental processes in microarray hybridization

    Data driven approaches for investigating molecular heterogeneity of the brain

    Get PDF
    It has been proposed that one of the clearest organizing principles for most sensory systems is the existence of parallel subcircuits and processing streams that form orderly and systematic mappings from stimulus space to neurons. Although the spatial heterogeneity of the early olfactory circuitry has long been recognized, we know comparatively little about the circuits that propagate sensory signals downstream. Investigating the potential modularity of the bulb’s intrinsic circuits proves to be a difficult task as termination patterns of converging projections, as with the bulb’s inputs, are not feasibly realized. Thus, if such circuit motifs exist, their detection essentially relies on identifying differential gene expression, or “molecular signatures,” that may demarcate functional subregions. With the arrival of comprehensive (whole genome, cellular resolution) datasets in biology and neuroscience, it is now possible for us to carry out large-scale investigations and make particular use of the densely catalogued, whole genome expression maps of the Allen Brain Atlas to carry out systematic investigations of the molecular topography of the olfactory bulb’s intrinsic circuits. To address the challenges associated with high-throughput and high-dimensional datasets, a deep learning approach will form the backbone of our informatic pipeline. In the proposed work, we test the hypothesis that the bulb’s intrinsic circuits are parceled into distinct, parallel modules that can be defined by genome-wide patterns of expression. In pursuit of this aim, our deep learning framework will facilitate the group-registration of the mitral cell layers of ~ 50,000 in-situ olfactory bulb circuits to test this hypothesis
    • …
    corecore