    Accurate HLA type inference using a weighted similarity graph

    Abstract Background The human leukocyte antigen system (HLA) contains many highly variable genes. HLA genes play an important role in the human immune system, and HLA gene matching is crucial for the success of human organ transplantations. Numerous studies have demonstrated that variation in HLA genes is associated with many autoimmune, inflammatory and infectious diseases. However, typing HLA genes by serology or PCR is time consuming and expensive, which limits large-scale studies involving HLA genes. Since it is much easier and cheaper to obtain single nucleotide polymorphism (SNP) genotype data, accurate computational algorithms to infer HLA gene types from SNP genotype data are in need. To infer HLA types from SNP genotypes, the first step is to infer SNP haplotypes from genotypes. However, for the same SNP genotype data set, the haplotype configurations inferred by different methods are usually inconsistent, and it is often difficult to decide which one is true. Results In this paper, we design an accurate HLA gene type inference algorithm by utilizing SNP genotype data from pedigrees, known HLA gene types of some individuals and the relationship between inferred SNP haplotypes and HLA gene types. Given a set of haplotypes inferred from the genotypes of a population consisting of many pedigrees, the algorithm first constructs a weighted similarity graph based on a new haplotype similarity measure and derives constraint edges from known HLA gene types. Based on the principle that different HLA gene alleles should have different background haplotypes, the algorithm searches for an optimal labeling of all the haplotypes with unknown HLA gene types such that the total weight among the same HLA gene types is maximized. To deal with ambiguous haplotype solutions, we use a genetic algorithm to select haplotype configurations that tend to maximize the same optimization criterion. Our experiments on a previously typed subset of the HapMap data show that the algorithm is highly accurate, achieving an accuracy of 96% for gene HLA-A, 95% for HLA-B, 97% for HLA-C, 84% for HLA-DRB1, 98% for HLA-DQA1 and 97% for HLA-DQB1 in a leave-one-out test. Conclusions Our algorithm can infer HLA gene types from neighboring SNP genotype data accurately. Compared with a recent approach on the same input data, our algorithm achieved a higher accuracy. The code of our algorithm is available to the public for free upon request to the corresponding authors

    Computational strategies for dissecting the high-dimensional complexity of adaptive immune repertoires

    The adaptive immune system recognizes antigens via an immense array of antigen-binding antibodies and T-cell receptors, the immune repertoire. The interrogation of immune repertoires is of high relevance for understanding the adaptive immune response in disease and infection (e.g., autoimmunity, cancer, HIV). Adaptive immune receptor repertoire sequencing (AIRR-seq) has driven the quantitative and molecular-level profiling of immune repertoires thereby revealing the high-dimensional complexity of the immune receptor sequence landscape. Several methods for the computational and statistical analysis of large-scale AIRR-seq data have been developed to resolve immune repertoire complexity in order to understand the dynamics of adaptive immunity. Here, we review the current research on (i) diversity, (ii) clustering and network, (iii) phylogenetic and (iv) machine learning methods applied to dissect, quantify and compare the architecture, evolution, and specificity of immune repertoires. We summarize outstanding questions in computational immunology and propose future directions for systems immunology towards coupling AIRR-seq with the computational discovery of immunotherapeutics, vaccines, and immunodiagnostics.Comment: 27 pages, 2 figure

    Tissue-specific regulatory circuits reveal variable modular perturbations across complex diseases

    Mapping perturbed molecular circuits that underlie complex diseases remains a great challenge. We developed a comprehensive resource of 394 cell type– and tissue-specific gene regulatory networks for human, each specifying the genome-wide connectivity among transcription factors, enhancers, promoters and genes. Integration with 37 genome-wide association studies (GWASs) showed that disease-associated genetic variants—including variants that do not reach genome-wide significance—often perturb regulatory modules that are highly specific to disease-relevant cell types or tissues. Our resource opens the door to systematic analysis of regulatory programs across hundreds of human cell types and tissue

    Unique networks: a method to identity disease-specific regulatory networks from microarray data

    This thesis was submitted for the degree of Doctor of Philosophy and awarded by Brunel University.The survival of any organismis determined by the mechanisms triggered in response to the inputs received. Underlying mechanisms are described by graphical networks that can be inferred from different types of data such as microarrays. Deriving robust and reliable networks can be complicated due to the microarray structure of the data characterized by a discrepancy between the number of genes and samples of several orders of magnitude, bias and noise. Researchers overcome this problem by integrating independent data together and deriving the common mechanisms through consensus network analysis. Different conditions generate different inputs to the organism which reacts triggering different mechanisms with similarities and differences. A lot of effort has been spent into identifying the commonalities under different conditions. Highlighting similarities may overshadow the differences which often identify the main characteristics of the triggered mechanisms. In this thesis we introduce the concept of study-specific mechanism. We develop a pipeline to semiautomatically identify study-specific networks called unique-networks through a combination of consensus approach, graphical similarities and network analysis. The main pipeline called UNIP (Unique Networks Identification Pipeline) takes a set of independent studies, builds gene regulatory networks for each of them, calculates an adaptation of the sensitivity measure based on the networks graphical similarities, applies clustering to group the studies who generate the most similar networks into study-clusters and derives the consensus networks. Once each study-cluster is associated with a consensus-network, we identify the links that appear only in the consensus network under consideration but not in the others (unique-connections). Considering the genes involved in the unique-connections we build Bayesian networks to derive the unique-networks. Finally, we exploit the inference tool to calculate each gene prediction-accuracy across all studies to further refine the unique-networks. Biological validation through different software and the literature are explored to validate our method. UNIP is first applied to a set of synthetic data perturbed with different levels of noise to study the performance and verify its reliability. Then, wheat under stress conditions and different types of cancer are explored. Finally, we develop a user-friendly interface to combine the set of studies by using AND and NOT logic operators. Based on the findings, UNIP is a robust and reliable method to analyse large sets of transcriptomic data. It easily detects the main complex relationships between transcriptional expression of genes specific for different conditions and also highlights structures and nodes that could be potential targets for further research

    Identifying markers of cell identity from single-cell omics data

    Einzelzell-Omics-Daten stehen derzeit im Fokus der Entwicklung computergestützter Methoden in der Molekularbiologie und Genetik. Einzelzellexperimenten lieferen dünnbesetzte, hochdimensionale Daten über zehntausende Gene oder hunderttausende regulatorische Regionen in zehntausenden Zellen. Diese Daten bieten den Forschenden die Möglichkeit, Gene und regulatorische Regionen zu identifizieren, welche die Bestimmung und Aufrechterhaltung der Zellidentität koordinieren. Die gängigste Strategie zur Identifizierung von Zellidentitätsmarkern besteht darin, die Zellen zu clustern und dann Merkmale zu finden, welche die Cluster unterscheiden, wobei davon ausgegangen wird, dass die Zellen innerhalb eines Clusters die gleiche Identität haben. Diese Annahme ist jedoch nicht immer zutreffend, insbesondere nicht für Entwicklungsdaten bei denen sich die Zellen in einem Kontinuum befinden und die Definition von Clustergrenzen biologisch gesehen potenziell willkürlich ist. Daher befasst sich diese Dissertation mit Clustering-unabhängigen Strategien zur Identifizierung von Markern aus Einzelzell-Omics-Daten. Der wichtigste Beitrag dieser Dissertation ist SEMITONES, eine auf linearer Regression basierende Methode zur Identifizierung von Markern. SEMITONES identifiziert (Gruppen von) Markern aus verschiedenen Arten von Einzelzell-Omics-Daten, identifiziert neue Marker und übertrifft bestehende Marker-Identifizierungsansätze. Außerdem ermöglicht die Identifizierung von regulatorischen Markerregionen durch SEMITONES neue Hypothesen über die Regulierung der Genexpression während dem Erwerb der Zellidentität. Schließlich beschreibt die Dissertation einen Ansatz zur Identifizierung neuer Markergene für sehr ähnliche, dennoch underschiedliche neurale Vorlauferzellen im zentralen Nervensystem von Drosphila melanogaster. Ingesamt zeigt die Dissertation, wie Cluster-unabhängige Ansätze zur Aufklärung bisher uncharakterisierter biologischer Phänome aus Einzelzell-Omics-Daten beitragen.Single-cell omics approaches are the current frontier of computational method development in molecular biology and genetics. A single single-cell experiment provides sparse, high-dimensional data on tens of thousands of genes or hundreds of thousands of regulatory regions (i.e. features) in tens of thousands of cells (i.e. samples). This data provides researchers with an unprecedented opportunity to identify those genes and regulatory regions that determine and coordinate cell identity acquisition and maintenance. The most common strategy for identifying cell identity markers consists of clustering the cells and then identifying differential features between these clusters, assuming that cells within a cluster share the same identity. This assumption is, however, not guaranteed to hold, particularly for developmental data where cells lie along a continuum and inferring cluster boundaries becomes non-trivial and potentially biologically arbitrary. In response, this thesis presents clustering-independent strategies for marker feature identification from single-cell omics data. The primary contribution of this thesis is a linear regression-based method for marker feature identification from single-cell omics data called SEMITONES. SEMITONES can identify markers or marker sets from diverse single-cell omics data types, identifies novel markers, outperforms existing marker identification approaches. The thesis also describes how the identification of marker regulatory regions by SEMITONES enables the generation of novel hypotheses regarding gene regulation during cell identity acquisition. Lastly, the thesis describes the clustering-independent identification of novel marker genes for highly similar yet distinct neural progenitor cells in the Drosophila melanogaster central nervous system. Altogether, the thesis demonstrates how clustering-independent approaches aid the elucidation of yet uncharacterised biological patterns from single cell-omics data


    This manuscript is technical memoir about my work at Stanford and Microsoft Research. Included are fundamental concepts central to machine learning and artificial intelligence, applications of these concepts, and stories behind their creation

    Measures of epitope binding degeneracy from T cell receptor repertoires

    Adaptive immunity is driven by specific binding of hypervariable receptors to diverse molecular targets. The sequence diversity of receptors and targets are both individually known but because multiple receptors can recognize the same target, a measure of the effective "functional" diversity of the human immune system has remained elusive. Here, we show that sequence near-coincidences within T cell receptors that bind specific epitopes provide a new window into this problem and allow the quantification of how binding probability covaries with sequence. We find that near-coincidence statistics within epitope-specific repertoires imply a measure of binding degeneracy to amino acid changes in receptor sequence that is consistent across disparate experiments. Paired data on both chains of the heterodimeric receptor are particularly revealing since simultaneous near-coincidences are rare and we show how they can be exploited to estimate the number of epitope responses that created the memory compartment. In addition, we find that paired-chain coincidences are strongly suppressed across donors with different human leukocyte antigens, evidence for a central role of antigen-driven selection in making paired chain receptors public. These results demonstrate the power of coincidence analysis to reveal the sequence determinants of epitope binding in receptor repertoires

    Predicting Gene-Disease Associations with Knowledge Graph Embeddings over Multiple Ontologies

    Tese de mestrado, Bioinformática e Biologia Computacional, Universidade de Lisboa, Faculdade de Ciências, 2021There are still more than 1,400 Mendelian conditions whose molecular cause is un known. In addition, almost all medical conditions are somehow influenced by human genetic variation. This challenge also presents itself as an opportunity to understand the mechanisms of diseases, thus allowing the development of better mitigation strategies, finding diagnostic markers and therapeutic targets. Deciphering the link between genes and diseases is one of the most demanding tasks in biomedical research. Computational approaches for gene-disease associations prediction can greatly accelerate this process, and recent developments that explore the scientific knowledge described in ontologies have achieved good results. State-of-the-art approaches that take advantage of ontologies or knowledge graphs for these predictions are typically based on semantic similarity measures that only take into consideration hierarchical relations. New developments in the area of knowledge graphs embeddings support more powerful representations but are usually limited to a single ontology, which may be insufficient in multi-domain applications such as the prediction of gene-disease associations. This dissertation proposes a novel approach of gene-disease associations prediction by exploring both the Human Phenotype Ontology and the Gene Ontology, using knowledge graph embeddings to represent gene and disease features in a shared semantic space that covers both gene function and phenotypes. Our approach integrates different methods for building the shared semantic space, as well as multiple knowledge graph embeddings algorithms and machine learning methods. The prediction performance was evaluated on curated gene-disease associations from DisGeNET and compared to classical semantic similarity measures. Our experiments demonstrate the value of employing knowledge graph embeddings based on random walks and highlight the need for closer integration of different ontologies