95 research outputs found

    Comparative analysis of plant genomes through data integration

    Get PDF
    When we started our research in 2008, several online resources for genomics existed, each with a different focus. TAIR (The Arabidopsis Information Resource) has a focus on the plant model species Arabidopsis thaliana, with (at that time) little or no support for evolutionary or comparative genomics. Ensemble provided some basic tools and functions as a data warehouse, but it would only start incorporating plant genomes in 2010. There was no online resource at that time however, that provided the necessary data content and tools for plant comparative and evolutionary genomics that we required. As such, the plant community was missing an essential component to get their research at the same level as the biomedicine oriented research communities. We started to work on PLAZA in order to provide such a data resource that could be accessed by the plant community, and which also contained the necessary data content to help our research group’s focus on evolutionary genomics. The platform for comparative and evolutionary genomics, which we named PLAZA, was developed from scratch (i.e. not based on an existing database scheme, such as Ensemble). Gathering the data for all species, parsing this data into a common format and then uploading it into the database was the next step. We developed a processing pipeline, based on sequence similarity measurements, to group genes into gene families and sub families. Functional annotation was gathered through both the original data providers and through InterPro scans, combined with Interpro2GO. This primary data information was then ready to be used in every subsequent analysis. Building such a database was good enough for research within our bioinformatics group, but the target goal was to provide a comprehensive resource for all plant biologists with an interest in comparative and evolutionary genomics. Designing and creating a user-friendly, visually appealing web interface, connected to our database, was the next step. While the most detailed information is commonly presented in data tables, aesthetically pleasing graphics, images and charts are often used to visualize trends, general statistics and also used in specific tools. Design and development of these tools and visualizations is thus one of the core elements within my PhD. The PLAZA platform was designed as a gene-centric data resource, which is easily navigated when a biologist wants to study a relative small number of genes. However, using the default PLAZA website to retrieve information for dozens of genes quickly becomes very tedious. Therefore a ’gene set’-centric extra layer was developed where user-defined gene sets could be quickly analyzed. This extra layer, called the PLAZA workbench, functions on top of the normal PLAZA website, implicating that only gene sets from species present within the PLAZA database can be directly analyzed. The PLAZA resource for comparative and evolutionary genomics was a major success, but it still had several issues. We tried to solve at least two of these problems at the same time by creating a new platform. The first issue was the building procedure of PLAZA: adding a single species, or updating the structural annotation of an existing one, requires the total re-computation of the database content. The second issue was the restrictiveness of the PLAZA workbench: through a mapping procedure gene sets could be entered for species not present in the PLAZA database, but for species without a phylogenetic close relative this approach did not always yield satisfying results. Furthermore, the research in question might just focus on the difference between a species present in PLAZA and a close relative not present in PLAZA (e.g. to study adaptation to a different ecological niche). In such a case, the mapping procedure is in itself useless. With the advent of NGS transcriptome data sets for a growing number of species, it was clear that a next challenge had presented itself. We designed and developed a new platform, named TRAPID, which could automatically process entire transcriptome data sets, using a reference database. The target goal was to have the processing done quickly with the results containing both gene family oriented data (such as multiple sequence alignments and phylogenetic trees) and functional characterization of the transcripts. Major efforts went into designing the processing pipeline so it could be reliable, fast and accurate

    Literature mining and network analysis in Biology

    Get PDF
    Η παρούσα διπλωματική παρουσιάζει το OnTheFly2.0, ένα διαδικτυακό εργαλείο που επικεντρώνεται στην εξαγωγή και επακόλουθη ανάλυση βιοϊατρικών όρων από μεμονωμένα αρχεία. Συγκεκριμένα, το OnTheFly2.0 υποστηρίζει πολλούς διαφορετικούς επιτρέποντας τον παράλληλο χειρισμό τους. Μέσω της ενσωμάτωσης της υπηρεσίας EXTRACT υλοποιείται η Αναγνώριση Ονοματικών Οντοτήτων (Named Entity Recognition) για γονίδια/πρωτεΐνες, χημικές ουσίες, οργανισμούς, ιστούς, περιβάλλοντα, ασθένειες, φαινοτύπους και όρους οντολογίας γονιδίων (Gene Ontology terms), καθώς και η δημιουργία αναδυόμενων παραθύρων που παρέχουν πληροφορίες για τον αναγνωρισμένο όρο, συνοδευόμενες από σύνδεσμο για διάφορες βάσεις δεδομένων. Οι αναγνωρισμένες πρωτεΐνες, τα γονίδια και οι χημικές ουσίες μπορούν να επεξεργαστούν περαιτέρω μέσω αναλύσεων εμπλουτισμού για τη λειτουργικότητα και τη βιβλιογραφία ή να συσχετιστούν με ασθένειες και πρωτεϊνικές δομές. Τέλος, είναι δυνατή η απεικόνιση αλληλεπιδράσεων μεταξύ πρωτεϊνών ή μεταξύ πρωτεϊνών και χημικών ουσιών μέσω της δημιουργίας διαδραστικών δικτύων από τις βάσεις STRING και STITCH αντίστοιχα. Το OnTheFly2.0 υποστηρίζει 197 διαφορετικά είδη οργανισμών και είναι διαθέσιμο στον παρακάτω σύνδεσμο: http://onthefly.pavlopouloslab.info.The particular thesis presents OnTheFly2.0, a web-based, versatile tool dedicated to the extraction and subsequent analysis of biomedical terms from individual files. More specifically, OnTheFly2.0 supports different file formats, enabling simultaneous file handling. The integration of the EXTRACT tagging service allows the implementation of Named Entity Recognition (NER) for genes/proteins, chemical compounds, organisms, tissues, environments, diseases, phenotypes and Gene Ontology terms, as well as the generation of popup windows which provide concise, context related information about the identified term, accompanied by links to various databases. Once named entities, such as proteins, genes and chemicals are identified, they can be further explored via functional and publication enrichment analysis or be associated with diseases and protein domains reporting from protein family databases. Finally, visualization of protein-protein and protein-chemical associations is possible through the generation of interactive networks from the STRING and STITCH services, respectively. OnTheFly2.0 currently supports 197 species and is available at http://onthefly.pavlopouloslab.info

    AN INTEGRATIVE SYSTEMS BIOINFORMATICS APPROACH OF THE ENVIRONMENTAL, GENETIC AND MOLECULAR FACTORS REGULATING SLEEP

    Get PDF
    Environmental changes and genetic variations are two important drivers of biological diversity. In complex traits, a multitude of genetic and environmental factors interact and combine in cryptic ways to direct the phenotypic variation. Sleep is a classic illustration of a complex trait that is vital and heritable but still poorly understood. Many aspects of sleep like the timing, duration and quality are regulated by the interaction of two processes: the circadian oscillations and the sleep homeostasis. In the context of a study that aimed at uncovering more clearly the molecular pathways regulating the sleep homeostat through the ambiguous relationship that exists between sleep- wake cycle and metabolism, we built, assembled, analyzed an extensive multi-scaled dataset using the systems genetics design. Machine learning algorithms and novel high-throughput sequencing technology permit to appraise more precisely and broadly the plethora of physiological and molecular phenotypes that contribute to sleep under disparate circumstances and genetic background, in order to build novel hypotheses based on data-driven discoveries. This dataset is composed of 33 recombinant inbred lines (RIL) from the BXD panel that were interrogated under sleep deprivation and undisturbed conditions for 341 sleep-wake related physiological phenotypes, 124 blood plasma metabolites, and cortical and liver transcriptomics. First analyses pointed out the pervasive effects of sleep deprivation and genetics both at the molecular and behavioral level and the complex interaction between genetic and environmental factors at all phenotypic layers. Then, two novel integrative methods were developed, the first to prioritize candidate genes within large associated genomic regions for physiological or metabolic phenotypes and the second to visualize the meta-dimensionality of the molecular network using the deterministic structure of hiveplots. Our findings led to the discovery of a bidirectional relationship between fatty acid turnover and sleep homeostasis but also between brain slow-waves activity and ionotropic glutamate receptor transport. Using markup language and cloud-based technologies, we aimed at transforming this resourceful, multidisciplinary dataset into an exploitable digital research object. The generation of dynamic analysis reports and workflow metadata promoted the reproducibility this data-object. In addition, tools were developed for the exploration and mining of integrated data. The resulting database and associated web interface ensures the reusability of this dataset and associated methodologies. -- La diversité biologique est dirigée par deux opérateurs importants, les changements environnementaux ainsi que les variations génétiques. Pour les traits dits complexe, leur variation est le fruit de nombreux facteurs génétiques et environnementaux qui vont interagir et se combiner, souvent de manière cryptique. Le sommeil est un exemple-type de trait complexe, il est vital et héritable mais fondamentalement méconnu. La régulation de nombreux aspects du sommeil comme sa durée, timing ou qualité fait intervenir deux processus : les oscillations circadiennes et l’homéostasie du sommeil. Afin de mieux cerner les voies qui régulent le mécanisme d’homéostasie du sommeil, en particulier celle mêlant le métabolisme, nous avons créé, assemblé et analysé un grand set de données en utilisant une approche dite de génétique des systèmes. Avec l’aide d’algorithmes d’apprentissage automatique et de nouvelles technologies de séquençage à haut-débit, nous avons pu mesurer dans des conditions et contextes génétiques différents de nombreux phénotypes moléculaires ou physiologiques qui contribuent à la régulation du sommeil. Notre approche étant ainsi principalement axée sur la construction d’hypothèse guidée par les données. Ce set est composé de 33 lignées de souris consanguines recombinantes (BXD) dont on a examiné, dans des conditions de privation de sommeil et de contrôle : 341 phénotypes physiologiques liés au sommeil et à l’éveil, 124 métabolites du plasma sanguin, ainsi que leur transcriptome du cortex et du foie. Les premières analyses ont pointé l’effet aigu de la privation de sommeil, de la génétique ainsi que leur interaction sur tous les niveaux de phénotypes. Ensuite, deux nouvelles méthodes d’intégration ont été développées, la première pour prioritiser les gènes opérateurs du sommeil et du métabolisme à l’intérieur de grande région génomique, la deuxième pour visualiser la méta-dimensionalité des données moléculaires via une structure de ‘hiveplot’. Nous avons mis en avant une relation bidirectionnelle entre les modifications d’acides gras et l’homéostasie du sommeil, ainsi que l’activité des ondes lentes du cerveau et le transport de récepteur au glutamate ionotropique. En utilisant le langage de balisage ainsi que des technologies basées sur le cloud, nous avons cherché à transformer ce jeu de données en un objet de recherche numérique. La reproductibilité de cet objet a été améliorée par la génération de rapports d'analyse dynamiques ainsi que de métadonnées. De plus, des outils ont été développés pour l'exploration et l'extraction de données via une interface web et assurent ainsi la réutilisation de ce set et de ces méthodologies associées

    Single-cell multi-omics analysis of the immune response in COVID-19

    Get PDF
    Analysis of human blood immune cells provides insights into the coordinated response to viral infections such as severe acute respiratory syndrome coronavirus 2, which causes coronavirus disease 2019 (COVID-19). We performed single-cell transcriptome, surface proteome and T and B lymphocyte antigen receptor analyses of over 780,000 peripheral blood mononuclear cells from a cross-sectional cohort of 130 patients with varying severities of COVID-19. We identified expansion of nonclassical monocytes expressing complement transcripts (CD16+C1QA/B/C+) that sequester platelets and were predicted to replenish the alveolar macrophage pool in COVID-19. Early, uncommitted CD34+ hematopoietic stem/progenitor cells were primed toward megakaryopoiesis, accompanied by expanded megakaryocyte-committed progenitors and increased platelet activation. Clonally expanded CD8+ T cells and an increased ratio of CD8+ effector T cells to effector memory T cells characterized severe disease, while circulating follicular helper T cells accompanied mild disease. We observed a relative loss of IgA2 in symptomatic disease despite an overall expansion of plasmablasts and plasma cells. Our study highlights the coordinated immune response that contributes to COVID-19 pathogenesis and reveals discrete cellular components that can be targeted for therapy

    Protein-Coding Gene Repertoires : Annotation, Characterization, and Variability in Holometabola

    Get PDF
    Three previously unavailable prerequisites for gene repertoire analyses are established and used within this thesis: (1) a new tool (COGNATE, written in Perl) was developed and used that records all gene structure parameters instead of relying on summary metrics; (2) it was ensured that automatically generated gene models are suitable for gene structure analyses; and (3) a unique species sample was employed, which representatively covers a younger radiation and allows the comparison of holometabolous and hemimetabolous. Goal of this thesis was to describe and analyze the variation of gene structures within and between repertoires of the species-rich and diverse insects as a step towards understanding and explaining the mechanisms and driving factors of insect genome evolution. COGNATE was employed to evaluate the magnitude of changes in predicted structural properties of protein-coding genes due to manual curation by comparing annotated gene sets from seven insect species sequenced by the i5k initiative. The properties of automatically generated gene models and their manually curated replacements do not differ extensively, and major correlative trends regarding gene structures can be recovered from both sets. From these results I conclude that gene models yielded from unsupervised annotation procedures are a suitable data basis to characterize structural gene features of a whole repertoire. The gene repertoires of twelve Hymenoptera were structurally characterized using COGNATE. The two focal species, non-apocritan "symphytans", possess small genomes and gene repertoires, but a strikingly high GC content of more than 41 %. Striking features are highlighted and lead to the conclusion that structural analyses of gene structure parameters can be used to direct efforts of detailed investigations, for example focused on here discovered miniature introns of ants. The established analysis approach was used to assess the variability in protein-coding gene repertoires in a characterization of the repertoires of a large, unique species sample. Previous research suggested universal patterns of conservation within a gene repertoire in relation to others by which it could be partitioned: highly conserved genes, genes moderately conserved and unevenly distributed across taxa, and lineage- or species-specific genes. My results show that considerable differences exist in gene structural parameters between and within these gene sets beyond previously described patterns. This work provides a solid baseline of expectations on insect gene structure variation as well as manifold investigative leads prompting further research

    The metaRbolomics Toolbox in Bioconductor and beyond

    Get PDF
    Metabolomics aims to measure and characterise the complex composition of metabolites in a biological system. Metabolomics studies involve sophisticated analytical techniques such as mass spectrometry and nuclear magnetic resonance spectroscopy, and generate large amounts of high-dimensional and complex experimental data. Open source processing and analysis tools are of major interest in light of innovative, open and reproducible science. The scientific community has developed a wide range of open source software, providing freely available advanced processing and analysis approaches. The programming and statistics environment R has emerged as one of the most popular environments to process and analyse Metabolomics datasets. A major benefit of such an environment is the possibility of connecting different tools into more complex workflows. Combining reusable data processing R scripts with the experimental data thus allows for open, reproducible research. This review provides an extensive overview of existing packages in R for different steps in a typical computational metabolomics workflow, including data processing, biostatistics, metabolite annotation and identification, and biochemical network and pathway analysis. Multifunctional workflows, possible user interfaces and integration into workflow management systems are also reviewed. In total, this review summarises more than two hundred metabolomics specific packages primarily available on CRAN, Bioconductor and GitHub

    Dissemination and visualisation of biological data

    Get PDF
    With the recent advent of various waves of technological advances, the amount of biological data being generated has exploded. As a consequence of this data deluge, new challenges have emerged in the field of biological data management. In order to maximize the knowledge extracted from the huge amount of biological data produced it is of great importance for the research community that data dissemination and visualisation challenges are tackled. Opening and sharing our data and working collaboratively will benefit the scientific community as a whole and to move towards that end, new developements, tools and techniques are needed. Nowadays, many small research groups are capable of producing important and interesting datasets. The release of those datasets can greatly increase their scientific value. In addition, the development of new data analysis algorithms greatly benefits from the availability of a big corpus of annotated datasets for training and testing purposes, giving new and better algorithms to biomedical sciences in return. None of these would be feasible without large amounts of biological data made freely and publicly available. Dissemination The Distributed Annotation System (DAS) is a protocol designed to publish and integrate annotations on biological entities in a distributed way. DAS is structured as a client-server system where the client retrieves data from one or more servers and to further process and visualise. Nowadays, setting up a DAS server imposes some requirements not met by many research groups. With the aim of removing the hassle of setting up a DAS server, a new software platform has been developed: easyDAS. easyDAS is a hosted platform to automatically create DAS servers. Using a simple web interface the user can upload a data file, describe its contents and a new DAS server will be automatically created and data will be publicly available to DAS clients. Visualisation One of the most broadly used visualization paradigms for genomic data are genomic browsers. A genomic browser is capable of displaying different sets of features positioned relative to a sequence. It is possible to explore the sequence and the features by moving around and zooming in and out. When this project was started, in 2007, all major genome browsers offered quite an static experience. It was possible to browse and explore data, but is was done through a set of buttons to the genome a certain amount of bases to left or right or zooming in and out. From an architectural point of view, all web-based genome browsers were very similar: they all had a relatively thin clien-side part in charge of showing images and big backend servers taking care of everything else. Every change in the display parameters made by the user triggered a request to the server, impacting the perceived responsiveness. We created a new prototype genome browser called GenExp, an interactive web-based browser with canvas based client side data rendering. It offers fluid direct interaction with the genome representation and it's possible to use the mouse drag it and use the mouse wheel to change the zoom level. GenExp offers also some quite unique features, such as its multi-window capabilities that allow a user to create an arbitrary number of independent or linked genome windows and its ability to save and share browsing sessions. GenExp is a DAS client and all data is retrieved from DAS sources. It is possible to add any available DAS data source including all data in Ensembl, UCSC and even the custom ones created with easyDAS. In addition, we developed a javascript DAS client library, jsDAS. jsDAS is a complete DAS client library that will take care of everything DAS related in a javascript application. jsDAS is javascript library agnostic and can be used to add DAS capabilities to any web application. All software developed in this thesis is freely available under an open source license.Les recents millores tecnològiques han portat a una explosió en la quantitat de dades biològiques que es generen i a l'aparició de nous reptes en el camp de la gestió de les dades biològiques. Per a maximitzar el coneixement que podem extreure d'aquestes ingents quantitats de dades cal que solucionem el problemes associats al seu anàlisis, i en particular a la seva disseminació i visualització. La compartició d'aquestes dades de manera lliure i gratuïta pot beneficiar en gran mesura a la comunitat científica i a la societat en general, però per a fer-ho calen noves eines i tècniques. Actualment, molts grups són capaços de generar grans conjunts de dades i la seva publicació en pot incrementar molt el valor científic. A més, la disponibilitat de grans conjunts de dades és necessària per al desenvolupament de nous algorismes d'anàlisis. És important, doncs, que les dades biològiques que es generen siguin accessibles de manera senzilla, estandaritzada i lliure. Disseminació El Sistema d'Anotació Distribuïda (DAS) és un protocol dissenyat per a la publicació i integració d'anotacions sobre entitats biològiques de manera distribuïda. DAS segueix una esquema de client-servidor, on el client obté dades d'un o més servidors per a combinar-les, processar-les o visualitzar-les. Avui dia, però, crear un servidor DAS necessita uns coneixements i infraestructures que van més enllà dels recursos de molts grups de recerca. Per això, hem creat easyDAS, una plataforma per a la creació automàtica de servidors DAS. Amb easyDAS un usuari pot crear un servidor DAS a través d'una senzilla interfície web i amb només alguns clics. Visualització Els navegadors genomics són un dels paradigmes de de visualització de dades genòmiques més usats i permet veure conjunts de dades posicionades al llarg d'una seqüència. Movent-se al llarg d'aquesta seqüència és possibles explorar aquestes dades. Quan aquest projecte va començar, l'any 2007, tots els grans navegadors genomics oferien una interactivitat limitada basada en l'ús de botons. Des d'un punt de vista d'arquitectura tots els navegadors basats en web eren molt semblants: un client senzill encarregat d'ensenyar les imatges i un servidor complex encarregat d'obtenir les dades, processar-les i generar les imatges. Així, cada canvi en els paràmetres de visualització requeria una nova petició al servidor, impactant molt negativament en la velocitat de resposta percebuda. Vam crear un prototip de navegador genòmic anomenat GenExp. És un navegador interactiu basat en web que fa servir canvas per a dibuixar en client i que ofereix la possibilitatd e manipulació directa de la respresentació del genoma. GenExp té a més algunes característiques úniques com la possibilitat de crear multiples finestres de visualització o la possibilitat de guardar i compartir sessions de navegació. A més, com que és un client DAS pot integrar les dades de qualsevol servidor DAS com els d'Ensembl, UCSC o fins i tot aquells creats amb easyDAS. A més, hem desenvolupat jsDAS, la primera llibreria de client DAS completa escrita en javascript. jsDAS es pot integrar en qualsevol aplicació DAS per a dotar-la de la possibilitat d'accedir a dades de servidors DAS. Tot el programari desenvolupat en el marc d'aquesta tesis està lliurement disponible i sota una llicència de codi lliure

    A Tale of Two Approaches: Comparing Top-Down and Bottom-Up Strategies for Analyzing and Visualizing High-Dimensional Data

    Get PDF
    The proliferation of high-throughput and sensory technologies in various fields has led to a considerable increase in data volume, complexity, and diversity. Traditional data storage, analysis, and visualization methods are struggling to keep pace with the growth of modern data sets, necessitating innovative approaches to overcome the challenges of managing, analyzing, and visualizing data across various disciplines. One such approach is utilizing novel storage media, such as deoxyribonucleic acid~(DNA), which presents efficient, stable, compact, and energy-saving storage option. Researchers are exploring the potential use of DNA as a storage medium for long-term storage of significant cultural and scientific materials. In addition to novel storage media, scientists are also focussing on developing new techniques that can integrate multiple data modalities and leverage machine learning algorithms to identify complex relationships and patterns in vast data sets. These newly-developed data management and analysis approaches have the potential to unlock previously unknown insights into various phenomena and to facilitate more effective translation of basic research findings to practical and clinical applications. Addressing these challenges necessitates different problem-solving approaches. Researchers are developing novel tools and techniques that require different viewpoints. Top-down and bottom-up approaches are essential techniques that offer valuable perspectives for managing, analyzing, and visualizing complex high-dimensional multi-modal data sets. This cumulative dissertation explores the challenges associated with handling such data and highlights top-down, bottom-up, and integrated approaches that are being developed to manage, analyze, and visualize this data. The work is conceptualized in two parts, each reflecting the two problem-solving approaches and their uses in published studies. The proposed work showcases the importance of understanding both approaches, the steps of reasoning about the problem within them, and their concretization and application in various domains
    corecore