4 research outputs found

    A hands-on introduction to querying evolutionary relationships across multiple data sources using SPARQL [version 1; peer review: 1 approved, 2 approved with reservations]

    Get PDF
    The increasing use of Semantic Web technologies in the life sciences, in particular the use of the Resource Description Framework (RDF) and the RDF query language SPARQL, opens the path for novel integrative analyses, combining information from multiple sources. However, analyzing evolutionary data in RDF is not trivial, due to the steep learning curve required to understand both the data models adopted by different RDF data sources, as well as the SPARQL query language. In this article, we provide a hands-on introduction to querying evolutionary data across multiple sources that publish orthology information in RDF, namely: The Orthologous MAtrix (OMA), the European Bioinformatics Institute (EBI) RDF platform, the Database of Orthologous Groups (OrthoDB) and the Microbial Genome Database (MBGD). We present four protocols in increasing order of complexity. In these protocols, we demonstrate through SPARQL queries how to retrieve pairwise orthologs, homologous groups, and hierarchical orthologous groups. Finally, we show how orthology information in different sources can be compared, through the use of federated SPARQL queries

    Advances and Applications in the Quest for Orthologs

    Get PDF
    Gene families evolve by the processes of speciation (creating orthologs), gene duplication (paralogs) and horizontal gene transfer (xenologs), in addition to sequence divergence and gene loss. Orthologs in particular play an essential role in comparative genomics and phylogenomic analyses. With the continued sequencing of organisms across the tree of life, the data are available to reconstruct the unique evolutionary histories of tens of thousands of gene families. Accurate reconstruction of these histories, however, is a challenging computational problem, and the focus of the Quest for Orthologs Consortium. We review the recent advances and outstanding challenges in this field, as revealed at a symposium and meeting held at the University of Southern California in 2017. Key advances have been made both at the level of orthology algorithm development and with respect to coordination across the community of algorithm developers and orthology end-users. Applications spanned a broad range, including gene function prediction, phylostratigraphy, genome evolution, and phylogenomics. The meetings highlighted the increasing use of meta-analyses integrating results from multiple different algorithms, and discussed ongoing challenges in orthology inference as well as the next steps toward improvement and integration of orthology resources

    Enabling Complex Semantic Queries to Bioinformatics Databases through Intuitive Search Over Data

    Get PDF
    Data integration promises to be one of the main catalysts in enabling new insights to be drawn from the wealth of biological data already available publicly. However, the heterogene- ity of the existing data sources still poses significant challenges for achieving interoperability among biological databases. Furthermore, merely solving the technical challenges of data in- tegration, for example through the use of common data representation formats, leaves open the larger problem. Namely, the steep learning curve required for understanding the data models of each public source, as well as the technical language through which the sources can be queried and joined. As a consequence, most of the available biological data remain practically unexplored today. In this thesis, we address these problems jointly, by first introducing an ontology-based data integration solution in order to mitigate the data source heterogeneity problem. We illustrate through the concrete example of Bgee, a gene expression data source, how relational databases can be exposed as virtual Resource Description Framework (RDF) graphs, through relational-to-RDF mappings. This has the important advantage that the original data source can remain unmodified, while still becoming interoperable with external RDF sources. We complement our methods with applied case studies designed to guide domain experts in formulating expressive federated queries targeting the integrated data across the domains of evolutionary relationships and gene expression. More precisely, we introduce two com- parative analyses, first within the same domain (using orthology data from multiple, inter- operable, data sources) and second across domains, in order to study the relation between expression change and evolution rate following a duplication event. Finally, in order to bridge the semantic gap between users and data, we design and im- plement Bio-SODA, a question answering system over domain knowledge graphs, that does not require training data for translating user questions to SPARQL. Bio-SODA uses a novel ranking approach that combines syntactic and semantic similarity, while also incorporating node centrality metrics to rank candidate matches for a given user question. Our results in testing Bio-SODA across several real-world databases that span multiple domains (both within and outside bioinformatics) show that it can answer complex, multi-fact queries, be- yond the current state-of-the-art in the more well-studied open-domain question answering. -- L’intĂ©gration des donnĂ©es promet d’ĂȘtre l’un des principaux catalyseurs permettant d’extraire des nouveaux aperçus de la richesse des donnĂ©es biologiques dĂ©jĂ  disponibles publiquement. Cependant, l’hĂ©tĂ©rogĂ©nĂ©itĂ© des sources de donnĂ©es existantes pose encore des dĂ©fis importants pour parvenir Ă  l’interopĂ©rabilitĂ© des bases de donnĂ©es biologiques. De plus, en surmontant seulement les dĂ©fis techniques de l’intĂ©gration des donnĂ©es, par exemple grĂące Ă  l’utilisation de formats standard de reprĂ©sentation de donnĂ©es, on laisse ouvert un problĂšme encore plus grand. À savoir, la courbe d’apprentissage abrupte nĂ©cessaire pour comprendre la modĂ©li- sation des donnĂ©es choisie par chaque source publique, ainsi que le langage technique par lequel les sources peuvent ĂȘtre interrogĂ©s et jointes. Par consĂ©quent, la plupart des donnĂ©es biologiques publiquement disponibles restent pratiquement inexplorĂ©s aujourd’hui. Dans cette thĂšse, nous abordons l’ensemble des deux problĂšmes, en introduisant d’abord une solution d’intĂ©gration de donnĂ©es basĂ©e sur ontologies, afin d’attĂ©nuer le problĂšme d’hĂ©tĂ©- rogĂ©nĂ©itĂ© des sources de donnĂ©es. Nous montrons, Ă  travers l’exemple de Bgee, une base de donnĂ©es d’expression de gĂšnes, une approche permettant les bases de donnĂ©es relationnelles d’ĂȘtre publiĂ©s sous forme de graphes RDF (Resource Description Framework) virtuels, via des correspondances relationnel-vers-RDF (« relational-to-RDF mappings »). Cela prĂ©sente l’important avantage que la source de donnĂ©es d’origine peut rester inchangĂ©, tout en de- venant interopĂ©rable avec les sources RDF externes. Nous complĂ©tons nos mĂ©thodes avec des Ă©tudes de cas appliquĂ©es, conçues pour guider les experts du domaine dans la formulation de requĂȘtes fĂ©dĂ©rĂ©es expressives, ciblant les don- nĂ©es intĂ©grĂ©es dans les domaines des relations Ă©volutionnaires et de l’expression des gĂšnes. Plus prĂ©cisĂ©ment, nous introduisons deux analyses comparatives, d’abord dans le mĂȘme do- maine (en utilisant des donnĂ©es d’orthologie provenant de plusieurs sources de donnĂ©es in- teropĂ©rables) et ensuite Ă  travers des domaines interconnectĂ©s, afin d’étudier la relation entre le changement d’expression et le taux d’évolution suite Ă  une duplication de gĂšne. Enfin, afin de mitiger le dĂ©calage sĂ©mantique entre les utilisateurs et les donnĂ©es, nous concevons et implĂ©mentons Bio-SODA, un systĂšme de rĂ©ponse aux questions sur des graphes de connaissances domaine-spĂ©cifique, qui ne nĂ©cessite pas de donnĂ©es de formation pour traduire les questions des utilisateurs vers SPARQL. Bio-SODA utilise une nouvelle ap- proche de classement qui combine la similaritĂ© syntactique et sĂ©mantique, tout en incorporant des mĂ©triques de centralitĂ© des nƓuds, pour classer les possibles candidats en rĂ©ponse Ă  une question utilisateur donnĂ©e. Nos rĂ©sultats suite aux tests effectuĂ©s en utilisant Bio-SODA sur plusieurs bases de donnĂ©es Ă  travers plusieurs domaines (tantĂŽt liĂ©s Ă  la bioinformatique qu’extĂ©rieurs) montrent que Bio-SODA rĂ©ussit Ă  rĂ©pondre Ă  des questions complexes, en- gendrant multiples entitĂ©s, au-delĂ  de l’état actuel de la technique en matiĂšre de systĂšmes de rĂ©ponses aux questions sur les donnĂ©es structures, en particulier graphes de connaissances
    corecore