25 research outputs found
Towards linked open gene mutations data
<p>Abstract</p> <p>Background</p> <p>With the advent of high-throughput technologies, a great wealth of variation data is being produced. Such information may constitute the basis for correlation analyses between genotypes and phenotypes and, in the future, for personalized medicine. Several databases on gene variation exist, but this kind of information is still scarce in the Semantic Web framework.</p> <p>In this paper, we discuss issues related to the integration of mutation data in the Linked Open Data infrastructure, part of the Semantic Web framework. We present the development of a mapping from the IARC TP53 Mutation database to RDF and the implementation of servers publishing this data.</p> <p>Methods</p> <p>A version of the IARC TP53 Mutation database implemented in a relational database was used as first test set. Automatic mappings to RDF were first created by using D2RQ and later manually refined by introducing concepts and properties from domain vocabularies and ontologies, as well as links to Linked Open Data implementations of various systems of biomedical interest.</p> <p>Since D2RQ query performances are lower than those that can be achieved by using an RDF archive, generated data was also loaded into a dedicated system based on tools from the Jena software suite.</p> <p>Results</p> <p>We have implemented a D2RQ Server for TP53 mutation data, providing data on a subset of the IARC database, including gene variations, somatic mutations, and bibliographic references. The server allows to browse the RDF graph by using links both between classes and to external systems. An alternative interface offers improved performances for SPARQL queries. The resulting data can be explored by using any Semantic Web browser or application.</p> <p>Conclusions</p> <p>This has been the first case of a mutation database exposed as Linked Data. A revised version of our prototype, including further concepts and IARC TP53 Mutation database data sets, is under development.</p> <p>The publication of variation information as Linked Data opens new perspectives: the exploitation of SPARQL searches on mutation data and other biological databases may support data retrieval which is presently not possible. Moreover, reasoning on integrated variation data may support discoveries towards personalized medicine.</p
IBWS: IST Bioinformatics Web Services
The Bioinformatics group at the National Cancer Research Institute (IST) of Genoa has been involved since many years in the development and maintenance of biomedical information systems. Among them, the Common Access to Biological Resources and Information network services offer access to more than 130 000 biological resources, like strains of micro-organisms and human and animal cell lines, included in 29 collections from some of the most known European Biological Resource Centers. An Sequence Retrieval System (SRS) implementation of the TP53 Mutation Database of the International Agency for Research on Cancer (Lyon) was made available in order to improve interoperability of this data with other molecular biology databases. âSRS by WS (SWS)â, a system for retrieving information on public SRS sites and for directly querying them, was also implemented. In order to make this information available through application programming interfaces, we implemented a suite of free web services (WS), called the âIST Bioinformatics Web Services (IBWS)â. A support web site, including a description of the system, a list of available WS together with help pages, links to corresponding WSDLs and forms for testing services, is available at http://bioinformatics.istge.it/ibws/. WSDL definitions can also be retrieved directly at http://bioinformatics.istge.it:8080/axis/services
Improving data workflow systems with cloud services and use of open data for bioinformatics research
Data workflow systems (DWFSs) enable bioinformatics researchers to combine components for data access and data analytics, and to share the final data analytics approach with their collaborators. Increasingly, such systems have to cope with large-scale data, such as full genomes (about 200âGB each), public fact repositories (about 100Â TB of data) and 3D imaging data at even larger scales. As moving the data becomes cumbersome, the DWFS needs to embed its processes into a cloud infrastructure, where the data are already hosted. As the standardized public data play an increasingly important role, the DWFS needs to comply with Semantic Web technologies. This advancement to DWFS would reduce overhead costs and accelerate the progress in bioinformatics research based on large-scale data and public resources, as researchers would require less specialized IT knowledge for the implementation. Furthermore, the high data growth rates in bioinformatics research drive the demand for parallel and distributed computing, which then imposes a need for scalability and high-throughput capabilities onto the DWFS. As a result, requirements for data sharing and access to public knowledge bases suggest that compliance of the DWFS with Semantic Web standards is necessary. In this article, we will analyze the existing DWFS with regard to their capabilities toward public open data use as well as large-scale computational and human interface requirements. We untangle the parameters for selecting a preferable solution for bioinformatics research with particular consideration to using cloud services and Semantic Web technologies. Our analysis leads to research guidelines and recommendations toward the development of future DWFS for the bioinformatics research community
AIoTES: Setting the principles for semantic interoperable and modern IoT-enabled reference architecture for Active and Healthy Ageing ecosystems
[EN]
The average life expectancy of the world's population is increasing and the healthcare systems sooner than later will be compromised by its reduced capacity and its highly economic cost; in addition, the age distribution of the population is leading towards the older spectrum. This trend will lead to immeasurable and unexpected economic problems and social changes. In order to face up this challenge and complex economic and social problem, it is necessary to rely on the appropriate digital tools and technological infrastructures for ensuring that the elderly are properly cared in their everyday living environments and they can live independently for longer. This article presents ACTIVAGE IoT Ecosystem Suite (AIoTES), a concrete reference architecture and its implementation process that addresses these issues and that was designed within the first European Large Scale Pilot, ACTIVAGE, a H2020 funded project by the European Commission with the objective of creating sustainable ecosystems for Active and Healthy Ageing (AHA) based on Internet of Things and big data technologies. AIoTES offers platform level semantic interoperability, with security and privacy, as well as Big Data and Ecosystem tools. AIoTES enables and promotes the creation, exchange and adoption of crossplatform services and applications for AHA. The number of existing AHA services and solutions are quite large, especially when state-of-the-art technology is introduced, however a concrete architecture such as AIoTES gains more importance and relevance by providing a vision for establishing a complete ecosystem, that looks for supporting a larger variety of AHA services, rather than claiming to be a unique solution for all the AHA domain problems. AIoTES has been successfully validated by testing all of its components, individually, integrated, and in real-world environments with 4345 direct users. Each validation is contextualized in 11 Deployment Sites (DS) with 13 Validation Scenarios covering the heterogeneity of the AHA-IoT needs. These results also show a clear path for improvement, as well as the importance for standardization efforts in the ever-evolving AHA-IoT domain.We thank to all the people who have participated in the development and validation of AIoTES. This work has been developed under the framework of the ACTIVAGE project. The project has received funding from the European Union's Horizon 2020 research and innovation programme under grant agreement No 732679.Valero-LĂłpez, CI.; Medrano-Gil, A.; GonzĂĄlez-Usach, R.; JuliĂĄn-SeguĂ, M.; Fico, G.; Arredondo, MT.; Stavropoulos, TG.... (2021). AIoTES: Setting the principles for semantic interoperable and modern IoT-enabled reference architecture for Active and Healthy Ageing ecosystems. Computer Communications. 177:96-111. https://doi.org/10.1016/j.comcom.2021.06.0109611117
Towards linked open gene mutations data
Background: With the advent of high-throughput technologies, a great wealth of variation data is being produced. Such information may constitute the basis for correlation analyses between genotypes and phenotypes and, in the future, for personalized medicine. Several databases on gene variation exist, but this kind of information is still scarce in the Semantic Web framework. In this paper, we discuss issues related to the integration of mutation data in the Linked Open Data infrastructure, part of the Semantic Web framework. We present the development of a mapping from the IARC TP53 Mutation database to RDF and the implementation of servers publishing this data.
Methods: A version of the IARC TP53 Mutation database implemented in a relational database was used as first test set. Automatic mappings to RDF were first created by using D2RQ and later manually refined by introducing concepts and properties from domain vocabularies and ontologies, as well as links to Linked Open Data implementations of various systems of biomedical interest. Since D2RQ query performances are lower than those that can be achieved by using an RDF archive, generated data was also loaded into a dedicated system based on tools from the Jena software suite.
Results: We have implemented a D2RQ Server for TP53 mutation data, providing data on a subset of the IARC database, including gene variations, somatic mutations, and bibliographic references. The server allows to browse the RDF graph by using links both between classes and to external systems. An alternative interface offers improved performances for SPARQL queries. The resulting data can be explored by using any Semantic Web browser or application.
Conclusions: This has been the first case of a mutation database exposed as Linked Data. A revised version of our prototype, including further concepts and IARC TP53 Mutation database data sets, is under development. The publication of variation information as Linked Data opens new perspectives: the exploitation of SPARQL searches on mutation data and other biological databases may support data retrieval which is presently not possible. Moreover, reasoning on integrated variation data may support discoveries towards personalized medicine
A deep learning approach to genomics data for population scale clustering and ethnicity prediction
The understanding of variations in genome sequences assists us in identifying
people who are predisposed to common diseases, solving rare diseases, and finding
corresponding population group of the individuals from a larger population group.
Although classical machine learning techniques allow the researchers to identify groups
or clusters of related variables, accuracies, and effectiveness of these methods diminish
for large and hyperdimensional datasets such as whole human genome. On the other hand,
deep learning (DL) can make better representations of large-scale datasets to build models
to learn these representations very extensively. Furthermore, Semantic Web (SW)
technologies already acted as useful adaptors in life science research for large-scale data
integration and querying. Thus the standardized public data created using SW plays an
increasingly important role in life sciences research. In this paper, we propose a novel and
scalable genomic data analysis towards population scale clustering and predicting
geographic ethnicity using SW and DL-based technique. We used genotypes data from
the 1000 Genome Project resulting from the whole genomes sequencing extracted from
the 2504 individuals consisting of 84 million variants with 26 ethnic origins.
Experimental results in terms accuracy and scalability show the effectiveness and
superiority compared to the state-of-the-art. Particularly, our deep-learning-based
analytics technique using classification and clustering algorithms can predict and group
targeted populations with a prediction accuracy of 98% and an ARI of 0.92 respectively.This publication has emanated from research conducted with the financial support of
Science Foundation Ireland (SFI) under the Grant Number SFI/12/RC/2289
A deep learning approach to genomics data for population scale clustering and ethnicity prediction
The understanding of variations in genome sequences assists us in identifying
people who are predisposed to common diseases, solving rare diseases, and finding
corresponding population group of the individuals from a larger population group.
Although classical machine learning techniques allow the researchers to identify groups
or clusters of related variables, accuracies, and effectiveness of these methods diminish
for large and hyperdimensional datasets such as whole human genome. On the other hand,
deep learning (DL) can make better representations of large-scale datasets to build models
to learn these representations very extensively. Furthermore, Semantic Web (SW)
technologies already acted as useful adaptors in life science research for large-scale data
integration and querying. Thus the standardized public data created using SW plays an
increasingly important role in life sciences research. In this paper, we propose a novel and
scalable genomic data analysis towards population scale clustering and predicting
geographic ethnicity using SW and DL-based technique. We used genotypes data from
the 1000 Genome Project resulting from the whole genomes sequencing extracted from
the 2504 individuals consisting of 84 million variants with 26 ethnic origins.
Experimental results in terms accuracy and scalability show the effectiveness and
superiority compared to the state-of-the-art. Particularly, our deep-learning-based
analytics technique using classification and clustering algorithms can predict and group
targeted populations with a prediction accuracy of 98% and an ARI of 0.92 respectively.This publication has emanated from research conducted with the financial support of
Science Foundation Ireland (SFI) under the Grant Number SFI/12/RC/2289.non-peer-reviewe