22 research outputs found

    Evaluating the Cassandra NoSQL Database Approach for Genomic Data Persistency

    Get PDF
    Rapid advances in high-throughput sequencing techniques have created interesting computational challenges in bioinformatics. One of them refers to management of massive amounts of data generated by automatic sequencers. We need to deal with the persistency of genomic data, particularly storing and analyzing these large-scale processed data. To find an alternative to the frequently considered relational database model becomes a compelling task. Other data models may be more effective when dealing with a very large amount of nonconventional data, especially for writing and retrieving operations. In this paper, we discuss the Cassandra NoSQL database approach for storing genomic data. We perform an analysis of persistency and I/O operations with real data, using the Cassandra database system. We also compare the results obtained with a classical relational database system and another NoSQL database approach, MongoDB

    Using Blockchain to support Data & Service Monetization

    Get PDF
    Two required features of a data monetization platform are query and retrieval of the metadata of the resources to be monetized. Centralized platforms rely on the maturity of traditional NoSQL database systems to support these features. These databases, for example, MongoDB allows for very efficient query and retrieval of data it stores. However, centralized platforms come with a bag of security and privacy concerns, making them not the ideal approach for a data monetization platform. On the other hand, most existing decentralized platforms are only partially decentralized. In this research, I developed Cowry, a platform for publishing metadata describing available resources (data or services), discovery of published metadata including fast search and filtering. My main contribution is a fully decentralized architecture that combines blockchain and traditional distributed database to gain additional features such as efficient query and retrieval of metadata stored on the blockchain

    Using conceptual modeling to improve genome data management

    Full text link
    [EN] With advances in genomic sequencing technology, a large amount of data is publicly available for the research community to extract meaningful and reliable associations among risk genes and the mechanisms of disease. However, this exponential growth of data is spread in over thousand heterogeneous repositories, represented in multiple formats and with different levels of quality what hinders the differentiation of clinically valid relationships from those that are less well-sustained and that could lead to wrong diagnosis. This paper presents how conceptual models can play a key role to efficiently manage genomic data. These data must be accessible, informative and reliable enough to extract valuable knowledge in the context of the identification of evidence supporting the relationship between DNA variants and disease. The approach presented in this paper provides a solution that help researchers to organize, store and process information focusing only on the data that are relevant and minimizing the impact that the information overload has in clinical and research contexts. A case-study (epilepsy) is also presented, to demonstrate its application in a real context.Spanish State Research Agency and the Generalitat Valenciana under the projects TIN2016-80811-P and PROMETEO/2018/176; ERDF.Pastor López, O.; León-Palacio, A.; Reyes Román, JF.; García-Simón, A.; Casamayor Rodenas, JC. (2020). Using conceptual modeling to improve genome data management. Briefings in Bioinformatics. 22(1):45-54. https://doi.org/10.1093/bib/bbaa100S4554221McCombie, W. R., McPherson, J. D., & Mardis, E. R. (2018). Next-Generation Sequencing Technologies. Cold Spring Harbor Perspectives in Medicine, 9(11), a036798. doi:10.1101/cshperspect.a036798Condit, C. M., Achter, P. J., Lauer, I., & Sefcovic, E. (2001). The changing meanings of ?mutation:? A contextualized study of public discourse. Human Mutation, 19(1), 69-75. doi:10.1002/humu.10023Karki, R., Pandya, D., Elston, R. C., & Ferlini, C. (2015). Defining «mutation» and «polymorphism» in the era of personal genomics. BMC Medical Genomics, 8(1). doi:10.1186/s12920-015-0115-zHamid, J. S., Hu, P., Roslin, N. M., Ling, V., Greenwood, C. M. T., & Beyene, J. (2009). Data Integration in Genetics and Genomics: Methods and Challenges. Human Genomics and Proteomics, 1(1). doi:10.4061/2009/869093Baudhuin, L. M., Biesecker, L. G., Burke, W., Green, E. D., & Green, R. C. (2019). Predictive and Precision Medicine with Genomic Data. Clinical Chemistry, 66(1), 33-41. doi:10.1373/clinchem.2019.304345Amaral, G., & Guizzardi, G. (2019). On the Application of Ontological Patterns for Conceptual Modeling in Multidimensional Models. Lecture Notes in Computer Science, 215-231. doi:10.1007/978-3-030-28730-6_14Ashburner, M., Ball, C. A., Blake, J. A., Botstein, D., Butler, H., Cherry, J. M., … Sherlock, G. (2000). Gene Ontology: tool for the unification of biology. Nature Genetics, 25(1), 25-29. doi:10.1038/75556Eilbeck, K., Lewis, S. E., Mungall, C. J., Yandell, M., Stein, L., Durbin, R., & Ashburner, M. (2005). Genome Biology, 6(5), R44. doi:10.1186/gb-2005-6-5-r44Vihinen, M. (2013). Variation Ontology for annotation of variation effects and mechanisms. Genome Research, 24(2), 356-364. doi:10.1101/gr.157495.113Köhler, S., Carmody, L., Vasilevsky, N., Jacobsen, J. O. B., Danis, D., Gourdine, J.-P., … McMurry, J. A. (2018). Expansion of the Human Phenotype Ontology (HPO) knowledge base and resources. Nucleic Acids Research, 47(D1), D1018-D1027. doi:10.1093/nar/gky1105Proceedings of the Eleventh International Conference on Data Engineering. (1995). Proceedings of the Eleventh International Conference on Data Engineering. doi:10.1109/icde.1995.380416Okayama, T., Tamura, T., Gojobori, T., Tateno, Y., Ikeo, K., Miyazaki, S., … Sugawara, H. (1998). Formal design and implementation of an improved DDBJ DNA database with a new schema and object-oriented library. Bioinformatics, 14(6), 472-478. doi:10.1093/bioinformatics/14.6.472Medigue, C., Rechenmann, F., Danchin, A., & Viari, A. (1999). Imagene: an integrated computer environment for sequence annotation and analysis. Bioinformatics, 15(1), 2-15. doi:10.1093/bioinformatics/15.1.2Paton, N. W., Khan, S. A., Hayes, A., Moussouni, F., Brass, A., Eilbeck, K., … Oliver, S. G. (2000). Conceptual modelling of genomic information. Bioinformatics, 16(6), 548-557. doi:10.1093/bioinformatics/16.6.548Vihinen, M., Hancock, J. M., Maglott, D. R., Landrum, M. J., Schaafsma, G. C. P., & Taschner, P. (2016). Human Variome Project Quality Assessment Criteria for Variation Databases. Human Mutation, 37(6), 549-558. doi:10.1002/humu.22976Fleuren, W. W. M., & Alkema, W. (2015). Application of text mining in the biomedical domain. Methods, 74, 97-106. doi:10.1016/j.ymeth.2015.01.015Salzberg, S. L. (2007). Genome re-annotation: a wiki solution? Genome Biology, 8(1). doi:10.1186/gb-2007-8-1-102Rigden, D. J., & Fernández, X. M. (2018). The 26th annual Nucleic Acids Research database issue and Molecular Biology Database Collection. Nucleic Acids Research, 47(D1), D1-D7. doi:10.1093/nar/gky1267Reyes Román, J. F., García, A., Rueda, U., & Pastor, Ó. (2019). GenesLove.Me 2.0: Improving the Prioritization of Genetic Variations. Evaluation of Novel Approaches to Software Engineering, 314-333. doi:10.1007/978-3-030-22559-9_14Richards, S., Aziz, N., Bale, S., Bick, D., Das, S., … Rehm, H. L. (2015). Standards and guidelines for the interpretation of sequence variants: a joint consensus recommendation of the American College of Medical Genetics and Genomics and the Association for Molecular Pathology. Genetics in Medicine, 17(5), 405-423. doi:10.1038/gim.2015.30Kelly, M. A., Caleshu, C., Morales, A., Buchan, J., Wolf, Z., … Funke, B. (2018). Adaptation and validation of the ACMG/AMP variant classification framework for MYH7-associated inherited cardiomyopathies: recommendations by ClinGen’s Inherited Cardiomyopathy Expert Panel. Genetics in Medicine, 20(3), 351-359. doi:10.1038/gim.2017.21

    Implementation of Integration VaaMSN and SEMAR for Wide Coverage Air Quality Monitoring

    Get PDF
    The current air quality monitoring system cannot cover a large area, not real-time and has not implemented big data analysis technology with high accuracy. The purpose of an integration Mobile Sensor Network and Internet of Things system is to build air quality monitoring system that able to monitor in wide coverage. This system consists of Vehicle as a Mobile Sensors Network (VaaMSN) as edge computing and Smart Environment Monitoring and Analytic in Real-time (SEMAR) cloud computing. VaaMSN is a package of air quality sensor, GPS, 4G WiFi modem and single board computing. SEMAR cloud computing has a time-series database for real-time visualization, Big Data environment and analytics use the Support Vector Machines (SVM) and Decision Tree (DT) algorithm. The output from the system are maps, table, and graph visualization. The evaluation obtained from the experimental results shows that the accuracy of both algorithms reaches more than 90%. However, Mean Square Error (MSE) value of SVM algorithm about 0.03076293, but DT algorithm has 10x smaller MSE value than SVM algorithm

    Gerenciamento de proveniência de dados de workflows de bioinformática em ambiente de nuvem computacional

    Get PDF
    Dissertação (mestrado)—Universidade de Brasília, Instituto de Ciências Exatas, Departamento de Ciência da Computação, 2018.Os experimentos da biologia molecular são frequentemente apresentados sob a forma de workflows científicos. Um workflow científico é composto por um conjunto de atividades realizadas por diferentes entidades de processamento através de tarefas gerenciadas. O conhecimento sobre a trajetória dos dados ao longo de um determinado workflow permite a reprodutibilidade por meio da proveniência de dados. Para reproduzir um experimento de Bioinformática in silico, é preciso considerar outros aspectos, além das tarefas executadas em um workflow. De fato, as configurações computacionais nas quais os programas envolvidos são executados são um requisito para a reprodutibilidade. A tecnologia da computação em nuvem pode ocultar detalhes técnicos e facilitar ao usuário a configuração desse ambiente sob demanda. Os sistemas de banco de dados NoSQL também ganharam popularidade, particularmente na nuvem. Considerando este cenário, é proposta uma modelagem para a proveniência de dados de experimentos científicos, em ambiente de nuvem computacional, utilizando o PROV-DM e realizando o mapeamento para três diferentes tipos de famílias de sistemas de banco de dados NoSQL. Foram executados dois workflows de Bioinformática envolvendo diferentes fases, os quais foram utilizados para os testes nos bancos de dados NoSQL Cassandra, MongoDB e OrientDB, e em seguida é apresentada uma análise dessas execuções e testes. Os resultados obtidos mostraram que os tempos de armazenamento da proveniência são mínimos comparados aos tempos de execução dos workflows sem o uso da proveniência e, portanto, os modelos propostos para os bancos de dados NoSQL mostraram ser uma boa opção para armazenamento e gerenciamento de proveniência de dados biológicos.Coordenação de Aperfeiçoamento de Pessoal de Nível Superior (CAPES).Molecular biology experiments are often presented in the form of scientific workflows. There is a set of activities performed by different processing entities through managed tasks. Knowledge about the data trajectory throughout a given workflow enables reproducibility by data provenance. In order to reproduce an in silico bioinformatics experiment one must consider other aspects besides those steps followed by a workflow. Indeed, the computational settings in which the involved programs run is a requirement for reproducibility. Cloud computing technology may hide the technical details and make it easier for the user to set up such an on-demand environment. NoSQL database systems have also gained popularity, particularly in the cloud. Considering this scenario, a model for the provenance of data from scientific experiments in a computational cloud environment is proposed, using the PROV-DM and mapping to three different types of families of NoSQL database systems. Two Bioinformatics workflows involving different phases were performed, which were used for the tests in the NoSQL Cassandra, MongoDB and OrientDB databases, followed by an analysis of these executions and tests.The results obtained showed that the storage times of the provenance are minimal compared to the execution times of the workflows without the use of the provenance and therefore, the proposed models for the NoSQL databases proved to be a good option for storage and management of biological data

    NoSQL Cassandra : um estudo de caso com workflows de bioinformática

    Get PDF
    Trabalho de Conclusão de Curso (graduação)—Universidade de Brasília, Instituto de Ciências Exatas, Departamento de Ciência da Computação, 2016.Projetos de Bioinformática, geralmente„ são executados como workflows científicos. Biólogos podem executar um mesmo workflow diversas vezes com diferentes parâmetros, com o objetivo de comparar os resultados obtidos e refinar a análise dos dados. Essas execuções geram vários arquivos com formatos e tamanhos diferentes, os quais precisam ser armazenados para futuras execuções. Para o gerenciamento de grandes volumes de dados, novos modelos de banco de dados, denominados NoSQL (Not Only SQL), tem sido especificados. Neste contexto, esta monografia apresenta uma avaliação do uso do sistema gerenciador de banco de dados Cassandra em workflows científicos de Bioinformática.Projects in bioinformatics are usually executed as scientific workflows. Biologists often execute the same workflow many times with different settings so that data is refined. These executions generate many files with different size and formats which needs to be stored for further executions. To manage huge volumes of data, database models known as NoSQL (Not Only SQL) are being specified. In this context, this undergraduate thesis presents an evaluation of the Cassandra database in bioinformatic scientifics workflows

    Enhancing scientific information systems with semantic annotations

    Full text link

    SILE: A Method for the Efficient Management of Smart Genomic Information

    Full text link
    [ES] A lo largo de las últimas dos décadas, los datos generados por las tecnologías de secuenciación de nueva generación han revolucionado nuestro entendimiento de la biología humana. Es más, nos han permitido desarrollar y mejorar nuestro conocimiento sobre cómo los cambios (variaciones) en el ADN pueden estar relacionados con el riesgo de sufrir determinadas enfermedades. Actualmente, hay una gran cantidad de datos genómicos disponibles de forma pública, que son consultados con frecuencia por la comunidad científica para extraer conclusiones significativas sobre las asociaciones entre los genes de riesgo y los mecanismos que producen las enfermedades. Sin embargo, el manejo de esta cantidad de datos que crece de forma exponencial se ha convertido en un reto. Los investigadores se ven obligados a sumergirse en un lago de datos muy complejos que están dispersos en más de mil repositorios heterogéneos, representados en múltiples formatos y con diferentes niveles de calidad. Además, cuando se trata de resolver una tarea en concreto sólo una pequeña parte de la gran cantidad de datos disponibles es realmente significativa. Estos son los que nosotros denominamos datos "inteligentes". El principal objetivo de esta tesis es proponer un enfoque sistemático para el manejo eficiente de datos genómicos inteligentes mediante el uso de técnicas de modelado conceptual y evaluación de calidad de los datos. Este enfoque está dirigido a poblar un sistema de información con datos que sean lo suficientemente accesibles, informativos y útiles para la extracción de conocimiento de valor.[CA] Al llarg de les últimes dues dècades, les dades generades per les tecnologies de secuenciació de nova generació han revolucionat el nostre coneixement sobre la biologia humana. És mes, ens han permès desenvolupar i millorar el nostre coneixement sobre com els canvis (variacions) en l'ADN poden estar relacionats amb el risc de patir determinades malalties. Actualment, hi ha una gran quantitat de dades genòmiques disponibles de forma pública i que són consultats amb freqüència per la comunitat científica per a extraure conclusions significatives sobre les associacions entre gens de risc i els mecanismes que produeixen les malalties. No obstant això, el maneig d'aquesta quantitat de dades que creix de forma exponencial s'ha convertit en un repte i els investigadors es veuen obligats a submergir-se en un llac de dades molt complexes que estan dispersos en mes de mil repositoris heterogenis, representats en múltiples formats i amb diferents nivells de qualitat. A m\és, quan es tracta de resoldre una tasca en concret només una petita part de la gran quantitat de dades disponibles és realment significativa. Aquests són els que nosaltres anomenem dades "intel·ligents". El principal objectiu d'aquesta tesi és proposar un enfocament sistemàtic per al maneig eficient de dades genòmiques intel·ligents mitjançant l'ús de tècniques de modelatge conceptual i avaluació de la qualitat de les dades. Aquest enfocament està dirigit a poblar un sistema d'informació amb dades que siguen accessibles, informatius i útils per a l'extracció de coneixement de valor.[EN] In the last two decades, the data generated by the Next Generation Sequencing Technologies have revolutionized our understanding about the human biology. Furthermore, they have allowed us to develop and improve our knowledge about how changes (variants) in the DNA can be related to the risk of developing certain diseases. Currently, a large amount of genomic data is publicly available and frequently used by the research community, in order to extract meaningful and reliable associations among risk genes and the mechanisms of disease. However, the management of this exponential growth of data has become a challenge and the researchers are forced to delve into a lake of complex data spread in over thousand heterogeneous repositories, represented in multiple formats and with different levels of quality. Nevertheless, when these data are used to solve a concrete problem only a small part of them is really significant. This is what we call "smart" data. The main goal of this thesis is to provide a systematic approach to efficiently manage smart genomic data, by using conceptual modeling techniques and the principles of data quality assessment. The aim of this approach is to populate an Information System with data that are accessible, informative and actionable enough to extract valuable knowledge.This thesis was supported by the Research and Development Aid Program (PAID-01-16) under the FPI grant 2137.León Palacio, A. (2019). SILE: A Method for the Efficient Management of Smart Genomic Information [Tesis doctoral]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/131698TESISPremios Extraordinarios de tesis doctorale

    Análise de modelos de dados para NoSQL baseados em documento em workflows de bioinformática

    Get PDF
    Trabalho de Conclusão de Curso (graduação)—Universidade de Brasília, Instituto de Ciências Exatas, Departamento de Ciência da Computação, 2018.Para a quantidade crescente de dados gerados por várias áreas do conhecimento dá-se o nome de Big Data. Neste cenário, pode-se dizer que as pesquisas de bioinformática necessitam de dados de proveniência, pois estes são capazes de fornecer o histórico das informações coletadas no workflow da pesquisa e responder questões relacionadas à origem dos dados. Big Data trouxe o surgimento da abordagem NoSQL (Not Only SQL) como uma alternativa ao uso de Modelos de Banco de Dados Relacional por não apresentar as limitações observadas no Modelo de Banco de Dados Relacional quando este é aplicado em uma grande quantidade de dados. Com foco no MongoDB, este trabalho propõe, com o auxílio de um programa, criado capaz de executar automaticamente um workflow, armazenar sua proveniência e dados brutos em três diferentes formatos de documentos: referencial, embutido e híbrido. Essas três maneiras diferentes são comparadas e analisadas usando parâmetros como tempo e recursos de consulta. Os resultados mostraram algumas particularidades da bioinformática e vantagens ou desvantagens para cada mod- elo.The increasing amount of data named generated by several areas of knowledge is named Big Data. In this scenary, it can be said that Bioinformatic researchs needs provenance data, since it is capable of providing the history of the information collected in the research workflow and answer questions related to the data source. Big Data brought the emergence of the NoSQL (Not Only SQL) approach as an alternative to the use of Relational Database Models because it does not present the limitations observed in the Relational Database Model when it is applied in a large dataset. With focus on MongoDB, this work proposes a program that can automatically execute a workflow and store its provenance and raw data into three different document formats: reference, embedded and hybrid. Those three different ways are compared using parameters such as time and query capabilities. Results showed some bioinformatics particularities and advantages or disadvantages for each model

    Development and application of a platform for harmonisation and integration of metabolomics data

    Get PDF
    Integrating diverse metabolomics data for molecular epidemiology analyses provides both opportuni- ties and challenges in the field of human health research. Combining patient cohorts may improve power and sensitivity of analyses but is challenging due to significant technical and analytical vari- ability. Additionally, current systems for the storage and analysis of metabolomics data suffer from scalability, query-ability, and integration issues that limit their adoption for molecular epidemiological research. Here, a novel platform for integrative metabolomics is developed, which addresses issues of storage, harmonisation, querying, scaling, and analysis of large-scale metabolomics data. Its use is demonstrated through an investigation of molecular trends of ageing in an integrated four-cohort dataset where the advantages and disadvantages of combining balanced and unbalanced cohorts are explored, and robust metabolite trends are successfully identified and shown to be concordant with previous studies.Open Acces
    corecore