104 research outputs found

    Big tranSMART for clinical decision making

    Get PDF
    Molecular profiling data based patient stratification plays a key role in clinical decision making, such as identification of disease subgroups and prediction of treatment responses of individual subjects. Many existing knowledge management systems like tranSMART enable scientists to do such analysis. But in the big data era, molecular profiling data size increases sharply due to new biological techniques, such as next generation sequencing. None of the existing storage systems work well while considering the three ”V” features of big data (Volume, Variety, and Velocity). New Key Value data stores like Apache HBase and Google Bigtable can provide high speed queries by the Key. These databases can be modeled as Distributed Ordered Table (DOT), which horizontally partitions a table into regions and distributes regions to region servers by the Key. However, none of existing data models work well for DOT. A Collaborative Genomic Data Model (CGDM) has been designed to solve all these is- sues. CGDM creates three Collaborative Global Clustering Index Tables to improve the data query velocity. Microarray implementation of CGDM on HBase performed up to 246, 7 and 20 times faster than the relational data model on HBase, MySQL Cluster and MongoDB. Single nucleotide polymorphism implementation of CGDM on HBase outperformed the relational model on HBase and MySQL Cluster by up to 351 and 9 times. Raw sequence implementation of CGDM on HBase gains up to 440-fold and 22-fold speedup, compared to the sequence alignment map format implemented in HBase and a binary alignment map server. The integration into tranSMART shows up to 7-fold speedup in the data export function. In addition, a popular hierarchical clustering algorithm in tranSMART has been used as an application to indicate how CGDM can influence the velocity of the algorithm. The optimized method using CGDM performs more than 7 times faster than the same method using the relational model implemented in MySQL Cluster.Open Acces

    Microarray tools and analysis methods to better characterize biological networks

    Get PDF
    To accurately model a biological system (e.g. cell), we first need to characterize each of its distinct networks. While omics data has given us unprecedented insight into the structure and dynamics of these networks, the associated analysis routines are more involved and the accuracy and precision of the experimental technologies not sufficiently examined. The main focus of our research has been to develop methods and tools to better manage and interpret microarray data. How can we improve methods to store and retrieve microarray data from a relational database? What experimental and biological factors most influence our interpretation of a microarray's measurements? By accounting for these factors, can we improve the accuracy and precision of microarray measurements? It's essential to address these last two questions before using 'omics data for downstream analyses, such as inferring transciption regulatory networks from microarray data. While answers to such questions are vital to microarray research in particular, they are equally relevant to systems biology in general. We developed three studies to investigate aspects of these questions when using Affymetrix expression arrays. In the first study, we develop the Data-FATE framework to improve the handling of large scientific data sets. In the next two studies, we developed methods and tools that allow us to examine the impact of physical and technical factors known or suspected to dramatically alter the interpretation of a microarray experiment. In the second study, we develop ArrayInitiative -- a tool that simplifies the process of creating custom CDFs -- so that we can easily re-design the array specifications for Affymetrix 3' IVT expression arrays. This tool is essential for testing the impact of the various factors, and for making the framework easy to communicate and re-use. We then use ArrayInitiative in a case study to illustrate the impact of several factors known to distort microarray signals. In the third study, we systematically and exhaustively examine the effect of physical and technical factors -- both generally accepted and novel -- on our interpretation of dozens of experiments using hundreds of E. coli Affymetrix microarrays

    Alternatives to relational databases in precision medicine: comparison of NOSQL approaches for big data storage using supercomputers

    Get PDF
    Improvements in medical and genomic technologies have dramatically increased the production of electronic data over the last decade. As a result, data management is rapidly becoming a major determinant, and urgent challenge, for the development of Precision Medicine. Although successful data management is achievable using Relational Database Management Systems (RDBMS), exponential data growth is a significant contributor to failure scenarios. Growing amounts of data can also be observed in other sectors, such as economics and business, which, together with the previous facts, suggests that alternate database approaches (NoSQL) may soon be required for efficient storage and management of big databases. However, this hypothesis has been difficult to test in the Precision Medicine field since alternate database architectures are complex to assess and means to integrate heterogeneous electronic health records (EHR) with dynamic genomic data are not easily available. In this dissertation, we present a novel set of experiments for identifying NoSQL database approaches that enable effective data storage and management in Precision Medicine using patients’ clinical and genomic information from the cancer genome atlas (TCGA). The first experiment draws on performance and scalability from biologically meaningful queries with differing complexity and database sizes. The second experiment measures performance and scalability in database updates without schema changes. The third experiment assesses performance and scalability in database updates with schema modifications due dynamic data. We have identified two NoSQL approach, based on Cassandra and Redis, which seems to be the ideal database management systems for our precision medicine queries in terms of performance and scalability. We present NoSQL approaches and show how they can be used to manage clinical and genomic big data. Our research is relevant to the public health since we are focusing on one of the main challenges to the development of Precision Medicine and, consequently, investigating a potential solution to the progressively increasing demands on health care

    Discovering Biomarkers of Alzheimer's Disease by Statistical Learning Approaches

    Get PDF
    In this work, statistical learning approaches are exploited to discover biomarkers for Alzheimer's disease (AD). The contributions has been made in the fields of both biomarker and software driven studies. Surprising discoveries were made in the field of blood-based biomarker search. With the inclusion of existing biological knowledge and a proposed novel feature selection method, several blood-based protein models were discovered to have promising ability to separate AD patients from healthy individuals. A new statistical pattern was discovered which can be potential new guideline for diagnosis methodology. In the field of brain-based biomarker, the positive contribution of covariates such as age, gender and APOE genotype to a AD classifier was verified, as well as the discovery of panel of highly informative biomarkers comprising 26 RNA transcripts. The classifier trained by the panetl of genes shows excellent capacity in discriminating patients from control. Apart from biomarker driven studies, the development of statistical packages or application were also involved. R package metaUnion was designed and developed to provide advanced meta-analytic approach applicable for microarray data. This package overcomes the defects appearing in previous meta-analytic packages { 1) the neglection of missing data, 2) the in exibility of feature dimension 3) the lack of functions to support post-analysis summary. R package metaUnion has been applied in a published study as part of the integrated genomic approaches and resulted in significant findings. To provide benchmark references about significance of features for dementia researchers, a web-based platform AlzExpress was built to provide researchers with granular level of differential expression test and meta-analysis results. A combination of fashionable big data technologies and robust data mining algorithms make AlzExpress flexible, scalable and comprehensive platform of valuable bioinformatics in dementia research.Plymouth Universit

    Applied Bioinformatics in Saccharomyces cerevisiae: Data storage, integration and analysis

    Get PDF
    The massive amount of biological data has had a significant effect on the field of bioinformatics. This growth of data has not only lead to the growing number of biological databases but has also imposed the needs for additional and more sophisticated computational techniques to proficiently manage, store and retrieve these data, as well as to competently help gaining biological insights and contribute to novel discoveries. This thesis presents results from applying several bioinformatics approaches on yeast datasets. Three yeast databases were developed using different technologies. Each database emphasizes on a specific aspect. yApoptosis collects and structurally organizes vital information specifically for yeast cell death pathway, apoptosis. It includes predicted protein complexes and clustered motifs from the incorporation of apoptosis genes and interaction data. yStreX highlights exploitation of transcriptome data generated by studies of stress responses and ageing in yeast. It contains a compilation of results from gene expression analyses in different contexts making it an integrated resource to facilitate data query and data comparison between different experiments. A yeast data repository is a centralized database encompassing with multiple kinds of yeast data. The database is applied on a dedicated database system that was developed addressing data integration issue in managing heterogeneous datasets. Data analysis was performed in parallel using several methods and software packages such as Limma, Piano and metaMA. Particularly the gene expressions of chronologically ageing yeast were analyzed in the integrative fashion to gain a more thorough picture of the condition such as gene expression patterns, biological processes, transcriptional regulations, metabolic pathways and interactions of active components. This study demonstrates extensive applications of bioinformatics in the domains of data storage, data sharing, data integration and data analysis on various data from yeast S.cerevisiae in order to gain biological insights. Numerous methodologies and technologies were selectively applied in different contexts depended upon characteristics of the data and the goal of the specific biological question

    Development of novel software tools and methods for investigating the significance of overlapping transcription factor genomic interactions

    Get PDF
    Identifying overlapping DNA binding patterns of different transcription factors is a major objective of genomic studies, but existing methods to archive large numbers of datasets in a personalised database lack sophistication and utility. To address this need, various database systems were benchmarked and a tool BiSA (Binding Sites Analyser) was developed for archiving of genomic regions and easy identification of overlap with or proximity to other regions of interest. BiSA can also calculate statistical significance of overlapping regions and can also identify genes located near binding regions of interest or genomic features near a gene or locus of interest. BiSA was populated with >1000 datasets from previously published genomic studies describing transcription factor binding sites and histone modifications. Using BiSA, the relationships between binding sites for a range of transcription factors were analysed and a number of statistically significant relationships were identified. This included an extensive comparison of estrogen receptor alpha (ERα) and progesterone receptor (PR) in breast cancer cells, which revealed a statistically significant functional relationship at a subset of sites. In summary, the BiSA comprehensive knowledge base contains publicly available datasets describing transcription factor binding sites and epigenetic modification and provides an easy graphical interface to biologists for advanced analysis of genomic interactions

    Implementazione ed ottimizzazione di algoritmi per l'analisi di Biomedical Big Data

    Get PDF
    Big Data Analytics poses many challenges to the research community who has to handle several computational problems related to the vast amount of data. An increasing interest involves Biomedical data, aiming to get the so-called personalized medicine, where therapy plans are designed on the specific genotype and phenotype of an individual patient and algorithm optimization plays a key role to this purpose. In this work we discuss about several topics related to Biomedical Big Data Analytics, with a special attention to numerical issues and algorithmic solutions related to them. We introduce a novel feature selection algorithm tailored on omics datasets, proving its efficiency on synthetic and real high-throughput genomic datasets. We tested our algorithm against other state-of-art methods obtaining better or comparable results. We also implemented and optimized different types of deep learning models, testing their efficiency on biomedical image processing tasks. Three novel frameworks for deep learning neural network models development are discussed and used to describe the numerical improvements proposed on various topics. In the first implementation we optimize two Super Resolution models showing their results on NMR images and proving their efficiency in generalization tasks without a retraining. The second optimization involves a state-of-art Object Detection neural network architecture, obtaining a significant speedup in computational performance. In the third application we discuss about femur head segmentation problem on CT images using deep learning algorithms. The last section of this work involves the implementation of a novel biomedical database obtained by the harmonization of multiple data sources, that provides network-like relationships between biomedical entities. Data related to diseases and other biological relates were mined using web-scraping methods and a novel natural language processing pipeline was designed to maximize the overlap between the different data sources involved in this project
    • …
    corecore