22 research outputs found

    Statistical methods to improve the analysis of biological data: Benchmarking phenotypes, protein function prediction, and spatial modelling of gene expression

    Get PDF
    Data collected in biological experiments comes in all shapes and sizes, including DNA and protein sequences, mRNA counts, spatial interactions, protein annotations, phenotypic images and so on. In order to make sense of this myriad of data, novel statistical methods are needed to not only model the biological data, but also to assess the accuracy of predictions. In this thesis, I present three research studies that perform statistical analysis in the benchmarking, assessment and modelling of genetic data, demonstrating diversity of bioinformatics research. The approach taken here is to tailor statistical methods for specific data types. To provide quality benchmark data for phenotypic image processing and assessment, a Generalized Linear Mixed effects model was used to compare the performance of different groups of people (lay people recruited through Amazon Mechanical Turk versus experts) in their efficacy to highlight key elements in phenotypic images collected from corn fields. The analyzed images were then used as ground-truth for the training and testing of automated methods. We concluded that properly managed crowdsourcing can be used to establish large volumes of viable ground truth data at a low cost and high quality, especially in the context of high throughput plant phenotyping. To assess the quality of computational protein function predictions, the third Critical Assessment of Functional Annotation (CAFA) was launched to evaluate predictions in the form of a community challenge. Each protein is associated with multiple functions represented by Gene Ontology terms (labels). These ontological terms form a hierarchical structure, and the frequency of each term is not distributed uniformly among different proteins. Precision-recall based assessment metrics were not enough to account for the non-uniform prior distribution of this multi-label problem, so semantic-distance based methods were developed for better model assessment. We concluded that while predictions of the molecular function and biological process annotations have slightly improved over time, those of the cellular component have not. Term-centric prediction of experimental annotations remains equally challenging; although the performance of the top methods is significantly better than expectations set by baseline methods, it leaves considerable room and need for improvement. The CAFA community now involves a broad range of participants with expertise in bioinformatics, biological experimentation, biocuration, and bio-ontologies, working together to improve functional annotation databases, computational function prediction, and our ability to manage big data in the era of large experimental screens. To model the spatial dependency of gene expression on the 3D structure of the genome, a Poisson Hierarchical Markov Random Field model (PhiMRF) was developed for gene expression data that accounts for the pairwise spatial interaction from HiC experiments. The quantitative expression of genes on human chromosomes 1, 4, 5, 6, 8, 9, 12, 19, 20 , 21 and X all showed meaningful positive intra-chromosomal spatial dependency. Moreover, the spatial dependency is much stronger than the dependency based on linear gene neighborhoods, suggesting that 3D chromosome structures such as chromatin loops and Topologically Associating Domains (TADs) are indeed strongly correlated with gene expression levels. The results both confirm and quantify the spatial correlation in gene expression. In addition, PhiMRF improves upon the stochastic modelling of gene expression that is currently widely used in differential expression analyses. PhiMRF is available at https://github.com/ashleyzhou972/PhiMRF as an R package

    Evaluation on Transfer Efficiency at Integrated Transport Terminals through Multilevel Grey Evaluation

    Get PDF
    AbstractTransfer efficiency in integrated transportation terminal is greatly important for both passengers and operational companies. In this paper, we proposed various criteria and a hierarchy index system to evaluate the performance of the transfer condition inside Beijing South Railway Station. To make the assessment more scientific, we assign weightings to each of them by integrated weighting method. Then we use an evaluation method, Multi-level Grey Evaluation, to calculate the performance indexes of different transfer modes in the station and further we compare the ranking results of transfer efficiency of different transfer modes

    The CAFA challenge reports improved protein function prediction and new functional annotations for hundreds of genes through experimental screens

    Get PDF
    Background The Critical Assessment of Functional Annotation (CAFA) is an ongoing, global, community-driven effort to evaluate and improve the computational annotation of protein function. Results Here, we report on the results of the third CAFA challenge, CAFA3, that featured an expanded analysis over the previous CAFA rounds, both in terms of volume of data analyzed and the types of analysis performed. In a novel and major new development, computational predictions and assessment goals drove some of the experimental assays, resulting in new functional annotations for more than 1000 genes. Specifically, we performed experimental whole-genome mutation screening in Candida albicans and Pseudomonas aureginosa genomes, which provided us with genome-wide experimental data for genes associated with biofilm formation and motility. We further performed targeted assays on selected genes in Drosophila melanogaster, which we suspected of being involved in long-term memory. Conclusion We conclude that while predictions of the molecular function and biological process annotations have slightly improved over time, those of the cellular component have not. Term-centric prediction of experimental annotations remains equally challenging; although the performance of the top methods is significantly better than the expectations set by baseline methods in C. albicans and D. melanogaster, it leaves considerable room and need for improvement. Finally, we report that the CAFA community now involves a broad range of participants with expertise in bioinformatics, biological experimentation, biocuration, and bio-ontologies, working together to improve functional annotation, computational function prediction, and our ability to manage big data in the era of large experimental screens.Peer reviewe

    The CAFA challenge reports improved protein function prediction and new functional annotations for hundreds of genes through experimental screens

    Get PDF
    BackgroundThe Critical Assessment of Functional Annotation (CAFA) is an ongoing, global, community-driven effort to evaluate and improve the computational annotation of protein function.ResultsHere, we report on the results of the third CAFA challenge, CAFA3, that featured an expanded analysis over the previous CAFA rounds, both in terms of volume of data analyzed and the types of analysis performed. In a novel and major new development, computational predictions and assessment goals drove some of the experimental assays, resulting in new functional annotations for more than 1000 genes. Specifically, we performed experimental whole-genome mutation screening in Candida albicans and Pseudomonas aureginosa genomes, which provided us with genome-wide experimental data for genes associated with biofilm formation and motility. We further performed targeted assays on selected genes in Drosophila melanogaster, which we suspected of being involved in long-term memory.ConclusionWe conclude that while predictions of the molecular function and biological process annotations have slightly improved over time, those of the cellular component have not. Term-centric prediction of experimental annotations remains equally challenging; although the performance of the top methods is significantly better than the expectations set by baseline methods in C. albicans and D. melanogaster, it leaves considerable room and need for improvement. Finally, we report that the CAFA community now involves a broad range of participants with expertise in bioinformatics, biological experimentation, biocuration, and bio-ontologies, working together to improve functional annotation, computational function prediction, and our ability to manage big data in the era of large experimental screens.</p

    Statistical methods to improve the analysis of biological data: Benchmarking phenotypes, protein function prediction, and spatial modelling of gene expression

    No full text
    Data collected in biological experiments comes in all shapes and sizes, including DNA and protein sequences, mRNA counts, spatial interactions, protein annotations, phenotypic images and so on. In order to make sense of this myriad of data, novel statistical methods are needed to not only model the biological data, but also to assess the accuracy of predictions. In this thesis, I present three research studies that perform statistical analysis in the benchmarking, assessment and modelling of genetic data, demonstrating diversity of bioinformatics research. The approach taken here is to tailor statistical methods for specific data types. To provide quality benchmark data for phenotypic image processing and assessment, a Generalized Linear Mixed effects model was used to compare the performance of different groups of people (lay people recruited through Amazon Mechanical Turk versus experts) in their efficacy to highlight key elements in phenotypic images collected from corn fields. The analyzed images were then used as ground-truth for the training and testing of automated methods. We concluded that properly managed crowdsourcing can be used to establish large volumes of viable ground truth data at a low cost and high quality, especially in the context of high throughput plant phenotyping. To assess the quality of computational protein function predictions, the third Critical Assessment of Functional Annotation (CAFA) was launched to evaluate predictions in the form of a community challenge. Each protein is associated with multiple functions represented by Gene Ontology terms (labels). These ontological terms form a hierarchical structure, and the frequency of each term is not distributed uniformly among different proteins. Precision-recall based assessment metrics were not enough to account for the non-uniform prior distribution of this multi-label problem, so semantic-distance based methods were developed for better model assessment. We concluded that while predictions of the molecular function and biological process annotations have slightly improved over time, those of the cellular component have not. Term-centric prediction of experimental annotations remains equally challenging; although the performance of the top methods is significantly better than expectations set by baseline methods, it leaves considerable room and need for improvement. The CAFA community now involves a broad range of participants with expertise in bioinformatics, biological experimentation, biocuration, and bio-ontologies, working together to improve functional annotation databases, computational function prediction, and our ability to manage big data in the era of large experimental screens. To model the spatial dependency of gene expression on the 3D structure of the genome, a Poisson Hierarchical Markov Random Field model (PhiMRF) was developed for gene expression data that accounts for the pairwise spatial interaction from HiC experiments. The quantitative expression of genes on human chromosomes 1, 4, 5, 6, 8, 9, 12, 19, 20 , 21 and X all showed meaningful positive intra-chromosomal spatial dependency. Moreover, the spatial dependency is much stronger than the dependency based on linear gene neighborhoods, suggesting that 3D chromosome structures such as chromatin loops and Topologically Associating Domains (TADs) are indeed strongly correlated with gene expression levels. The results both confirm and quantify the spatial correlation in gene expression. In addition, PhiMRF improves upon the stochastic modelling of gene expression that is currently widely used in differential expression analyses. PhiMRF is available at https://github.com/ashleyzhou972/PhiMRF as an R package.</p

    Intercropping with Potato-Onion Enhanced the Soil Microbial Diversity of Tomato

    No full text
    Intercropping can achieve sustainable agricultural development by increasing plant diversity. In this study, we investigated the effects of tomato monoculture and tomato/potato-onion intercropping systems on tomato seedling growth and changes of soil microbial communities in greenhouse conditions. Results showed that the intercropping with potato-onion increased tomato seedling biomass. Compared with monoculture system, the alpha diversity of soil bacterial and fungal communities, beta diversity and abundance of bacterial community were increased in the intercropping system. Nevertheless, the beta-diversity and abundance of fungal community had no difference between the intercropping and monoculture systems. The relative abundances of some taxa (i.e., Acidobacteria-Subgroup-6, Arthrobacter, Bacillus, Pseudomonas) and several OTUs with the potential to promote plant growth were increased, while the relative abundances of some potential plant pathogens (i.e., Cladosporium) were decreased in the intercropping system. Redundancy analysis indicated that bacterial community structure was significantly influenced by soil organic carbon and pH, the fungal community structure was related to changes in soil organic carbon and available phosphorus. Overall, our results suggested that the tomato/potato-onion intercropping system altered soil microbial communities and improved the soil environment, which may be the main factor in promoting tomato growth

    Development of an Adaptive Fuzzy Integral-Derivative Line-of-Sight Method for Bathymetric LiDAR Onboard Unmanned Surface Vessel

    No full text
    Previous control methods developed by our research team cannot satisfy the high accuracy requirements of unmanned surface vessel (USV) path-tracking during bathymetric mapping because of the excessive overshoot and slow convergence speed. For this reason, this study developed an adaptive fuzzy integral-derivative line-of-sight (AFIDLOS) method for USV path-tracking control. Integral and derivative terms were added to counteract the effect of the sideslip angle with which the USV could be quickly guided to converge to the planned path for bathymetric mapping. To obtain high accuracy of the look-ahead distance, a fuzzy control method was proposed. The proposed method was verified using simulations and outdoor experiments. The results demonstrate that the AFIDLOS method can reduce the overshoot by 79.85%, shorten the settling time by 55.32% in simulation experiments, reduce the average cross-track error by 10.91% and can ensure a 30% overlap of neighboring strips of bathymetric LiDAR outdoor mapping when compared with the traditional guidance law
    corecore