36 research outputs found

    The iPlant Collaborative: Cyberinfrastructure for Plant Biology

    Get PDF
    The iPlant Collaborative (iPlant) is a United States National Science Foundation (NSF) funded project that aims to create an innovative, comprehensive, and foundational cyberinfrastructure in support of plant biology research (PSCIC, 2006). iPlant is developing cyberinfrastructure that uniquely enables scientists throughout the diverse fields that comprise plant biology to address Grand Challenges in new ways, to stimulate and facilitate cross-disciplinary research, to promote biology and computer science research interactions, and to train the next generation of scientists on the use of cyberinfrastructure in research and education. Meeting humanity's projected demands for agricultural and forest products and the expectation that natural ecosystems be managed sustainably will require synergies from the application of information technologies. The iPlant cyberinfrastructure design is based on an unprecedented period of research community input, and leverages developments in high-performance computing, data storage, and cyberinfrastructure for the physical sciences. iPlant is an open-source project with application programming interfaces that allow the community to extend the infrastructure to meet its needs. iPlant is sponsoring community-driven workshops addressing specific scientific questions via analysis tool integration and hypothesis testing. These workshops teach researchers how to add bioinformatics tools and/or datasets into the iPlant cyberinfrastructure enabling plant scientists to perform complex analyses on large datasets without the need to master the command-line or high-performance computational services

    The iPlant Collaborative: Cyberinfrastructure for Plant Biology

    Get PDF
    The iPlant Collaborative (iPlant) is a United States National Science Foundation (NSF) funded project that aims to create an innovative, comprehensive, and foundational cyberinfrastructure in support of plant biology research (PSCIC, 2006). iPlant is developing cyberinfrastructure that uniquely enables scientists throughout the diverse fields that comprise plant biology to address Grand Challenges in new ways, to stimulate and facilitate cross-disciplinary research, to promote biology and computer science research interactions, and to train the next generation of scientists on the use of cyberinfrastructure in research and education. Meeting humanity's projected demands for agricultural and forest products and the expectation that natural ecosystems be managed sustainably will require synergies from the application of information technologies. The iPlant cyberinfrastructure design is based on an unprecedented period of research community input, and leverages developments in high-performance computing, data storage, and cyberinfrastructure for the physical sciences. iPlant is an open-source project with application programming interfaces that allow the community to extend the infrastructure to meet its needs. iPlant is sponsoring community-driven workshops addressing specific scientific questions via analysis tool integration and hypothesis testing. These workshops teach researchers how to add bioinformatics tools and/or datasets into the iPlant cyberinfrastructure enabling plant scientists to perform complex analyses on large datasets without the need to master the command-line or high-performance computational services

    Mapping-by-sequencing in complex polyploid genomes using genic sequence capture: a case study to map yellow rust resistance in hexaploid wheat

    Get PDF
    Previously we extended the utility of mapping-by-sequencing by combining it with sequence capture and mapping sequence data to pseudo-chromosomes that were organized using wheat-Brachypodium synteny. This, with a bespoke haplotyping algorithm, enabled us to map the flowering time locus in the diploid wheat Triticum monococcum L identifying a set of deleted genes (Gardiner et al., 2014). Here, we develop this combination of gene enrichment and sliding window mapping-by-synteny analysis to map the Yr6 locus for yellow stripe rust resistance in hexaploid wheat. A 110MB NimbleGen capture probe set was used to enrich and sequence a doubled-haploid mapping population of hexaploid wheat derived from an Avalon and Cadenza cross. The Yr6 locus was identified by mapping to the POPSEQ chromosomal pseudomolecules using a bespoke pipeline and algorithm (Chapman et al., 2015). Furthermore the same locus was identified using newly developed pseudo-chromosome sequences as a mapping reference that are based on the genic sequence used for sequence enrichment. The pseudo-chromosomes allow us to demonstrate the application of mapping-by-sequencing to even poorly defined polyploidy genomes where chromosomes are incomplete and sub-genome assemblies are collapsed. This analysis uniquely enabled us to: compare wheat genome annotations; identify the Yr6 locus - defining a smaller genic region than was previously possible; associate the interval with one wheat sub-genome and increase the density of SNP markers associated. Finally, we built the pipeline in iPlant, making it a user-friendly community resource for phenotype mapping

    Knowledge discovery with recommenders for big data management in science and engineering communities

    Get PDF
    Recent science and engineering research tasks are increasingly becoming dataintensive and use workflows to automate integration and analysis of voluminous data to test hypotheses. Particularly, bold scientific advances in areas of neuroscience and bioinformatics necessitate access to multiple data archives, heterogeneous software and computing resources, and multi-site interdisciplinary expertise. Datasets are evolving, and new tools are continuously invented for achieving new state-of-the-art performance. Principled cyber and software automation approaches to data-intensive analytics using systematic integration of cyberinfrastructure (CI) technologies and knowledge discovery driven algorithms will significantly enhance research and interdisciplinary collaborations in science and engineering. In this thesis, we demonstrate a novel recommender approach to discover latent knowledge patterns from both the infrastructure perspective (i.e., measurement recommender) and the applications perspective (i.e., topic recommender and scholar recommender). In the infrastructure perspective, we identify and diagnose network-wide anomaly events to address performance bottleneck by proposing a novel measurement recommender scheme. In cases where there is a lack of ground truth in networking performance monitoring (e.g., perfSONAR deployments), it is hard to pinpoint the root-cause analysis in a multi-domain context. To solve this problem, we define a "social plane" concept that relies on recommendation schemes to share diagnosis knowledge or work collaboratively. Our solution makes it easier for network operators and application users to quickly and effectively troubleshoot performance bottlenecks on wide-area network backbones. To evaluate our "measurement recommender", we use both real and synthetic datasets. The results show our measurement recommender scheme has high performance in terms of precision, recall, and accuracy, as well as efficiency in terms of the time taken for large volume measurement trace analysis. In the application perspective, our goal is to shorten time to knowledge discovery and adapt prior domain knowledge for computational and data-intensive communities. To achieve this goal, we design a novel topic recommender that leverages a domain-specific topic model (DSTM) algorithm to help scientists find the relevant tools or datasets for their applications. The DSTM is a probabilistic graphical model that extends the Latent Dirichlet Allocation (LDA) and uses the Markov chain Monte Carlo (MCMC) algorithm to infer latent patterns within a specific domain in an unsupervised manner. We evaluate our scheme based on large collections of the dataset (i.e., publications, tools, datasets) from bioinformatics and neuroscience domains. Our experiments result using the perplexity metric show that our model has better generalization performance within a domain for discovering highly-specific latent topics. Lastly, to enhance the collaborations among scholars to generate new knowledge, it is necessary to identify scholars with their specific research interests or cross-domain expertise. We propose a "ScholarFinder" model to quantify expert knowledge based on publications and funding records using a deep generative model. Our model embeds scholars' knowledge in order to recommend suitable scholars to perform multi-disciplinary tasks. We evaluate our model with state-of-the-art baseline models (e.g., XGBoost, DNN), and experiment results show that our ScholarFinder model outperforms state-ofthe-art models in terms of precision, recall, F1-score, and accuracy.Includes bibliographical references (pages 113-124)

    Doctor of Philosophy

    Get PDF
    dissertationThe MAKER genome annotation and curation software tool was developed in response to increased demand for genome annotation services, secondary to decreased genome sequencing costs. MAKER currently has over 1000 registered users throughout the world. This wide adoption of MAKER has uncovered the need for additional functionalities. Here I addressed moving MAKER into the domain of plant annotation, expanding MAKER to include new methods of gene and noncoding RNA annotation, and improving usability of MAKER through documentation and community outreach. To move MAKER into the plant annotation domain, I benchmarked MAKER on the well-annotated Arabidopsis thaliana genome. MAKER performs well on the Arabidopsis genome in de novo genome annotation and was able to improve the current TAIR10 gene models by incorporating mRNA-seq data not available during the original annotation efforts. In addition to this benchmarking, I annotated the genome of the sacred lotus Nelumbo Nucifera. I enabled noncoding RNA annotation in MAKER by adding the ability for MAKER to run and process the outputs of tRNAscan-SE and snoscan. These functionalities were tested on the Arabidopsis genome and used MAKER to annotate tRNAs and snoRNAs in Zea mays. The resulting version of MAKER was named MAKER-P. I added the functionality of a combiner by adding EVidence Modeler to the MAKER code base. iv As the number of MAKER users has grown, so have the help requests sent to the MAKER developers list. Motivated by the belief that improving the MAKER documentation would obviate the need for many of these requests, I created a media wiki that was linked to the MAKER download page, and the MAKER developers list was made searchable. Additionally I have written a unit on genome annotation using MAKER for Current Protocols in Bioinformatics. In response to these efforts I have seen a corresponding decrease in help requests, even though the number of registered MAKER users continues to increase. Taken together these products and activities have moved MAKER into the domain of plant annotation, expanded MAKER to include new methods of gene and noncoding RNA annotation, and improved the usability of MAKER through documentation and community outreach

    Annual Report

    Get PDF

    AgBioData consortium recommendations for sustainable genomics and genetics databases for agriculture

    Get PDF
    The future of agricultural research depends on data. The sheer volume of agricultural biological data being produced today makes excellent data management essential. Governmental agencies, publishers and science funders require data management plans for publicly funded research. Furthermore, the value of data increases exponentially when they are properly stored, described, integrated and shared, so that they can be easily utilized in future analyses. AgBioData (https://www.agbiodata.org) is a consortium of people working at agricultural biological databases, data archives and knowledgbases who strive to identify common issues in database development, curation and management, with the goal of creating database products that are more Findable, Accessible, Interoperable and Reusable. We strive to promote authentic, detailed, accurate and explicit communication between all parties involved in scientific data. As a step toward this goal, we present the current state of biocuration, ontologies, metadata and persistence, database platforms, programmatic (machine) access to data, communication and sustainability with regard to data curation. Each section describes challenges and opportunities for these topics, along with recommendations and best practices
    corecore