3,005 research outputs found

    Overview of the gene ontology task at BioCreative IV

    Get PDF
    Gene Ontology (GO) annotation is a common task among model organism databases (MODs) for capturing gene function data from journal articles. It is a time-consuming and labor-intensive task, and is thus often considered as one of the bottlenecks in literature curation. There is a growing need for semiautomated or fully automated GO curation techniques that will help database curators to rapidly and accurately identify gene function information in full-length articles. Despite multiple attempts in the past, few studies have proven to be useful with regard to assisting real-world GO curation. The shortage of sentence-level training data and opportunities for interaction between text-mining developers and GO curators has limited the advances in algorithm development and corresponding use in practical circumstances. To this end, we organized a text-mining challenge task for literature-based GO annotation in BioCreative IV. More specifically, we developed two subtasks: (i) to automatically locate text passages that contain GO-relevant information (a text retrieval task) and (ii) to automatically identify relevant GO terms for the genes in a given article (a concept-recognition task). With the support from five MODs, we provided teams with >4000 unique text passages that served as the basis for each GO annotation in our task data. Such evidence text information has long been recognized as critical for text-mining algorithm development but was never made available because of the high cost of curation. In total, seven teams participated in the challenge task. From the team results, we conclude that the state of the art in automatically mining GO terms from literature has improved over the past decade while much progress is still needed for computer-assisted GO curation. Future work should focus on addressing remaining technical challenges for improved performance of automatic GO concept recognition and incorporating practical benefits of text-mining tools into real-world GO annotation

    Finding approximate gene clusters with GECKO 3

    Get PDF
    Winter S, Jahn K, Wehner S, et al. Finding approximate gene clusters with GECKO 3. Nucleic Acids Research. 2016;44(20):9600-9610.Gene-order-based comparison of multiple genomes provides signals for functional analysis of genes and the evolutionary process of genome organization. Gene clusters are regions of co-localized genes on genomes of different species. The rapid increase in sequenced genomes necessitates bioinformatics tools for finding gene clusters in hundreds of genomes. Existing tools are often restricted to few (in many cases, only two) genomes, and often make restrictive assumptions such as short perfect conservation, conserved gene order or monophyletic gene clusters. We present Gecko 3, an open-source software for finding gene clusters in hundreds of bacterial genomes, that comes with an easy-to-use graphical user interface. The underlying gene cluster model is intuitive, can cope with low degrees of conservation as well as misannotations and is complemented by a sound statistical evaluation. To evaluate the biological benefit of Gecko 3 and to exemplify our method, we search for gene clusters in a dataset of 678 bacterial genomes using Synechocystis sp. PCC 6803 as a reference. We confirm detected gene clusters reviewing the literature and comparing them to a database of operons; we detect two novel clusters, which were confirmed by publicly available experimental RNA-Seq data. The computational analysis is carried out on a laptop computer in <40 min

    Identification of conserved gene clusters in multiple genomes based on synteny and homology

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Uncovering the relationship between the conserved chromosomal segments and the functional relatedness of elements within these segments is an important question in computational genomics. We build upon the series of works on <it>gene teams</it> and <it>homology teams.</it></p> <p>Results</p> <p>Our primary contribution is a local sliding-window SYNS (SYNtenic teamS) algorithm that refines an existing family structure into orthologous sub-families by analyzing the neighborhoods around the members of a given family with a locally sliding window. The neighborhood analysis is done by computing conserved gene clusters. We evaluate our algorithm on the existing homologous families from the Genolevures database over five genomes of the Hemyascomycete phylum.</p> <p>Conclusions</p> <p>The result is an efficient algorithm that works on multiple genomes, considers paralogous copies of genes and is able to uncover orthologous clusters even in distant genomes. Resulting orthologous clusters are comparable to those obtained by manual curation.</p

    A Survey of Matrix Completion Methods for Recommendation Systems

    Get PDF
    In recent years, the recommendation systems have become increasingly popular and have been used in a broad variety of applications. Here, we investigate the matrix completion techniques for the recommendation systems that are based on collaborative filtering. The collaborative filtering problem can be viewed as predicting the favorability of a user with respect to new items of commodities. When a rating matrix is constructed with users as rows, items as columns, and entries as ratings, the collaborative filtering problem can then be modeled as a matrix completion problem by filling out the unknown elements in the rating matrix. This article presents a comprehensive survey of the matrix completion methods used in recommendation systems. We focus on the mathematical models for matrix completion and the corresponding computational algorithms as well as their characteristics and potential issues. Several applications other than the traditional user-item association prediction are also discussed

    Efficient algorithms for gene cluster detection in prokaryotic genomes

    Get PDF
    Schmidt T. Efficient algorithms for gene cluster detection in prokaryotic genomes. Bielefeld (Germany): Bielefeld University; 2005.The research in genomics science rapidly emerged in the last few years, and the availability of completely sequenced genomes continuously increases due to the use of semi-automatic sequencing machines. Also these sequences, mostly prokaryotic ones, are well annotated, which means that the positions of their genes and parts of their regulatory or metabolic pathways are known. A new task in the field of bioinformatics now is to gain gene or protein information from the comparison of genomes on a higher level. In the approach of "comparative genomics" researchers in bioinformatics are attempting to locate groups or clusters of orthologous genes that may have the same function in multiple genomes. These researches are often anchored on the simple, but biologically verified fact, that functionally related proteins are usually coded by genes placed in a region of close genomic neighborhood, in different species. From an algorithmic and combinatorial point of view, the first descriptions of the concept of "closely placed genes" were only fragmentary, and sometimes confusing. The given algorithms often lack the necessary grounds to prove their correctness, or assess their complexity. Within the first formal models of a conserved genomic neighborhood, genomes are often represented as permutations of their genes, and common intervals, i.e. intervals containing the same set of genes, are interpreted as gene clusters. But here the major disadvantage of representing genomes as permutations is the fact that paralogous copies of the same gene inside one genome can not be modelled. Since especially large genomes contain numerous paralogous genes, this model is insufficient to be used on real genomic data. In this work, we consider a modified model of gene clusters that allows paralogs, simply by representing genomes as sequences rather than permutations of genes. We define common intervals based on this model, and we present a simple algorithm that finds all common intervals of two sequences in [Theta](n2) time using [Theta](n2) space. Another, more complicated algorithm runs in [Omikron](n2) time and uses only linear space. We also show how to extend these algorithms to more than two genomes and present the implementation of the algorithms as well as the visualization of the located clusters in the tool Gecko. Since the creation of the string representation of a set of genomes is a non-trivial task, we also present the data preparation tool GhostFam that groups all genes from the given set of genomes to their families of homologs. In the evaluation on a set of 20 bacterial genomes, we show that with the presented approach it is possible to correctly locate gene clusters that are known from the literature, and to successfully predict new groups of functionally related genes

    Silent but Not Static: Accelerated Base-Pair Substitution in Silenced Chromatin of Budding Yeasts

    Get PDF
    Subtelomeric DNA in budding yeasts, like metazoan heterochromatin, is gene poor, repetitive, transiently silenced, and highly dynamic. The rapid evolution of subtelomeric regions is commonly thought to arise from transposon activity and increased recombination between repetitive elements. However, we found evidence of an additional factor in this diversification. We observed a surprising level of nucleotide divergence in transcriptionally silenced regions in inter-species comparisons of Saccharomyces yeasts. Likewise, intra-species analysis of polymorphisms also revealed increased SNP frequencies in both intergenic and synonymous coding positions of silenced DNA. This analysis suggested that silenced DNA in Saccharomyces cerevisiae and closely related species had increased single base-pair substitution that was likely due to the effects of the silencing machinery on DNA replication or repair

    Resource management and the effects of trade on vulnerable places and people : lessons from six case studies

    Get PDF
    Lessons from six case studies illustrate the complex relationships between international trade, vulnerable ecologies and the poor. The studies, taken from Africa, Asia and Latin America and conducted by local researchers, are set in places where the poor live in close proximity to ecologies that are important to global conservation efforts, and focus on the cascading consequences of trade policy for local livelihoods and environmental services. Collectively, the studies show how under-valued common resources are often poorly protected and consequently subject to shifting economic incentives, including those that arise from trade. The studies provide examples where trade works to accelerate the use of natural resources and to exacerbate unsustainable dependencies by the poor, and other examples where trade has the opposite effect. An important conclusion is that local livelihood and technology choices have important consequences for how environmental resources are used and should be taken into account when designing policies to safeguard fragile ecologies.Environmental Economics&Policies,Economic Theory&Research,Emerging Markets,Labor Policies,Population Policies

    A Fresh Insight into Transmission of Schistosomiasis: A Misleading Tale of Biomphalaria in Lake Victoria

    Get PDF
    Lake Victoria is a known hot-spot for Schistosoma mansoni, which utilises freshwater snails of the genus Biomphalaria as intermediate hosts. Different species of Biomphalaria are associated with varying parasite compatibility, affecting local transmission. It is thought that two species, B. choanomphala and B. sudanica, inhabit Lake Victoria; despite their biomedical importance, the taxonomy of these species has not been thoroughly examined. This study combined analysis of morphological and molecular variables; the results demonstrated that molecular groupings were not consistent with morphological divisions. Habitat significantly predicted morphotype, suggesting that the different Lake Victorian forms of Biomphalaria are ecophentoypes of one species. The nomenclature should be revised accordingly; the names B. choanomphala choanomphala and B. c. sudanica are proposed. From a public health perspective, these findings can be utilised by policy-makers for better understanding of exposure risk, resulting in more effective and efficient control initiatives

    ComPath: comparative enzyme analysis and annotation in pathway/subsystem contexts

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Once a new genome is sequenced, one of the important questions is to determine the presence and absence of biological pathways. Analysis of biological pathways in a genome is a complicated task since a number of biological entities are involved in pathways and biological pathways in different organisms are not identical. Computational pathway identification and analysis thus involves a number of computational tools and databases and typically done in comparison with pathways in other organisms. This computational requirement is much beyond the capability of biologists, so information systems for reconstructing, annotating, and analyzing biological pathways are much needed. We introduce a new comparative pathway analysis workbench, ComPath, which integrates various resources and computational tools using an interactive spreadsheet-style web interface for reliable pathway analyses.</p> <p>Results</p> <p>ComPath allows users to compare biological pathways in multiple genomes using a spreadsheet style web interface where various sequence-based analysis can be performed either to compare enzymes (e.g. sequence clustering) and pathways (e.g. pathway hole identification), to search a genome for <it>de novo </it>prediction of enzymes, or to annotate a genome in comparison with reference genomes of choice. To fill in pathway holes or make <it>de novo </it>enzyme predictions, multiple computational methods such as FASTA, Whole-HMM, CSR-HMM (a method of our own introduced in this paper), and PDB-domain search are integrated in ComPath. Our experiments show that FASTA and CSR-HMM search methods generally outperform Whole-HMM and PDB-domain search methods in terms of sensitivity, but FASTA search performs poorly in terms of specificity, detecting more false positive as E-value cutoff increases. Overall, CSR-HMM search method performs best in terms of both sensitivity and specificity. Gene neighborhood and pathway neighborhood (global network) visualization tools can be used to get context information that is complementary to conventional KEGG map representation.</p> <p>Conclusion</p> <p>ComPath is an interactive workbench for pathway reconstruction, annotation, and analysis where experts can perform various sequence, domain, context analysis, using an intuitive and interactive spreadsheet-style interface. </p

    Search beyond traditional probabilistic information retrieval

    Get PDF
    "This thesis focuses on search beyond probabilistic information retrieval. Three ap- proached are proposed beyond the traditional probabilistic modelling. First, term associ- ation is deeply examined. Term association considers the term dependency using a factor analysis based model, instead of treating each term independently. Latent factors, con- sidered the same as the hidden variables of ""eliteness"" introduced by Robertson et al. to gain understanding of the relation among term occurrences and relevance, are measured by the dependencies and occurrences of term sequences and subsequences. Second, an entity-based ranking approach is proposed in an entity system named ""EntityCube"" which has been released by Microsoft for public use. A summarization page is given to summarize the entity information over multiple documents such that the truly relevant entities can be highly possibly searched from multiple documents through integrating the local relevance contributed by proximity and the global enhancer by topic model. Third, multi-source fusion sets up a meta-search engine to combine the ""knowledge"" from different sources. Meta-features, distilled as high-level categories, are deployed to diversify the baselines. Three modified fusion methods are employed, which are re- ciprocal, CombMNZ and CombSUM with three expanded versions. Through extensive experiments on the standard large-scale TREC Genomics data sets, the TREC HARD data sets and the Microsoft EntityCube Web collections, the proposed extended models beyond probabilistic information retrieval show their effectiveness and superiority.
    corecore