10,340 research outputs found

    Identification of functionally related enzymes by learning-to-rank methods

    Full text link
    Enzyme sequences and structures are routinely used in the biological sciences as queries to search for functionally related enzymes in online databases. To this end, one usually departs from some notion of similarity, comparing two enzymes by looking for correspondences in their sequences, structures or surfaces. For a given query, the search operation results in a ranking of the enzymes in the database, from very similar to dissimilar enzymes, while information about the biological function of annotated database enzymes is ignored. In this work we show that rankings of that kind can be substantially improved by applying kernel-based learning algorithms. This approach enables the detection of statistical dependencies between similarities of the active cleft and the biological function of annotated enzymes. This is in contrast to search-based approaches, which do not take annotated training data into account. Similarity measures based on the active cleft are known to outperform sequence-based or structure-based measures under certain conditions. We consider the Enzyme Commission (EC) classification hierarchy for obtaining annotated enzymes during the training phase. The results of a set of sizeable experiments indicate a consistent and significant improvement for a set of similarity measures that exploit information about small cavities in the surface of enzymes

    Discrete Elastic Inner Vector Spaces with Application in Time Series and Sequence Mining

    Get PDF
    This paper proposes a framework dedicated to the construction of what we call discrete elastic inner product allowing one to embed sets of non-uniformly sampled multivariate time series or sequences of varying lengths into inner product space structures. This framework is based on a recursive definition that covers the case of multiple embedded time elastic dimensions. We prove that such inner products exist in our general framework and show how a simple instance of this inner product class operates on some prospective applications, while generalizing the Euclidean inner product. Classification experimentations on time series and symbolic sequences datasets demonstrate the benefits that we can expect by embedding time series or sequences into elastic inner spaces rather than into classical Euclidean spaces. These experiments show good accuracy when compared to the euclidean distance or even dynamic programming algorithms while maintaining a linear algorithmic complexity at exploitation stage, although a quadratic indexing phase beforehand is required.Comment: arXiv admin note: substantial text overlap with arXiv:1101.431

    Simple identification tools in FishBase

    Get PDF
    Simple identification tools for fish species were included in the FishBase information system from its inception. Early tools made use of the relational model and characters like fin ray meristics. Soon pictures and drawings were added as a further help, similar to a field guide. Later came the computerization of existing dichotomous keys, again in combination with pictures and other information, and the ability to restrict possible species by country, area, or taxonomic group. Today, www.FishBase.org offers four different ways to identify species. This paper describes these tools with their advantages and disadvantages, and suggests various options for further development. It explores the possibility of a holistic and integrated computeraided strategy

    LALNVIEW: a graphical viewer for pairwise sequence alignments

    Get PDF
    LALNVIEW is a graphical program for visualising local alignments between two sequences (protein or nucleic acids). Sequences are represented by coloured rectangles to give an overall picture of their similarities. LALNVIEW can display sequence features (exon, intron, active site, domain, propeptide, etc.) along with the alignment. When using LALNVIEW through our Web servers, sequence features are automatically extracted from database annotations (SWISS-PROT, GenBank, EMBL or HOVERGEN) and displayed with the alignment. LALNVIEW is a useful tool for analysing pairwise sequence alignments and for making the link between sequence homology and what is known about the structure or function of sequences. LALNVIEW executables for UNIX, Macintosh and PC computers are freely available from our server (http://expasy.hcuge.ch/sprot/lalnview.html

    Bioinformatics: A Way Forward to Explore “Plant Omics”

    Get PDF
    Bioinformatics, a computer-assisted science aiming at managing a huge volume of genomic data, is an emerging discipline that combines the power of computers, mathematical algorithms, and statistical concepts to solve multiple genetic/biological puzzles. This science has progressed parallel to the evolution of genome-sequencing tools, for example, the next-generation sequencing technologies, that resulted in arranging and analyzing the genome-sequencing information of large genomes. Synergism of “plant omics” and bioinformatics set a firm foundation for deducing ancestral karyotype of multiple plant families, predicting genes, etc. Second, the huge genomic data can be assembled to acquire maximum information from a voluminous “omics” data. The science of bioinformatics is handicapped due to lack of appropriate computational procedures in assembling sequencing reads of the homologs occurring in complex genomes like cotton (2n = 4x = 52), wheat (2n = 6x = 42), etc., and shortage of multidisciplinary-oriented trained manpower. In addition, the rapid expansion of sequencing data restricts the potential of acquisitioning, storing, distributing, and analyzing the genomic information. In future, inventions of high-tech computational tools and skills together with improved biological expertise would provide better insight into the genomes, and this information would be helpful in sustaining crop productivities on this planet

    A D.C. Programming Approach to the Sparse Generalized Eigenvalue Problem

    Full text link
    In this paper, we consider the sparse eigenvalue problem wherein the goal is to obtain a sparse solution to the generalized eigenvalue problem. We achieve this by constraining the cardinality of the solution to the generalized eigenvalue problem and obtain sparse principal component analysis (PCA), sparse canonical correlation analysis (CCA) and sparse Fisher discriminant analysis (FDA) as special cases. Unlike the â„“1\ell_1-norm approximation to the cardinality constraint, which previous methods have used in the context of sparse PCA, we propose a tighter approximation that is related to the negative log-likelihood of a Student's t-distribution. The problem is then framed as a d.c. (difference of convex functions) program and is solved as a sequence of convex programs by invoking the majorization-minimization method. The resulting algorithm is proved to exhibit \emph{global convergence} behavior, i.e., for any random initialization, the sequence (subsequence) of iterates generated by the algorithm converges to a stationary point of the d.c. program. The performance of the algorithm is empirically demonstrated on both sparse PCA (finding few relevant genes that explain as much variance as possible in a high-dimensional gene dataset) and sparse CCA (cross-language document retrieval and vocabulary selection for music retrieval) applications.Comment: 40 page
    • …
    corecore