2,914 research outputs found

    AlexSys: a knowledge-based expert system for multiple sequence alignment construction and analysis

    Get PDF
    Multiple sequence alignment (MSA) is a cornerstone of modern molecular biology and represents a unique means of investigating the patterns of conservation and diversity in complex biological systems. Many different algorithms have been developed to construct MSAs, but previous studies have shown that no single aligner consistently outperforms the rest. This has led to the development of a number of ‘meta-methods’ that systematically run several aligners and merge the output into one single solution. Although these methods generally produce more accurate alignments, they are inefficient because all the aligners need to be run first and the choice of the best solution is made a posteriori. Here, we describe the development of a new expert system, AlexSys, for the multiple alignment of protein sequences. AlexSys incorporates an intelligent inference engine to automatically select an appropriate aligner a priori, depending only on the nature of the input sequences. The inference engine was trained on a large set of reference multiple alignments, using a novel machine learning approach. Applying AlexSys to a test set of 178 alignments, we show that the expert system represents a good compromise between alignment quality and running time, making it suitable for high throughput projects. AlexSys is freely available from http://alnitak.u-strasbg.fr/∼aniba/alexsys

    Multiobjective characteristic-based framework for very-large multiple sequence alignment

    Get PDF
    Rubio-Largo, Á., Vanneschi, L., Castelli, M., & Vega-Rodríguez, M. A. (2018). Multiobjective characteristic-based framework for very-large multiple sequence alignment. Applied Soft Computing Journal, 69, 719-736. [Advanced online publication on 27 June 2017]DOI: 10.1016/j.asoc.2017.06.022In the literature, we can find several heuristics for solving the multiple sequence alignment problem. The vast majority of them makes use of flags in order to modify certain alignment parameters; however, if no flags are used, the aligner will run with the default parameter configuration, which, often, is not the optimal one. In this work, we propose a framework that, depending on the biological characteristics of the input dataset, runs the aligner with the best parameter configuration found for another dataset that has similar biological characteristics, improving the accuracy and conservation of the obtained alignment. To train the framework, we use three well-known multiobjective evolutionary algorithms: NSGA-II, IBEA, and MOEA/D. Then, we perform a comparative study between several aligners proposed in the literature and the characteristic-based version of Kalign, MAFFT, and MUSCLE, when solving widely-used benchmarks (PREFAB v4.0 and SABmark v1.65) and very-large benchmarks with thousands of unaligned sequences (HomFam).authorsversionpublishe

    Robust Subgraph Generation Improves Abstract Meaning Representation Parsing

    Full text link
    The Abstract Meaning Representation (AMR) is a representation for open-domain rich semantics, with potential use in fields like event extraction and machine translation. Node generation, typically done using a simple dictionary lookup, is currently an important limiting factor in AMR parsing. We propose a small set of actions that derive AMR subgraphs by transformations on spans of text, which allows for more robust learning of this stage. Our set of construction actions generalize better than the previous approach, and can be learned with a simple classifier. We improve on the previous state-of-the-art result for AMR parsing, boosting end-to-end performance by 3 F1_1 on both the LDC2013E117 and LDC2014T12 datasets.Comment: To appear in ACL 201

    Parallel Exchange of Randomized SubGraphs for Optimization of Network Alignment: PERSONA

    Get PDF
    The aim of Network Alignment in Protein-Protein Interaction Networks is discovering functionally similar regions between compared organisms. One major compromise for solving a network alignment problem is the trade-off among multiple similarity objectives while applying an alignment strategy. An alignment may lose its biological relevance while favoring certain objectives upon others due to the actual relevance of unfavored objectives. One possible solution for solving this issue may be blending the stronger aspects of various alignment strategies until achieving mature solutions. This study proposes a parallel approach called PERSONA that allows aligners to share their partial solutions continuously while they progress. All these aligners pursue their particular heuristics as part of a particle swarm that searches for multi-objective solutions of the same alignment problem in a reactive actor environment. The actors use the stronger portion of a solution as a subgraph that they receive from leading or other actors and send their own stronger subgraphs back upon evaluation of those partial solutions. Moreover, the individual heuristics of each actor takes randomized parameter values at each cycle of parallel execution so that the problem search space can thoroughly be investigated. The results achieved with PERSONA are remarkably optimized and balanced for both topological and node similarity objectives

    The cooperative effects of channel length-bias, width asymmetry, gradient steepness, and contact-guidance on fibroblasts’ directional decision making

    Get PDF
    Cell migration in complex micro-environments, that are similar to tissue pores, is important for predicting locations of tissue nucleation and optimizing scaffold architectures. Firstly, how fibroblast cells - relevant to tissue engineering, affect each other’s directional decisions when encountered with a bifurcation of different channel lengths was investigated. It was found that cell sequence and cell mitosis influence the directional choices that the cells made while chemotaxing. Specifically, the fibroblasts chose to alternate between two possible paths - one longer and the other shorter - at a bifurcation. This finding was counter-intuitive given that the shorter path had a steeper chemoattractant gradient, and would thus be expected to be the preferred path, according to classical chemotaxis theory. Hence, a multiscale image-based modeling was performed in order to explain this behavior. It showed that consumption of the chemotactic signals by the neighboring cells led to the sequence-dependent directional decisions. Furthermore, it was also found that cellular division led to daughter cells making opposite directional choices from each other; even it meant that one of the daughter cells had to move against the chemotactic gradient, and overcome oncoming traffic of other cells. Secondly, a comparison of the effects of the various directional cues on the migration of individual fibroblast cells: including the chemoattractant concentration gradient, the channel width, and the contact-guidance was provided. Simple bifurcated mazes with two branches of different widths were created and fibroblasts were allowed to travel across these geometries by introducing a gradient of PDGF-BB at the ‘exit’ of the device. By incorporating image-based modeling methodology into the experimental approach, an insight into (i) how individual cells make directional decisions in the presence of complex migration cues and (ii) how the cell-cell interaction influences it was provided. It was found that a larger width ratio between the two bifurcated branches outdoes a gradient difference in attracting the cells. Also, when cells encounter a symmetric bifurcation (i.e., no difference between the branch widths), the gradient is predominant in deciding which path the cell will take. Then, in a symmetrical gradient field (i.e., inside a bifurcation of similar branch widths, and in the absence of any leading cells), the contact guidance is important for guiding the cells in making directional choices. Finally, these directional cues were ranked according to the order from the most importance to the least: vast gradient difference between the two branches, channel width bias, mild gradient difference, and contact-guidance

    Optimum Search Schemes for Approximate String Matching Using Bidirectional FM-Index

    Full text link
    Finding approximate occurrences of a pattern in a text using a full-text index is a central problem in bioinformatics and has been extensively researched. Bidirectional indices have opened new possibilities in this regard allowing the search to start from anywhere within the pattern and extend in both directions. In particular, use of search schemes (partitioning the pattern and searching the pieces in certain orders with given bounds on errors) can yield significant speed-ups. However, finding optimal search schemes is a difficult combinatorial optimization problem. Here for the first time, we propose a mixed integer program (MIP) capable to solve this optimization problem for Hamming distance with given number of pieces. Our experiments show that the optimal search schemes found by our MIP significantly improve the performance of search in bidirectional FM-index upon previous ad-hoc solutions. For example, approximate matching of 101-bp Illumina reads (with two errors) becomes 35 times faster than standard backtracking. Moreover, despite being performed purely in the index, the running time of search using our optimal schemes (for up to two errors) is comparable to the best state-of-the-art aligners, which benefit from combining search in index with in-text verification using dynamic programming. As a result, we anticipate a full-fledged aligner that employs an intelligent combination of search in the bidirectional FM-index using our optimal search schemes and in-text verification using dynamic programming outperforms today's best aligners. The development of such an aligner, called FAMOUS (Fast Approximate string Matching using OptimUm search Schemes), is ongoing as our future work

    Automated, Systematic and Parallel Approaches to Software Testing in Bioinformatics

    Get PDF
    Software quality assurance becomes especially critical if bioinformatics tools are to be used in a translational medical setting, such as analysis and interpretation of biological data. We must ensure that only validated algorithms are used, and that they are implemented correctly in the analysis pipeline – and not disrupted by hardware or software failure. In this thesis, I review common quality assurance practice and guidelines for bioinformatics software testing. Furthermore, I present a novel cloud-based framework to enable automated testing of genetic sequence alignment programs. This framework performs testing based on gold standard simulation data sets, and metamorphic testing. I demonstrate the effectiveness of this cloudbased framework using two widely used sequence alignment programs, BWA and Bowtie, and some fault-seeded ‘mutant’ versions of BWA and Bowtie. This preliminary study demonstrates that this type of cloud-based software testing framework is an effective and promising way to implement quality assurance in bioinformatics software that is used in genomic medicine

    Creation of functional viruses from non-functional cDNA clones obtained from an RNA virus population by the use of ancestral reconstruction

    Get PDF
    RNA viruses have the highest known mutation rates. Consequently it is likely that a high proportion of individual RNA virus genomes, isolated from an infected host, will contain lethal mutations and be non-functional. This is problematic if the aim is to clone and investigate high-fitness, functional cDNAs and may also pose problems for sequence-based analysis of viral evolution. To address these challenges we have performed a study of the evolution of classical swine fever virus (CSFV) using deep sequencing and analysis of 84 full-length cDNA clones, each representing individual genomes from a moderately virulent isolate. In addition to here being used as a model for RNA viruses generally, CSFV has high socioeconomic importance and remains a threat to animal welfare and pig production. We find that the majority of the investigated genomes are non-functional and only 12% produced infectious RNA transcripts. Full length sequencing of cDNA clones and deep sequencing of the parental population identified substitutions important for the observed phenotypes. The investigated cDNA clones were furthermore used as the basis for inferring the sequence of functional viruses. Since each unique clone must necessarily be the descendant of a functional ancestor, we hypothesized that it should be possible to produce functional clones by reconstructing ancestral sequences. To test this we used phylogenetic methods to infer two ancestral sequences, which were then reconstructed as cDNA clones. Viruses rescued from the reconstructed cDNAs were tested in cell culture and pigs. Both reconstructed ancestral genomes proved functional, and displayed distinct phenotypes in vitro and in vivo. We suggest that reconstruction of ancestral viruses is a useful tool for experimental and computational investigations of virulence and viral evolution. Importantly, ancestral reconstruction can be done even on the basis of a set of sequences that all correspond to non-functional variants

    Assessment of Alignment Algorithms, Variant Discovery and Genotype Calling Strategies in Exome Sequencing Data

    Get PDF
    Advances in next generation sequencing (NGS) technologies, in the past half decade, have enabled many novel genomic applications and have generated unprecedented amounts of new knowledge that is quickly changing how biomedical research is being conducted, as well as, how we view human diseases and diversity. As the methods, algorithms and software used to process NGS data are constantly being developed and improved, performing analysis and determining the validity of the results become complex. Moreover, as sequencing moves from being a research tool into a clinical diagnostic tool understanding the performance and limitations of bioinformatics pipelines and the results they produce becomes imperative. This thesis aims to assess the performance of nine bioinformatics pipelines for sequence read alignment, variant calling and genotyping in a Mendelian inherited disease, parent-trio exome sequencing design. A well-characterized reference variant call set from the National Institute of Standards and Technology and the Genome in a Bottle Consortium is be used for producing and comparing the analytical performance of each pipeline on the GRCh37 and GRCh38 human references

    Framing Apache Spark in life sciences

    Get PDF
    Advances in high-throughput and digital technologies have required the adoption of big data for handling complex tasks in life sciences. However, the drift to big data led researchers to face technical and infrastructural challenges for storing, sharing, and analysing them. In fact, this kind of tasks requires distributed computing systems and algorithms able to ensure efficient processing. Cutting edge distributed programming frameworks allow to implement flexible algorithms able to adapt the computation to the data over on-premise HPC clusters or cloud architectures. In this context, Apache Spark is a very powerful HPC engine for large-scale data processing on clusters. Also thanks to specialised libraries for working with structured and relational data, it allows to support machine learning, graph-based computation, and stream processing. This review article is aimed at helping life sciences researchers to ascertain the features of Apache Spark and to assess whether it can be successfully used in their research activities
    corecore