1,521 research outputs found

    Knowledge Rich Natural Language Queries over Structured Biological Databases

    Full text link
    Increasingly, keyword, natural language and NoSQL queries are being used for information retrieval from traditional as well as non-traditional databases such as web, document, image, GIS, legal, and health databases. While their popularity are undeniable for obvious reasons, their engineering is far from simple. In most part, semantics and intent preserving mapping of a well understood natural language query expressed over a structured database schema to a structured query language is still a difficult task, and research to tame the complexity is intense. In this paper, we propose a multi-level knowledge-based middleware to facilitate such mappings that separate the conceptual level from the physical level. We augment these multi-level abstractions with a concept reasoner and a query strategy engine to dynamically link arbitrary natural language querying to well defined structured queries. We demonstrate the feasibility of our approach by presenting a Datalog based prototype system, called BioSmart, that can compute responses to arbitrary natural language queries over arbitrary databases once a syntactic classification of the natural language query is made

    Novel Algorithms and Methodology to Help Unravel Secrets that Next Generation Sequencing Data Can Tell

    Get PDF
    The genome of an organism is its complete set of DNA nucleotides, spanning all of its genes and also of its non-coding regions. It contains most of the information necessary to build and maintain an organism. It is therefore no surprise that sequencing the genome provides an invaluable tool for the scientific study of an organism. Via the inference of an evolutionary (phylogenetic) tree, DNA sequences can be used to reconstruct the evolutionary history of a set of species. DNA sequences, or genotype data, has also proven useful for predicting an organisms’ phenotype (i. e. observed traits) from its genotype. This is the objective of association studies. While methods for finding the DNA sequence of an organism have existed for decades, the recent advent of Next Generation Sequencing (NGS) has meant that the availability of such data has increased to such an extent that the computational challenges that now form an integral part of biological studies can no longer be ignored. By focusing on phylogenetics and Genome-Wide Association Studies (GWAS), this thesis aims to help address some of these challenges. As a consequence this thesis is in two parts with the first one centring on phylogenetics and the second one on GWAS. In the first part, we present theoretical insights for reconstructing phylogenetic trees from incomplete distances. This problem is important in the context of NGS data as incomplete pairwise distances between organisms occur frequently with such input and ignoring taxa for which information is missing can introduce undesirable bias. In the second part we focus on the problem of inferring population stratification between individuals in a dataset due to reproductive isolation. While powerful methods for doing this have been proposed in the literature, they tend to struggle when faced with the sheer volume of data that comes with NGS. To help address this problem we introduce the novel PSIKO software and show that it scales very well when dealing with large NGS datasets

    Evolution & Phylogenetic Analysis: Classroom Activities for Investigating Molecular & Morphological Concepts

    Get PDF
    In a flexible multisession laboratory, students investigate concepts of phylogenetic analysis at both the molecular and the morphological level. Students finish by conducting their own analysis on a collection of skeletons representing the major phyla of vertebrates, a collection of primate skulls, or a collection of hominid skulls

    Evolution & Phylogenetic Analysis: Classroom Activities for Investigating Molecular & Morphological Concepts

    Get PDF
    In a flexible multisession laboratory, students investigate concepts of phylogenetic analysis at both the molecular and the morphological level. Students finish by conducting their own analysis on a collection of skeletons representing the major phyla of vertebrates, a collection of primate skulls, or a collection of hominid skulls

    PRT: Parallel program for a full backtranslation of oligopeptides

    Get PDF
    DNA hybridization methods have become the most widely used tools in molecular biology to identify organisms and evaluate gene expression levels. PCR (Polymerase Chain Reaction)-based methods, fluorescent in situ hybridization (FISH) and the recent development of DNA microarrays as a high throughput technology need efficient primers or probes design. Evaluation of the metabolic capacities of complex microbial communities found in terrestrial or aquatic environments requires new probe design algorithms that reflect the genetic diversity. As only a small part of the microbial diversity is known, gene sequences deposited in international databases do not reflect the entire diversity. In this context we propose to use oligopeptide sequences for the design of complete set of DNA probes that are able to target the entire genetic diversity of genes encoding enzymes. Due to the degenerated genetic code backtranslation must be managed efficiently. To our knowledge no software has been developed to propose a full backtranslation. This complexity is tractable since we only need to focus on short oligopeptides for DNA probe design. We propose new algorithms that perform a high performance oligopeptide backtranslation into all potential nucleic sequences. We use different efficient techniques such as memory mapping to perform such a computing. We also propose a MPI parallel computing that reduces the whole execution time using data load balancing and network file stream distribution on a cluster architecture

    DIALIGN-TX: greedy and progressive approaches for segment-based multiple sequence alignment

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>DIALIGN-T is a reimplementation of the multiple-alignment program DIALIGN. Due to several algorithmic improvements, it produces significantly better alignments on locally and globally related sequence sets than previous versions of DIALIGN. However, like the original implementation of the program, DIALIGN-T uses a a straight-forward greedy approach to assemble multiple alignments from local pairwise sequence similarities. Such greedy approaches may be vulnerable to spurious random similarities and can therefore lead to suboptimal results. In this paper, we present DIALIGN-TX, a substantial improvement of DIALIGN-T that combines our previous greedy algorithm with a progressive alignment approach.</p> <p>Results</p> <p>Our new heuristic produces significantly better alignments, especially on globally related sequences, without increasing the CPU time and memory consumption exceedingly. The new method is based on a guide tree; to detect possible spurious sequence similarities, it employs a vertex-cover approximation on a conflict graph. We performed benchmarking tests on a large set of nucleic acid and protein sequences For protein benchmarks we used the benchmark database BALIBASE 3 and an updated release of the database IRMBASE 2 for assessing the quality on globally and locally related sequences, respectively. For alignment of nucleic acid sequences, we used BRAliBase II for global alignment and a newly developed database of locally related sequences called <it>DIRM-BASE 1</it>. IRMBASE 2 and DIRMBASE 1 are constructed by implanting highly conserved motives at random positions in long unalignable sequences.</p> <p>Conclusion</p> <p>On BALIBASE3, our new program performs significantly better than the previous program DIALIGN-T and outperforms the popular global aligner CLUSTAL W, though it is still outperformed by programs that focus on global alignment like MAFFT, MUSCLE and T-COFFEE. On the locally related test sets in IRMBASE 2 and DIRM-BASE 1, our method outperforms all other programs while MAFFT E-INSi is the only method that comes close to the performance of DIALIGN-TX.</p

    An Introduction to Programming for Bioscientists: A Python-based Primer

    Full text link
    Computing has revolutionized the biological sciences over the past several decades, such that virtually all contemporary research in the biosciences utilizes computer programs. The computational advances have come on many fronts, spurred by fundamental developments in hardware, software, and algorithms. These advances have influenced, and even engendered, a phenomenal array of bioscience fields, including molecular evolution and bioinformatics; genome-, proteome-, transcriptome- and metabolome-wide experimental studies; structural genomics; and atomistic simulations of cellular-scale molecular assemblies as large as ribosomes and intact viruses. In short, much of post-genomic biology is increasingly becoming a form of computational biology. The ability to design and write computer programs is among the most indispensable skills that a modern researcher can cultivate. Python has become a popular programming language in the biosciences, largely because (i) its straightforward semantics and clean syntax make it a readily accessible first language; (ii) it is expressive and well-suited to object-oriented programming, as well as other modern paradigms; and (iii) the many available libraries and third-party toolkits extend the functionality of the core language into virtually every biological domain (sequence and structure analyses, phylogenomics, workflow management systems, etc.). This primer offers a basic introduction to coding, via Python, and it includes concrete examples and exercises to illustrate the language's usage and capabilities; the main text culminates with a final project in structural bioinformatics. A suite of Supplemental Chapters is also provided. Starting with basic concepts, such as that of a 'variable', the Chapters methodically advance the reader to the point of writing a graphical user interface to compute the Hamming distance between two DNA sequences.Comment: 65 pages total, including 45 pages text, 3 figures, 4 tables, numerous exercises, and 19 pages of Supporting Information; currently in press at PLOS Computational Biolog

    Protein multiple sequence alignment by hybrid bio-inspired algorithms

    Get PDF
    This article presents an immune inspired algorithm to tackle the Multiple Sequence Alignment (MSA) problem. MSA is one of the most important tasks in biological sequence analysis. Although this paper focuses on protein alignments, most of the discussion and methodology may also be applied to DNA alignments. The problem of finding the multiple alignment was investigated in the study by Bonizzoni and Vedova and Wang and Jiang, and proved to be a NP-hard (non-deterministic polynomial-time hard) problem. The presented algorithm, called Immunological Multiple Sequence Alignment Algorithm (IMSA), incorporates two new strategies to create the initial population and specific ad hoc mutation operators. It is based on the ‘weighted sum of pairs’ as objective function, to evaluate a given candidate alignment. IMSA was tested using both classical benchmarks of BAliBASE (versions 1.0, 2.0 and 3.0), and experimental results indicate that it is comparable with state-of-the-art multiple alignment algorithms, in terms of quality of alignments, weighted Sums-of-Pairs (SP) and Column Score (CS) values. The main novelty of IMSA is its ability to generate more than a single suboptimal alignment, for every MSA instance; this behaviour is due to the stochastic nature of the algorithm and of the populations evolved during the convergence process. This feature will help the decision maker to assess and select a biologically relevant multiple sequence alignment. Finally, the designed algorithm can be used as a local search procedure to properly explore promising alignments of the search space

    Formal reasoning on qualitative models of coinfection of HIV and Tuberculosis and HAART therapy.

    Get PDF
    BACKGROUND: Several diseases, many of which nowadays pandemic, consist of multifactorial pathologies. Paradigmatic examples come from the immune response to pathogens, in which cases the effects of different infections combine together, yielding complex mutual feedback, often a positive one that boosts infection progression in a scenario that can easily become lethal. HIV is one such infection, which weakens the immune system favouring the insurgence of opportunistic infections, amongst which Tuberculosis (TB). The treatment with antiretroviral therapies has shown effective in reducing mortality. An in-depth understanding of complex systems, like the one consisting of HIV, TB and related therapies, is an open great challenge, on the boundaries of bioinformatics, computational and systems biology. RESULTS: We present a simplified formalisation of the highly dynamic system consisting of HIV, TB and related therapies, at the cellular level. The progression of the disease (AIDS) depends hence on interactions between viruses, cells, chemokines, the high mutation rate of viruses, the immune response of individuals and the interaction between drugs and infection dynamics. We first discuss a deterministic model of dual infection (HIV and TB) which is able to capture the long-term dynamics of CD4 T cells, viruses and Tumour Necrosis Factor (TNF). We contrast this model with a stochastic approach which captures intrinsic fluctuations of the biological processes. Furthermore, we also integrate automated reasoning techniques, i.e. probabilistic model checking, in our formal analysis. Beyond numerical simulations, model checking allows general properties (effectiveness of anti-HIV therapies) to be verified against the models by means of an automated procedure. Our work stresses the growing importance and flexibility of model checking techniques in bioinformatics. In this paper we i) describe HIV as a complex case of infectious diseases; ii) provide a number of different formal descriptions that suitably account for aspects of interests; iii) suggest that the integration of different models together with automated reasoning techniques can improve the understanding of infections and therapies through formal analysis methodologies. CONCLUSION: We argue that the described methodology suitably supports the study of viral infections in a formal, automated and expressive manner. We envisage a long-term contribution of this kind of approaches to clinical Bioinformatics and Translational Medicine.RIGHTS : This article is licensed under the BioMed Central licence at http://www.biomedcentral.com/about/license which is similar to the 'Creative Commons Attribution Licence'. In brief you may : copy, distribute, and display the work; make derivative works; or make commercial use of the work - under the following conditions: the original author must be given credit; for any reuse or distribution, it must be made clear to others what the license terms of this work are
    • …
    corecore