1,521 research outputs found
Knowledge Rich Natural Language Queries over Structured Biological Databases
Increasingly, keyword, natural language and NoSQL queries are being used for
information retrieval from traditional as well as non-traditional databases
such as web, document, image, GIS, legal, and health databases. While their
popularity are undeniable for obvious reasons, their engineering is far from
simple. In most part, semantics and intent preserving mapping of a well
understood natural language query expressed over a structured database schema
to a structured query language is still a difficult task, and research to tame
the complexity is intense. In this paper, we propose a multi-level
knowledge-based middleware to facilitate such mappings that separate the
conceptual level from the physical level. We augment these multi-level
abstractions with a concept reasoner and a query strategy engine to dynamically
link arbitrary natural language querying to well defined structured queries. We
demonstrate the feasibility of our approach by presenting a Datalog based
prototype system, called BioSmart, that can compute responses to arbitrary
natural language queries over arbitrary databases once a syntactic
classification of the natural language query is made
Novel Algorithms and Methodology to Help Unravel Secrets that Next Generation Sequencing Data Can Tell
The genome of an organism is its complete set of DNA nucleotides, spanning
all of its genes and also of its non-coding regions. It contains most of
the information necessary to build and maintain an organism. It is therefore
no surprise that sequencing the genome provides an invaluable tool for
the scientific study of an organism. Via the inference of an evolutionary
(phylogenetic) tree, DNA sequences can be used to reconstruct the evolutionary
history of a set of species. DNA sequences, or genotype data, has
also proven useful for predicting an organisms’ phenotype (i. e. observed
traits) from its genotype. This is the objective of association studies.
While methods for finding the DNA sequence of an organism have existed
for decades, the recent advent of Next Generation Sequencing (NGS) has
meant that the availability of such data has increased to such an extent
that the computational challenges that now form an integral part of biological
studies can no longer be ignored. By focusing on phylogenetics
and Genome-Wide Association Studies (GWAS), this thesis aims to help
address some of these challenges. As a consequence this thesis is in two
parts with the first one centring on phylogenetics and the second one on
GWAS.
In the first part, we present theoretical insights for reconstructing phylogenetic
trees from incomplete distances. This problem is important in the
context of NGS data as incomplete pairwise distances between organisms
occur frequently with such input and ignoring taxa for which information
is missing can introduce undesirable bias. In the second part we focus on
the problem of inferring population stratification between individuals in a
dataset due to reproductive isolation. While powerful methods for doing
this have been proposed in the literature, they tend to struggle when faced
with the sheer volume of data that comes with NGS. To help address this
problem we introduce the novel PSIKO software and show that it scales
very well when dealing with large NGS datasets
Evolution & Phylogenetic Analysis: Classroom Activities for Investigating Molecular & Morphological Concepts
In a flexible multisession laboratory, students investigate concepts of phylogenetic analysis at both the molecular and the morphological level. Students finish by conducting their own analysis on a collection of skeletons representing the major phyla of vertebrates, a collection of primate skulls, or a collection of hominid skulls
Evolution & Phylogenetic Analysis: Classroom Activities for Investigating Molecular & Morphological Concepts
In a flexible multisession laboratory, students investigate concepts of phylogenetic analysis at both the molecular and the morphological level. Students finish by conducting their own analysis on a collection of skeletons representing the major phyla of vertebrates, a collection of primate skulls, or a collection of hominid skulls
PRT: Parallel program for a full backtranslation of oligopeptides
DNA hybridization methods have become the most widely used tools in molecular biology to identify organisms and evaluate gene expression levels. PCR (Polymerase Chain Reaction)-based methods, fluorescent in situ hybridization (FISH) and the recent development of DNA microarrays as a high throughput technology need efficient primers or probes design. Evaluation of the metabolic capacities of complex microbial communities found in terrestrial or aquatic environments requires new probe design algorithms that reflect the genetic diversity. As only a small part of the microbial diversity is known, gene sequences deposited in international databases do not reflect the entire diversity. In this context we propose to use oligopeptide sequences for the design of complete set of DNA probes that are able to target the entire genetic diversity of genes encoding enzymes. Due to the degenerated genetic code backtranslation must be managed efficiently. To our knowledge no software has been developed to propose a full backtranslation. This complexity is tractable since we only need to focus on short oligopeptides for DNA probe design. We propose new algorithms that perform a high performance oligopeptide backtranslation into all potential nucleic sequences. We use different efficient techniques such as memory mapping to perform such a computing. We also propose a MPI parallel computing that reduces the whole execution time using data load balancing and network file stream distribution on a cluster architecture
DIALIGN-TX: greedy and progressive approaches for segment-based multiple sequence alignment
<p>Abstract</p> <p>Background</p> <p>DIALIGN-T is a reimplementation of the multiple-alignment program DIALIGN. Due to several algorithmic improvements, it produces significantly better alignments on locally and globally related sequence sets than previous versions of DIALIGN. However, like the original implementation of the program, DIALIGN-T uses a a straight-forward greedy approach to assemble multiple alignments from local pairwise sequence similarities. Such greedy approaches may be vulnerable to spurious random similarities and can therefore lead to suboptimal results. In this paper, we present DIALIGN-TX, a substantial improvement of DIALIGN-T that combines our previous greedy algorithm with a progressive alignment approach.</p> <p>Results</p> <p>Our new heuristic produces significantly better alignments, especially on globally related sequences, without increasing the CPU time and memory consumption exceedingly. The new method is based on a guide tree; to detect possible spurious sequence similarities, it employs a vertex-cover approximation on a conflict graph. We performed benchmarking tests on a large set of nucleic acid and protein sequences For protein benchmarks we used the benchmark database BALIBASE 3 and an updated release of the database IRMBASE 2 for assessing the quality on globally and locally related sequences, respectively. For alignment of nucleic acid sequences, we used BRAliBase II for global alignment and a newly developed database of locally related sequences called <it>DIRM-BASE 1</it>. IRMBASE 2 and DIRMBASE 1 are constructed by implanting highly conserved motives at random positions in long unalignable sequences.</p> <p>Conclusion</p> <p>On BALIBASE3, our new program performs significantly better than the previous program DIALIGN-T and outperforms the popular global aligner CLUSTAL W, though it is still outperformed by programs that focus on global alignment like MAFFT, MUSCLE and T-COFFEE. On the locally related test sets in IRMBASE 2 and DIRM-BASE 1, our method outperforms all other programs while MAFFT E-INSi is the only method that comes close to the performance of DIALIGN-TX.</p
An Introduction to Programming for Bioscientists: A Python-based Primer
Computing has revolutionized the biological sciences over the past several
decades, such that virtually all contemporary research in the biosciences
utilizes computer programs. The computational advances have come on many
fronts, spurred by fundamental developments in hardware, software, and
algorithms. These advances have influenced, and even engendered, a phenomenal
array of bioscience fields, including molecular evolution and bioinformatics;
genome-, proteome-, transcriptome- and metabolome-wide experimental studies;
structural genomics; and atomistic simulations of cellular-scale molecular
assemblies as large as ribosomes and intact viruses. In short, much of
post-genomic biology is increasingly becoming a form of computational biology.
The ability to design and write computer programs is among the most
indispensable skills that a modern researcher can cultivate. Python has become
a popular programming language in the biosciences, largely because (i) its
straightforward semantics and clean syntax make it a readily accessible first
language; (ii) it is expressive and well-suited to object-oriented programming,
as well as other modern paradigms; and (iii) the many available libraries and
third-party toolkits extend the functionality of the core language into
virtually every biological domain (sequence and structure analyses,
phylogenomics, workflow management systems, etc.). This primer offers a basic
introduction to coding, via Python, and it includes concrete examples and
exercises to illustrate the language's usage and capabilities; the main text
culminates with a final project in structural bioinformatics. A suite of
Supplemental Chapters is also provided. Starting with basic concepts, such as
that of a 'variable', the Chapters methodically advance the reader to the point
of writing a graphical user interface to compute the Hamming distance between
two DNA sequences.Comment: 65 pages total, including 45 pages text, 3 figures, 4 tables,
numerous exercises, and 19 pages of Supporting Information; currently in
press at PLOS Computational Biolog
Protein multiple sequence alignment by hybrid bio-inspired algorithms
This article presents an immune inspired algorithm to tackle the Multiple Sequence Alignment (MSA) problem. MSA is one of the most important tasks in biological sequence analysis. Although this paper focuses on protein alignments, most of the discussion and methodology may also be applied to DNA alignments. The problem of finding the multiple alignment was investigated in the study by Bonizzoni and Vedova and Wang and Jiang, and proved to be a NP-hard (non-deterministic polynomial-time hard) problem. The presented algorithm, called Immunological Multiple Sequence Alignment Algorithm (IMSA), incorporates two new strategies to create the initial population and specific ad hoc mutation operators. It is based on the ‘weighted sum of pairs’ as objective function, to evaluate a given candidate alignment. IMSA was tested using both classical benchmarks of BAliBASE (versions 1.0, 2.0 and 3.0), and experimental results indicate that it is comparable with state-of-the-art multiple alignment algorithms, in terms of quality of alignments, weighted Sums-of-Pairs (SP) and Column Score (CS) values. The main novelty of IMSA is its ability to generate more than a single suboptimal alignment, for every MSA instance; this behaviour is due to the stochastic nature of the algorithm and of the populations evolved during the convergence process. This feature will help the decision maker to assess and select a biologically relevant multiple sequence alignment. Finally, the designed algorithm can be used as a local search procedure to properly explore promising alignments of the search space
Formal reasoning on qualitative models of coinfection of HIV and Tuberculosis and HAART therapy.
BACKGROUND: Several diseases, many of which nowadays pandemic, consist of multifactorial pathologies. Paradigmatic examples come from the immune response to pathogens, in which cases the effects of different infections combine together, yielding complex mutual feedback, often a positive one that boosts infection progression in a scenario that can easily become lethal. HIV is one such infection, which weakens the immune system favouring the insurgence of opportunistic infections, amongst which Tuberculosis (TB). The treatment with antiretroviral therapies has shown effective in reducing mortality. An in-depth understanding of complex systems, like the one consisting of HIV, TB and related therapies, is an open great challenge, on the boundaries of bioinformatics, computational and systems biology. RESULTS: We present a simplified formalisation of the highly dynamic system consisting of HIV, TB and related therapies, at the cellular level. The progression of the disease (AIDS) depends hence on interactions between viruses, cells, chemokines, the high mutation rate of viruses, the immune response of individuals and the interaction between drugs and infection dynamics. We first discuss a deterministic model of dual infection (HIV and TB) which is able to capture the long-term dynamics of CD4 T cells, viruses and Tumour Necrosis Factor (TNF). We contrast this model with a stochastic approach which captures intrinsic fluctuations of the biological processes. Furthermore, we also integrate automated reasoning techniques, i.e. probabilistic model checking, in our formal analysis. Beyond numerical simulations, model checking allows general properties (effectiveness of anti-HIV therapies) to be verified against the models by means of an automated procedure. Our work stresses the growing importance and flexibility of model checking techniques in bioinformatics. In this paper we i) describe HIV as a complex case of infectious diseases; ii) provide a number of different formal descriptions that suitably account for aspects of interests; iii) suggest that the integration of different models together with automated reasoning techniques can improve the understanding of infections and therapies through formal analysis methodologies. CONCLUSION: We argue that the described methodology suitably supports the study of viral infections in a formal, automated and expressive manner. We envisage a long-term contribution of this kind of approaches to clinical Bioinformatics and Translational Medicine.RIGHTS : This article is licensed under the BioMed Central licence at http://www.biomedcentral.com/about/license which is similar to the 'Creative Commons Attribution Licence'. In brief you may : copy, distribute, and display the work; make derivative works; or make commercial use of the work - under the following conditions: the original author must be given credit; for any reuse or distribution, it must be made clear to others what the license terms of this work are
- …