230 research outputs found

    A comparison of common programming languages used in bioinformatics

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>The performance of different programming languages has previously been benchmarked using abstract mathematical algorithms, but not using standard bioinformatics algorithms. We compared the memory usage and speed of execution for three standard bioinformatics methods, implemented in programs using one of six different programming languages. Programs for the Sellers algorithm, the Neighbor-Joining tree construction algorithm and an algorithm for parsing BLAST file outputs were implemented in C, C++, C#, Java, Perl and Python.</p> <p>Results</p> <p>Implementations in C and C++ were fastest and used the least memory. Programs in these languages generally contained more lines of code. Java and C# appeared to be a compromise between the flexibility of Perl and Python and the fast performance of C and C++. The relative performance of the tested languages did not change from Windows to Linux and no clear evidence of a faster operating system was found.</p> <p>Source code and additional information are available from <url>http://www.bioinformatics.org/benchmark/</url></p> <p>Conclusion</p> <p>This benchmark provides a comparison of six commonly used programming languages under two different operating systems. The overall comparison shows that a developer should choose an appropriate language carefully, taking into account the performance expected and the library availability for each language.</p

    MaxAlign: maximizing usable data in an alignment

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>The presence of gaps in an alignment of nucleotide or protein sequences is often an inconvenience for bioinformatical studies. In phylogenetic and other analyses, for instance, gapped columns are often discarded entirely from the alignment.</p> <p>Results</p> <p>MaxAlign is a program that optimizes the alignment prior to such analyses. Specifically, it maximizes the number of nucleotide (or amino acid) symbols that are present in gap-free columns – the alignment area – by selecting the optimal subset of sequences to exclude from the alignment.</p> <p>MaxAlign can be used prior to phylogenetic and bioinformatical analyses as well as in other situations where this form of alignment improvement is useful. In this work we test MaxAlign's performance in these tasks and compare the accuracy of phylogenetic estimates including and excluding gapped columns from the analysis, with and without processing with MaxAlign. In this paper we also introduce a new simple measure of tree similarity, Normalized Symmetric Similarity (NSS) that we consider useful for comparing tree topologies.</p> <p>Conclusion</p> <p>We demonstrate how MaxAlign is helpful in detecting misaligned or defective sequences without requiring manual inspection. We also show that it is not advisable to exclude gapped columns from phylogenetic analyses unless MaxAlign is used first. Finally, we find that the sequences removed by MaxAlign from an alignment tend to be those that would otherwise be associated with low phylogenetic accuracy, and that the presence of gaps in any given sequence does not seem to disturb the phylogenetic estimates of <it>other </it>sequences.</p> <p>The MaxAlign web-server is freely available online at http://www.cbs.dtu.dk/services/MaxAlign where supplementary information can also be found. The program is also freely available as a Perl stand-alone package.</p

    Evolutionary distances in the twilight zone -- a rational kernel approach

    Get PDF
    Phylogenetic tree reconstruction is traditionally based on multiple sequence alignments (MSAs) and heavily depends on the validity of this information bottleneck. With increasing sequence divergence, the quality of MSAs decays quickly. Alignment-free methods, on the other hand, are based on abstract string comparisons and avoid potential alignment problems. However, in general they are not biologically motivated and ignore our knowledge about the evolution of sequences. Thus, it is still a major open question how to define an evolutionary distance metric between divergent sequences that makes use of indel information and known substitution models without the need for a multiple alignment. Here we propose a new evolutionary distance metric to close this gap. It uses finite-state transducers to create a biologically motivated similarity score which models substitutions and indels, and does not depend on a multiple sequence alignment. The sequence similarity score is defined in analogy to pairwise alignments and additionally has the positive semi-definite property. We describe its derivation and show in simulation studies and real-world examples that it is more accurate in reconstructing phylogenies than competing methods. The result is a new and accurate way of determining evolutionary distances in and beyond the twilight zone of sequence alignments that is suitable for large datasets.Comment: to appear in PLoS ON

    Mitochondrial phylogeography and demographic history of the Vicuña: implications for conservation

    Get PDF
    The vicuña (Vicugna vicugna; Miller, 1924) is a conservation success story, having recovered from near extinction in the 1960s to current population levels estimated at 275 000. However, lack of information about its demographic history and genetic diversity has limited both our understanding of its recovery and the development of science-based conservation measures. To examine the evolution and recent demographic history of the vicuña across its current range and to assess its genetic variation and population structure, we sequenced mitochondrial DNA from the control region (CR) for 261 individuals from 29 populations across Peru, Chile and Argentina. Our results suggest that populations currently designated as Vicugna vicugna vicugna and Vicugna vicugna mensalis comprise separate mitochondrial lineages. The current population distribution appears to be the result of a recent demographic expansion associated with the last major glacial event of the Pleistocene in the northern (18 to 22°S) dry Andes 14–12 000 years ago and the establishment of an extremely arid belt known as the 'Dry Diagonal' to 29°S. Within the Dry Diagonal, small populations of V. v. vicugna appear to have survived showing the genetic signature of demographic isolation, whereas to the north V. v. mensalis populations underwent a rapid demographic expansion before recent anthropogenic impacts

    A new, fast algorithm for detecting protein coevolution using maximum compatible cliques

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>The MatrixMatchMaker algorithm was recently introduced to detect the similarity between phylogenetic trees and thus the coevolution between proteins. MMM finds the largest common submatrices between pairs of phylogenetic distance matrices, and has numerous advantages over existing methods of coevolution detection. However, these advantages came at the cost of a very long execution time.</p> <p>Results</p> <p>In this paper, we show that the problem of finding the maximum submatrix reduces to a multiple maximum clique subproblem on a graph of protein pairs. This allowed us to develop a new algorithm and program implementation, MMMvII, which achieved more than 600× speedup with comparable accuracy to the original MMM.</p> <p>Conclusions</p> <p>MMMvII will thus allow for more more extensive and intricate analyses of coevolution.</p> <p>Availability</p> <p>An implementation of the MMMvII algorithm is available at: <url>http://www.uhnresearch.ca/labs/tillier/MMMWEBvII/MMMWEBvII.php</url></p

    Genetic Differentiation of the Western Capercaillie Highlights the Importance of South-Eastern Europe for Understanding the Species Phylogeography

    Get PDF
    The Western Capercaillie (Tetrao urogallus L.) is a grouse species of open boreal or high altitude forests of Eurasia. It is endangered throughout most mountain range habitat areas in Europe. Two major genetically identifiable lineages of Western Capercaillie have been described to date: the southern lineage at the species' southernmost range of distribution in Europe, and the boreal lineage. We address the question of genetic differentiation of capercaillie populations from the Rhodope and Rila Mountains in Bulgaria, across the Dinaric Mountains to the Slovenian Alps. The two lineages' contact zone and resulting conservation strategies in this so-far understudied area of distribution have not been previously determined. The results of analysis of mitochondrial DNA control region sequences of 319 samples from the studied populations show that Alpine populations were composed exclusively of boreal lineage; Dinaric populations of both, but predominantly (96%) of boreal lineage; and Rhodope-Rila populations predominantly (>90%) of southern lineage individuals. The Bulgarian mountains were identified as the core area of the southern lineage, and the Dinaric Mountains as the western contact zone between both lineages in the Balkans. Bulgarian populations appeared genetically distinct from Alpine and Dinaric populations and exhibited characteristics of a long-term stationary population, suggesting that they should be considered as a glacial relict and probably a distinct subspecies. Although all of the studied populations suffered a decline in the past, the significantly lower level of genetic diversity when compared with the neighbouring Alpine and Bulgarian populations suggests that the isolated Dinaric capercaillie is particularly vulnerable to continuing population decline. The results are discussed in the context of conservation of the species in the Balkans, its principal threats and legal protection status. Potential conservation strategies should consider the existence of the two lineages and their vulnerable Dinaric contact zone and support the specificities of the populations

    Controlling Population Evolution in the Laboratory to Evaluate Methods of Historical Inference

    Get PDF
    Natural populations of known detailed past demographic history are extremely valuable to evaluate methods of historical inference, yet are extremely rare. As an alternative approach, we have generated multiple replicate microsatellite data sets from laboratory-cultured populations of a gonochoric free-living nematode, Caenorhabditis remanei, that were constrained to pre-defined demographic histories featuring different levels of migration among populations or bottleneck events of different magnitudes. These data sets were then used to evaluate the performances of two recently developed population genetics methods, BayesAss+, that estimates recent migration rates among populations, and Bottleneck, that detects the occurrence of recent bottlenecks. Migration rates inferred by BayesAss+ were generally over-estimates, although these were often included within the confidence interval. Analyses of data sets simulated in-silico, using a model mimicking the laboratory experiments, produced less biased estimates of the migration rates, and showed increased efficiency of the program when the number of loci and sampled genotypes per population was higher. In the replicates for which the pre-bottleneck laboratory-cultured populations did not significantly depart from a mutation/drift equilibrium, an important assumption of the program Bottleneck, only a portion of the bottleneck events were detected. This result was confirmed by in-silico simulations mirroring the laboratory bottleneck experiments. More generally, our study demonstrates the feasibility, and highlights some of the limits, of the approach that consists in generating molecular genetic data sets by controlling the evolution of laboratory-reared nematode populations, for the purpose of validating methods inferring population history

    New Insights into the Lake Chad Basin Population Structure Revealed by High-Throughput Genotyping of Mitochondrial DNA Coding SNPs

    Get PDF
    BACKGROUND: Located in the Sudan belt, the Chad Basin forms a remarkable ecosystem, where several unique agricultural and pastoral techniques have been developed. Both from an archaeological and a genetic point of view, this region has been interpreted to be the center of a bidirectional corridor connecting West and East Africa, as well as a meeting point for populations coming from North Africa through the Saharan desert. METHODOLOGY/PRINCIPAL FINDINGS: Samples from twelve ethnic groups from the Chad Basin (n = 542) have been high-throughput genotyped for 230 coding region mitochondrial DNA (mtDNA) Single Nucleotide Polymorphisms (mtSNPs) using Matrix-Assisted Laser Desorption/Ionization Time-Of-Flight (MALDI-TOF) mass spectrometry. This set of mtSNPs allowed for much better phylogenetic resolution than previous studies of this geographic region, enabling new insights into its population history. Notable haplogroup (hg) heterogeneity has been observed in the Chad Basin mirroring the different demographic histories of these ethnic groups. As estimated using a Bayesian framework, nomadic populations showed negative growth which was not always correlated to their estimated effective population sizes. Nomads also showed lower diversity values than sedentary groups. CONCLUSIONS/SIGNIFICANCE: Compared to sedentary population, nomads showed signals of stronger genetic drift occurring in their ancestral populations. These populations, however, retained more haplotype diversity in their hypervariable segments I (HVS-I), but not their mtSNPs, suggesting a more ancestral ethnogenesis. Whereas the nomadic population showed a higher Mediterranean influence signaled mainly by sub-lineages of M1, R0, U6, and U5, the other populations showed a more consistent sub-Saharan pattern. Although lifestyle may have an influence on diversity patterns and hg composition, analysis of molecular variance has not identified these differences. The present study indicates that analysis of mtSNPs at high resolution could be a fast and extensive approach for screening variation in population studies where labor-intensive techniques such as entire genome sequencing remain unfeasible

    Genotyping of Bacillus cereus Strains by Microarray-Based Resequencing

    Get PDF
    The ability to distinguish microbial pathogens from closely related but nonpathogenic strains is key to understanding the population biology of these organisms. In this regard, Bacillus anthracis, the bacterium that causes inhalational anthrax, is of interest because it is closely related and often difficult to distinguish from other members of the B. cereus group that can cause diverse diseases. We employed custom-designed resequencing arrays (RAs) based on the genome sequence of Bacillus anthracis to generate 422 kb of genomic sequence from a panel of 41 Bacillus cereus sensu lato strains. Here we show that RAs represent a “one reaction” genotyping technology with the ability to discriminate between highly similar B. anthracis isolates and more divergent strains of the B. cereus s.l. Clade 1. Our data show that RAs can be an efficient genotyping technology for pre-screening the genetic diversity of large strain collections to selected the best candidates for whole genome sequencing
    corecore