157 research outputs found

    Who Watches the Watchmen? An Appraisal of Benchmarks for Multiple Sequence Alignment

    Get PDF
    Multiple sequence alignment (MSA) is a fundamental and ubiquitous technique in bioinformatics used to infer related residues among biological sequences. Thus alignment accuracy is crucial to a vast range of analyses, often in ways difficult to assess in those analyses. To compare the performance of different aligners and help detect systematic errors in alignments, a number of benchmarking strategies have been pursued. Here we present an overview of the main strategies--based on simulation, consistency, protein structure, and phylogeny--and discuss their different advantages and associated risks. We outline a set of desirable characteristics for effective benchmarking, and evaluate each strategy in light of them. We conclude that there is currently no universally applicable means of benchmarking MSA, and that developers and users of alignment tools should base their choice of benchmark depending on the context of application--with a keen awareness of the assumptions underlying each benchmarking strategy.Comment: Revie

    Unusual quasars from the Sloan Digital Sky Survey selected by means of Kohonen self-organising maps

    Full text link
    We exploit the spectral archive of the Sloan Digital Sky Survey (SDSS) Data Release 7 to select unusual quasar spectra. The selection method is based on a combination of the power of self-organising maps and the visual inspection of a huge number of spectra. Self-organising maps were applied to nearly 10^5 spectra classified as quasars by the SDSS pipeline. Particular attention was paid to minimise possible contamination by rare peculiar stellar spectral types. We present a catalogue of 1005 quasars with unusual spectra. This large sample provides a useful resource for both studying properties and relations of/between different types of unusual quasars and selecting particularly interesting objects. The spectra are grouped into six types. All these types turn out to be on average more luminous than comparison samples of normal quasars after a statistical correction is made for intrinsic reddening. Both the unusual broad absorption line (BAL) quasars and the strong iron emitters have significantly lower radio luminosities than normal quasars. We also confirm that strong BALs avoid the most radio-luminous quasars. Finally, we create a sample of quasars similar to the two "mysterious" objects discovered by Hall et al. (2002) and briefly discuss the quasar properties and possible explanations of their highly peculiar spectra. (Abstract modified to match the arXiv format)Comment: Added reference to section 6; a few typos corrected; corrections according to the version published in Astronomy and Astrophysic

    Dynamics of Glycoprotein Charge in the Evolutionary History of Human Influenza

    Get PDF
    Influenza viruses show a significant capacity to evade host immunity; this is manifest both as large occasional jumps in the antigenic phenotype of viral surface molecules and in gradual antigenic changes leading to annual influenza epidemics in humans. Recent mouse studies show that avidity for host cells can play an important role in polyclonal antibody escape, and further that electrostatic charge of the hemagglutinin glycoprotein can contribute to such avidity.We test the role of glycoprotein charge on sequence data from the three major subtypes of influenza A in humans, using a simple method of calculating net glycoprotein charge. Of all subtypes, H3N2 in humans shows a striking pattern of increasing positive charge since its introduction in 1968. Notably, this trend applies to both hemagglutinin and neuraminidase glycoproteins. In the late 1980s hemagglutinin charge reached a plateau, while neuraminidase charge started to decline. We identify key groups of amino acid sites involved in this charge trend.To our knowledge these are the first indications that, for human H3N2, net glycoprotein charge covaries strongly with antigenic drift on a global scale. Further work is needed to elucidate how such charge interacts with other immune escape mechanisms, such as glycosylation, and we discuss important questions arising for future study

    Molecular characterization of partial-open reading frames 1a and 2 of the human astroviruses in South Korea

    Get PDF
    Human astroviruses (HAstVs) are among the major causes of gastroenteritis in South Korea. In this study, the partial regions of the open reading frame (ORF) 1a and ORF2 genes of HAstVs from gastroenteritis patients in nine hospitals were sequenced, and the molecular characterization of the viruses was revealed. 89 partial nucleotide sequences of ORF1a and 88 partial nucleotide sequences of ORF2 were amplified from 120 stool specimens. Phylogenetic analysis showed that most of the nucleotide sequences of ORF1a and ORF2 were grouped with HAstV type 1 but had evolutionary genetic distance compared with the reference sequences, such as the HAstV-1 prototype, Dresden strain, and Oxford strain. According to the phylogenetic analysis, some nucleotide sequences including SE0506041, SE0506043, and SE0506058, showed the discrepancy of the genotypes, but there was no proof of recombination among the HAstV types. In conclusion, this study showed that the dominant HAstV isolated from the Seoul metropolitan area in 2004-2005 was HAstV type 1, and that Korean HAstV-1 had the genetic distance in evolution compared with the reference sequences of HAstVs. Lots of nucleotide sequences of the ORF1a and ORF2 genes of HAstV will be useful for studying for the control and prevention of HAstV gastroenteritis in South Korea

    Efficient representation of uncertainty in multiple sequence alignments using directed acyclic graphs

    Get PDF
    Background A standard procedure in many areas of bioinformatics is to use a single multiple sequence alignment (MSA) as the basis for various types of analysis. However, downstream results may be highly sensitive to the alignment used, and neglecting the uncertainty in the alignment can lead to significant bias in the resulting inference. In recent years, a number of approaches have been developed for probabilistic sampling of alignments, rather than simply generating a single optimum. However, this type of probabilistic information is currently not widely used in the context of downstream inference, since most existing algorithms are set up to make use of a single alignment. Results In this work we present a framework for representing a set of sampled alignments as a directed acyclic graph (DAG) whose nodes are alignment columns; each path through this DAG then represents a valid alignment. Since the probabilities of individual columns can be estimated from empirical frequencies, this approach enables sample-based estimation of posterior alignment probabilities. Moreover, due to conditional independencies between columns, the graph structure encodes a much larger set of alignments than the original set of sampled MSAs, such that the effective sample size is greatly increased. Conclusions The alignment DAG provides a natural way to represent a distribution in the space of MSAs, and allows for existing algorithms to be efficiently scaled up to operate on large sets of alignments. As an example, we show how this can be used to compute marginal probabilities for tree topologies, averaging over a very large number of MSAs. This framework can also be used to generate a statistically meaningful summary alignment; example applications show that this summary alignment is consistently more accurate than the majority of the alignment samples, leading to improvements in downstream tree inference. Implementations of the methods described in this article are available at http://statalign.github.io/WeaveAlign webcite

    A Full Year's Chandra Exposure on SDSS Quasars from the Chandra Multiwavelength Project

    Full text link
    We study the spectral energy distributions and evolution of a large sample of optically selected quasars from the Sloan Digital Sky Survey (SDSS) that were observed in 323 Chandra images analyzed by the Chandra Multiwavelength Project (ChaMP). Our highest-confidence matched sample includes 1135 X-ray detected quasars in the redshift range 0.2<z<5.4, representing some 36Msec of effective exposure. Spectroscopic redshifts are available for about 1/3 of the detected sample; elsewhere, redshifts are estimated photometrically. With 56 z>3 QSOs detected, we find no evidence for evolution out to z~5 for either the X-ray photon index Gamma or for the ratio of optical/UV to X-ray flux alpha_ox. About 10% of detected QSOs are obscured (Nh>1E22), but the fraction might reach ~1/3 if most non-detections are absorbed. We confirm a significant correlation between alpha_ox and optical luminosity, but it flattens or disappears for fainter AGN alone. Gamma hardens significantly both towards higher X-ray luminosity, and for relatively X-ray loud quasars. These trends may represent a relative increase in non-thermal X-ray emission, and our findings thereby strengthen analogies between Galactic black hole binaries and AGN.Comment: 28 pages, 21 figures. Accepted (26 Aug 2008) for publication in ApJS. Electronic datafiles (for tables 2 and 3) and high resolution figures available at http://hea-www.harvard.edu/CHAMP

    Identifying Changes in Selective Constraints: Host Shifts in Influenza

    Get PDF
    The natural reservoir of Influenza A is waterfowl. Normally, waterfowl viruses are not adapted to infect and spread in the human population. Sometimes, through reassortment or through whole host shift events, genetic material from waterfowl viruses is introduced into the human population causing worldwide pandemics. Identifying which mutations allow viruses from avian origin to spread successfully in the human population is of great importance in predicting and controlling influenza pandemics. Here we describe a novel approach to identify such mutations. We use a sitewise non-homogeneous phylogenetic model that explicitly takes into account differences in the equilibrium frequencies of amino acids in different hosts and locations. We identify 172 amino acid sites with strong support and 518 sites with moderate support of different selection constraints in human and avian viruses. The sites that we identify provide an invaluable resource to experimental virologists studying adaptation of avian flu viruses to the human host. Identification of the sequence changes necessary for host shifts would help us predict the pandemic potential of various strains. The method is of broad applicability to investigating changes in selective constraints when the timing of the changes is known

    Prevalence of Epistasis in the Evolution of Influenza A Surface Proteins

    Get PDF
    The surface proteins of human influenza A viruses experience positive selection to escape both human immunity and, more recently, antiviral drug treatments. In bacteria and viruses, immune-escape and drug-resistant phenotypes often appear through a combination of several mutations that have epistatic effects on pathogen fitness. However, the extent and structure of epistasis in influenza viral proteins have not been systematically investigated. Here, we develop a novel statistical method to detect positive epistasis between pairs of sites in a protein, based on the observed temporal patterns of sequence evolution. The method rests on the simple idea that a substitution at one site should rapidly follow a substitution at another site if the sites are positively epistatic. We apply this method to the surface proteins hemagglutinin and neuraminidase of influenza A virus subtypes H3N2 and H1N1. Compared to a non-epistatic null distribution, we detect substantial amounts of epistasis and determine the identities of putatively epistatic pairs of sites. In particular, using sequence data alone, our method identifies epistatic interactions between specific sites in neuraminidase that have recently been demonstrated, in vitro, to confer resistance to the drug oseltamivir; these epistatic interactions are responsible for widespread drug resistance among H1N1 viruses circulating today. This experimental validation demonstrates the predictive power of our method to identify epistatic sites of importance for viral adaptation and public health. We conclude that epistasis plays a large role in shaping the molecular evolution of influenza viruses. In particular, sites with , which would normally not be identified as positively selected, can facilitate viral adaptation through epistatic interactions with their partner sites. The knowledge of specific interactions among sites in influenza proteins may help us to predict the course of antigenic evolution and, consequently, to select more appropriate vaccines and drugs

    Integrating sequence and array data to create an improved 1000 Genomes Project haplotype reference panel

    Get PDF
    A major use of the 1000 Genomes Project (1000GP) data is genotype imputation in genome-wide association studies (GWAS). Here we develop a method to estimate haplotypes from low-coverage sequencing data that can take advantage of single-nucleotide polymorphism (SNP) microarray genotypes on the same samples. First the SNP array data are phased to build a backbone (or 'scaffold') of haplotypes across each chromosome. We then phase the sequence data 'onto' this haplotype scaffold. This approach can take advantage of relatedness between sequenced and non-sequenced samples to improve accuracy. We use this method to create a new 1000GP haplotype reference set for use by the human genetic community. Using a set of validation genotypes at SNP and bi-allelic indels we show that these haplotypes have lower genotype discordance and improved imputation performance into downstream GWAS samples, especially at low-frequency variants. © 2014 Macmillan Publishers Limited. All rights reserved
    corecore