337 research outputs found
High-Resolution Copy-Number Variation Map Reflects Human Olfactory Receptor Diversity and Evolution
Olfactory receptors (ORs), which are involved in odorant recognition, form the largest mammalian protein superfamily. The genomic content of OR genes is considerably reduced in humans, as reflected by the relatively small repertoire size and the high fraction (∼55%) of human pseudogenes. Since several recent low-resolution surveys suggested that OR genomic loci are frequently affected by copy-number variants (CNVs), we hypothesized that CNVs may play an important role in the evolution of the human olfactory repertoire. We used high-resolution oligonucleotide tiling microarrays to detect CNVs across 851 OR gene and pseudogene loci. Examining genomic DNA from 25 individuals with ancestry from three populations, we identified 93 OR gene loci and 151 pseudogene loci affected by CNVs, generating a mosaic of OR dosages across persons. Our data suggest that ∼50% of the CNVs involve more than one OR, with the largest CNV spanning 11 loci. In contrast to earlier reports, we observe that CNVs are more frequent among OR pseudogenes than among intact genes, presumably due to both selective constraints and CNV formation biases. Furthermore, our results show an enrichment of CNVs among ORs with a close human paralog or lacking a one-to-one ortholog in chimpanzee. Interestingly, among the latter we observed an enrichment in CNV losses over gains, a finding potentially related to the known diminution of the human OR repertoire. Quantitative PCR experiments performed for 122 sampled ORs agreed well with the microarray results and uncovered 23 additional CNVs. Importantly, these experiments allowed us to uncover nine common deletion alleles that affect 15 OR genes and five pseudogenes. Comparison to the chimpanzee reference genome revealed that all of the deletion alleles are human derived, therefore indicating a profound effect of human-specific deletions on the individual OR gene content. Furthermore, these deletion alleles may be used in future genetic association studies of olfactory inter-individual differences
Integrating Sequencing Technologies in Personal Genomics: Optimal Low Cost Reconstruction of Structural Variants
The goal of human genome re-sequencing is obtaining an accurate assembly of an individual's genome. Recently, there has been great excitement in the development of many technologies for this (e.g. medium and short read sequencing from companies such as 454 and SOLiD, and high-density oligo-arrays from Affymetrix and NimbelGen), with even more expected to appear. The costs and sensitivities of these technologies differ considerably from each other. As an important goal of personal genomics is to reduce the cost of re-sequencing to an affordable point, it is worthwhile to consider optimally integrating technologies. Here, we build a simulation toolbox that will help us optimally combine different technologies for genome re-sequencing, especially in reconstructing large structural variants (SVs). SV reconstruction is considered the most challenging step in human genome re-sequencing. (It is sometimes even harder than de novo assembly of small genomes because of the duplications and repetitive sequences in the human genome.) To this end, we formulate canonical problems that are representative of issues in reconstruction and are of small enough scale to be computationally tractable and simulatable. Using semi-realistic simulations, we show how we can combine different technologies to optimally solve the assembly at low cost. With mapability maps, our simulations efficiently handle the inhomogeneous repeat-containing structure of the human genome and the computational complexity of practical assembly algorithms. They quantitatively show how combining different read lengths is more cost-effective than using one length, how an optimal mixed sequencing strategy for reconstructing large novel SVs usually also gives accurate detection of SNPs/indels, how paired-end reads can improve reconstruction efficiency, and how adding in arrays is more efficient than just sequencing for disentangling some complex SVs. Our strategy should facilitate the sequencing of human genomes at maximum accuracy and low cost
Systematic Inference of Copy-Number Genotypes from Personal Genome Sequencing Data Reveals Extensive Olfactory Receptor Gene Content Diversity
Copy-number variations (CNVs) are widespread in the human genome, but comprehensive assignments of integer locus copy-numbers (i.e., copy-number genotypes) that, for example, enable discrimination of homozygous from heterozygous CNVs, have remained challenging. Here we present CopySeq, a novel computational approach with an underlying statistical framework that analyzes the depth-of-coverage of high-throughput DNA sequencing reads, and can incorporate paired-end and breakpoint junction analysis based CNV-analysis approaches, to infer locus copy-number genotypes. We benchmarked CopySeq by genotyping 500 chromosome 1 CNV regions in 150 personal genomes sequenced at low-coverage. The assessed copy-number genotypes were highly concordant with our performed qPCR experiments (Pearson correlation coefficient 0.94), and with the published results of two microarray platforms (95–99% concordance). We further demonstrated the utility of CopySeq for analyzing gene regions enriched for segmental duplications by comprehensively inferring copy-number genotypes in the CNV-enriched >800 olfactory receptor (OR) human gene and pseudogene loci. CopySeq revealed that OR loci display an extensive range of locus copy-numbers across individuals, with zero to two copies in some OR loci, and two to nine copies in others. Among genetic variants affecting OR loci we identified deleterious variants including CNVs and SNPs affecting ∼15% and ∼20% of the human OR gene repertoire, respectively, implying that genetic variants with a possible impact on smell perception are widespread. Finally, we found that for several OR loci the reference genome appears to represent a minor-frequency variant, implying a necessary revision of the OR repertoire for future functional studies. CopySeq can ascertain genomic structural variation in specific gene families as well as at a genome-wide scale, where it may enable the quantitative evaluation of CNVs in genome-wide association studies involving high-throughput sequencing
On the power and the systematic biases of the detection of chromosomal inversions by paired-end genome sequencing
One of the most used techniques to study structural variation at a genome level is paired-end mapping (PEM). PEM has the advantage of being able to detect balanced events, such as inversions and translocations. However, inversions are still quite difficult to predict reliably, especially from high-throughput sequencing data. We simulated realistic PEM experiments with different combinations of read and library fragment lengths, including sequencing errors and meaningful base-qualities, to quantify and track down the origin of false positives and negatives along sequencing, mapping, and downstream analysis. We show that PEM is very appropriate to detect a wide range of inversions, even with low coverage data. However, % of inversions located between segmental duplications are expected to go undetected by the most common sequencing strategies. In general, longer DNA libraries improve the detectability of inversions far better than increments of the coverage depth or the read length. Finally, we review the performance of three algorithms to detect inversions -SVDetect, GRIAL, and VariationHunter-, identify common pitfalls, and reveal important differences in their breakpoint precisions. These results stress the importance of the sequencing strategy for the detection of structural variants, especially inversions, and offer guidelines for the design of future genome sequencing projects
Extensive Copy-Number Variation of Young Genes across Stickleback Populations
MM received funding from the Max Planck innovation funds for this project. PGDF was supported by a Marie Curie European Reintegration Grant (proposal nr 270891). CE was supported by German Science Foundation grants (DFG, EI 841/4-1 and EI 841/6-1). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript
A Hidden Markov Model for Copy Number Variant prediction from whole genome resequencing data
Motivation: Copy Number Variants (CNVs) are important genetic factors for studying human diseases. While high-throughput whole genome re-sequencing provides multiple lines of evidence for detecting CNVs, computational algorithms need to be tailored for different type or size of CNVs under different experimental designs. Results: To achieve optimal power and resolution of detecting CNVs at low depth of coverage, we implemented a Hidden Markov Model that integrates both depth of coverage and mate-pair relationship. The novelty of our algorithm is that we infer the likelihood of carrying a deletion jointly from multiple mate pairs in a region without the requirement of a single mate pairs being obvious outliers. By integrating all useful information in a comprehensive model, our method is able to detect medium-size deletions (200-2000bp) at low depth (<10× per sample). We applied the method to simulated data and demonstrate the power of detecting medium-size deletions is close to theoretical values. Availability: A program implemented in Java, Zinfandel, is available at http://www.cs.columbia.edu/~itsik/zinfandel
PhyloPat: phylogenetic pattern analysis of eukaryotic genes
BACKGROUND: Phylogenetic patterns show the presence or absence of certain genes or proteins in a set of species. They can also be used to determine sets of genes or proteins that occur only in certain evolutionary branches. Phylogenetic patterns analysis has routinely been applied to protein databases such as COG and OrthoMCL, but not upon gene databases. Here we present a tool named PhyloPat which allows the complete Ensembl gene database to be queried using phylogenetic patterns. DESCRIPTION: PhyloPat is an easy-to-use webserver, which can be used to query the orthologies of all complete genomes within the EnsMart database using phylogenetic patterns. This enables the determination of sets of genes that occur only in certain evolutionary branches or even single species. We found in total 446,825 genes and 3,164,088 orthologous relationships within the EnsMart v40 database. We used a single linkage clustering algorithm to create 147,922 phylogenetic lineages, using every one of the orthologies provided by Ensembl. PhyloPat provides the possibility of querying with either binary phylogenetic patterns (created by checkboxes) or regular expressions. Specific branches of a phylogenetic tree of the 21 included species can be selected to create a branch-specific phylogenetic pattern. Users can also input a list of Ensembl or EMBL IDs to check which phylogenetic lineage any gene belongs to. The output can be saved in HTML, Excel or plain text format for further analysis. A link to the FatiGO web interface has been incorporated in the HTML output, creating easy access to functional information. Finally, lists of omnipresent, polypresent and oligopresent genes have been included. CONCLUSION: PhyloPat is the first tool to combine complete genome information with phylogenetic pattern querying. Since we used the orthologies generated by the accurate pipeline of Ensembl, the obtained phylogenetic lineages are reliable. The completeness and reliability of these phylogenetic lineages will further increase with the addition of newly found orthologous relationships within each new Ensembl release
Drug-resistant genotypes and multi-clonality in Plasmodium falciparum analysed by direct genome sequencing from peripheral blood of malaria patients.
Naturally acquired blood-stage infections of the malaria parasite Plasmodium falciparum typically harbour multiple haploid clones. The apparent number of clones observed in any single infection depends on the diversity of the polymorphic markers used for the analysis, and the relative abundance of rare clones, which frequently fail to be detected among PCR products derived from numerically dominant clones. However, minority clones are of clinical interest as they may harbour genes conferring drug resistance, leading to enhanced survival after treatment and the possibility of subsequent therapeutic failure. We deployed new generation sequencing to derive genome data for five non-propagated parasite isolates taken directly from 4 different patients treated for clinical malaria in a UK hospital. Analysis of depth of coverage and length of sequence intervals between paired reads identified both previously described and novel gene deletions and amplifications. Full-length sequence data was extracted for 6 loci considered to be under selection by antimalarial drugs, and both known and previously unknown amino acid substitutions were identified. Full mitochondrial genomes were extracted from the sequencing data for each isolate, and these are compared against a panel of polymorphic sites derived from published or unpublished but publicly available data. Finally, genome-wide analysis of clone multiplicity was performed, and the number of infecting parasite clones estimated for each isolate. Each patient harboured at least 3 clones of P. falciparum by this analysis, consistent with results obtained with conventional PCR analysis of polymorphic merozoite antigen loci. We conclude that genome sequencing of peripheral blood P. falciparum taken directly from malaria patients provides high quality data useful for drug resistance studies, genomic structural analyses and population genetics, and also robustly represents clonal multiplicity
- …