30 research outputs found

    Estimation of Sequencing Error Rates in Short Reads

    Get PDF
    Background: Short-read data from next-generation sequencing technologies are now being generated across a range of research projects. The fidelity of this data can be affected by several factors and it is important to have simple and reliable approaches for monitoring it at the level of individual experiments. Results: We developed a fast, scalable and accurate approach to estimating error rates in short reads, which has the added advantage of not requiring a reference genome. We build on the fundamental observation that there is a linear relationship between the copy number for a given read and the number of erroneous reads that differ from the read of interest by one or two bases. The slope of this relationship can be transformed to give an estimate of the error rate, both by read and by position. We present simulation studies as well as analyses of real data sets illustrating the precision and accuracy of this method, and we show that it is more accurate than alternatives that count the difference between the sample of interest and a reference genome. We show how this methodology led to the detection of mutations in the genome of the PhiX strain used for calibration of Illumina data. The proposed method is implemented in an R package, which can be downloaded from http://bcb.dfci.harvard.edu/āˆ¼vwang/shadowRegression.html. Conclusions: The proposed method can be used to monitor the quality of sequencing pipelines at the level of individual experiments without the use of reference genomes. Furthermore, having an estimate of the error rates gives one the opportunity to improve analyses and inferences in many applications of next-generation sequencing data

    GeneSigDBā€”a curated database of gene expression signatures

    Get PDF
    The primary objective of most gene expression studies is the identification of one or more gene signatures; lists of genes whose transcriptional levels are uniquely associated with a specific biological phenotype. Whilst thousands of experimentally derived gene signatures are published, their potential value to the community is limited by their computational inaccessibility. Gene signatures are embedded in published article figures, tables or in supplementary materials, and are frequently presented using non-standard gene or probeset nomenclature. We present GeneSigDB (http://compbio.dfci.harvard.edu/genesigdb) a manually curated database of gene expression signatures. GeneSigDB release 1.0 focuses on cancer and stem cells gene signatures and was constructed from more than 850 publications from which we manually transcribed 575 gene signatures. Most gene signatures (n = 560) were successfully mapped to the genome to extract standardized lists of EnsEMBL gene identifiers. GeneSigDB provides the original gene signature, the standardized gene list and a fully traceable gene mapping history for each gene from the original transcribed data table through to the standardized list of genes. The GeneSigDB web portal is easy to search, allows users to compare their own gene list to those in the database, and download gene signatures in most common gene identifier formats

    Therapeutic Implications of GIPC1 Silencing in Cancer

    Get PDF
    GIPC1 is a cytoplasmic scaffold protein that interacts with numerous receptor signaling complexes, and emerging evidence suggests that it plays a role in tumorigenesis. GIPC1 is highly expressed in a number of human malignancies, including breast, ovarian, gastric, and pancreatic cancers. Suppression of GIPC1 in human pancreatic cancer cells inhibits in vivo tumor growth in immunodeficient mice. To better understand GIPC1 function, we suppressed its expression in human breast and colorectal cancer cell lines and human mammary epithelial cells (HMECs) and assayed both gene expression and cellular phenotype. Suppression of GIPC1 promotes apoptosis in MCF-7, MDA-MD231, SKBR-3, SW480, and SW620 cells and impairs anchorage-independent colony formation of HMECs. These observations indicate GIPC1 plays an essential role in oncogenic transformation, and its expression is necessary for the survival of human breast and colorectal cancer cells. Additionally, a GIPC1 knock-down gene signature was used to interrogate publically available breast and ovarian cancer microarray datasets. This GIPC1 signature statistically correlates with a number of breast and ovarian cancer phenotypes and clinical outcomes, including patient survival. Taken together, these data indicate that GIPC1 inhibition may represent a new target for therapeutic development for the treatment of human cancers

    High-throughput genotyping and fingerprinting of mycobacterium tuberculosis multi-drug resistant strains

    Full text link
    Thesis (Ph.D.)--Boston UniversityPLEASE NOTE: Boston University Libraries did not receive an Authorization To Manage form for this thesis or dissertation. It is therefore not openly accessible, though it may be available by request. If you are the author or principal advisor of this work and would like to request open access for it, please contact us at [email protected]. Thank you.Multiple drug resistance in Mycobacterium tuberculosis poses both significant treatment and epidemiological challenges. Measuring drug resistance in clinical settings is time consuming and prone to errors, problems that can lead to suboptimal treatments and the selection of further resistance to an increased number of antibacterial drugs. A fast and accurate genotyping assay, directed at mutations that are highly associated with drug-resistance, would improve response time and the choice of drugs used to treat multiple drug resistant tuberculosis. From an epidemiologic perspective, tracking the origin and dynamics of drug resistant strains in outbreaks is also a challenge and existing methods fall short because they lack resolution (spoligtyping) or are too expensive or labor-intensive to implement on a large scale (RFLP, MIRU-VNTR). In this work, I developed methods to adapt and expand a high throughput targeted resequencing method based on molecular inversion probes and subsequent Illumina sequencing, to cover 28 protein and rRNA-coding genes described previously as primary and secondary actors in drug resistance. I validated the method on a control set, compared it with traditional Sanger sequencing and whole-genome Illumina sequencing and applied it to a collection of 1200 drug resistant Mycobacterium tuberculosis strains from all over the world. This project was funded by the Bill and Melinda Gates Foundation and the result of our work will be freely available as a resource to the research community through a website hosted by the Broad Institute. For this project, I have written, tested and optimized algorithms for large-scale molecular inversion probe design (MIPDesigner), for next-generation sequence data processing (MIPCleaner), for SNP filtering, and for quality-control metric computation. Molecular inversion probes also provide a mechanism for rapid, high-throughput, molecular fingerprinting of Mycobacterium tuberculosis strains, that can be performed simultaneously with the detection drug resistance mutations. I used my optimized MIPs pipeline to design and test a "virtual spoligotyping" method based on the capture and sequencing of the spacers in the CRISPR locus with a molecular inversion probe. This new method expands the resolution and power of the classical spoligotyping assay and provides a mechanism for the continuous improvement of Mycobacterium tuberculosis fingerprinting.2031-01-0

    Transcriptional profiling with a blood pressure QTL interval-specific oligonucleotide array

    No full text
    Although the evidence for a genetic predisposition to human essential hypertension is compelling, the genetic control of blood pressure (BP) is poorly understood. The Dahl salt-sensitive (S) rat is a model for studying the genetic component of BP. Using this model, we previously reported the identification of 16 different genomic regions that contain one or more BP quantitative trait loci (QTLs). The proximal region of rat chromosome 1 contains multiple BP QTLs. Of these, we have localized the BP QTL1b region to a 13.5-cM (20.92 Mb) region. Interestingly, five additional independent studies in rats and four independent studies in humans have reported genetic linkage for BP control by regions homologous to QTL1b. To view the overall renal transcriptional topography of the positional candidate genes for this QTL, we sought a comparative gene expression profiling between a congenic strain containing QTL1b and control S rats by employing 1) a saturated QTL1b interval-specific oligonucleotide array and 2) a whole genome cDNA microarray representing 20,465 unique genes that are positioned outside the QTL. Results indicated that 17 of the 231 positional candidate genes for this QTL are differentially expressed between the two strains tested. Surprisingly, \u3e1,500 genes outside of QTL1b were differentially expressed between the two rat strains. Integrating the results from the two approaches revealed at least one complex network of transcriptional control initiated by the positional candidate Nr2f2. This network appears to account for the majority of gene expression differences occurring outside of the QTL interval. Further substitution mapping is currently underway to test the validity of each of these differentially expressed positional candidate genes. These results demonstrate the importance of using a saturated oligonucleotide array for identifying and prioritizing differentially expressed positional candidate genes of a BP QTL. Copyright Ā© 2005 the American Physiological Society
    corecore