11 research outputs found

    Inferring progression models for CGH data

    Get PDF
    Motivation: One of the mutational processes that has been monitored genome-wide is the occurrence of regional DNA copy number alterations (CNAs), which may lead to deletion or over-expression of tumor suppressors or oncogenes, respectively. Understanding the relationship between CNAs and different cancer types is a fundamental problem in cancer studies. Results: This article develops an efficient method that can accurately model the progression of the cancer markers and reconstruct evolutionary relationship between multiple types of cancers using comparative genomic hybridization (CGH) data. Such modeling can lead to better understanding of the commonalities and differences between multiple cancer types and potential therapies. We have developed an automatic method to infer a graph model for the markers of multiple cancers from a large population of CGH data. Our method identifies highly related markers across different cancer types. It then builds a directed acyclic graph that shows the evolutionary history of these markers based on how common each marker is in different cancer types. We demonstrated the use of this model in determining the importance of markers in cancer evolution. We have also developed a new method to measure the evolutionary distance between different cancers based on their markers. This method employs the graph model we developed for the individual markers to measure the distance between pairs of cancers. We used this measure to create an evolutionary tree for multiple cancers. Our experiments on Progenetix database show that our markers are largely consistent to the reported hot-spot imbalances and most frequent imbalances. The results show that our distance measure can accurately reconstruct the evolutionary relationship between multiple cancer types. Availability: All the code developed in this article are available at http://bioinformatics.cise.ufl.edu/phylogeny.html. Contact: [email protected]; [email protected] Supplementary information: Supplementary data are available at Bioinformatics onlin

    Finding Recurrent Regions of Copy Number Variation: A Review

    Get PDF
    Copy number variation (CNV) in genomic DNA is linked to a variety of human diseases, and array-based CGH (aCGH) is currently the main technology to locate CNVs. Although many methods have been developed to analyze aCGH from a single array/subject, disease-critical genes are more likely to be found in regions that are common or recurrent among subjects. Unfortunately, finding recurrent CNV regions remains a challenge. We review existing methods for the identification of recurrent CNV regions. The working definition of ``common\u27\u27 or ``recurrent\u27\u27 region differs between methods, leading to approaches that use different types of input (discretized output from a previous CGH segmentation analysis or intensity ratios), or that incorporate to varied degrees biological considerations (which play a role in the identification of ``interesting\u27\u27 regions and in the details of null models used to assess statistical significance). Very few approaches use and/or return probabilities, and code is not easily available for several methods. We suggest that finding recurrent CNVs could benefit from reframing the problem in a biclustering context. We also emphasize that, when analyzing data from complex diseases with significant among-subject heterogeneity, methods should be able to identify CNVs that affect only a subset of subjects. We make some recommendations about choice among existing methods, and we suggest further methodological research

    On the Adaptive Partition Approach to the Detection of Multiple Change-Points

    Get PDF
    With an adaptive partition procedure, we can partition a “time course” into consecutive non-overlapped intervals such that the population means/proportions of the observations in two adjacent intervals are significantly different at a given level . However, the widely used recursive combination or partition procedures do not guarantee a global optimization. We propose a modified dynamic programming algorithm to achieve a global optimization. Our method can provide consistent estimation results. In a comprehensive simulation study, our method shows an improved performance when it is compared to the recursive combination/partition procedures. In practice, can be determined based on a cross-validation procedure. As an application, we consider the well-known Pima Indian Diabetes data. We explore the relationship among the diabetes risk and several important variables including the plasma glucose concentration, body mass index and age

    Genomic imbalances in 5918 malignant epithelial tumors: an explorative meta-analysis of chromosomal CGH data

    Get PDF
    BACKGROUND: Chromosomal abnormalities have been associated with most human malignancies, with gains and losses on some genomic regions associated with particular entities. METHODS: Of the 15429 cases collected for the Progenetix molecular-cytogenetic database, 5918 malignant epithelial neoplasias analyzed by chromosomal Comparative Genomic Hybridization (CGH) were selected for further evaluation. For the 22 clinico-pathological entities with more than 50 cases, summary profiles for genomic imbalances were generated from case specific data and analyzed. RESULTS: With large variation in overall genomic instability, recurring genomic gains and losses were prominent. Most entities showed frequent gains involving 8q2, while gains on 20q, 1q, 3q, 5p, 7q and 17q were frequent in different entities. Loss "hot spots" included 3p, 4q, 13q, 17p and 18q among others. Related average imbalance patterns were found for clinically distinct entities, e.g. hepatocellular carcinomas (ca.) and ductal breast ca., as well as for histologically related entities (squamous cell ca. of different sites). CONCLUSION: Although considerable case-by-case variation of genomic profiles can be found by CGH in epithelial malignancies, a limited set of variously combined chromosomal imbalances may be typical for carcinogenesis. Focus on the respective regions should aid in target gene detection and pathway deduction

    Detection of recurrent copy number alterations in the genome: taking among-subject heterogeneity seriously

    Get PDF
    Se adjunta un fichero pdf con los datos de investigación titulado "Supplementary Material for \Detection of Recurrent Copy Number Alterations in the Genome: taking among-subject heterogeneity seriously"Background: Alterations in the number of copies of genomic DNA that are common or recurrent among diseased individuals are likely to contain disease-critical genes. Unfortunately, defining common or recurrent copy number alteration (CNA) regions remains a challenge. Moreover, the heterogeneous nature of many diseases requires that we search for common or recurrent CNA regions that affect only some subsets of the samples (without knowledge of the regions and subsets affected), but this is neglected by most methods. Results: We have developed two methods to define recurrent CNA regions from aCGH data. Our methods are unique and qualitatively different from existing approaches: they detect regions over both the complete set of arrays and alterations that are common only to some subsets of the samples (i.e., alterations that might characterize previously unknown groups); they use probabilities of alteration as input and return probabilities of being a common region, thus allowing researchers to modify thresholds as needed; the two parameters of the methods have an immediate, straightforward, biological interpretation. Using data from previous studies, we show that we can detect patterns that other methods miss and that researchers can modify, as needed, thresholds of immediate interpretability and develop custom statistics to answer specific research questions. Conclusion: These methods represent a qualitative advance in the location of recurrent CNA regions, highlight the relevance of population heterogeneity for definitions of recurrence, and can facilitate the clustering of samples with respect to patterns of CNA. Ultimately, the methods developed can become important tools in the search for genomic regions harboring disease-critical genesFunding provided by Fundación de Investigación Médica Mutua Madrileña. Publication charges covered by projects CONSOLIDER: CSD2007-00050 of the Spanish Ministry of Science and Innovation and by RTIC COMBIOMED RD07/0067/0014 of the Spanish Health Ministr

    Detection of Recurrent Copy Number Alterations in the Genome: a Probabilistic Approach

    Get PDF
    Copy number variation (CNV) in genomic DNA is linked to a variety of human diseases (including cancer, HIV acquisition, autoimmune and neurodegenerative diseases), and array-based CGH (aCGH) is currently the main technology to locate CNVs. Several methods can analyze aCGH data at the single sample level, but disease-critical genes are more likely to be found in regions that are common or recurrent among samples. Unfortunately, defining recurrent CNV regions remains a challenge. Moreover, the heterogeneous nature of many diseases requires that we search for CNVs that affect only some subsets of the samples (without prior knowledge of which regions and subsets of samples are affected), but this is neglected by current methods. We have developed two methods to define recurrent CNV regions. Our methods are unique and qualitatively different from existing approaches: they detect both regions over the complete set of arrays and alterations that are common only to some subsets of the samples and, thus, CNV alterations that might characterize previously unknown groups; they use probabilities of alteration as input (not discretized gain/loss calls, which discard uncertainty and variability) and return probabilities of being a shared common region, thus allowing researchers to modify thresholds as needed; the two parameters of the methods have an immediate, straightforward, biological interpretation. Using data from previous studies, we show that we can detect patterns that other methods miss and, by using probabilities, that researchers can modify, as needed, thresholds of immediate interpretability to answer specific research questions. These methods are a qualitative advance in the location of recurrent CNV regions and will be instrumental in efforts to standardize definitions of recurrent CNVs and cluster samples with respect to patterns of CNV, and ultimately in the search for genomic regions harboring disease-critical genes

    An Improved Binary Differential Evolution Algorithm to Infer Tumor Phylogenetic Trees

    Get PDF

    Bioinformatic solutions for chromosomal copy number analysis in cancer

    Get PDF
    Chromosomal copy number aberrations are one of the main mechanisms that give rise to the proliferative capabilities of cancer cells. These aberrations can be quantified with technologies that generate measurements genome-wide and with high resolution. Hence, they produce vast amounts of data, which requires tailored bioinformatic solutions for analysis and management. Two such high-resolution and genome-wide technologies are DNA microarrays, which are successively replaced by next-generation sequencing approaches. This dissertation describes three novel bioinformatic solutions for copy number analysis in cancer with these technologies. CanGEM is a publicly-accessible database solution for storage of raw and processed copy number data from cancer research experiments. The contents of the database can be queried based on clinical and copy number data. Clinical data is collected using appropriate controlled vocabularies. Copy number data is collected as raw microarray data and automated analysis identifies the locations of chromosomal aberrations. In order to allow integration of data measured with different microarray platforms, a copy number status is derived for every known human gene. CGHpower is a statistical power calculator for copy number experiments that compare two groups. It estimates genome complexity of a cancer type in question from a pilot data set of the sample series, and assesses the number of samples required to satisfy statistical requirements. It can be used either in the planning stages of experiments, including as a justification in grant applications, or to verify whether sufficient samples were included in past experiments. Performance of this bioinformatic solution is evaluated with real and simulated data sets. QDNAseq is a preprocessing solution to detect copy number aberrations from shallow whole-genome next-generation sequencing data. It corrects the observed sequencing coverage for known systematic biases and allows filtering of spurious regions in the genome. A new list of such problematic regions is derived from public data generated by the 1000 Genomes Project. Performance of the solution is evaluated relative to other similar published solutions and DNA microarrays, and also compared to theoretical statistical expectations. An application of the QDNAseq method is also presented in a translational research project with the aim to identify copy number aberrations in tumors of patients with low-grade glioma. Aberrations identified by shallow whole-genome next-generation sequencing and QDNAseq are used to evaluate associations with patient survival, and also to assess intratumoral heterogeneity and temporal evolution of these tumors. A loss in chromosome 10q is identified to be associated with poor prognosis, and the finding validated in two independent data sets. From the assessment of intratumoral heterogeneity and temporal tumor evolution, the well-characterized co-deletion of 1p/19q is found to be the only chromosomal aberration that is consistently present or absent across the entire tumor and possible future recurrences. This is compatible with the present view of its role as an early event in the development of these tumors. The text concludes with a discussion of lessons learned from the development process and application of the three described bioinformatic solutions. Better awareness of and adherence to established best practices from the software development field would have been useful, and together with more careful consideration of implementation decisions could have resulted…Kromosomaaliset kopiolukupoikkeamat ovat eräs tärkeimmistä mekanismeista syövän synnyssä. Yhden äidiltä ja yhden isältä perityn geenikopion sijaan osa perimästä voi olla monistunut useammaksi kopioksi, ja joidenkin osien kohdalla yksi tai molemmat kopiot voivat olla hävinneet. Kopiolukupoikkeamien todentamiseen käytetään genominlaajuisia tekniikoita, joilla on tarkka erotuskyky. Ne tuottavat suuria tietomääriä, joiden analysointi ja käsittely vaativat räätälöityjä bioinformaattisia menetelmiä. Tekniikoihin sisältyvät DNA-mikrolevyt sekä ne käytännössä jo syrjäyttäneet uuden sukupolven sekvensointimenetelmät. Tässä väitöskirjassa kuvataan kolme uutta bioinformaattista ohjelmistoa kopiolukupoikkeamien analysointiin syöpänäytteistä näillä tekniikoilla. CanGEM on julkinen tietokanta raa'an ja prosessoidun mikrolevyaineiston keräämiseen yksittäisistä syöpätutkimuksista. Tietokannan sisältöön voi tehdä hakuja kliinisten muuttujien tai kopiolukupoikkeamien perusteella. Kliinisten muuttujien tallennukseen käytetään asianmukaisia luokittelujärjestelmiä. Kopiolukuaineisto kerätään raakoina mikrolevymittauksina, joista kopiolukupoikkeamat tunnistetaan algoritmisesti. Jotta eri mikrolevyalustoilla mitatun tiedon yhdistäminen olisi mahdollista, kopioluku määritetään erikseen jokaiselle tunnetulle ihmisen geenille. CGHpower on menetelmä tilastollisten voima-analyysien tekemiseen kahta ryhmää vertailevista kopiolukututkimuksista. Aineiston kopiolukupoikkeamien monimutkaisuus arvioidaan koe-erästä näytteitä ja määritetään tilastollisten vaatimusten edellyttämä otoskoko. Menetelmää voidaan käyttää joko tutkimusten suunnitteluvaiheessa, mm. rahoitushakemusten tukena, tai arvioimaan onko jo tehdyissä kokeissa käytetty riittävää määrää näytteitä. Suorituskyky mitataan sekä todellisilla että simuloiduilla aineistoilla. QDNAseq on esikäsittelymenetelmä kopiolukupoikkeamien tunnistamiseen matalalla lukupeitolla ja genominlaajuisesti tuotetusta uuden sukupolven sekvensointiaineistosta. Se korjaa havaittua lukupeittoa tunnettujen vinoumalähteiden osalta ja mahdollistaa kopiolukuanalyyseille ongelmallisten perimän osien suodattamisen jatkokäsittelystä. Näistä ongelmallisista alueista kuvataan uusi luettelo, joka on johdettu 1000 Genomes -projektin julkaisemasta aineistosta. Menetelmän suorituskykyä arvioidaan verrattuna muihin vastaaviin julkaistuihin menetelmiin ja DNA-mikrolevyihin, sekä suhteessa teoreettisiin tilastollisiin odotuksiin. Itse menetelmän lisäksi kuvataan QDNAseq:n sovellutus translationaaliseen tutkimukseen ja kopiolukupoikkeamien tunnistamiseen alhaisen erilaistumisasteen glioomista. Todetaan kromosomin 10q häviämän yhteys huonoon ennusteeseen ja löydös vahvistetaan kahdessa riippumattomassa aineistossa. Tunnistettuja kopiolukupoikkeamia käytetään myös kasvaimien epäyhtenäisyyden ja ajallisen kehityksen tarkasteluun. Havaitaan kyseiselle syöpätyypille yleisen 1p/19q-häviämän olevan ainoa kopiolukupoikkeama, joka on johdonmukaisesti joko läsnä taikka puuttuu läpi sekä koko alkuperäisen syöpäkasvaimen että mahdollisten uusiutumien. Havainto sopii nykynäkemykseen kyseisen poikkeaman synnystä hyvin varhaisessa vaiheessa kyseisen syöpätyypin kehitystä. Lopuksi tarkastellaan kuvattujen bioinformaattisten ohjelmistojen kehitys- ja sovellutusprosesseista opittuja asioita. Ohjelmistokehitysalan vakiintuneiden käytänteiden parempi tuntemus olisi ollut hyödyllistä, ja yhdessä toteutusyksityiskohtien tarkemman harkinnan kanssa voinut auttaa tuottamaan tarkoituksensa paremmin täyttäviä sekä helpommin kehitettäviä ja ylläpidettäviä…Afwijkingen in het aantal chromosomen, of delen van chromosomen, zijn een van de mechanismen die aanleiding geven tot het proliferatieve gedrag van kankercellen. Deze chromosomale afwijkingen kunnen worden gemeten met genomische technieken met een hoge resolutie. Deze technieken genereren zeer grote hoeveelheden data, die op maat gemaakte bioinformatische oplossingen vereisen voor analyse en databeheer. De twee meest relevante genomische technieken met hoge resolutie zijn microarrays en ‘next generation sequencing’. Hoofdstuk 1 van dit proefschrift behandelt de literatuur van de data-analyse voor chromosomale afwijkingen gemeten met microarrays of ‘next generation sequencing’. Het introduceert relevante bioinformatische concepten, beschrijft het analytische proces van ruwe data tot identificatie van numerieke chromosoomafwijkingen in individuele tumoren en het bioinformatisch onderzoek gericht op de betekenis van die afwijkingen in grote series tumoren. Hoofdstuk 2 tot en met 4 beschrijven drie nieuwe bioinformatische implementaties ontwikkeld voor de analyse van deze chromosomale afwijkingen in kanker. CanGEM (Hoofdstuk 2) is een publiek toegankelijke database voor het opslaan van ruwe en verwerkte chromosoomaantallen het kankeronderzoek. De inhoud van de database kan worden doorzocht op basis van zowel klinische als experimentele gegevens met betrekking tot chromosoomaantallen. Klinische gegevens worden verzameld met behulp van gecontroleerde woordenlijsten. Chromosoomaantallen worden verzameld als ruwe microarray data en begin- en eindpositie van de afwijkingen worden steeds opnieuw automatisch bepaald. Om de integratie van de data, die gemeten worden met microarrays van verschillende makelij, verder te faciliteren, wordt het aantal chromosomen per gen afgeleid voor ieder van de ca. 19.000 tot 20.000 menselijke genen. CGHpower (Hoofdstuk 3) is een methode om te berekenen hoeveel tumormonsters statistisch nodig zijn om verschillen en overeenkomsten in chromosomale afwijkingen tussen twee groepen tumoren te kunnen vergelijken. Er wordt een schatting gemaakt van de complexiteit van de afwijkingen in een bepaald type kanker met behulp van een beperkt aantal monsters. Vervolgens wordt geschat hoeveel tumoren nodig zijn om aan de statistische eisen te voldoen. CGHpower kan in de planningsfase van een subsidieaanvraag worden gebruikt als rechtvaardiging van de voorgestelde aantallen naar een subsidiegever, of kan gebruikt worden om te controleren of er voldoende aantallen tumoren in een experiment werden opgenomen. CGHpower wordt geëvalueerd met behulp van experimentele en gesimuleerde datasets. QDNAseq (Hoofdstuk 4) is een methode die een voorbewerkingstap maakt van ‘next generation sequencing’ data naar chromosoomaantallen in het genoom van een tumor, waarbij wordt uitgegaan van sequencing met een diepte van slechts 10\% van het gehele genoom. QDNAseq corrigeert de waargenomen genoomwijde dekking voor systematische fouten en faciliteert de mogelijkheid om onregelmatige gebieden in het genoom te verwijderen. Een lijst van dergelijke systematische fouten en onregelmatige gebieden is afgeleid van publieke data die openbaar werd gemaakt door het “1000 Genomes Project”. QDNAseq wordt geëvalueerd ten opzichte van de microarraytechniek en andere gepubliceerde software voor de analyse van numerieke chromosoomafwijkingen met behulp van ‘next generation sequencing’. Tenslotte worden de uitkomsten van QDNAseq op ‘next generation sequencing’ data vergeleken met theoretische statistisch verwachte resultaten. In het voorlaatste hoofdstuk (Chapter 5) wordt QDNAseq toegepast op translationeel onderzoek dat tot doel heeft afwijkingen in het aantal chromosomen of delen daarvan te identificeren bij tumoren van patiënten met laag-gradige gliomen. Chromosomale afwijkingen geïdentificeerd middels ‘next generation sequencing’ en QDNAseq worden gebruikt om associaties te bepalen met de overleving van de patiënt, de intratumorale heterogeniteit van de tumoren en de evolutie over tijd van deze tumoren. Een verlies van het distale deel van chromosoom 10q wordt in dit onderzoek geassocieerd met een slechte prognose. Deze bevinding kon worden gevalideerd in twee onafhankelijke patiëntenseries. Uit de beoordeling van intratumorale heterogeniteit en tumorevolutie blijkt tenslotte dat verlies van chromosoom 1p samen met 19q de enige afwijking is die consistent aan- of afwezig is in de tumoren. Net als bij de drie beschreven implementaties voor de analyse van chromosomale afwijkingen in kanker, wordt veel bioinformatisch onderzoek uitgevoerd in academische groepen. De discussie (Hoofdstuk 6) behandelt de opgedane ervaringen met betrekking tot het ontwikkelingsproces en de toepassing van bioinformatische oplossingen
    corecore