612 research outputs found

    Integrative Transcriptomic Analysis of Long Intergenic Non-Coding RNAs in Cancer.

    Get PDF
    Ph.D. Thesis. University of Hawaiʻi at Mānoa 2017

    Development of Integrated Machine Learning and Data Science Approaches for the Prediction of Cancer Mutation and Autonomous Drug Discovery of Anti-Cancer Therapeutic Agents

    Get PDF
    Few technological ideas have captivated the minds of biochemical researchers to the degree that machine learning (ML) and artificial intelligence (AI) have. Over the last few years, advances in the ML field have driven the design of new computational systems that improve with experience and are able to model increasingly complex chemical and biological phenomena. In this dissertation, we capitalize on these achievements and use machine learning to study drug receptor sites and design drugs to target these sites. First, we analyze the significance of various single nucleotide variations and assess their rate of contribution to cancer. Following that, we used a portfolio of machine learning and data science approaches to design new drugs to target protein kinase inhibitors. We show that these techniques exhibit strong promise in aiding cancer research and drug discovery

    The Role of Genetic Alterations in Tumor Initiation, Progression and Transformation

    Get PDF
    Follicular lymphoma (FL) is the second most common lymphoma in the United States. Although it is generally an indolent lymphoma, FL is not curable, and, in about 30% of patients, the FL undergoes transformation into an aggressive lymphoma (tFL) with marked worsening of prognosis. To identify mutations preferentially present in tFL, we performed whole exome sequencing (WES) on paired FL and tFL arising in the same patients and developed a mutational analysis pipeline. After we identified potentially important genes that have been found to be mutated in our paired FL and tFL study, we constructed a custom capture platform including these genes as well as other genes known to be mutated in B-cell lymphomas. We were able to use this focused sequencing platform to analyze additional samples at greater sequencing depth. Clonal architecture and evolution can be readily identified; however, the DNA samples were fragmented using restriction enzymes, which compromised duplicate analysis. We developed a new approach with a statistical model to solve the problems. Samples from uninvolved tissue of the same patients are commonly used to distinguish germline variants from somatic mutations; however, the germline DNA was often not available for our samples. , We designed a filtering based method to limit the number of germline variants that would be mistakenly called somatic mutations and validated this approach using a dataset with paired normal samples. We also introduced a novel idea based on machine learning to predict somatic mutations from paired FL and tFL samples without healthy tissue. Five machine learning algorithms were tested in datasets with known somatic mutations, and their performance was evaluated by statistical measures. The results indicated somatic mutations can be reliably predicted. In order to provide complementary information, we integrated our mutation data with copy number abnormality data and found genes more frequently mutated in tFL cases. The recurrently mutated genes are often involved in epigenetic regulation, the JAK-STAT or the NF-κB pathway, immune surveillance, and cell cycle regulation, or are transcription factors involved in B cell development. As no entirely tFL specific mutations are found, the transformation event needs to cooperate with pre-existing alterations and future studies will focus on identifying cooperative mutations for FL transformation

    Identifying therapeutic targets in glioma using integrated network analysis

    Get PDF
    Gliomas are the most common brain tumours in adult population with rapid progression and poor prognosis. Survival among the patients diagnosed with the most aggressive histopathological subtype of gliomas, the glioblastoma, is a mere 12.6 months given the current standard of care. While glioblastomas mostly occur in people over 60, the lower-grade gliomas afflict themselves upon individuals in their third and fourth decades of life. Collectively, the gliomas are one of the major causes of cancer-related death in individuals under fortyin the UK. Over the past twenty years, little has changed in the standard of glioma treatment and the disease has remained incurable. This study focuses on identifying potential therapeutic targets in gliomasusing systems-level approaches and large-scale data integration.I used publicly available transcriptomic data to identify gene co-expression networks associated with the progression of IDH1-mutant 1p/19q euploid astrocytomas from grade II to grade III and high-lighted hub-genes of these networks, which could be targeted to modulate their biological function. I also studied the changes in co-expression patterns between grade II and grade III gliomas and identified a cluster of genes with differential co-expression in different disease states (module M2). By data integration and adaptation of reverse-engineering methods, I elucidated master regulators of the module M2. I then sought to counteract the regulatory activity by using drug-induced gene expression dataset to find compounds inducing gene expression in the opposite direction of the disease signature. I proposed resveratrol as a potentially disease modifying compound, which when administered to patients with a low-grade disease could potentially delay glioma progression.Finally, I appliedanensemble-learning algorithm on a large-scale loss-of-function viability screen in cancer cell-lines with different genetic backgrounds to identify gene dependencies associated with chromosomal copy-number losses common intheglioblastomas. I propose five novel target predictions to be validated in future experiments.Open acces

    A Survey of Computational Tools to Analyze and Interpret Whole Exome Sequencing Data

    Get PDF

    Toward a new era of cancer detection: patient-friendly solutions

    Get PDF
    SUMMARY High cancer mortality rates and the rising cancer burden worldwide prioritize the development of innovative methods that facilitate the early and accurate detection of cancer. Combining patient-friendly sampling methods with reliable biomarker testing offers a method that is convenient for patients and effective in detecting cancer at a curable stage, with improved patient outcomes as an ultimate goal. This thesis assessed the feasibility of DNA methylation testing in urine as a diagnostic tool for different cancer types, including endometrial, ovarian, and lung cancer. For endometrial and ovarian cancer, the value of DNA methylation testing in self-collected cervicovaginal samples and clinician-taken cervical scrapes was also investigated. Part 1: Endometrial and ovarian cancer detection in patient-friendly samples Part 1 describes the detection of endometrial and ovarian cancer in urine, cervicovaginal self-samples, and clinician-taken cervical scrapes. The outcomes of Part 1 revealed the value of methylation analysis in patient-friendly sample types for endometrial cancer detection of all stages. Convenient modes of sample collection offer the possibility of at-home collection with high patient acceptability. This approach is clinically useful to screen patient populations at risk for endometrial cancer and to streamline who needs to undergo invasive endometrial tissue sampling. Although promising, the clinical effectiveness of this approach requires further confirmation in additional cohorts, including individuals presenting with postmenopausal bleeding and asymptomatic women at risk for endometrial cancer. The presence of ovarian cancer-derived DNA in the urine provides the first steps toward urine-based diagnostics for ovarian cancer. Further research is needed to further explore and refine the use of urine biomarkers for ovarian cancer diagnostics. Part 2: Non-small cell lung cancer detection in urine In Part 2 of this thesis, the diagnostic potential of urine as a liquid biopsy for non-small cell lung cancer (NSCLC) detection was evaluated. The outcomes of Part 2 demonstrate the technical feasibility of detecting NSCLC in the urine using DNA methylation markers. Further research, including larger patient cohorts and controls with benign pulmonary nodules, is needed to validate the clinical usefulness of this approach. The considerable variability between urine samples highlights the need for a more thorough understanding of cfDNA dynamics and enhancements in test development to ensure reliability. Upon further refinement, this test has the potential to serve as a valuable complementary diagnostic tool to low-dose CT screening to guide clinical decisions in patients with pulmonary nodules

    Computational analysis of human genomic variants and lncRNAs from sequence data

    Get PDF
    The high-throughput sequencing technologies have been developed and applied to the human genome studies for nearly 20 years. These technologies have provided numerous research applications and have significantly expanded our knowledge about the human genome. In this thesis, computational methods that utilize sequence data to study human genomic variants and transcripts were evaluated and developed. Indel represents insertion and deletion, which are two types of common genomic variants that are widespread in the human genome. Detecting indels from human genomes is the crucial step for diagnosing indel related genomic disorders and may potentially identify novel indel makers for studying certain diseases. Compared with previous techniques, the high-throughput sequencing technologies, especially the next- generation sequencing (NGS) technology, enable to detect indels accurately and efficiently in wide ranges of genome. In the first part of the thesis, tools with indel calling abilities are evaluated with an assortment of indels and different NGS settings. The results show that the selection of tools and NGS settings impact on indel detection significantly, which provide suggestions for tool selection and future developments. In bioinformatics analysis, an indel’s position can be marked inconsistently on the reference genome, which may result in an indel having different but equivalent representations and cause troubles for downstream. This problem is related to the complex sequence context of the indels, for example, short tandem repeats (STRs), where the same short stretch of nucleotides is amplified. In the second part of the thesis, a novel computational tool VarSCAT was described, which has various functions for annotating the sequence context of variants, including ambiguous positions, STRs, and other sequence context features. Analysis of several high- confidence human variant sets with VarSCAT reveals that a large number of genomic variants, especially indels, have sequence features associated with STRs. In the human genome, not all genes and their transcripts are translated into proteins. Long non-coding ribonucleic acid (lncRNA) is a typical example. Sequence recognition built with machine learning models have improved significantly in recent years. In the last part of the thesis, several machine learning-based lncRNA prediction tools were evaluated on their predictions for coding potentiality of transcripts. The results suggest that tools based on deep learning identify lncRNAs best. Ihmisen genomivarianttien ja lncRNA:iden laskennallinen analyysi sekvenssiaineistosta Korkean suorituskyvyn sekvensointiteknologioita on kehitetty ja sovellettu ihmisen genomitutkimuksiin lähes 20 vuoden ajan. Nämä teknologiat ovat mahdollistaneet ihmisen genomin laaja-alaisen tutkimisen ja lisänneet merkittävästi tietoamme siitä. Tässä väitöstyössä arvioitiin ja kehitettiin sekvenssiaineistoa hyödyntäviä laskennallisia menetelmiä ihmisen genomivarianttien sekä transkriptien tutkimiseen. Indeli on yhteisnimitys lisäys- eli insertio-varianteille ja häviämä- eli deleetio-varianteille, joita esiintyy koko genomin alueella. Indelien tunnistaminen on ratkaisevaa geneettisten poikkeavuuksien diagnosoinnissa ja eri sairauksiin liittyvien uusien indeli-markkereiden löytämisessä. Aiempiin teknologioihin verrattuna korkean suorituskyvyn sekvensointiteknologiat, erityisesti seuraavan sukupolven sekvensointi (NGS) mahdollistavat indelien havaitsemisen tarkemmin ja tehokkaammin laajemmilta genomialueilta. Väitöstyön ensimmäisessä osassa indelien kutsumiseen tarkoitettuja laskentatyökaluja arvioitiin käyttäen laajaa valikoimaa indeleitä ja erilaisia NGS-asetuksia. Tulokset osoittivat, että työkalujen valinta ja NGS-asetukset vaikuttivat indelien tunnistukseen merkittävästi ja siten ne voivat ohjata työkalujen valinnassa ja kehitystyössä. Bioinformatiivisessa analyysissä saman indelin sijainti voidaan merkitä eri kohtiin referenssigenomia, joka voi aiheuttaa ongelmia loppupään analyysiin, kuten indeli-kutsujen arviointiin. Tämä ongelma liittyy sekvenssikontekstiin, koska variantit voivat sijoittua lyhyille perättäisille tandem-toistojaksoille (STR), jossa sama lyhyt nukleotidijakso on monistunut. Väitöstyön toisessa osassa kehitettiin laskentatyökalu VarSCAT, jossa on eri toimintoja, mm. monitulkintaisten sijaintitietojen, vierekkäisten alueiden ja STR-alueiden tarkasteluun. Luotettaviksi arvioitujen ihmisen varianttiaineistojen analyysi VarSCAT-työkalulla paljasti, että monien geneettisten varianttien ja erityisesti indelien ominaisuudet liittyvät STR-alueisiin. Kaikkia ihmisen geenejä ja niiden geenituotteita, kuten esimerkiksi ei-koodaavia RNA:ta (lncRNA) ei käännetä proteiiniksi. Koneoppimismenetelmissä ja sekvenssitunnistuksessa on tapahtunut huomattavaa parannusta viime vuosina. Väitöstyön viimeisessä osassa arvioitiin useiden koneoppimiseen perustuvien lncRNA-ennustustyökalujen ennusteita. Tulokset viittaavat siihen, että syväoppimiseen perustuvat työkalut tunnistavat lncRNA:t parhaiten

    Ideafix: a decision tree-based method for the refinement of variants in FFPE DNA sequencing data

    Get PDF
    [EN]Increasingly, treatment decisions for cancer patients are being made from next-generation sequencing results generated from formalin-fixed and paraffin-embedded (FFPE) biopsies. However, this material is prone to sequence artefacts that cannot be easily identified. In order to address this issue, we designed a machine learning-based algorithm to identify these artefacts using data from >1600000 variants from 27 paired FFPE and fresh-frozen breast cancer samples. Using these data, we assembled a series of variant features and evaluated the classification performance of five machine learning algorithms. Using leave-one-sample-out cross-validation, we found that XGBoost (extreme gradient boosting)and random forest obtained AUC (area under the receiver operating characteristic curve) values >0.86. Performance was further tested using two independent datasets that resulted in AUC values of 0.96, whereas a comparison with previously published tools resulted in a maximum AUC value of 0.92. The most discriminating features were read pair orientation bias, genomic context and variant allele frequency. In summary, our results show a promising future for the use of these samples in molecular testing. We built the algorithm into an R package called Ideafix (DEAmination FIXing) that is freely available at https://github.com/mmaitenat/ideafix.Departamento de Educaci ́on, Universidades e Investi- gaci ́on of the Basque Government [PRE 2019 2 0211 to M.T.A]; Ikerbasque, Basque Foundation for Science [to C.L.]; Starmer–Smith Memorial Fund [to C.L.]; Ministerio de Econom ́ıa, Industria y Competitividad (MINECO) of the Spanish Central Government [to C.L., PID2019- 104933GB-10 to B.C.]; ISCIII and FEDER Funds [PI12/00663, PIE13/00048, DTS14/00109, PI15/00275 and PI18/01710 to C.L.]; Departamento de Desarrollo Econ ́omico y Competitividad and Departamento de Sanidad of the Basque Government [to C.L.]; Aso- ciaci ́on Espa ̃nola Contra el Cancer (AECC) [to C.L.]; Diputaci ́on Foral de Guipuzcoa (DFG) [to C.L.]; Depar- tamento de Industria of the Basque Government [ELKA- RTEK Programme, project code: KK-2018/00038 to C.L., ELKARTEK Programme, project code: KK-2020/00049 to B.C., IT-1244-19 to B.C.
    corecore