639 research outputs found

    Machine Learning Classification of Primary Tissue Origin of Cancer from DNA Methylation Markers

    Full text link
    Cancer is one of the leading causes of death globally and was responsible for approximately 9.6 million deaths in 2018. One of the main reason for deaths from cancer is late-stage presentation and inaccessible diagnosis and treatment. Cancer often spreads from the part of the body where it started (primary site) to a different part of the body (metastatic site). Identifying the primary site of cancer plays a key role as it directs the appropriate treatment. Cancer which spreads needs the same treatment as its origin. Having this knowledge can help doctors to decide the type of treatment. All cancers begin when one or more genes in a cell mutate and create abnormal proteins which cause cells to multiply uncontrollably. Genes are present in the DNA of each cell in human body, and research shows that distinct and abnormal patterns in methylation of DNA are observed in case of cancers. DNA methylation is also considered as an early and fundamental step where normal tissue undergoes transformations. Since DNA methylation is tissue-specific and change with cell differentiation, methylation sites are good markers for identifying tissues of origin. In this thesis, we propose the use of machine learning techniques to identify the primary sites of cancers to increase the accuracy of diagnosis and treatment. For this purpose, we implemented various classification algorithms in machine learning like support vector machines, random forests classifier, decision trees, and K nearest neighbor classifier to classify the tumor samples into their tissue origin and compared these models using traditional machine learning metrics. The models are trained and tested on features extracted from the DNA methylation datasets maintained by The Cancer Genome Atlas (TCGA). The experimental results showed that support vector machines could predict the primary sites with 95% training accuracy. The model gave 86% accuracy when tested on a completely independent dataset collected from Gene Expression Omnibus (GEO)

    機械学習モデルからの知識抽出と生命情報学への応用

    Get PDF
    京都大学新制・課程博士博士(情報学)甲第23397号情博第766号新制||情||131(附属図書館)京都大学大学院情報学研究科知能情報学専攻(主査)教授 阿久津 達也, 教授 山本 章博, 教授 鹿島 久嗣学位規則第4条第1項該当Doctor of InformaticsKyoto UniversityDFA

    Differential evolution of non-coding DNA across eukaryotes and its close relationship with complex multicellularity on Earth

    Get PDF
    Here, I elaborate on the hypothesis that complex multicellularity (CM, sensu Knoll) is a major evolutionary transition (sensu Szathmary), which has convergently evolved a few times in Eukarya only: within red and brown algae, plants, animals, and fungi. Paradoxically, CM seems to correlate with the expansion of non-coding DNA (ncDNA) in the genome rather than with genome size or the total number of genes. Thus, I investigated the correlation between genome and organismal complexities across 461 eukaryotes under a phylogenetically controlled framework. To that end, I introduce the first formal definitions and criteria to distinguish ‘unicellularity’, ‘simple’ (SM) and ‘complex’ multicellularity. Rather than using the limited available estimations of unique cell types, the 461 species were classified according to our criteria by reviewing their life cycle and body plan development from literature. Then, I investigated the evolutionary association between genome size and 35 genome-wide features (introns and exons from protein-coding genes, repeats and intergenic regions) describing the coding and ncDNA complexities of the 461 genomes. To that end, I developed ‘GenomeContent’, a program that systematically retrieves massive multidimensional datasets from gene annotations and calculates over 100 genome-wide statistics. R-scripts coupled to parallel computing were created to calculate >260,000 phylogenetic controlled pairwise correlations. As previously reported, both repetitive and non-repetitive DNA are found to be scaling strongly and positively with genome size across most eukaryotic lineages. Contrasting previous studies, I demonstrate that changes in the length and repeat composition of introns are only weakly or moderately associated with changes in genome size at the global phylogenetic scale, while changes in intron abundance (within and across genes) are either not or only very weakly associated with changes in genome size. Our evolutionary correlations are robust to: different phylogenetic regression methods, uncertainties in the tree of eukaryotes, variations in genome size estimates, and randomly reduced datasets. Then, I investigated the correlation between the 35 genome-wide features and the cellular complexity of the 461 eukaryotes with phylogenetic Principal Component Analyses. Our results endorse a genetic distinction between SM and CM in Archaeplastida and Metazoa, but not so clearly in Fungi. Remarkably, complex multicellular organisms and their closest ancestral relatives are characterized by high intron-richness, regardless of genome size. Finally, I argue why and how a vast expansion of non-coding RNA (ncRNA) regulators rather than of novel protein regulators can promote the emergence of CM in Eukarya. As a proof of concept, I co-developed a novel ‘ceRNA-motif pipeline’ for the prediction of “competing endogenous” ncRNAs (ceRNAs) that regulate microRNAs in plants. We identified three candidate ceRNAs motifs: MIM166, MIM171 and MIM159/319, which were found to be conserved across land plants and be potentially involved in diverse developmental processes and stress responses. Collectively, the findings of this dissertation support our hypothesis that CM on Earth is a major evolutionary transition promoted by the expansion of two major ncDNA classes, introns and regulatory ncRNAs, which might have boosted the irreversible commitment of cell types in certain lineages by canalizing the timing and kinetics of the eukaryotic transcriptome.:Cover page Abstract Acknowledgements Index 1. The structure of this thesis 1.1. Structure of this PhD dissertation 1.2. Publications of this PhD dissertation 1.3. Computational infrastructure and resources 1.4. Disclosure of financial support and information use 1.5. Acknowledgements 1.6. Author contributions and use of impersonal and personal pronouns 2. Biological background 2.1. The complexity of the eukaryotic genome 2.2. The problem of counting and defining “genes” in eukaryotes 2.3. The “function” concept for genes and “dark matter” 2.4. Increases of organismal complexity on Earth through multicellularity 2.5. Multicellularity is a “fitness transition” in individuality 2.6. The complexity of cell differentiation in multicellularity 3. Technical background 3.1. The Phylogenetic Comparative Method (PCM) 3.2. RNA secondary structure prediction 3.3. Some standards for genome and gene annotation 4. What is in a eukaryotic genome? GenomeContent provides a good answer 4.1. Background 4.2. Motivation: an interoperable tool for data retrieval of gene annotations 4.3. Methods 4.4. Results 4.5. Discussion 5. The evolutionary correlation between genome size and ncDNA 5.1. Background 5.2. Motivation: estimating the relationship between genome size and ncDNA 5.3. Methods 5.4. Results 5.5. Discussion 6. The relationship between non-coding DNA and Complex Multicellularity 6.1. Background 6.2. Motivation: How to define and measure complex multicellularity across eukaryotes? 6.3. Methods 6.4. Results 6.5. Discussion 7. The ceRNA motif pipeline: regulation of microRNAs by target mimics 7.1. Background 7.2. A revisited protocol for the computational analysis of Target Mimics 7.3. Motivation: a novel pipeline for ceRNA motif discovery 7.4. Methods 7.5. Results 7.6. Discussion 8. Conclusions and outlook 8.1. Contributions and lessons for the bioinformatics of large-scale comparative analyses 8.2. Intron features are evolutionarily decoupled among themselves and from genome size throughout Eukarya 8.3. “Complex multicellularity” is a major evolutionary transition 8.4. Role of RNA throughout the evolution of life and complex multicellularity on Earth 9. Supplementary Data Bibliography Curriculum Scientiae Selbständigkeitserklärung (declaration of authorship

    A machine learning-based investigation of cloud service attacks

    Get PDF
    In this thesis, the security challenges of cloud computing are investigated in the Infrastructure as a Service (IaaS) layer, as security is one of the major concerns related to Cloud services. As IaaS consists of different security terms, the research has been further narrowed down to focus on Network Layer Security. Review of existing research revealed that several types of attacks and threats can affect cloud security. Therefore, there is a need for intrusion defence implementations to protect cloud services. Intrusion Detection (ID) is one of the most effective solutions for reacting to cloud network attacks. [Continues.

    A cross-species transcriptomics approach to identify genes involved in leaf development

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>We have made use of publicly available gene expression data to identify transcription factors and transcriptional modules (regulons) associated with leaf development in <it>Populus</it>. Different tissue types were compared to identify genes informative in the discrimination of leaf and non-leaf tissues. Transcriptional modules within this set of genes were identified in a much wider set of microarray data collected from leaves in a number of developmental, biotic, abiotic and transgenic experiments.</p> <p>Results</p> <p>Transcription factors that were over represented in leaf EST libraries and that were useful for discriminating leaves from other tissues were identified, revealing that the C2C2-YABBY, CCAAT-HAP3 and 5, MYB, and ZF-HD families are particularly important in leaves. The expression of transcriptional modules and transcription factors was examined across a number of experiments to select those that were particularly active during the early stages of leaf development. Two transcription factors were found to collocate to previously published Quantitative Trait Loci (QTL) for leaf length. We also found that miRNA family 396 may be important in the control of leaf development, with three members of the family collocating with clusters of leaf development QTL.</p> <p>Conclusion</p> <p>This work provides a set of candidate genes involved in the control and processes of leaf development. This resource can be used for a wide variety of purposes such as informing the selection of candidate genes for association mapping or for the selection of targets for reverse genetics studies to further understanding of the genetic control of leaf size and shape.</p

    Genetic Analysis of Axillary Meristem Development in Arabidopsis: Roles of MIR164, CUC1, CUC2, CUC3 and LAS, and identification of novel regulators.

    Get PDF
    Aerial architecture and reproductive success in higher plants is determined by the formation of secondary axes of growth which are formed by axillary meristems initiated post-embryonically in the axils of leaves. Among the genetic modulators of axillary meristem fate in Arabidopsis is LATERAL SUPPRESSOR, a putative transcription factor belonging to the GRAS family, which specifically regulates the initiation of axillary meristems during the vegetative phase of development. The aim of this work was to study the mechanism of LAS function in the meristem and to identify new regulators of axillary meristem initiation in Arabidopsis. To study the spatio-temporal specification of its function, LAS was misexpressed from promoters of meristematic genes possessing adjoining or overlapping expression domains in the SAM. Analysis of STM::LAS, KNAT1::LAS and UFO::LAS transgenic plants in las-4 background revealed partial to complete complementation of the las-4 branching phenotype, but did not lead to the formation of ectopic meristems. These results imply a function for LAS in maintaining the meristematic potential in axillary cells which can later initiate axillary meristems upon activation by other developmental cues. A potential mechanism of LAS function in axillary meristems was investigated by GA spraying experiments and complementation analysis of LAS::GAI and LAS::GAI &#61508;DELLA transgenic plants in las-4 mutant background. Preliminary results indicate a role for LAS as a regulator of GA signaling in axillary meristems. To identify new regulators of axillary meristem development, two approaches were employed. Firstly, an EMS mutagenesis screen was carried out to identify supperssors of the las-4 max1-1 phenotype. Characterisation of three suppressor of las-4 (sol) candidates, sol2, sol6 and sol7, revealed three novel loci that regulate axillary meristem development. sol2, sol6 and sol7 complemented the branching defect in las-4 max1-1 to different degrees and were found to be non-allelic to each other. Their phenotypes were dependent on the las-4 mutation. Molecular mapping of two of these loci is underway. Secondly, the NAC domain transcription factors CUP-SHAPED COTYLEDON1, CUC2 and CUC3, exhibiting a characteristic expression pattern in the axils of leaf primordia, were investigated for potential roles in the development of axillary meristems. Investigation of loss-of-function mutants of these genes revealed that cuc3-2 is impaired in axillary bud formation, and that the severity of this phenotype is day length dependent. Transcripts of the other two CUC genes, CUC1 and CUC2, are targeted for degradation by miR164. Overexpression of MIR164A or MIR164B in the cuc3-2 mutant caused an almost complete block in axillary bud development. Conversely, plants harbouring miR164-resistant alleles of CUC1 and CUC2 developed accessory buds in rosette and cauline leaf axils, revealing redundant functions of CUC1 and CUC2 in axillary meristem development. Development of accessory buds was also observed in mir164 mutants. Thus, the role of CUC genes and miR164 in regulation of axillary meristem development was unveiled in this study

    INTEGRATION OF BIOMEDICAL IMAGING AND TRANSLATIONAL APPROACHES FOR MANAGEMENT OF HEAD AND NECK CANCER

    Get PDF
    The aim of the clinical component of this work was to determine whether the currently available clinical imaging tools can be integrated with radiotherapy (RT) platforms for monitoring and adaptation of radiation dose, prediction of tumor response and disease outcomes, and characterization of patterns of failure and normal tissue toxicity in head and neck cancer (HNC) patients with potentially curable tumors. In Aim 1, we showed that the currently available clinical imaging modalities can be successfully used to adapt RT dose based-on dynamic tumor response, predict oncologic disease outcomes, characterize RT-induced toxicity, and identify the patterns of disease failure. We used anatomical MRIs for the RT dose adaptation purpose. Our findings showed that after proper standardization of the immobilization and image acquisition techniques, we can achieve high geometric accuracy. These images can then be used to monitor the shrinkage of tumors during RT and optimize the clinical target volumes accordingly. Our results also showed that this MR-guided dose adaptation technique has a dosimetric advantage over the standard of care and was associated with a reduction in normal tissue doses that translated into a reduction of the odds of long-term RT-induced toxicity. In the second aim, we used quantitative MRIs to determine its benefit for prediction of oncologic outcomes and characterization of RT-induced normal tissue toxicity. Our findings showed that delta changes of apparent diffusion coefficient parameters derived from diffusion-weighted images at mid-RT can be used to predict local recurrence and recurrence free-survival. We also showed that Ktrans and Ve vascular parameters derived from dynamic contrast-enhanced MRIs can characterize the mandibular areas of osteoradionecrosis. In the final clinical aim, we used CT images of recurrence and baseline CT planning images to develop a methodology and workflow that involves the application of deformable image registration software as a tool to standardize image co-registration in addition to granular combined geometric- and dosimetric-based failure characterization to correctly attribute sites and causes of locoregional failure. We then successfully applied this methodology to identify the patterns of failure following postoperative and definitive IMRT in HNC patients. Using this methodology, we showed that most recurrences occurred in the central high dose regions for patients treated with definitive IMRT compared with mainly non-central high dose recurrences after postoperative IMRT. We also correlated recurrences with pretreatment FDG-PET and identified that most of the central high dose recurrences originated in an area that would be covered by a 10-mm margin on the volume of 50% of the maximum FDG uptake. In the translational component of this work, we integrated radiomic features derived from pre-RT CT images with whole-genome measurements using TCGA and TCIA data. Our results demonstrated a statistically significant associations between radiomic features characterizing different tumor phenotypes and different genomic features. These findings represent a promising potential towards non-invasively tract genomic changes in the tumor during treatment and use this information to adapt treatment accordingly. In the final project of this dissertation, we developed a high-throughput approach to identify effective systemic agents against aggressive head and neck tumors with poor prognosis like anaplastic thyroid cancer. We successfully identified three candidate drugs and performed extensive in vitro and in vivo validation using orthotopic and PDX models. Among these drugs, HDAC inhibitor and LBH-589 showed the most effective tumor growth inhibition that can be used in future clinical trials

    Identification of RNA Oligonucleotide and Protein Interactions Using Term Frequency Inverse Document Frequency and Random Forest

    Get PDF
    The interaction between protein and Ribonucleic Acid (RNA) plays crucial roles in many biological aspects such as gene expression, posttranscriptional regulation, and protein synthesis. However, the experimental screening of protein-RNA binding affinity is laborious and time-consuming, there is a pressing desire of accurate and reliable computational approaches. In this study, we proposed a novel method to predict that interaction based on both sequences of protein and RNA. The Random Forest was trained and tested on a combination of benchmark datasets and the term frequency–inverse document frequency method combined with XgBoost algorithm was used to extract useful information from sequences. The performance of our method was very impressive, and the accuracy was as high as 94%, the Area Under the Curve of 0.98 and the Matthew Correlation Coefficient (MCC) of 0.90. All these high metrics, especially the MCC, show that our method is robust enough to keep its performance on unseen datasets

    Silencing parasitism effectors of the root lesion nematode, Pratylenchus thornei

    Get PDF
    The root lesion nematode (RLN), Pratylenchus thornei, is a biotrophic migratory pest of plant roots and its infestation causes losses in many economically important crops. RNA interference (RNAi) is a naturally occurring eukaryotic phenomenon and can be used to silence parasitism effector genes of P. thornei using host-mediated RNAi. This may be developed as an environmentally friendly and a cost-effective control strategy. The overall aims of this research were to investigate the effects of in vitro and in planta RNAi silencing of putative P. thornei parasitism effector genes, and their nematicidal effects in two host plants. Five putative target parasitism genes vital for nematode entry into roots (Pt-Eng-1, Pt-PL), feeding (Pt-CLP) and suppressing host defence responses (Pt-UEP, Pt-GST) were identified, validated in silico using comparative bioinformatics, cloned into suitable in vitro transcription and binary vectors, and advanced to RNAi studies. Partial sequences for four of these target effector genes (Pt-Eng-1, Pt-PL, Pt-CLP, Pt-GST) were identified using Rapid Amplification of cDNA (RACE) PCRs and annotated in silico. Protein families, conserved domains, taxonomic and phylogenetic relationships for all four effectors were studied. This sequence information will help inform future investigations involving gene expression and proteomics of the selected putative effectors. In vitro RNAi was used for functional characterisation of the five effector sequences. Effects on nematode phenotype, behaviour, gene expression, and longer-term effects on reproduction were assessed after soaking nematodes in dsRNA through infection of healthy wild type soybean and alfalfa roots. Soaking of mixed stage P. thornei in 1mg/mL dsRNA of target genes for 16 h did not cause phenotypic changes except for Pt-PL, which exhibited straight or slightly curved phenotypes after soaking compared to the normal sigmoid body movement, also evident for green fluorescent protein (gfp) and no dsRNA treated controls. Semi-quantitative PCRs and densitometry analysis revealed a significant reduction of transcript accumulation for all five putative parasitism effector genes. Longer-term effects assessed at 21 dpi reduced nematode reproduction by 40 to 70% for all target genes compared to respective control treatments suggesting that the effectors studied were required for nematode infectivity, survival or reproduction. In planta RNAi involved Agrobacterium-mediated plant transformations to develop axenic transgenic hairy root events of soybean (Glycine max var. Williams 82) and alfalfa (Medicago sativa), and non-axenic hairy roots (composite plants) of soybean. Both hosts were amenable to Agrobacterium-mediated transformation, but hairy root induction was faster in alfalfa than soybean. However, more events were generated for soybean than alfalfa. Transgenic hairy roots confirmed by molecular analyses were challenged with P. thornei and their presence confirmed after 14 dpi. After 21 dpi, nematode numbers and transcript abundance was assessed using semi-quantitative PCRs and densitometry analysis. Host-mediated silencing of the five putative parasitism effector genes using transgenic soybean and alfalfa hairy roots showed a significant reduction in target transcript accumulation and approximately 38 to 75% reduction in P. thornei numbers compared to untransformed wild-type controls. For some events, there was a positive correlation between reduced transcripts and nematode numbers. Based on percent reduction in transcript accumulation of the target genes relative to 18S rRNA as assessed by densitometry, the extent of gene knockdown measured (from most to least) was: Pt-Eng-1, Pt-PL, Pt-CLP, Pt-UEP, and Pt-GST. Similarly, Pt-Eng-1, Pt-PL and Pt-CLP were ranked in the same order, from the lowest to highest reproduction on soybean and alfalfa, indicating a positive correlation between the level of knockdown and reduced reproduction. In soybean, these genes were followed by Pt-GST and Pt-UEP for the percentage of reproduction recorded, whereas, in alfalfa, reduction in reproduction for these two target genes did not differ significantly. Composite soybean with wild-type shoots and transgenic hairy roots expressing Pt-Eng-1 and Pt-PL genes were developed and provided an opportunity to test the effectiveness of silencing target genes in planta and on nematode numbers in conditions that mimicked natural host infections. For both Pt-Eng-1 and Pt-PL genes, there was a significant reduction in percentage of transcript accumulation relative to 18S rRNA, which correlated with a reduction in nematode numbers by 53.4% and 48.5% for Pt-Eng-1 and Pt-PL, respectively. The amenability of P. thornei to host-mediated RNAi using effector gene sequences, and the overall results of this study, point towards the potential use of this technology to control P. thornei and related RLN species effectively in different host crops

    A systems biology approach to non-coding RNAs: the networks of cancer

    Get PDF
    A non-coding RNA is a functional RNA molecule that is not translated into a protein. This class of molecules is involved in many cellular processes and includes highly abundant and functionally important RNAs such as transfer RNA (tRNA), ribosomal RNA (rRNA), as well as small interfering RNAs (siRNAs), microRNAs (miRNAs), transcribed ultraconserved regions (T-UCRs) and others. First of all, we investigate the specificity for normal tissues of two selected non-coding RNAs: Transcribed UltraConserved Region and microRNAs. Second, we want to find whether these non-coding RNAs can be candidates as features for the selection of specific cancers, using statistical algorithms and bioinformatics tools. Third, we generate miRNA gene networks in normal and different cancer and leukemia. The term “ultraconserved” refer to genomic regions longer than 200 base pairs that are absolutely conserved (100% homology with no insertions or deletions) in human, mouse, and rat genomes. There are 481 T-UCRs. The reason for this extreme conservation remains a mystery; T-UCRs may play a functional role in the ontogeny and phylogeny of mammals and other vertebrates. Genome-wide profiling revealed that UCRs are frequently located on overlapping exons in genes involved in RNA processing and can be found in introns or at fragile sites and in cancer-associated genomic regions. We investigate the expression of T-UCRs in 374 normal samples from 46 different tissues, grouped by 16 systems. Moreover, we analyzed the specificity of T-UCRs in cancers. Tissue specific T-UCRs can differentiate cell types. We then examine the expression of T-UCRs in human embryonic stem cells, induced pluripotent stem cells, and a series of differentiated cell types (trophoblast, embryoid bodies at 7 and 14 days of differentiation, definitive endoderm, and spontaneous differentiated monolayers). One T-UCR in particular, uc.283 plus, is highly specific for embryonic and induced pluripotent stem cells, as confirmed by real time PCR (RT-PCR). MiRNAs are global regulators of protein output. Each miRNA has been studied for its single contribution to differential expression or to a compact predictive signature. Thus, we propose a study of miRNAs in cancer by applying a systems biology approach. We study miRNA profiles in 4419 human samples (3312 neoplastic, 1107 non-malignant), corresponding to 50 normal tissues (grouped by 17 systems) and 51 cancer types. We calculate tissue specificity and cancer type specificity, a small set of miRNAs were tissue-specific while many others were broadly expressed. Then we find whether non-coding RNAs can be candidates as features for the selection of specific cancers, using statistical algorithms and bioinformatics tools, as decision trees. Afterwards, we build miRNA gene networks by using our very large expression miRNA database. The complexity of our expression database enables us to perform a detailed analysis of coordinated miRNA activities. We also build specialized miRNA networks for different solid tumors and leukemias. Combining differential expression, genetic networks, DNA copy number alterations and other systems biology approaches we confirm or discovered miRNAs with comprehensive roles in cancer. We find that normal tissues are represented by single complete miRNA networks. Cancers instead show separate and unlinked miRNA sub-networks. miRNAs independent from the general transcriptional program were often known as cancer-related. We validate our results by in silico, in vitro and in vivo analysis. We demonstrate that the target genes of these uncoordinated miRNA involve in specific cancer-related pathways
    corecore