7,863 research outputs found

    Spectral Analysis of Guanine and Cytosine Fluctuations of Mouse Genomic DNA

    Full text link
    We study global fluctuations of the guanine and cytosine base content (GC%) in mouse genomic DNA using spectral analyses. Power spectra S(f) of GC% fluctuations in all nineteen autosomal and two sex chromosomes are observed to have the universal functional form S(f) \sim 1/f^alpha (alpha \approx 1) over several orders of magnitude in the frequency range 10^-7< f < 10^-5 cycle/base, corresponding to long-ranging GC% correlations at distances between 100 kb and 10 Mb. S(f) for higher frequencies (f > 10^-5 cycle/base) shows a flattened power-law function with alpha < 1 across all twenty-one chromosomes. The substitution of about 38% interspersed repeats does not affect the functional form of S(f), indicating that these are not predominantly responsible for the long-ranged multi-scale GC% fluctuations in mammalian genomes. Several biological implications of the large-scale GC% fluctuation are discussed, including neutral evolutionary history by DNA duplication, chromosomal bands, spatial distribution of transcription units (genes), replication timing, and recombination hot spots.Comment: 15 pages (figures included), 2 figure

    Human Promoter Prediction Using DNA Numerical Representation

    Get PDF
    With the emergence of genomic signal processing, numerical representation techniques for DNA alphabet set {A, G, C, T} play a key role in applying digital signal processing and machine learning techniques for processing and analysis of DNA sequences. The choice of the numerical representation of a DNA sequence affects how well the biological properties can be reflected in the numerical domain for the detection and identification of the characteristics of special regions of interest within the DNA sequence. This dissertation presents a comprehensive study of various DNA numerical and graphical representation methods and their applications in processing and analyzing long DNA sequences. Discussions on the relative merits and demerits of the various methods, experimental results and possible future developments have also been included. Another area of the research focus is on promoter prediction in human (Homo Sapiens) DNA sequences with neural network based multi classifier system using DNA numerical representation methods. In spite of the recent development of several computational methods for human promoter prediction, there is a need for performance improvement. In particular, the high false positive rate of the feature-based approaches decreases the prediction reliability and leads to erroneous results in gene annotation.To improve the prediction accuracy and reliability, DigiPromPred a numerical representation based promoter prediction system is proposed to characterize DNA alphabets in different regions of a DNA sequence.The DigiPromPred system is found to be able to predict promoters with a sensitivity of 90.8% while reducing false prediction rate for non-promoter sequences with a specificity of 90.4%. The comparative study with state-of-the-art promoter prediction systems for human chromosome 22 shows that our proposed system maintains a good balance between prediction accuracy and reliability. To reduce the system architecture and computational complexity compared to the existing system, a simple feed forward neural network classifier known as SDigiPromPred is proposed. The SDigiPromPred system is found to be able to predict promoters with a sensitivity of 87%, 87%, 99% while reducing false prediction rate for non-promoter sequences with a specificity of 92%, 94%, 99% for Human, Drosophila, and Arabidopsis sequences respectively with reconfigurable capability compared to existing system

    Deep Domain Adaptation Learning Framework for Associating Image Features to Tumour Gene Profile

    Get PDF
    While medical imaging and general pathology are routine in cancer diagnosis, genetic sequencing is not always assessable due to the strong phenotypic and genetic heterogeneity of human cancers. Image-genomics integrates medical imaging and genetics to provide a complementary approach to optimise cancer diagnosis by associating tumour imaging traits with clinical data and has demonstrated its potential in identifying imaging surrogates for tumour biomarkers. However, existing image-genomics research has focused on quantifying tumour visual traits according to human understanding, which may not be optimal across different cancer types. The challenge hence lies in the extraction of optimised imaging representations in an objective data-driven manner. Such an approach requires large volumes of annotated image data that are difficult to acquire. We propose a deep domain adaptation learning framework for associating image features to tumour genetic information, exploiting the ability of domain adaptation technique to learn relevant image features from close knowledge domains. Our proposed framework leverages the current state-of-the-art in image object recognition to provide image features to encode subtle variations of tumour phenotypic characteristics with domain adaptation techniques. The proposed framework was evaluated with current state-of-the-art in: (i) tumour histopathology image classification and; (ii) image-genomics associations. The proposed framework demonstrated improved accuracy of tumour classification, as well as providing additional data-derived representations of tumour phenotypic characteristics that exhibit strong image-genomics association. This thesis advances and indicates the potential of image-genomics research to reveal additional imaging surrogates to genetic biomarkers, which has the potential to facilitate cancer diagnosis

    Modern Computing Techniques for Solving Genomic Problems

    Get PDF
    With the advent of high-throughput genomics, biological big data brings challenges to scientists in handling, analyzing, processing and mining this massive data. In this new interdisciplinary field, diverse theories, methods, tools and knowledge are utilized to solve a wide variety of problems. As an exploration, this dissertation project is designed to combine concepts and principles in multiple areas, including signal processing, information-coding theory, artificial intelligence and cloud computing, in order to solve the following problems in computational biology: (1) comparative gene structure detection, (2) DNA sequence annotation, (3) investigation of CpG islands (CGIs) for epigenetic studies. Briefly, in problem #1, sequences are transformed into signal series or binary codes. Similar to the speech/voice recognition, similarity is calculated between two signal series and subsequently signals are stitched/matched into a temporal sequence. In the nature of binary operation, all calculations/steps can be performed in an efficient and accurate way. Improving performance in terms of accuracy and specificity is the key for a comparative method. In problem #2, DNA sequences are encoded and transformed into numeric representations for deep learning methods. Encoding schemes greatly influence the performance of deep learning algorithms. Finding the best encoding scheme for a particular application of deep learning is significant. Three applications (detection of protein-coding splicing sites, detection of lincRNA splicing sites and improvement of comparative gene structure identification) are used to show the computing power of deep neural networks. In problem #3, CpG sites are assigned certain energy and a Gaussian filter is applied to detection of CpG islands. By using the CpG box and Markov model, we investigate the properties of CGIs and redefine the CGIs using the emerging epigenetic data. In summary, these three problems and their solutions are not isolated; they are linked to modern techniques in such diverse areas as signal processing, information-coding theory, artificial intelligence and cloud computing. These novel methods are expected to improve the efficiency and accuracy of computational tools and bridge the gap between biology and scientific computing

    Computational Methods for Delineating Multiple Nuclear Phenotypes from Different Imaging Modalities

    Get PDF
    Characterizing histopathology or organoid models of breast cancer can provide fundamental knowledge that will lead to a better understanding of tumors, response to therapeutic agents, and discovery of new targeted therapies. To this aim, the delineation of nuclei is significantly interesting since it provides rich information about the aberrant microanatomy or colony formation. For example, (i) cancer cells tend to be larger and, if coupled with high chromatin content, may indicate aneuploidy; (ii) cellular density can be the result of rapid proliferation; (iii) nuclear micro-texture can be a surrogate for fluctuation of heterochromatin patterns, where epigenetic aberrations in cancers are sometimes correlated with alterations in heterochromatin distribution; and (iv) normalized colony formation of cancer cells, in 3D culture, can serve as a surrogate metric for tumor suppression. These evidences suggest that nuclear segmentation and profiling is a major step for subsequent bioinformatics analysis. However, there are two barriers which include technical variations during the sample preparation step and biological heterogeneity since no two patients/samples are alike. As a result of these complexities, extension of deep learning methodologies will have a significant impact on the robust characterization and profiling of pathology sections or organoid models. In this presentation, we demonstrate that integration of regional and contextual representations, within the framework of a deep encoder-decoder architecture, contribute to robust delineation of various nuclear phenotypes from both bright field and confocal microscopy. The deep encoder-decoder architecture can infer perceptual boundaries that are necessary to decompose clumps of nuclei. The method has been validated on pathology section and organoid models of human mammary epithelial cells

    Computational mapping of regulatory domains of human genes

    Get PDF
    Das menschliche Genom enthält Millionen von regulatorischen Elementen - Enhancern -, die die Genexpression quantitativ regulieren. Trotz des enormen Fortschritts beim Verständnis, wie Enhancer die Genexpression steuern, fehlt es in diesem Bereich immer noch an einem systematischen, integrativen und zugänglichen Ansatz zur Entdeckung und Dokumentation von cis-regulatorischen Beziehungen im gesamten Genom. Wir haben eine neuartige Methode - reg2gene - entwickelt, die Genexpression~Enhancer-Aktivität modelliert und integriert. reg2gene besteht aus drei Hauptschritten: 1) Datenquantifizierung, 2) Datenmodellierung und Signifikanzbewertung und 3) Datenintegration, die in dem R-Paket reg2gene zusammengefasst sind. Als Ergebnis haben wir zwei Sätze von Enhancer-Gen-Assoziationen (EGAs) identifiziert: den flexiblen Satz von ~230K EGAs (flexibleC) und den stringenten Satz von ~60K EGAs (stringentC). Wir haben große Unterschiede zwischen den bisher veröffentlichten Berechnungsmodellen für Enhancer-Gene-Assoziationen festgestellt, vor allem in Bezug auf die Lage, die Anzahl und die Eigenschaften der definierten Enhancer-Regionen und EGAs. Wir führten ein detailliertes Benchmarking von sieben Sets von rechnerisch modellierten EGAs durch, zeigten jedoch, dass keiner der derzeit verfügbaren Benchmark-Datensätze als "goldener Standard" verwendet werden kann. Wir definierten einen zusätzlichen Benchmark-Datensatz mit positiven und negativen EGAs, mit dem wir zeigten, dass das stringentC-Modell den höchsten positiven Vorhersagewert (PPV) hatte. Wir haben das Potenzial von EGAs zur Identifizierung von Genzielen von nicht-kodierenden SNP-Gene-Assoziationen nachgewiesen. Schließlich führten wir eine funktionelle Analyse durch, um neue Genziele, Enhancer-Pleiotropie und Mechanismen der Enhancer-Aktivität zu ermitteln. Insgesamt bringt diese Arbeit unser Verständnis der durch Enhancer vermittelten Regulierung der Genexpression in Gesundheit und Krankheit voran.Human genome contains millions of regulatory elements - enhancers - that quantitatively regulate gene expression. Multiple experimental and computational approaches were developed to associate enhancers with their gene targets. Despite the tremendous progress in understanding how enhancers tune gene expression, the field still lacks an approach that is systematic, integrative and accessible for discovering and documenting cis-regulatory relationships across the genome. We developed a novel computational approach - reg2gene- that models and integrates gene expression ~ enhancer activity. reg2gene consists of three main steps: 1) data quantification, 2) data modelling and significance assessment, and 3) data integration gathered in the reg2gene R package. As a result we identified two sets of enhancer-gene associations (EGAs): the flexible set of ~230K EGAs (flexibleC), and the stringent set of ~60K EGAs (stringentC). We identified major differences across previously published computational models of enhancer-gene associations; mostly in the location, number and properties of defined enhancer regions and EGAs. We performed detailed benchmarking of seven sets of computationally modelled EGAs, but showed that none of the currently available benchmark datasets could be used as a “golden-standard” benchmark dataset. To account for that observation, we defined an additional benchmark set of positive and negative EGAs with which we showed that the stringentC model had the highest positive predictive value (PPV) across all analyzed computational models. We reviewed the influence of EGA sets on the functional analysis of risk SNPs and demonstrated the potential of EGAs to identify gene targets of non-coding SNP-gene associations. Lastly, we performed a functional analysis to detect novel gene targets, enhancer pleiotropy, and mechanisms of enhancer activity. Altogether, this work advances our understanding of enhancer-mediated gene expression regulation in health and disease.Ljudski genom sadrži milijune regulatornih elemenata - enhancera - koji kvantitativno reguliraju ekspresiju gena. Unatoč ogromnom napretku u razumijevanju načina na koji enhanceri reguliraju ekspresiju gena, području još uvijek nedostaje pristup koji je sustavan, integrativan i dostupan za otkrivanje i dokumentiranje cis-regulatornih odnosa u cijelom genomu. Razvili smo novu računalnu metodu - reg2gene - koja modelira i integrira aktivnost enhancera~ekspresije gena. reg2gene sastoji se od tri glavna koraka: 1) kvantifikacija podataka, 2) modeliranje podataka i procjena značaja, i 3) integracija podataka prikupljenih u reg2gene R paketu. Kao rezultat toga, identificirali smo dva skupa enhancer-gen interakcija (EGA): fleksibilni skup od ~ 230K EGA (flexibleC) i strogi skup od ~ 60K EGA (stringentC). Utvrdili smo velike razlike u prethodno objavljenim računalnim modelima enhancer-gen interakcija; uglavnom u lokaciji, broju i svojstvima definiranih enhancera i EGA. Izveli smo detaljno mjerenje performansi sedam skupova računalno modeliranih EGA-a, ali smo pokazali da se niti jedan od trenutno dostupnih skupova referentnih podataka ne može koristiti kao referentni skup podataka "zlatnI standard". Definirali smo dodatni referentni skup pozitivnih i negativnih EGA -a pomoću kojih smo pokazali da stringentC ima najveću pozitivnu prediktivnu vrijednost (PPV). Pokazali smo potencijal EGA-a za identifikaciju genskih meta nekodirajucih SNP-ova. Proveli smo funkcionalnu analizu kako bismo otkrili nove genske mete, pleiotropiju enhancera i mehanizme aktivnosti enhancera. Ovaj rad poboljšava naše razumijevanje regulacije ekspresije gena posredovane enhancerima
    corecore