760 research outputs found

    Demarcation of coding and non-coding regions of DNA using linear transforms

    Get PDF
    Deoxyribonucleic Acid (DNA) strand carries genetic information in the cell. A strand of DNA consists of nitrogenous molecules called nucleotides. Nucleotides triplets, or the codons, code for amino acids. There are two distinct regions in DNA, the gene and the intergenic DNA, or the junk DNA. Two regions can be distinguished in the gene- the exons, or the regions that code for amino acid, and the introns, or the regions that do not code for amino acid. The main aim of the thesis is to study signal processing techniques that help distinguish between the regions of the exons and the introns. Previous research has shown the fact that the exons can be considered as a sequence of signal and noise, whereas introns are noise-like sequences. Fourier Transform of an exonic sequence exhibits a peak at frequency sample value k N/3 where N is the length of the FFT transform. This property is referred to as the period -3 property. Unlike exons, introns have a noise-like spectrum. The factor that determines the performance efficiency of a transform is the figure of merit, defined as the ratio of the peak value to the arithmetic mean of all the values. A comparative study was conducted for the application of the Discrete Fourier Transform and the Karhunen Loeve Transform. Though both DFT and KLT of an exon sequence produce a higher figure of merit than that for an intron sequence, it is interesting to note that the difference in the figure of merits of exons and introns was higher when the KLT was applied to the sequence than when the DFT was applied. The two transforms were also applied on entire sequences in a sliding window fashion. Finally, the two transforms were applied on a large number of sequences from a variety of organisms. A Neyman Pearson based detector was used to obtain receiver operating curves, i.e., probability of detection versus probability of false alarm. When a transform is applied as a sliding window, the values for exons and introns are taken separately. The exons and the introns served as the two hypotheses of the detector. The Neyman Pearson detector helped indicate the fact the KLT worked better on a variety of organisms than the DFT

    Revisiting detrended fluctuation analysis

    Get PDF
    Half a century ago Hurst introduced Rescaled Range (R/S) Analysis to study fluctuations in time series. Thousands of works have investigated or applied the original methodology and similar techniques, with Detrended Fluctuation Analysis becoming preferred due to its purported ability to mitigate nonstationaries. We show Detrended Fluctuation Analysis introduces artifacts for nonlinear trends, in contrast to common expectation, and demonstrate that the empirically observed curvature induced is a serious finite-size effect which will always be present. Explicit detrending followed by measurement of the diffusional spread of a signals' associated random walk is preferable, a surprising conclusion given that Detrended Fluctuation Analysis was crafted specifically to replace this approach. The implications are simple yet sweeping: there is no compelling reason to apply Detrended Fluctuation Analysis as it 1) introduces uncontrolled bias; 2) is computationally more expensive than the unbiased estimator; and 3) cannot provide generic or useful protection against nonstationaries

    The Mathematics of Phylogenomics

    Get PDF
    The grand challenges in biology today are being shaped by powerful high-throughput technologies that have revealed the genomes of many organisms, global expression patterns of genes and detailed information about variation within populations. We are therefore able to ask, for the first time, fundamental questions about the evolution of genomes, the structure of genes and their regulation, and the connections between genotypes and phenotypes of individuals. The answers to these questions are all predicated on progress in a variety of computational, statistical, and mathematical fields. The rapid growth in the characterization of genomes has led to the advancement of a new discipline called Phylogenomics. This discipline results from the combination of two major fields in the life sciences: Genomics, i.e., the study of the function and structure of genes and genomes; and Molecular Phylogenetics, i.e., the study of the hierarchical evolutionary relationships among organisms and their genomes. The objective of this article is to offer mathematicians a first introduction to this emerging field, and to discuss specific mathematical problems and developments arising from phylogenomics.Comment: 41 pages, 4 figure

    A study of the molecular pathology of ductal carcinoma in situ and invasive ductal carcinoma of the breast.

    Get PDF
    The biological validity of the histopathological classification of ductal carcinoma in situ (DCIS) of the breast was evaluated in this study by correlating the three histopathological grades of DCIS to immunohistochemical expression of Ki67, p53, cerbB-2, markers of poor prognosis in invasive ductal carcinoma (IDC) and also to bcl2 and ER, markers of good prognosis in invasive breast cancer. DCIS grades correlated positively to Ki67, p53, cerbB-2 and negatively to bcl2 and ER, suggesting validity of the classification. The incidence of bax protein expression was determined immunohistochemically in DCIS and IDC. It did not correlate to histopathological grades of DCIS or IDC. The relationships of bax protein to the above mentioned biological markers were also determined in DCIS and IDC. Furthermore, the expression of bax, bcl2, Ki67, ER, p53 and cerbB-2 within DCIS grades was compared with the expression of these markers within IDC grades. The DCIS grades were determined subjectively as well as objectively by means of computer assisted image analysis with significant correlation found between subjective and objective measures. Image analysis was also used to determine percentage of positive cells per case for the nuclear stains (Ki67, ER, p53). Immunohistochemically positive p53 cases were analysed for p53 mutation by polymerase chain reaction (PCR) and subsequent DNA sequencing to compare the incidence of p53 mutation in DCIS to that of IDC. Biochemical changes within tissue may either initiate disease or occur as the result of the disease process and these changes can be studied by both Fourier transform infrared (FTIR) and FT-Raman spectroscopic techniques. FTIR and FT-Raman were employed to distinguish the DCIS and IDC grades. It has the potential to distinguish between DCIS grades, between IDC grades and also between DCIS and IDC as whole groups. The implications of the obtained data for the understanding of the molecular biology of DCIS of the breast and IDC are discussed and future investigations to further elucidate the molecular and cellular mechanisms involved are proposed

    Rank-statistics based enrichment-site prediction algorithm developed for chromatin immunoprecipitation on chip experiments

    Get PDF
    Background: High density oligonucleotide tiling arrays are an effective and powerful platform for conducting unbiased genome-wide studies. The ab initio probe selection method employed in tiling arrays is unbiased, and thus ensures consistent sampling across coding and non-coding regions of the genome. Tiling arrays are increasingly used in chromatin immunoprecipitation (IP) experiments (ChIP on chip). ChIP on chip facilitates the generation of genome-wide maps of in-vivo interactions between DNA-associated proteins including transcription factors and DNA. Analysis of the hybridization of an immunoprecipitated sample to a tiling array facilitates the identification of ChIP-enriched segments of the genome. These enriched segments are putative targets of antibody assayable regulatory elements. The enrichment response is not ubiquitous across the genome. Typically 5 to 10% of tiled probes manifest some significant enrichment. Depending upon the factor being studied, this response can drop to less than 1%. The detection and assessment of significance for interactions that emanate from non-canonical and/or un-annotated regions of the genome is especially challenging. This is the motivation behind the proposed algorithm. Results: We have proposed a novel rank and replicate statistics-based methodology for identifying and ascribing statistical confidence to regions of ChIP-enrichment. The algorithm is optimized for identification of sites that manifest low levels of enrichment but are true positives, as validated by alternative biochemical experiments. Although the method is described here in the context of ChIP on chip experiments, it can be generalized to any treatment-control experimental design. The results of the algorithm show a high degree of concordance with independent biochemical validation methods. The sensitivity and specificity of the algorithm have been characterized via quantitative PCR and independent computational approaches. Conclusion: The algorithm ranks all enrichment sites based on their intra-replicate ranks and inter-replicate rank consistency. Following the ranking, the method allows segmentation of sites based on a meta p-value, a composite array signal enrichment criterion, or a composite of these two measures. The sensitivities obtained subsequent to the segmentation of data using a meta p-value of 10(-5), an array signal enrichment of 0.2 and a composite of these two values are 88%, 87% and 95%, respectively

    Computational Methods for Sequencing and Analysis of Heterogeneous RNA Populations

    Get PDF
    Next-generation sequencing (NGS) and mass spectrometry technologies bring unprecedented throughput, scalability and speed, facilitating the studies of biological systems. These technologies allow to sequence and analyze heterogeneous RNA populations rather than single sequences. In particular, they provide the opportunity to implement massive viral surveillance and transcriptome quantification. However, in order to fully exploit the capabilities of NGS technology we need to develop computational methods able to analyze billions of reads for assembly and characterization of sampled RNA populations. In this work we present novel computational methods for cost- and time-effective analysis of sequencing data from viral and RNA samples. In particular, we describe: i) computational methods for transcriptome reconstruction and quantification; ii) method for mass spectrometry data analysis; iii) combinatorial pooling method; iv) computational methods for analysis of intra-host viral populations

    Selected Works in Bioinformatics

    Get PDF
    This book consists of nine chapters covering a variety of bioinformatics subjects, ranging from database resources for protein allergens, unravelling genetic determinants of complex disorders, characterization and prediction of regulatory motifs, computational methods for identifying the best classifiers and key disease genes in large-scale transcriptomic and proteomic experiments, functional characterization of inherently unfolded proteins/regions, protein interaction networks and flexible protein-protein docking. The computational algorithms are in general presented in a way that is accessible to advanced undergraduate students, graduate students and researchers in molecular biology and genetics. The book should also serve as stepping stones for mathematicians, biostatisticians, and computational scientists to cross their academic boundaries into the dynamic and ever-expanding field of bioinformatics

    Analysis of Genomic and Proteomic Signals Using Signal Processing and Soft Computing Techniques

    Get PDF
    Bioinformatics is a data rich field which provides unique opportunities to use computational techniques to understand and organize information associated with biomolecules such as DNA, RNA, and Proteins. It involves in-depth study in the areas of genomics and proteomics and requires techniques from computer science,statistics and engineering to identify, model, extract features and to process data for analysis and interpretation of results in a biologically meaningful manner.In engineering methods the signal processing techniques such as transformation,filtering, pattern analysis and soft-computing techniques like multi layer perceptron(MLP) and radial basis function neural network (RBFNN) play vital role to effectively resolve many challenging issues associated with genomics and proteomics. In this dissertation, a sincere attempt has been made to investigate on some challenging problems of bioinformatics by employing some efficient signal and soft computing methods. Some of the specific issues, which have been attempted are protein coding region identification in DNA sequence, hot spot identification in protein, prediction of protein structural class and classification of microarray gene expression data. The dissertation presents some novel methods to measure and to extract features from the genomic sequences using time-frequency analysis and machine intelligence techniques.The problems investigated and the contribution made in the thesis are presented here in a concise manner. The S-transform, a powerful time-frequency representation technique, possesses superior property over the wavelet transform and short time Fourier transform as the exponential function is fixed with respect to time axis while the localizing scalable Gaussian window dilates and translates. The S-transform uses an analysis window whose width is decreasing with frequency providing a frequency dependent resolution. The invertible property of S-transform makes it suitable for time-band filtering application. Gene prediction and protein coding region identification have been always a challenging task in computational biology,especially in eukaryote genomes due to its complex structure. This issue is resolved using a S-transform based time-band filtering approach by localizing the period-3 property present in the DNA sequence which forms the basis for the identification.Similarly, hot spot identification in protein is a burning issue in protein science due to its importance in binding and interaction between proteins. A novel S-transform based time-frequency filtering approach is proposed for efficient identification of the hot spots. Prediction of structural class of protein has been a challenging problem in bioinformatics.A novel feature representation scheme is proposed to efficiently represent the protein, thereby improves the prediction accuracy. The high dimension and low sample size of microarray data lead to curse of dimensionality problem which affects the classification performance.In this dissertation an efficient hybrid feature extraction method is proposed to overcome the dimensionality issue and a RBFNN is introduced to efficiently classify the microarray samples

    The anthropometric, environmental and genetic determinants of right ventricular structure and function

    Get PDF
    BACKGROUND Measures of right ventricular (RV) structure and function have significant prognostic value. The right ventricle is currently assessed by global measures, or point surrogates, which are insensitive to regional and directional changes. We aim to create a high-resolution three-dimensional RV model to improve understanding of its structural and functional determinants. These may be particularly of interest in pulmonary hypertension (PH), a condition in which RV function and outcome are strongly linked. PURPOSE To investigate the feasibility and additional benefit of applying three-dimensional phenotyping and contemporary statistical and genetic approaches to large patient populations. METHODS Healthy subjects and incident PH patients were prospectively recruited. Using a semi-automated atlas-based segmentation algorithm, 3D models characterising RV wall position and displacement were developed, validated and compared with anthropometric, physiological and genetic influences. Statistical techniques were adapted from other high-dimensional approaches to deal with the problems of multiple testing, contiguity, sparsity and computational burden. RESULTS 1527 healthy subjects successfully completed high-resolution 3D CMR and automated segmentation. Of these, 927 subjects underwent next-generation sequencing of the sarcomeric gene titin and 947 subjects completed genotyping of common variants for genome-wide association study. 405 incident PH patients were recruited, of whom 256 completed phenotyping. 3D modelling demonstrated significant reductions in sample size compared to two-dimensional approaches. 3D analysis demonstrated that RV basal-freewall function reflects global functional changes most accurately and that a similar region in PH patients provides stronger survival prediction than all anthropometric, haemodynamic and functional markers. Vascular stiffness, titin truncating variants and common variants may also contribute to changes in RV structure and function. CONCLUSIONS High-resolution phenotyping coupled with computational analysis methods can improve insights into the determinants of RV structure and function in both healthy subjects and PH patients. Large, population-based approaches offer physiological insights relevant to clinical care in selected patient groups.Open Acces

    Nonparametric Bayesian analysis of some clustering problems

    Get PDF
    Nonparametric Bayesian models have been researched extensively in the past 10 years following the work of Escobar and West (1995) on sampling schemes for Dirichlet processes. The infinite mixture representation of the Dirichlet process makes it useful for clustering problems where the number of clusters is unknown. We develop nonparametric Bayesian models for two different clustering problems, namely functional and graphical clustering. We propose a nonparametric Bayes wavelet model for clustering of functional or longitudinal data. The wavelet modelling is aimed at the resolution of global and local features during clustering. The model also allows the elicitation of prior belief about the regularity of the functions and has the ability to adapt to a wide range of functional regularity. Posterior inference is carried out by Gibbs sampling with conjugate priors for fast computation. We use simulated as well as real datasets to illustrate the suitability of the approach over other alternatives. The functional clustering model is extended to analyze splice microarray data. New microarray technologies probe consecutive segments along genes to observe alternative splicing (AS) mechanisms that produce multiple proteins from a single gene. Clues regarding the number of splice forms can be obtained by clustering the functional expression profiles from different tissues. The analysis was carried out on the Rosetta dataset (Johnson et al., 2003) to obtain a splice variant by tissue distribution for all the 10,000 genes. We were able to identify a number of splice forms that appear to be unique to cancer. We propose a Bayesian model for partitioning graphs depicting dependencies in a collection of objects. After suitable transformations and modelling techniques, the problem of graph cutting can be approached by nonparametric Bayes clustering. We draw motivation from a recent work (Dhillon, 2001) showing the equivalence of kernel k-means clustering and certain graph cutting algorithms. It is shown that loss functions similar to the kernel k-means naturally arise in this model, and the minimization of associated posterior risk comprises an effective graph cutting strategy. We present here results from the analysis of two microarray datasets, namely the melanoma dataset (Bittner et al., 2000) and the sarcoma dataset (Nykter et al., 2006)
    corecore