189 research outputs found
An entropic approach to the analysis of time series.
Statistical analysis of time series. With compelling arguments we show that the Diffusion Entropy Analysis (DEA) is the only method of the literature of the Science of Complexity that correctly determines the scaling hidden within a time series reflecting a Complex Process. The time series is thought of as a source of fluctuations, and the DEA is based on the Shannon entropy of the diffusion process generated by these fluctuations. All traditional methods of scaling analysis, instead, are based on the variance of this diffusion process. The variance methods detect the real scaling only if the Gaussian assumption holds true. We call H the scaling exponent detected by the variance methods and d the real scaling exponent. If the time series is characterized by Fractional Brownian Motion, we have H¹d and the scaling can be safely determined, in this case, by using the variance methods. If, on the contrary, the time series is characterized, for example, by Lévy statistics, H ¹ d and the variance methods cannot be used to detect the true scaling. Lévy walk yields the relation d=1/(3-2H). In the case of Lévy flights, the variance diverges and the exponent H cannot be determined, whereas the scaling d exists and can be established by using the DEA. Therefore, only the joint use of two different scaling analysis methods, the variance scaling analysis and the DEA, can assess the real nature, Gauss or Lévy or something else, of a time series. Moreover, the DEA determines the information content, under the form of Shannon entropy, or of any other convenient entopic indicator, at each time step of the process that, given a sufficiently large number of data, is expected to become diffusion with scaling. This makes it possible to study the regime of transition from dynamics to thermodynamics, non-stationary regimes, and the saturation regime as well. First of all, the efficiency of the DEA is proved with theoretical arguments and with numerical work on artificial sequences. Then we apply the DEA to three different sets of real data, Genome sequences, hard x-ray solar flare waiting times and sequences of sociological interest. In all these cases the DEA makes new properties, overlooked by the standard method of analysis, emerge
Use of wavelet-packet transforms to develop an engineering model for multifractal characterization of mutation dynamics in pathological and nonpathological gene sequences
This study uses dynamical analysis to examine in a quantitative fashion the information coding mechanism in DNA sequences. This exceeds the simple dichotomy of either modeling the mechanism by comparing DNA sequence walks as Fractal Brownian Motion (fbm) processes. The 2-D mappings of the DNA sequences for this research are from Iterated Function System (IFS) (Also known as the Chaos Game Representation (CGR)) mappings of the DNA sequences. This technique converts a 1-D sequence into a 2-D representation that preserves subsequence structure and provides a visual representation. The second step of this analysis involves the application of Wavelet Packet Transforms, a recently developed technique from the field of signal processing. A multi-fractal model is built by using wavelet transforms to estimate the Hurst exponent, H. The Hurst exponent is a non-parametric measurement of the dynamism of a system. This procedure is used to evaluate gene-coding events in the DNA sequence of cystic fibrosis mutations. The H exponent is calculated for various mutation sites in this gene. The results of this study indicate the presence of anti-persistent, random walks and persistent sub-periods in the sequence. This indicates the hypothesis of a multi-fractal model of DNA information encoding warrants further consideration.;This work examines the model\u27s behavior in both pathological (mutations) and non-pathological (healthy) base pair sequences of the cystic fibrosis gene. These mutations both natural and synthetic were introduced by computer manipulation of the original base pair text files. The results show that disease severity and system information dynamics correlate. These results have implications for genetic engineering as well as in mathematical biology. They suggest that there is scope for more multi-fractal models to be developed
Data Discovery and Anomaly Detection using Atypicality.
Ph.D. Thesis. University of Hawaiʻi at Mānoa 2017
Information Theory in Molecular Evolution: From Models to Structures and Dynamics
This Special Issue collects novel contributions from scientists in the interdisciplinary field of biomolecular evolution. Works listed here use information theoretical concepts as a core but are tightly integrated with the study of molecular processes. Applications include the analysis of phylogenetic signals to elucidate biomolecular structure and function, the study and quantification of structural dynamics and allostery, as well as models of molecular interaction specificity inspired by evolutionary cues
Recommended from our members
Hypothesis testing and causal inference with heterogeneous medical data
Learning from data which associations hold and are likely to hold in the future is a fundamental part of scientific discovery. With increasingly heterogeneous data collection practices, exemplified by passively collected electronic health records or high-dimensional genetic data with only few observed samples, biases and spurious correlations are prevalent. These are called spurious because they do not contribute to the effect being studied. In this context, the modelling assumptions of existing statistical tests and causal inference methods are often found inadequate and their practical utility diminished even though these models are increasingly used as decision-support tools in practice. This thesis investigates how modern computational techniques may broaden the fields of hypothesis testing and causal inference to handle the subtleties of large heterogeneous data sets, as well as simultaneously improve the robustness and theoretical understanding of machine learning algorithms using insights from causality and statistics.
The first part of this thesis is concerned with hypothesis testing. We develop a framework for hypothesis testing on set-valued data, a representation that faithfully describes many real-world phenomena including patient biomarker trajectories in the hospital. Using similar techniques, we develop next a two-sample test for making inference on selection-biased data, in the sense that not all individuals are equally likely to be included in the study, a fact that biases tests if not accounted for and if the desideratum is to obtain conclusions that are generally applicable. We conclude this section with an investigation of conditional independence in high-dimensional data, such as found in gene expression data, and propose a test using generative adversarial networks. The second part of this thesis is concerned with causal inference and discovery, with a special focus on the influence of unobserved confounders that distort the observed associations between variables and yet may not be ruled out or adjusted for using data alone. We start by demonstrating that unobserved confounders may bias substantially the generalization performance of machine learning algorithms trained with conventional learning paradigms such as empirical risk minimization. Acknowledging this spurious effect, we develop a new learning principle inspired by causal insights that provably generalizes to test data sampled from a larger set of distributions different from the training distribution. In the last chapter we consider the influence of unobserved confounders for causal discovery. We show that with some assumptions on the type and influence on the nature of unobserved confounding one may develop provably consistent causal discovery algorithms, formulated as a solution to a continuous optimization program
An Inferential Framework for Network Hypothesis Tests: With Applications to Biological Networks
The analysis of weighted co-expression gene sets is gaining momentum in systems biology. In addition to substantial research directed toward inferring co-expression networks on the basis of microarray/high-throughput sequencing data, inferential methods are being developed to compare gene networks across one or more phenotypes. Common gene set hypothesis testing procedures are mostly confined to comparing average gene/node transcription levels between one or more groups and make limited use of additional network features, e.g., edges induced by significant partial correlations. Ignoring the gene set architecture disregards relevant network topological comparisons and can result in familiar
Network models of stochastic processes in cancer
Complex systems which can be modelled as networks are ubiquitous. Well-known examples include social and economic networks, as well as many examples in cell biology such as gene regulatory and protein signalling networks. Many cell biological processes are inherently stochastic and non-stationary, and this is the perspective from which I have developed novel mathematical and computational statistical models, focusing particularly on network models. These models are primarily motivated by cell biological processes relating to DNA methylation and stem cell and cancer biology, but can be generalised to other systems and domains. I have used these and other models to identify and analyse novel DNA-based cancer biomarkers
Bioinformatics Applications Based On Machine Learning
The great advances in information technology (IT) have implications for many sectors, such as bioinformatics, and has considerably increased their possibilities. This book presents a collection of 11 original research papers, all of them related to the application of IT-related techniques within the bioinformatics sector: from new applications created from the adaptation and application of existing techniques to the creation of new methodologies to solve existing problems
- …