337 research outputs found

    Determining the Number of Samples Required to Estimate Entropy in Natural Sequences

    Full text link
    Calculating the Shannon entropy for symbolic sequences has been widely considered in many fields. For descriptive statistical problems such as estimating the N-gram entropy of English language text, a common approach is to use as much data as possible to obtain progressively more accurate estimates. However in some instances, only short sequences may be available. This gives rise to the question of how many samples are needed to compute entropy. In this paper, we examine this problem and propose a method for estimating the number of samples required to compute Shannon entropy for a set of ranked symbolic natural events. The result is developed using a modified Zipf-Mandelbrot law and the Dvoretzky-Kiefer-Wolfowitz inequality, and we propose an algorithm which yields an estimate for the minimum number of samples required to obtain an estimate of entropy with a given confidence level and degree of accuracy

    Informational laws of genome structures

    Get PDF
    In recent years, the analysis of genomes by means of strings of length k occurring in the genomes, called k-mers, has provided important insights into the basic mechanisms and design principles of genome structures. In the present study, we focus on the proper choice of the value of k for applying information theoretic concepts that express intrinsic aspects of genomes. The value k\u2009=\u2009lg2(n), where n is the genome length, is determined to be the best choice in the definition of some genomic informational indexes that are studied and computed for seventy genomes. These indexes, which are based on information entropies and on suitable comparisons with random genomes, suggest five informational laws, to which all of the considered genomes obey. Moreover, an informational genome complexity measure is proposed, which is a generalized logistic map that balances entropic and anti-entropic components of genomes and is related to their evolutionary dynamics. Finally, applications to computational synthetic biology are briefly outlined

    Memetic Science: I-General Introduction

    Get PDF
    Memetic Science is the name of a new field that deals with the quantitativeanalysis of cultural transfer.The units of cultural transfer are entities called "memes". In a nutshell, memes are to cultural and mental constructs as genesare to biological organisms. Examplesof memesare ideas,tunes, fashions, and virtuallyany culturaland behavioral unit that gets copiedwitha certaindegree of fidelity. It is arguedthat the under standing of memes is of similar importance and consequence as the understanding of processes involving DNA and RNA in molecular biology.Thispaperpresentsa rigorousfoundation fordiscussion ofmemes and approaches to quantifying relevantaspects of memegenesis, inter action, mutation, growth,deathand spreadingprocesses. It is also argued inthispaper that recombinant memetics is possible incomplete analogy to recombinant DNA/ genetic engineering. Special attention is paid to memes in written modern English

    Entropy of printed Bengali language texts.

    Get PDF
    One of the most important sources of information is written and spoken human language. The language that is spoken, written, or signed by humans for general-purpose communication is referred to as natural language. Determining the entropy of natural language text is a fundamentally important problem in natural language processing. The study and analysis of the entropy of a language can be a meaningful resource for researchers in linguistics and communication theory. For the purpose of this research we have taken printed Bengali language text as our source of natural language. We have collected a sufficient number of printed Bengali language text samples and divided them into two classes, newspaper and literature. We have studied each class in order to come up with specific entropy for each category and analyzed their characteristics. As a separate study, we collected printed religious Bengali language texts, divided them into two classes, Islamic and Hindu, found their entropy and studied and analyzed their characteristics. From our research, we have found the zero and first-order entropy of Bengali language to be 5.52 and 4.55 respectively. The language uncertainty and redundancy are 0.8242 and 17.58% respectively. These entropy and redundancy results of the language will be useful to researchers to help find a better text compression method for Bengali language.The original print copy of this thesis may be available here: http://wizard.unbc.ca/record=b146606

    Models, Techniques, and Metrics for Managing Risk in Software Engineering

    Get PDF
    The field of Software Engineering (SE) is the study of systematic and quantifiable approaches to software development, operation, and maintenance. This thesis presents a set of scalable and easily implemented techniques for quantifying and mitigating risks associated with the SE process. The thesis comprises six papers corresponding to SE knowledge areas such as software requirements, testing, and management. The techniques for risk management are drawn from stochastic modeling and operational research. The first two papers relate to software testing and maintenance. The first paper describes and validates novel iterative-unfolding technique for filtering a set of execution traces relevant to a specific task. The second paper analyzes and validates the applicability of some entropy measures to the trace classification described in the previous paper. The techniques in these two papers can speed up problem determination of defects encountered by customers, leading to improved organizational response and thus increased customer satisfaction and to easing of resource constraints. The third and fourth papers are applicable to maintenance, overall software quality and SE management. The third paper uses Extreme Value Theory and Queuing Theory tools to derive and validate metrics based on defect rediscovery data. The metrics can aid the allocation of resources to service and maintenance teams, highlight gaps in quality assurance processes, and help assess the risk of using a given software product. The fourth paper characterizes and validates a technique for automatic selection and prioritization of a minimal set of customers for profiling. The minimal set is obtained using Binary Integer Programming and prioritized using a greedy heuristic. Profiling the resulting customer set leads to enhanced comprehension of user behaviour, leading to improved test specifications and clearer quality assurance policies, hence reducing risks associated with unsatisfactory product quality. The fifth and sixth papers pertain to software requirements. The fifth paper both models the relation between requirements and their underlying assumptions and measures the risk associated with failure of the assumptions using Boolean networks and stochastic modeling. The sixth paper models the risk associated with injection of requirements late in development cycle with the help of stochastic processes
    • …
    corecore