337 research outputs found
Determining the Number of Samples Required to Estimate Entropy in Natural Sequences
Calculating the Shannon entropy for symbolic sequences has been widely
considered in many fields. For descriptive statistical problems such as
estimating the N-gram entropy of English language text, a common approach is to
use as much data as possible to obtain progressively more accurate estimates.
However in some instances, only short sequences may be available. This gives
rise to the question of how many samples are needed to compute entropy. In this
paper, we examine this problem and propose a method for estimating the number
of samples required to compute Shannon entropy for a set of ranked symbolic
natural events. The result is developed using a modified Zipf-Mandelbrot law
and the Dvoretzky-Kiefer-Wolfowitz inequality, and we propose an algorithm
which yields an estimate for the minimum number of samples required to obtain
an estimate of entropy with a given confidence level and degree of accuracy
Informational laws of genome structures
In recent years, the analysis of genomes by means of strings of length k occurring in the genomes, called k-mers, has provided important insights into the basic mechanisms and design principles of genome structures. In the present study, we focus on the proper choice of the value of k for applying information theoretic concepts that express intrinsic aspects of genomes. The value k\u2009=\u2009lg2(n), where n is the genome length, is determined to be the best choice in the definition of some genomic informational indexes that are studied and computed for seventy genomes. These indexes, which are based on information entropies and on suitable comparisons with random genomes, suggest five informational laws, to which all of the considered genomes obey. Moreover, an informational genome complexity measure is proposed, which is a generalized logistic map that balances entropic and anti-entropic components of genomes and is related to their evolutionary dynamics. Finally, applications to computational synthetic biology are briefly outlined
Memetic Science: I-General Introduction
Memetic Science is the name of a new field that deals with the
quantitativeanalysis of cultural transfer.The units of cultural transfer are
entities called "memes". In a nutshell, memes are to cultural and mental
constructs as genesare to biological organisms. Examplesof memesare
ideas,tunes, fashions, and virtuallyany culturaland behavioral unit that
gets copiedwitha certaindegree of fidelity. It is arguedthat the under
standing of memes is of similar importance and consequence as the
understanding of processes involving DNA and RNA in molecular
biology.Thispaperpresentsa rigorousfoundation fordiscussion ofmemes
and approaches to quantifying relevantaspects of memegenesis, inter
action, mutation, growth,deathand spreadingprocesses. It is also argued
inthispaper that recombinant memetics is possible incomplete analogy
to recombinant DNA/ genetic engineering. Special attention is paid to
memes in written modern English
Entropy of printed Bengali language texts.
One of the most important sources of information is written and spoken human language. The language that is spoken, written, or signed by humans for general-purpose communication is referred to as natural language. Determining the entropy of natural language text is a fundamentally important problem in natural language processing. The study and analysis of the entropy of a language can be a meaningful resource for researchers in linguistics and communication theory. For the purpose of this research we have taken printed Bengali language text as our source of natural language. We have collected a sufficient number of printed Bengali language text samples and divided them into two classes, newspaper and literature. We have studied each class in order to come up with specific entropy for each category and analyzed their characteristics. As a separate study, we collected printed religious Bengali language texts, divided them into two classes, Islamic and Hindu, found their entropy and studied and analyzed their characteristics. From our research, we have found the zero and first-order entropy of Bengali language to be 5.52 and 4.55 respectively. The language uncertainty and redundancy are 0.8242 and 17.58% respectively. These entropy and redundancy results of the language will be useful to researchers to help find a better text compression method for Bengali language.The original print copy of this thesis may be available here: http://wizard.unbc.ca/record=b146606
Models, Techniques, and Metrics for Managing Risk in Software Engineering
The field of Software Engineering (SE) is the study of systematic and quantifiable approaches to software development, operation, and maintenance. This thesis presents a set of scalable and easily implemented techniques for quantifying and mitigating risks associated with the SE process. The thesis comprises six papers corresponding to SE knowledge areas such as software requirements, testing, and management. The techniques for risk management are drawn from stochastic modeling and operational research.
The first two papers relate to software testing and maintenance. The first paper describes and validates novel iterative-unfolding technique for filtering a set of execution traces relevant to a specific task. The second paper analyzes and validates the applicability of some entropy measures to the trace classification described in the previous paper. The techniques in these two papers can speed up problem determination of defects encountered by customers, leading to improved organizational response and thus increased customer satisfaction and to easing of resource constraints.
The third and fourth papers are applicable to maintenance, overall software quality and SE management. The third paper uses Extreme Value Theory and Queuing Theory tools to derive and validate metrics based on defect rediscovery data. The metrics can aid the allocation of resources to service and maintenance teams, highlight gaps in quality assurance processes, and help assess the risk of using a given software product. The fourth paper characterizes and validates a technique for automatic selection and prioritization of a minimal set of customers for profiling. The minimal set is obtained using Binary Integer Programming and prioritized using a greedy heuristic. Profiling the resulting customer set leads to enhanced comprehension of user behaviour, leading to improved test specifications and clearer quality assurance policies, hence reducing risks associated with unsatisfactory product quality.
The fifth and sixth papers pertain to software requirements. The fifth paper both models the relation between requirements and their underlying assumptions and measures the risk associated with failure of the assumptions using Boolean networks and stochastic modeling. The sixth paper models the risk associated with injection of requirements late in development cycle with the help of stochastic processes
- …