9 research outputs found

    An Automata Based Text Analysis System

    Get PDF
    This report describes and implements an automata based text analysis system. We have collected some of the writing samples. Each sample establishes a tree, and uses the ALERGIA algorithm to merge all compatible nodes in order to get a merged stochastic finite automaton. We store these automatons which demonstrate writing style of the sample texts in the hard drive. For a new testing piece, we can test if it has similar writing style compared to those sample texts

    Pattern Recognition of DNA Sequences using Automata with application to Species Distinction

    Get PDF
    Darwin wasn\u27t just provocative in saying that we descend from the apes—he didn\u27t go far enough, we are apes in every way, from our long arms and tailless bodies to our habits and temperament. said Frans de Waal, a primate scientist at Emory University in Atlanta, Georgia. 1.3 million Species have been named and analyzed by scientists. This project focuses on capturing various nucleotide sequences of various species and determining the similarity and differences between them. Finite state automata have been used to accomplish this. The automata for a DNA genome is created using Alergia algorithm and is used as the foundation for comparing it to the other species DNA sequences

    DNA Sequence Representation by Use of Statistical Finite Automata

    Get PDF
    This project defines and intends to solve the problem of representing information carried by DNA sequences in terms of amino acids, through application of the theory of finite automata. Sequences can be compared against each other to find existing patterns, if any, which may include important genetic information. Comparison can state whether the DNA sequences belong to the same, related or entirely different species in the ‘Tree of Life’ (phylogeny). This is achieved by using extended and statistical finite automata. In order to solve this problem, the concepts of automata and their extension, i.e. Alergia algorithm have been used. In this specific case, we have used the chemical property - polarity of amino acids to analyze the DNA sequences

    An automata based authorship identification system

    Get PDF

    Analysis on ALERGIA Algorithm: Pattern Recognition by Automata Theory

    Get PDF
    Based on Kolmogorov Complexity, a finite set x of strings has a pattern if the set x can be output by a Turing machine of length that is less than minimum of all |x|; this Turing machine, that may not be unique, is called a pattern of the finite set of string. In order to find a pattern of a given finite set of strings (assuming such a pattern exists), the ALERGIA algorithm is used to approximate such a pattern (Turing machine) in terms of finite automata. Note that each finite automaton defines a partition on formal language Σ*, ALERGIA algorithm can be viewed as Granular Rough Computing based approximations. Any subset of Σ*, such as DNA, can be approximated by equivalence classes. Based on this view, this thesis analyzes and improves the ALERGIA algorithm via minimization of deterministic finite automaton. Hypothesis testing indicates that the minimization does improve the ALERGIA. So the new method will have high usability in pattern recognition/data mining

    Data Discovery and Anomaly Detection Using Atypicality: Theory

    Full text link
    A central question in the era of 'big data' is what to do with the enormous amount of information. One possibility is to characterize it through statistics, e.g., averages, or classify it using machine learning, in order to understand the general structure of the overall data. The perspective in this paper is the opposite, namely that most of the value in the information in some applications is in the parts that deviate from the average, that are unusual, atypical. We define what we mean by 'atypical' in an axiomatic way as data that can be encoded with fewer bits in itself rather than using the code for the typical data. We show that this definition has good theoretical properties. We then develop an implementation based on universal source coding, and apply this to a number of real world data sets.Comment: 40 page

    Learning Author’s Writing Pattern System By Automata

    Get PDF
    The purpose of the report is to document our project’s theory, implementation and test results. The project works on an automata-based learning system which models authors’ writing characters with automatons. Since there were pervious works done by Dr. T.Y. Lin and Ms. S.X. Zhang, we continue on ALERGIA algorithm analysis and initial common pattern study in this project. Although every author has his/her own writing style, such as sentence length and word frequency etc, there are always some similarities in writing style. We hypothesize that common strings fogged the expected test result, just like the noise in radio wave. This report gives the design and implementation of finding common pattern, as well as testing results. This report also describes the implementation of ALERGIA algorithm based on paper of Learning Stochastic Regular Grammars by Means of a State Merging Method by Rafael C. Carrasco and Jose Oncina [2]. The coding is done in Java 6 on Eclipse Helios version

    Information Theory and Machine Learning

    Get PDF
    The recent successes of machine learning, especially regarding systems based on deep neural networks, have encouraged further research activities and raised a new set of challenges in understanding and designing complex machine learning algorithms. New applications require learning algorithms to be distributed, have transferable learning results, use computation resources efficiently, convergence quickly on online settings, have performance guarantees, satisfy fairness or privacy constraints, incorporate domain knowledge on model structures, etc. A new wave of developments in statistical learning theory and information theory has set out to address these challenges. This Special Issue, "Machine Learning and Information Theory", aims to collect recent results in this direction reflecting a diverse spectrum of visions and efforts to extend conventional theories and develop analysis tools for these complex machine learning systems
    corecore