19,975 research outputs found

    Using Genetic Algorithms for Texts Classification Problems

    Full text link
    The avalanche quantity of the information developed by mankind has led to concept of automation of knowledge extraction - Data Mining ([1]). This direction is connected with a wide spectrum of problems - from recognition of the fuzzy set to creation of search machines. Important component of Data Mining is processing of the text information. Such problems lean on concept of classification and clustering ([2]). Classification consists in definition of an accessory of some element (text) to one of in advance created classes. Clustering means splitting a set of elements (texts) on clusters which quantity are defined by localization of elements of the given set in vicinities of these some natural centers of these clusters. Realization of a problem of classification initially should lean on the given postulates, basic of which - the aprioristic information on primary set of texts and a measure of affinity of elements and classes.Comment: 16 pages, exposed on 5th International Conference "Actualities and Perspectives on Hardware and Software" - APHS2009, Timisoara, Romani

    Artificial Sequences and Complexity Measures

    Get PDF
    In this paper we exploit concepts of information theory to address the fundamental problem of identifying and defining the most suitable tools to extract, in a automatic and agnostic way, information from a generic string of characters. We introduce in particular a class of methods which use in a crucial way data compression techniques in order to define a measure of remoteness and distance between pairs of sequences of characters (e.g. texts) based on their relative information content. We also discuss in detail how specific features of data compression techniques could be used to introduce the notion of dictionary of a given sequence and of Artificial Text and we show how these new tools can be used for information extraction purposes. We point out the versatility and generality of our method that applies to any kind of corpora of character strings independently of the type of coding behind them. We consider as a case study linguistic motivated problems and we present results for automatic language recognition, authorship attribution and self consistent-classification.Comment: Revised version, with major changes, of previous "Data Compression approach to Information Extraction and Classification" by A. Baronchelli and V. Loreto. 15 pages; 5 figure

    All mixed up? Finding the optimal feature set for general readability prediction and its application to English and Dutch

    Get PDF
    Readability research has a long and rich tradition, but there has been too little focus on general readability prediction without targeting a specific audience or text genre. Moreover, though NLP-inspired research has focused on adding more complex readability features there is still no consensus on which features contribute most to the prediction. In this article, we investigate in close detail the feasibility of constructing a readability prediction system for English and Dutch generic text using supervised machine learning. Based on readability assessments by both experts and a crowd, we implement different types of text characteristics ranging from easy-to-compute superficial text characteristics to features requiring a deep linguistic processing, resulting in ten different feature groups. Both a regression and classification setup are investigated reflecting the two possible readability prediction tasks: scoring individual texts or comparing two texts. We show that going beyond correlation calculations for readability optimization using a wrapper-based genetic algorithm optimization approach is a promising task which provides considerable insights in which feature combinations contribute to the overall readability prediction. Since we also have gold standard information available for those features requiring deep processing we are able to investigate the true upper bound of our Dutch system. Interestingly, we will observe that the performance of our fully-automatic readability prediction pipeline is on par with the pipeline using golden deep syntactic and semantic information

    Language Trees and Zipping

    Get PDF
    In this letter we present a very general method to extract information from a generic string of characters, e.g. a text, a DNA sequence or a time series. Based on data-compression techniques, its key point is the computation of a suitable measure of the remoteness of two bodies of knowledge. We present the implementation of the method to linguistic motivated problems, featuring highly accurate results for language recognition, authorship attribution and language classification.Comment: 5 pages, RevTeX4, 1 eps figure. In press in Phys. Rev. Lett. (January 2002
    • …
    corecore