25,315 research outputs found

    Entropy and perpetual computers

    Get PDF
    A definition of entropy via the Kolmogorov algorithmic complexity is discussed. As examples, we show how the meanfield theory for the Ising model, and the entropy of a perfect gas can be recovered. The connection with computations are pointed out, by paraphrasing the laws of thermodynamics for computers. Also discussed is an approach that may be adopted to develop statistical mechanics using the algorithmic point of view.Comment: Based on Chanchal Majumdar memorial lectures given in Kolkata. 9 pages, 3 eps figures. For publication in "Physics Teacher"; v2. Sec 3 fragmented into smaller subsection

    Text segmentation with character-level text embeddings

    Get PDF
    Learning word representations has recently seen much success in computational linguistics. However, assuming sequences of word tokens as input to linguistic analysis is often unjustified. For many languages word segmentation is a non-trivial task and naturally occurring text is sometimes a mixture of natural language strings and other character data. We propose to learn text representations directly from raw character sequences by training a Simple recurrent Network to predict the next character in text. The network uses its hidden layer to evolve abstract representations of the character sequences it sees. To demonstrate the usefulness of the learned text embeddings, we use them as features in a supervised character level text segmentation and labeling task: recognizing spans of text containing programming language code. By using the embeddings as features we are able to substantially improve over a baseline which uses only surface character n-grams.Comment: Workshop on Deep Learning for Audio, Speech and Language Processing, ICML 201

    Dimensional Mutation and Spacelike Singularities

    Full text link
    I argue that string theory compactified on a Riemann surface crosses over at small volume to a higher dimensional background of supercritical string theory. Several concrete measures of the count of degrees of freedom of the theory yield the consistent result that at finite volume, the effective dimensionality is increased by an amount of order 2h/V2h/V for a surface of genus hh and volume VV in string units. This arises in part from an exponentially growing density of states of winding modes supported by the fundamental group, and passes an interesting test of modular invariance. Further evidence for a plethora of examples with the spacelike singularity replaced by a higher dimensional phase arises from the fact that the sigma model on a Riemann surface can be naturally completed by many gauged linear sigma models, whose RG flows approximate time evolution in the full string backgrounds arising from this in the limit of large dimensionality. In recent examples of spacelike singularity resolution by tachyon condensation, the singularity is ultimately replaced by a phase with all modes becoming heavy and decoupling. In the present case, the opposite behavior ensues: more light degrees of freedom arise in the small radius regime. I comment on the emerging zoology of cosmological singularities that results.Comment: 15 pages, harvmac big. v2: 18 pages, harvmac big; added computation of density of states and modular invariance check, enhanced discussion of multiplicity of solutions all sharing the feature of increased density of states, added reference

    Artificial Sequences and Complexity Measures

    Get PDF
    In this paper we exploit concepts of information theory to address the fundamental problem of identifying and defining the most suitable tools to extract, in a automatic and agnostic way, information from a generic string of characters. We introduce in particular a class of methods which use in a crucial way data compression techniques in order to define a measure of remoteness and distance between pairs of sequences of characters (e.g. texts) based on their relative information content. We also discuss in detail how specific features of data compression techniques could be used to introduce the notion of dictionary of a given sequence and of Artificial Text and we show how these new tools can be used for information extraction purposes. We point out the versatility and generality of our method that applies to any kind of corpora of character strings independently of the type of coding behind them. We consider as a case study linguistic motivated problems and we present results for automatic language recognition, authorship attribution and self consistent-classification.Comment: Revised version, with major changes, of previous "Data Compression approach to Information Extraction and Classification" by A. Baronchelli and V. Loreto. 15 pages; 5 figure

    Wrapper Maintenance: A Machine Learning Approach

    Full text link
    The proliferation of online information sources has led to an increased use of wrappers for extracting data from Web sources. While most of the previous research has focused on quick and efficient generation of wrappers, the development of tools for wrapper maintenance has received less attention. This is an important research problem because Web sources often change in ways that prevent the wrappers from extracting data correctly. We present an efficient algorithm that learns structural information about data from positive examples alone. We describe how this information can be used for two wrapper maintenance applications: wrapper verification and reinduction. The wrapper verification system detects when a wrapper is not extracting correct data, usually because the Web source has changed its format. The reinduction algorithm automatically recovers from changes in the Web source by identifying data on Web pages so that a new wrapper may be generated for this source. To validate our approach, we monitored 27 wrappers over a period of a year. The verification algorithm correctly discovered 35 of the 37 wrapper changes, and made 16 mistakes, resulting in precision of 0.73 and recall of 0.95. We validated the reinduction algorithm on ten Web sources. We were able to successfully reinduce the wrappers, obtaining precision and recall values of 0.90 and 0.80 on the data extraction task
    • …
    corecore