25,315 research outputs found
Entropy and perpetual computers
A definition of entropy via the Kolmogorov algorithmic complexity is
discussed. As examples, we show how the meanfield theory for the Ising model,
and the entropy of a perfect gas can be recovered. The connection with
computations are pointed out, by paraphrasing the laws of thermodynamics for
computers. Also discussed is an approach that may be adopted to develop
statistical mechanics using the algorithmic point of view.Comment: Based on Chanchal Majumdar memorial lectures given in Kolkata. 9
pages, 3 eps figures. For publication in "Physics Teacher"; v2. Sec 3
fragmented into smaller subsection
Text segmentation with character-level text embeddings
Learning word representations has recently seen much success in computational
linguistics. However, assuming sequences of word tokens as input to linguistic
analysis is often unjustified. For many languages word segmentation is a
non-trivial task and naturally occurring text is sometimes a mixture of natural
language strings and other character data. We propose to learn text
representations directly from raw character sequences by training a Simple
recurrent Network to predict the next character in text. The network uses its
hidden layer to evolve abstract representations of the character sequences it
sees. To demonstrate the usefulness of the learned text embeddings, we use them
as features in a supervised character level text segmentation and labeling
task: recognizing spans of text containing programming language code. By using
the embeddings as features we are able to substantially improve over a baseline
which uses only surface character n-grams.Comment: Workshop on Deep Learning for Audio, Speech and Language Processing,
ICML 201
Dimensional Mutation and Spacelike Singularities
I argue that string theory compactified on a Riemann surface crosses over at
small volume to a higher dimensional background of supercritical string theory.
Several concrete measures of the count of degrees of freedom of the theory
yield the consistent result that at finite volume, the effective dimensionality
is increased by an amount of order for a surface of genus and volume
in string units. This arises in part from an exponentially growing density
of states of winding modes supported by the fundamental group, and passes an
interesting test of modular invariance. Further evidence for a plethora of
examples with the spacelike singularity replaced by a higher dimensional phase
arises from the fact that the sigma model on a Riemann surface can be naturally
completed by many gauged linear sigma models, whose RG flows approximate time
evolution in the full string backgrounds arising from this in the limit of
large dimensionality. In recent examples of spacelike singularity resolution by
tachyon condensation, the singularity is ultimately replaced by a phase with
all modes becoming heavy and decoupling. In the present case, the opposite
behavior ensues: more light degrees of freedom arise in the small radius
regime. I comment on the emerging zoology of cosmological singularities that
results.Comment: 15 pages, harvmac big. v2: 18 pages, harvmac big; added computation
of density of states and modular invariance check, enhanced discussion of
multiplicity of solutions all sharing the feature of increased density of
states, added reference
Artificial Sequences and Complexity Measures
In this paper we exploit concepts of information theory to address the
fundamental problem of identifying and defining the most suitable tools to
extract, in a automatic and agnostic way, information from a generic string of
characters. We introduce in particular a class of methods which use in a
crucial way data compression techniques in order to define a measure of
remoteness and distance between pairs of sequences of characters (e.g. texts)
based on their relative information content. We also discuss in detail how
specific features of data compression techniques could be used to introduce the
notion of dictionary of a given sequence and of Artificial Text and we show how
these new tools can be used for information extraction purposes. We point out
the versatility and generality of our method that applies to any kind of
corpora of character strings independently of the type of coding behind them.
We consider as a case study linguistic motivated problems and we present
results for automatic language recognition, authorship attribution and self
consistent-classification.Comment: Revised version, with major changes, of previous "Data Compression
approach to Information Extraction and Classification" by A. Baronchelli and
V. Loreto. 15 pages; 5 figure
Wrapper Maintenance: A Machine Learning Approach
The proliferation of online information sources has led to an increased use
of wrappers for extracting data from Web sources. While most of the previous
research has focused on quick and efficient generation of wrappers, the
development of tools for wrapper maintenance has received less attention. This
is an important research problem because Web sources often change in ways that
prevent the wrappers from extracting data correctly. We present an efficient
algorithm that learns structural information about data from positive examples
alone. We describe how this information can be used for two wrapper maintenance
applications: wrapper verification and reinduction. The wrapper verification
system detects when a wrapper is not extracting correct data, usually because
the Web source has changed its format. The reinduction algorithm automatically
recovers from changes in the Web source by identifying data on Web pages so
that a new wrapper may be generated for this source. To validate our approach,
we monitored 27 wrappers over a period of a year. The verification algorithm
correctly discovered 35 of the 37 wrapper changes, and made 16 mistakes,
resulting in precision of 0.73 and recall of 0.95. We validated the reinduction
algorithm on ten Web sources. We were able to successfully reinduce the
wrappers, obtaining precision and recall values of 0.90 and 0.80 on the data
extraction task
- …