932 research outputs found
Large-Scale Pattern-Based Information Extraction from the World Wide Web
Extracting information from text is the task of obtaining structured, machine-processable facts from information that is mentioned in an unstructured manner. It thus allows systems to automatically aggregate information for further analysis, efficient retrieval, automatic validation, or appropriate visualization. This work explores the potential of using textual patterns for Information Extraction from the World Wide Web
Random Sequence Perception Amongst Finance and Accounting Personnel: Can We Measure Illusion Of Control, A Type I Error, or Illusion Of Chaos, A Type II Error?
The purpose of this dissertation was to determine if finance and accounting personnel could distinguish between random and non-random time-series strings and to determine what types of errors they would make. These individuals averaging 13 years of experience were unable to distinguish non-random patterns from random strings in an assessment composed of statistical process control (SPC) charts. Respondents scored no better than guessing which was also assessed with a series of true-false questions. Neither over-alternation (oscillation) nor under-alternation (trend) strategies were able to predict type I or type II error rates, i.e. illusion of control or illusion of chaos. Latent class analysis methods within partial least squares structural equation modeling (PLS-SEM) were successful in uncovering segments or groups of respondents with large explained variance and significant path models. Relationships between desirability of control, personal fear of invalidity, and error rates were more varied than expected. Yet, some segments tended to illusion of control while others to illusion of chaos. Similar effects were also observed when substituting a true-false guessing assessment for the SPC assessment with some loss of explained variance and weaker path coefficients. Respondents also provided their perceptions and thoughts of randomness for both SPC and true-false assessments
Eliminating Useless Parts in Semi-structured Documents Using Alternation Counts
Discovery Science : 4th InternationalConference, DS 2001, Washington, DC, USA, November 25-28, 2001. ProceedingsWe propose a preprocessing method for Web mining which, given semi-structured documents with the same structure and style, distinguishes useless parts and non-useless parts in each document without any knowledge on the documents. It is based on a simple idea that any -gram is useless if it appears frequently. To decide an appropriate pair of length and frequency , we introduce a new statistic measure alternation count. It is the number of alternations between useless parts and non-useless parts. Given news articles written in English or Japanese with some non-articles, the algorithm eliminates frequent -grams used for the structure and style of articles and extracts the news contents and headlines with more than 97% accuracy if articles are collected from the same site. Even if input articles are collected from different sites, the algorithm extracts contents of articles from these sites with at least 95% accuracy. Thus, the algorithm does not depend on the language, is robust for noises, and is applicable to multiple formats
Large-Scale Pattern-Based Information Extraction from the World Wide Web
Extracting information from text is the task of obtaining structured, machine-processable facts from information that is mentioned in an unstructured manner. It thus allows systems to automatically aggregate information for further analysis, efficient retrieval, automatic validation, or appropriate visualization.
This thesis explores the potential of using textual patterns for Information Extraction from the World Wide Web
Large-Scale Pattern-Based Information Extraction from the World Wide Web
Extracting information from text is the task of obtaining structured, machine-processable facts from information that is mentioned in an unstructured manner. It thus allows systems to automatically aggregate information for further analysis, efficient retrieval, automatic validation, or appropriate visualization. This work explores the potential of using textual patterns for Information Extraction from the World Wide Web
Autoencoders for natural language semantics
Les auto-encodeurs sont des réseaux de neurones artificiels qui apprennent des représentations. Dans un auto-encodeur, l’encodeur transforme une entrée en une représentation, et le décodeur essaie de prédire l’entrée à partir de la représentation. Cette thèse compile trois applications de ces modèles au traitement automatique des langues : pour l’apprentissage de représentations de mots et de phrases, ainsi que pour mieux comprendre la compositionnalité.
Dans le premier article, nous montrons que nous pouvons auto-encoder des définitions
de dictionnaire et ainsi apprendre des vecteurs de définition. Nous proposons une nouvelle
pénalité qui nous permet d’utiliser ces vecteurs comme entrées à l’encodeur lui-même, mais
aussi de les mélanger des vecteurs distributionnels pré-entraînés. Ces vecteurs de définition
capturent mieux la similarité sémantique que les méthodes distributionnelles telles que
word2vec. De plus, l’encodeur généralise à un certain degré à des définitions qu’il n’a pas
vues pendant l’entraînement.
Dans le deuxième article, nous analysons les représentations apprises par les auto-encodeurs
variationnels séquence-à-séquence. Nous constatons que les encodeurs ont tendance à mémo-
riser les premiers mots et la longueur de la phrase d’entrée. Cela limite considérablement
leur utilité en tant que modèles génératifs contrôlables. Nous analysons aussi des variantes
architecturales plus simples qui ne tiennent pas compte de l’ordre des mots, ainsi que des mé-
thodes basées sur le pré-entraînement. Les représentations qu’elles apprennent ont tendance
à encoder plus nettement des caractéristiques globales telles que le sujet et le sentiment, et
cela se voit dans les reconstructions qu’ils produisent.
Dans le troisième article, nous utilisons des simulations d’émergence du langage pour
étudier la compositionnalité. Un locuteur – l’encodeur – observe une entrée et produit un
message. Un auditeur – le décodeur – tente de reconstituer ce dont le locuteur a parlé dans
son message. Nous émettons l’hypothèse que faire des phrases impliquant plusieurs entités,
telles que « Jean aime Marie », nécessite fondamentalement de percevoir chaque entité comme
un tout. Nous dotons certains agents de cette capacité grâce à un mechanisme d’attention,
alors que d’autres en sont privés. Nous proposons différentes métriques qui mesurent à quel
point les langues des agents sont naturelles en termes de structure d’argument, et si elles sont davantage analytiques ou synthétiques. Les agents percevant les entités comme des touts
échangent des messages plus naturels que les autres agents.Autoencoders are artificial neural networks that learn representations. In an autoencoder, the
encoder transforms an input into a representation, and the decoder tries to recover the input
from the representation. This thesis compiles three different applications of these models to
natural language processing: for learning word and sentence representations, as well as to
better understand compositionality.
In the first paper, we show that we can autoencode dictionary definitions to learn word
vectors, called definition embeddings. We propose a new penalty that allows us to use these
definition embeddings as inputs to the encoder itself, but also to blend them with pretrained
distributional vectors. The definition embeddings capture semantic similarity better than
distributional methods such as word2vec. Moreover, the encoder somewhat generalizes to
definitions unseen during training.
In the second paper, we analyze the representations learned by sequence-to-sequence
variational autoencoders. We find that the encoders tend to memorize the first few words
and the length of the input sentence. This limits drastically their usefulness as controllable
generative models. We also analyze simpler architectural variants that are agnostic to word
order, as well as pretraining-based methods. The representations that they learn tend to
encode global features such as topic and sentiment more markedly, and this shows in the
reconstructions they produce.
In the third paper, we use language emergence simulations to study compositionality. A
speaker – the encoder – observes an input and produces a message about it. A listener – the
decoder – tries to reconstruct what the speaker talked about in its message. We hypothesize
that producing sentences involving several entities, such as “John loves Mary”, fundamentally
requires to perceive each entity, John and Mary, as distinct wholes. We endow some agents
with this ability via an attention mechanism, and deprive others of it. We propose various
metrics to measure whether the languages are natural in terms of their argument structure,
and whether the languages are more analytic or synthetic. Agents perceiving entities as
distinct wholes exchange more natural messages than other agents
木を用いた構造化並列プログラミング
High-level abstractions for parallel programming are still immature. Computations on complicated data structures such as pointer structures are considered as irregular algorithms. General graph structures, which irregular algorithms generally deal with, are difficult to divide and conquer. Because the divide-and-conquer paradigm is essential for load balancing in parallel algorithms and a key to parallel programming, general graphs are reasonably difficult. However, trees lead to divide-and-conquer computations by definition and are sufficiently general and powerful as a tool of programming. We therefore deal with abstractions of tree-based computations. Our study has started from Matsuzaki’s work on tree skeletons. We have improved the usability of tree skeletons by enriching their implementation aspect. Specifically, we have dealt with two issues. We first have implemented the loose coupling between skeletons and data structures and developed a flexible tree skeleton library. We secondly have implemented a parallelizer that transforms sequential recursive functions in C into parallel programs that use tree skeletons implicitly. This parallelizer hides the complicated API of tree skeletons and makes programmers to use tree skeletons with no burden. Unfortunately, the practicality of tree skeletons, however, has not been improved. On the basis of the observations from the practice of tree skeletons, we deal with two application domains: program analysis and neighborhood computation. In the domain of program analysis, compilers treat input programs as control-flow graphs (CFGs) and perform analysis on CFGs. Program analysis is therefore difficult to divide and conquer. To resolve this problem, we have developed divide-and-conquer methods for program analysis in a syntax-directed manner on the basis of Rosen’s high-level approach. Specifically, we have dealt with data-flow analysis based on Tarjan’s formalization and value-graph construction based on a functional formalization. In the domain of neighborhood computations, a primary issue is locality. A naive parallel neighborhood computation without locality enhancement causes a lot of cache misses. The divide-and-conquer paradigm is known to be useful also for locality enhancement. We therefore have applied algebraic formalizations and a tree-segmenting technique derived from tree skeletons to the locality enhancement of neighborhood computations.電気通信大学201
Old northumbrian verbal morphology in the glosses to the lindisfarne gospels
Tesis descargada en TeseoIn considering a text such as the Lindisfarne Gospels, one is very much aware of the
vast philological attention the manuscript has received since the first contribution made
to its study by George Hickes in 1705. Since then, scholars of the stature of Bouterwek
(1857), Skeat (1871-87), Lindelöf (1901), Holmqvist (1922), Berndt (1956) and Ross,
Stanley & Brown (1960) have advanced the subject (see Ross 1937:17-25 for a detailed
summary of early studies on Lindisfarne). This Latin Gospelbook written in the North
of England in the early eight century constitutes a major landmark of human cultural,
intellectual, spiritual and artistic achievement. While the Latin text of the Lindisfarne
Gospels is a valuable early witness to St Jerome’s ‘Vulgate’, it is the carefully inserted
interlinear gloss to the Latin, written in Old Northumbrian and added around the 950s960s,
and the linguistic importance this gloss holds as one of the most substantial
earliest surviving renderings of early northern dialect that will concern us in this study,
and more concretely the distribution of verbal morphology found therein.
Old and Middle English verbal morphology in the northern dialects diverged
most remarkably from that of the southern dialects in two main areas. Crucially, the
tenth-century Northumbrian texts bear witness to the replacement of the inherited
present-indicative -ð suffixes with -s forms, and by the Middle English period, presentindicative
plural verbal morphology in northern dialects was governed by a
grammatical constraint commonly referred to as the Northern Subject Rule (NSR) that
conditioned verbal morphology according to the type and position of the subject. The
plural marker was -s unless the verb had an immediately adjacent personal pronoun
subject in which case the marker was the reduced -e or the zero morpheme, giving a
system whereby They play occurred in juxtaposition to The children plays, They who
plays, They eat and plays.
It has tacitly been assumed in the literature that the reduced forms at the crux of
the NSR, and the constraint that triggers them, must have emerged in the northern
dialects during the early Middle English period, as there is little indication of the
pattern existing in extant Northumbrian texts from the tenth century, and by the time
northern textual evidence is once again available from c.1300, the NSR is clearly
prevalent (Pietsch 2005; de Haas 2008; de Haas & van Kemenade 2009). Nevertheless, the assumption that the NSR was entirely lacking in Old Northumbrian stands on shaky
grounds without further detailed analysis of the tenth-century northern writings, as has
been pointed out in the literature (Benskin 2011:170). As might well be imagined, such
an endeavour is hindered by the fact that extant textual evidence from the period is far
from abundant, and that which remains is limited in nature: the only substantial
Northumbrian texts passed down to us are the interlinear glosses to the Latin
manuscripts of the Lindisfarne Gospels and the Durham Ritual supposedly written by
the same scribe, Aldred, in the second half of the tenth-century, as well as the
Northumbrian part of the Rushworth Gospels gloss (Rushworth
2
), written by a scribe
called Owun in the late tenth-century and heavily reliant on the Lindisfarne gloss. Yet
despite their limitations, the glosses constitute a substantial record of late ONrth verbal
morphology that provides important insights into the mechanisms of linguistic change.
Although the study of the Northern Subject Rule in the early northern writings
has barely been touched upon in the literature (as far as I am aware the matter has only
been cursorily considered by de Haas 2008), morphological variation between -s as
opposed to -ð in the late Northumbrian texts has been the object of numerous
quantitative analyses (most famously Holmqvist 1922; Ross 1934; Blakeley 1949/50
and Berndt 1956). It is striking, however, that the vast majority of these studies were
written well over fifty years ago and the matter has not been thoroughly considered
since. A reconsideration of present-tense marking patterns in Old Northumbrian that
draws from the insights of recent research into variation and benefits from the
application of modern statistical methodology is clearly long overdue. Furthermore,
certain potentially relevant factors remain unexplored. For instance, while grammatical
person and number have been identified as important factors in conditioning variation
between the interdental and alveolar variants, the effect of subject type and adjacency
on morphological variation in Old Northumbrian has hitherto been disregarded. This is
despite the fact that research indicates that subject effects are a crucial factor in
determining the selection of verbal morphology, not just in non-standard varieties of
present-day English (cf. Chambers 2004; Tagliamonte 2009) and in varieties of
EModE, as discussed above, but also most notably in Middle English northern dialect
itself (McIntosh 1989; Montgomery 1994; de Haas & van Kemenade 2009; de Haas
2011).
Using data drawn from the standard edition of the Lindisfarne gloss (Skeat 1871-87) collated with the facsimile copy of the manuscript (Kendrick, T. D. et al.,
1960), this dissertation carries out a detailed study of the replacement of the interdental
fricative by the alveolar fricative which differs both methodologically and in
perspective from previous studies in several crucial ways. It constitutes the first study
to simultaneously examine the effects of all relevant phonetic, lexical and syntactic
variables on the process of change using statistical quantitative methodology. The
study approaches the issue from an innovative hitherto disregarded perspective and
considers factors such as lexical conditioning and morphosyntactic priming and pays
particular reference to the subject and adjacency effects of the so-called Northern
Subject Rule. By analysing the full breadth of possible language-internal explanatory
variables on the development of the alveolar fricative ending in late Old Northumbrian
and by applying statistical methodology, the study aims to elaborate and refine the
overall view presented in early studies and set the Northumbrian developments within a
broader framework of diachronic variation that will aid the verification of crosslinguistic
generalisations and further our understanding of regularisation processes. It
will be shown that the distribution of ONrth verbal morphology constitutes the first
attested manifestation of a tendency in English for subject type to compete with person
and number features for the function of grammatical material.
In addition to a variationist study of -ð and -s forms, this dissertation also carries
out a contextual and quantitative analysis of reduced morphology in the Old
Northumbrian interlinear gloss to the Lindisfarne Gospels. It looks in detail at reduced
forms in the Lindisfarne gloss and considers to what extent the nature and distribution
of these forms are indicative of the incipient development of the ME -s versus -e/Ø
NSR pattern in late Old Northumbrian. I also assess to what extent inflectional
morphology already present in the northern dialects constitutes the historical source for
the occurrence of -e/Ø/n in the present indicative. To this end, I posit that, not only
present-subjunctive morphology, but also preterite-present and preterite-indicative
verbal morphology played an important role in perpetuating the levelling of reduced
forms and -n into the present indicative. I show that the subject and adjacency effects at
the heart of the NSR appear not only to govern the occurrence of reduced morphology
in the present indicative as a low frequency variant but also conditions the distribution
of reduced verbal morphology in the preterite.
A further question that will be examined in this dissertation involves the contentious issue of the authorship of the glosses to Lindisfarne and whether or not the
interlinear gloss of the Lindisfarne Gospels was the work of a single hand, Aldred
(Ross, Stanley & Brown 1960; Brunner 1947/48; van Bergen 2008). To this end, I will
consider the utility of language variation as a diagnostic for determining the authorship
and more specifically, what light is shed upon this unresolved problem of Old English
philology by the distribution of variants verbal forms in Li.
Another aspect under consideration relates to methodology and the unreliability
of the text editions of medieval sources for linguistic research. In general, editions are
unsuitable as sources unless they are collated with the raw data of the original
manuscript because, as van der Hoek (2010) points out, they tend to involve “a
reconstruction of a non-extant version of the text in question by selecting and altering
from among the different surviving versions, in the attempt to arrive at a text that is
purer from either a literary or philological point of view.” The edition in question, in the
case of the Lindisfarne Gospels, is that of Skeat (1871-87) which relies on the sole
version of Li. but whose language and grammar have nevertheless been subjected to
editorial interpretation and alteration
- …