932 research outputs found

    Large-Scale Pattern-Based Information Extraction from the World Wide Web

    Get PDF
    Extracting information from text is the task of obtaining structured, machine-processable facts from information that is mentioned in an unstructured manner. It thus allows systems to automatically aggregate information for further analysis, efficient retrieval, automatic validation, or appropriate visualization. This work explores the potential of using textual patterns for Information Extraction from the World Wide Web

    Random Sequence Perception Amongst Finance and Accounting Personnel: Can We Measure Illusion Of Control, A Type I Error, or Illusion Of Chaos, A Type II Error?

    Get PDF
    The purpose of this dissertation was to determine if finance and accounting personnel could distinguish between random and non-random time-series strings and to determine what types of errors they would make. These individuals averaging 13 years of experience were unable to distinguish non-random patterns from random strings in an assessment composed of statistical process control (SPC) charts. Respondents scored no better than guessing which was also assessed with a series of true-false questions. Neither over-alternation (oscillation) nor under-alternation (trend) strategies were able to predict type I or type II error rates, i.e. illusion of control or illusion of chaos. Latent class analysis methods within partial least squares structural equation modeling (PLS-SEM) were successful in uncovering segments or groups of respondents with large explained variance and significant path models. Relationships between desirability of control, personal fear of invalidity, and error rates were more varied than expected. Yet, some segments tended to illusion of control while others to illusion of chaos. Similar effects were also observed when substituting a true-false guessing assessment for the SPC assessment with some loss of explained variance and weaker path coefficients. Respondents also provided their perceptions and thoughts of randomness for both SPC and true-false assessments

    Eliminating Useless Parts in Semi-structured Documents Using Alternation Counts

    No full text
    Discovery Science : 4th InternationalConference, DS 2001, Washington, DC, USA, November 25-28, 2001. ProceedingsWe propose a preprocessing method for Web mining which, given semi-structured documents with the same structure and style, distinguishes useless parts and non-useless parts in each document without any knowledge on the documents. It is based on a simple idea that any -gram is useless if it appears frequently. To decide an appropriate pair of length and frequency , we introduce a new statistic measure alternation count. It is the number of alternations between useless parts and non-useless parts. Given news articles written in English or Japanese with some non-articles, the algorithm eliminates frequent -grams used for the structure and style of articles and extracts the news contents and headlines with more than 97% accuracy if articles are collected from the same site. Even if input articles are collected from different sites, the algorithm extracts contents of articles from these sites with at least 95% accuracy. Thus, the algorithm does not depend on the language, is robust for noises, and is applicable to multiple formats

    Large-Scale Pattern-Based Information Extraction from the World Wide Web

    Get PDF
    Extracting information from text is the task of obtaining structured, machine-processable facts from information that is mentioned in an unstructured manner. It thus allows systems to automatically aggregate information for further analysis, efficient retrieval, automatic validation, or appropriate visualization. This thesis explores the potential of using textual patterns for Information Extraction from the World Wide Web

    Large-Scale Pattern-Based Information Extraction from the World Wide Web

    Get PDF
    Extracting information from text is the task of obtaining structured, machine-processable facts from information that is mentioned in an unstructured manner. It thus allows systems to automatically aggregate information for further analysis, efficient retrieval, automatic validation, or appropriate visualization. This work explores the potential of using textual patterns for Information Extraction from the World Wide Web

    Autoencoders for natural language semantics

    Full text link
    Les auto-encodeurs sont des réseaux de neurones artificiels qui apprennent des représentations. Dans un auto-encodeur, l’encodeur transforme une entrée en une représentation, et le décodeur essaie de prédire l’entrée à partir de la représentation. Cette thèse compile trois applications de ces modèles au traitement automatique des langues : pour l’apprentissage de représentations de mots et de phrases, ainsi que pour mieux comprendre la compositionnalité. Dans le premier article, nous montrons que nous pouvons auto-encoder des définitions de dictionnaire et ainsi apprendre des vecteurs de définition. Nous proposons une nouvelle pénalité qui nous permet d’utiliser ces vecteurs comme entrées à l’encodeur lui-même, mais aussi de les mélanger des vecteurs distributionnels pré-entraînés. Ces vecteurs de définition capturent mieux la similarité sémantique que les méthodes distributionnelles telles que word2vec. De plus, l’encodeur généralise à un certain degré à des définitions qu’il n’a pas vues pendant l’entraînement. Dans le deuxième article, nous analysons les représentations apprises par les auto-encodeurs variationnels séquence-à-séquence. Nous constatons que les encodeurs ont tendance à mémo- riser les premiers mots et la longueur de la phrase d’entrée. Cela limite considérablement leur utilité en tant que modèles génératifs contrôlables. Nous analysons aussi des variantes architecturales plus simples qui ne tiennent pas compte de l’ordre des mots, ainsi que des mé- thodes basées sur le pré-entraînement. Les représentations qu’elles apprennent ont tendance à encoder plus nettement des caractéristiques globales telles que le sujet et le sentiment, et cela se voit dans les reconstructions qu’ils produisent. Dans le troisième article, nous utilisons des simulations d’émergence du langage pour étudier la compositionnalité. Un locuteur – l’encodeur – observe une entrée et produit un message. Un auditeur – le décodeur – tente de reconstituer ce dont le locuteur a parlé dans son message. Nous émettons l’hypothèse que faire des phrases impliquant plusieurs entités, telles que « Jean aime Marie », nécessite fondamentalement de percevoir chaque entité comme un tout. Nous dotons certains agents de cette capacité grâce à un mechanisme d’attention, alors que d’autres en sont privés. Nous proposons différentes métriques qui mesurent à quel point les langues des agents sont naturelles en termes de structure d’argument, et si elles sont davantage analytiques ou synthétiques. Les agents percevant les entités comme des touts échangent des messages plus naturels que les autres agents.Autoencoders are artificial neural networks that learn representations. In an autoencoder, the encoder transforms an input into a representation, and the decoder tries to recover the input from the representation. This thesis compiles three different applications of these models to natural language processing: for learning word and sentence representations, as well as to better understand compositionality. In the first paper, we show that we can autoencode dictionary definitions to learn word vectors, called definition embeddings. We propose a new penalty that allows us to use these definition embeddings as inputs to the encoder itself, but also to blend them with pretrained distributional vectors. The definition embeddings capture semantic similarity better than distributional methods such as word2vec. Moreover, the encoder somewhat generalizes to definitions unseen during training. In the second paper, we analyze the representations learned by sequence-to-sequence variational autoencoders. We find that the encoders tend to memorize the first few words and the length of the input sentence. This limits drastically their usefulness as controllable generative models. We also analyze simpler architectural variants that are agnostic to word order, as well as pretraining-based methods. The representations that they learn tend to encode global features such as topic and sentiment more markedly, and this shows in the reconstructions they produce. In the third paper, we use language emergence simulations to study compositionality. A speaker – the encoder – observes an input and produces a message about it. A listener – the decoder – tries to reconstruct what the speaker talked about in its message. We hypothesize that producing sentences involving several entities, such as “John loves Mary”, fundamentally requires to perceive each entity, John and Mary, as distinct wholes. We endow some agents with this ability via an attention mechanism, and deprive others of it. We propose various metrics to measure whether the languages are natural in terms of their argument structure, and whether the languages are more analytic or synthetic. Agents perceiving entities as distinct wholes exchange more natural messages than other agents

    木を用いた構造化並列プログラミング

    Get PDF
    High-level abstractions for parallel programming are still immature. Computations on complicated data structures such as pointer structures are considered as irregular algorithms. General graph structures, which irregular algorithms generally deal with, are difficult to divide and conquer. Because the divide-and-conquer paradigm is essential for load balancing in parallel algorithms and a key to parallel programming, general graphs are reasonably difficult. However, trees lead to divide-and-conquer computations by definition and are sufficiently general and powerful as a tool of programming. We therefore deal with abstractions of tree-based computations. Our study has started from Matsuzaki’s work on tree skeletons. We have improved the usability of tree skeletons by enriching their implementation aspect. Specifically, we have dealt with two issues. We first have implemented the loose coupling between skeletons and data structures and developed a flexible tree skeleton library. We secondly have implemented a parallelizer that transforms sequential recursive functions in C into parallel programs that use tree skeletons implicitly. This parallelizer hides the complicated API of tree skeletons and makes programmers to use tree skeletons with no burden. Unfortunately, the practicality of tree skeletons, however, has not been improved. On the basis of the observations from the practice of tree skeletons, we deal with two application domains: program analysis and neighborhood computation. In the domain of program analysis, compilers treat input programs as control-flow graphs (CFGs) and perform analysis on CFGs. Program analysis is therefore difficult to divide and conquer. To resolve this problem, we have developed divide-and-conquer methods for program analysis in a syntax-directed manner on the basis of Rosen’s high-level approach. Specifically, we have dealt with data-flow analysis based on Tarjan’s formalization and value-graph construction based on a functional formalization. In the domain of neighborhood computations, a primary issue is locality. A naive parallel neighborhood computation without locality enhancement causes a lot of cache misses. The divide-and-conquer paradigm is known to be useful also for locality enhancement. We therefore have applied algebraic formalizations and a tree-segmenting technique derived from tree skeletons to the locality enhancement of neighborhood computations.電気通信大学201

    Old northumbrian verbal morphology in the glosses to the lindisfarne gospels

    Get PDF
    Tesis descargada en TeseoIn considering a text such as the Lindisfarne Gospels, one is very much aware of the vast philological attention the manuscript has received since the first contribution made to its study by George Hickes in 1705. Since then, scholars of the stature of Bouterwek (1857), Skeat (1871-87), Lindelöf (1901), Holmqvist (1922), Berndt (1956) and Ross, Stanley & Brown (1960) have advanced the subject (see Ross 1937:17-25 for a detailed summary of early studies on Lindisfarne). This Latin Gospelbook written in the North of England in the early eight century constitutes a major landmark of human cultural, intellectual, spiritual and artistic achievement. While the Latin text of the Lindisfarne Gospels is a valuable early witness to St Jerome’s ‘Vulgate’, it is the carefully inserted interlinear gloss to the Latin, written in Old Northumbrian and added around the 950s960s, and the linguistic importance this gloss holds as one of the most substantial earliest surviving renderings of early northern dialect that will concern us in this study, and more concretely the distribution of verbal morphology found therein. Old and Middle English verbal morphology in the northern dialects diverged most remarkably from that of the southern dialects in two main areas. Crucially, the tenth-century Northumbrian texts bear witness to the replacement of the inherited present-indicative -ð suffixes with -s forms, and by the Middle English period, presentindicative plural verbal morphology in northern dialects was governed by a grammatical constraint commonly referred to as the Northern Subject Rule (NSR) that conditioned verbal morphology according to the type and position of the subject. The plural marker was -s unless the verb had an immediately adjacent personal pronoun subject in which case the marker was the reduced -e or the zero morpheme, giving a system whereby They play occurred in juxtaposition to The children plays, They who plays, They eat and plays. It has tacitly been assumed in the literature that the reduced forms at the crux of the NSR, and the constraint that triggers them, must have emerged in the northern dialects during the early Middle English period, as there is little indication of the pattern existing in extant Northumbrian texts from the tenth century, and by the time northern textual evidence is once again available from c.1300, the NSR is clearly prevalent (Pietsch 2005; de Haas 2008; de Haas & van Kemenade 2009). Nevertheless, the assumption that the NSR was entirely lacking in Old Northumbrian stands on shaky grounds without further detailed analysis of the tenth-century northern writings, as has been pointed out in the literature (Benskin 2011:170). As might well be imagined, such an endeavour is hindered by the fact that extant textual evidence from the period is far from abundant, and that which remains is limited in nature: the only substantial Northumbrian texts passed down to us are the interlinear glosses to the Latin manuscripts of the Lindisfarne Gospels and the Durham Ritual supposedly written by the same scribe, Aldred, in the second half of the tenth-century, as well as the Northumbrian part of the Rushworth Gospels gloss (Rushworth 2 ), written by a scribe called Owun in the late tenth-century and heavily reliant on the Lindisfarne gloss. Yet despite their limitations, the glosses constitute a substantial record of late ONrth verbal morphology that provides important insights into the mechanisms of linguistic change. Although the study of the Northern Subject Rule in the early northern writings has barely been touched upon in the literature (as far as I am aware the matter has only been cursorily considered by de Haas 2008), morphological variation between -s as opposed to -ð in the late Northumbrian texts has been the object of numerous quantitative analyses (most famously Holmqvist 1922; Ross 1934; Blakeley 1949/50 and Berndt 1956). It is striking, however, that the vast majority of these studies were written well over fifty years ago and the matter has not been thoroughly considered since. A reconsideration of present-tense marking patterns in Old Northumbrian that draws from the insights of recent research into variation and benefits from the application of modern statistical methodology is clearly long overdue. Furthermore, certain potentially relevant factors remain unexplored. For instance, while grammatical person and number have been identified as important factors in conditioning variation between the interdental and alveolar variants, the effect of subject type and adjacency on morphological variation in Old Northumbrian has hitherto been disregarded. This is despite the fact that research indicates that subject effects are a crucial factor in determining the selection of verbal morphology, not just in non-standard varieties of present-day English (cf. Chambers 2004; Tagliamonte 2009) and in varieties of EModE, as discussed above, but also most notably in Middle English northern dialect itself (McIntosh 1989; Montgomery 1994; de Haas & van Kemenade 2009; de Haas 2011). Using data drawn from the standard edition of the Lindisfarne gloss (Skeat 1871-87) collated with the facsimile copy of the manuscript (Kendrick, T. D. et al., 1960), this dissertation carries out a detailed study of the replacement of the interdental fricative by the alveolar fricative which differs both methodologically and in perspective from previous studies in several crucial ways. It constitutes the first study to simultaneously examine the effects of all relevant phonetic, lexical and syntactic variables on the process of change using statistical quantitative methodology. The study approaches the issue from an innovative hitherto disregarded perspective and considers factors such as lexical conditioning and morphosyntactic priming and pays particular reference to the subject and adjacency effects of the so-called Northern Subject Rule. By analysing the full breadth of possible language-internal explanatory variables on the development of the alveolar fricative ending in late Old Northumbrian and by applying statistical methodology, the study aims to elaborate and refine the overall view presented in early studies and set the Northumbrian developments within a broader framework of diachronic variation that will aid the verification of crosslinguistic generalisations and further our understanding of regularisation processes. It will be shown that the distribution of ONrth verbal morphology constitutes the first attested manifestation of a tendency in English for subject type to compete with person and number features for the function of grammatical material. In addition to a variationist study of -ð and -s forms, this dissertation also carries out a contextual and quantitative analysis of reduced morphology in the Old Northumbrian interlinear gloss to the Lindisfarne Gospels. It looks in detail at reduced forms in the Lindisfarne gloss and considers to what extent the nature and distribution of these forms are indicative of the incipient development of the ME -s versus -e/Ø NSR pattern in late Old Northumbrian. I also assess to what extent inflectional morphology already present in the northern dialects constitutes the historical source for the occurrence of -e/Ø/n in the present indicative. To this end, I posit that, not only present-subjunctive morphology, but also preterite-present and preterite-indicative verbal morphology played an important role in perpetuating the levelling of reduced forms and -n into the present indicative. I show that the subject and adjacency effects at the heart of the NSR appear not only to govern the occurrence of reduced morphology in the present indicative as a low frequency variant but also conditions the distribution of reduced verbal morphology in the preterite. A further question that will be examined in this dissertation involves the contentious issue of the authorship of the glosses to Lindisfarne and whether or not the interlinear gloss of the Lindisfarne Gospels was the work of a single hand, Aldred (Ross, Stanley & Brown 1960; Brunner 1947/48; van Bergen 2008). To this end, I will consider the utility of language variation as a diagnostic for determining the authorship and more specifically, what light is shed upon this unresolved problem of Old English philology by the distribution of variants verbal forms in Li. Another aspect under consideration relates to methodology and the unreliability of the text editions of medieval sources for linguistic research. In general, editions are unsuitable as sources unless they are collated with the raw data of the original manuscript because, as van der Hoek (2010) points out, they tend to involve “a reconstruction of a non-extant version of the text in question by selecting and altering from among the different surviving versions, in the attempt to arrive at a text that is purer from either a literary or philological point of view.” The edition in question, in the case of the Lindisfarne Gospels, is that of Skeat (1871-87) which relies on the sole version of Li. but whose language and grammar have nevertheless been subjected to editorial interpretation and alteration
    corecore