1,875 research outputs found

    Adaptive Sentence Boundary Disambiguation

    Full text link
    Labeling of sentence boundaries is a necessary prerequisite for many natural language processing tasks, including part-of-speech tagging and sentence alignment. End-of-sentence punctuation marks are ambiguous; to disambiguate them most systems use brittle, special-purpose regular expression grammars and exception rules. As an alternative, we have developed an efficient, trainable algorithm that uses a lexicon with part-of-speech probabilities and a feed-forward neural network. After training for less than one minute, the method correctly labels over 98.5\% of sentence boundaries in a corpus of over 27,000 sentence-boundary marks. We show the method to be efficient and easily adaptable to different text genres, including single-case texts.Comment: This is a Latex version of the previously submitted ps file (formatted as a uuencoded gz-compressed .tar file created by csh script). The software from the work described in this paper is available by contacting [email protected]

    Pair-Linking for Collective Entity Disambiguation: Two Could Be Better Than All

    Full text link
    Collective entity disambiguation aims to jointly resolve multiple mentions by linking them to their associated entities in a knowledge base. Previous works are primarily based on the underlying assumption that entities within the same document are highly related. However, the extend to which these mentioned entities are actually connected in reality is rarely studied and therefore raises interesting research questions. For the first time, we show that the semantic relationships between the mentioned entities are in fact less dense than expected. This could be attributed to several reasons such as noise, data sparsity and knowledge base incompleteness. As a remedy, we introduce MINTREE, a new tree-based objective for the entity disambiguation problem. The key intuition behind MINTREE is the concept of coherence relaxation which utilizes the weight of a minimum spanning tree to measure the coherence between entities. Based on this new objective, we design a novel entity disambiguation algorithms which we call Pair-Linking. Instead of considering all the given mentions, Pair-Linking iteratively selects a pair with the highest confidence at each step for decision making. Via extensive experiments, we show that our approach is not only more accurate but also surprisingly faster than many state-of-the-art collective linking algorithms

    Russian word sense induction by clustering averaged word embeddings

    Full text link
    The paper reports our participation in the shared task on word sense induction and disambiguation for the Russian language (RUSSE-2018). Our team was ranked 2nd for the wiki-wiki dataset (containing mostly homonyms) and 5th for the bts-rnc and active-dict datasets (containing mostly polysemous words) among all 19 participants. The method we employed was extremely naive. It implied representing contexts of ambiguous words as averaged word embedding vectors, using off-the-shelf pre-trained distributional models. Then, these vector representations were clustered with mainstream clustering techniques, thus producing the groups corresponding to the ambiguous word senses. As a side result, we show that word embedding models trained on small but balanced corpora can be superior to those trained on large but noisy data - not only in intrinsic evaluation, but also in downstream tasks like word sense induction.Comment: Proceedings of the 24rd International Conference on Computational Linguistics and Intellectual Technologies (Dialogue-2018

    Prosodic Boundary Effects on Syntactic Disambiguation in Children with Cochlear Implants, and in Normal Hearing Adults and Children

    Full text link
    Theoretical Framework: Manipulations of prosodic structure influence how listeners interpret syntactically ambiguous sentences. However, the interface between prosody and syntax has received very little attention in languages other than English. Furthermore, many children with cochlear implants (CI) have deficits in sentence comprehension. Until now, these deficits have been attributed only to syntax, leaving prosody a neglected area, despite its clear deficit on this population and the role it plays in sentence comprehension. Purposes: Experiment 1 investigates prosodic boundary effects on the comprehension of attachment ambiguities in Brazilian Portuguese while experiment 2 investigates these effects in Brazilian Portuguese speaking children with CIs. Both experiments tested two hypotheses relying on the notion of boundary strength: the absolute boundary hypothesis (ABH) and the relative boundary hypothesis (RBH). The ABH states that only the high boundary before the ambiguous constituent influences attachment whereas the RBH advocates that the high boundary before the ambiguous constituent can only be interpreted according to the relative size of an earlier low boundary. Specific predictions of the two hypotheses were tested. Relationships between attachment results and performance on psychoacoustic tests of gap detection threshold and frequency limen were also investigated. Materials: The experiments were designed on E-Prime 2.0 software (Psychology Software Tools, Pittsburgh, PA). The sentences were recorded on Praat software (Boersma & Weenink, 2013), controlling for F0, duration of components and pauses between components. The prosodic boundaries were measured with the ToBI coding system distinguishing acoustic measures of intermediate phrase (ip) and intonational phrase (IPh) boundaries. Methods: Twenty-three normal hearing (NH) adults, 15 NH children and 13 children with CIs who are monolingual speakers of Brazilian Portuguese participated in a computerized sentence comprehension task. The target stimuli consisted of eight base sentences containing a prepositional phrase attachment ambiguity. Prosodic boundaries were manipulated by varying IPh, ip and null boundaries. Participants also engaged on psychoacoustic tests that investigated gap detection threshold and frequency discrimination ability on nonlinguistic stimuli. An adaptive 3-interval forced-choice procedure was used in gap detection. For the frequency discrimination task, participants completed a same-different two-alternative forced choice task. Results and Discussion: Unlike NH adults and children, children with CIs did not exhibit an overall effect of prosody on syntactic disambiguation. Nonetheless, adults and children with NH and children CIs had the same two predictions of the RBH confirmed, suggesting that they perceived and used the relative size of the boundaries similarly. Two predictions of the ABH were confirmed for adults with NH whereas only one was confirmed for children with NH. The ABH does not govern the syntactic disambiguation of children with CIs. Children with NH were significantly slower than adults with NH to indicate a high attachment response in all prosodic types. However, hearing status did not influence processing speed. Gap detection thresholds and frequency limens on nonlinguistic stimuli did not influence the attachment of syntactically ambiguous sentences with different prosodic boundaries in adults and children with NH. Although children with CIs exhibited a decreased ability to perceive the acoustic changes on a nonlinguistic level, no correlation was found between frequency limens and proportion of high attachment. In children with CIs, gap detection thresholds were only correlated with the proportion of high attachment on sentences with strong prosody contrasts, suggesting that gap detection thresholds possibly influenced the attachment of syntactically ambiguous sentences with strong prosodic dissimilarity between boundaries

    Period disambiguation with MaxEnt model

    Get PDF
    Abstract. This paper presents our recent work on period disambiguation, the kernel problem in sentence boundary identification, with the maximum entropy (Maxent) model. A number of experiments are conducted on PTB-II WSJ corpus for the investigation of how context window, feature space and lexical information such as abbreviated and sentence-initial words affect the learning performance. Such lexical information can be automatically acquired from a training corpus by a learner. Our experimental results show that extending the feature space to integrate these two kinds of lexical information can eliminate 93.52% of the remaining errors from the baseline Maxent model, achieving an F-score of 99.8227%.

    A text Ontology Method based on mining Develop D –MATRIX

    Get PDF
    In this issue, we demonstrate a text mining method of ontology based on the development and updating of a D-matrix naturally extraction of a large number of verbatim repairs (written in unstructured text) collected during the analysis stages. dependence (D) Fault - Matrix is a systematic demonstrative model is used to capture data symptomatic level progressive elimination system including dependencies between observable symptoms and failure modes associated with a frame. Matrix is a time D-long process. The development of D-matrix from first standards and update using the domain information is a concentrated work. In addition, increased D-die time for the disclosure of new symptoms and failure modes observed for the first race is a difficult task. In this methodology, we first develop the fault diagnosis ontology includes concepts and relationships regularly seen in fault diagnosis field. Then we use text mining algorithm that make use of this ontology to distinguish basic items, such as coins, symptoms, failure modes, and conditions of the unstructured text verbatim repair. The proposed technique is tools like a prototype tool and accepted using real - life information collected from cars space

    Breaking Sticks and Ambiguities with Adaptive Skip-gram

    Full text link
    Recently proposed Skip-gram model is a powerful method for learning high-dimensional word representations that capture rich semantic relationships between words. However, Skip-gram as well as most prior work on learning word representations does not take into account word ambiguity and maintain only single representation per word. Although a number of Skip-gram modifications were proposed to overcome this limitation and learn multi-prototype word representations, they either require a known number of word meanings or learn them using greedy heuristic approaches. In this paper we propose the Adaptive Skip-gram model which is a nonparametric Bayesian extension of Skip-gram capable to automatically learn the required number of representations for all words at desired semantic resolution. We derive efficient online variational learning algorithm for the model and empirically demonstrate its efficiency on word-sense induction task

    A Machine learning approach to POS tagging

    Get PDF
    We have applied inductive learning of statistical decision trees and relaxation labelling to the Natural Language Processing (NLP) task of morphosyntactic disambiguation (Part Of Speech Tagging). The learning process is supervised and obtains a language model oriented to resolve POS ambiguities. This model consists of a set of statistical decision trees expressing distribution of tags and words in some relevant contexts. The acquired language models are complete enough to be directly used as sets of POS disambiguation rules, and include more complex contextual information than simple collections of n-grams usually used in statistical taggers. We have implemented a quite simple and fast tagger that has been tested and evaluated on the Wall Street Journal (WSJ) corpus with a remarkable accuracy. However, better results can be obtained by translating the trees into rules to feed a flexible relaxation labelling based tagger. In this direction we describe a tagger which is able to use information of any kind (n-grams, automatically acquired constraints, linguistically motivated manually written constraints, etc.), and in particular to incorporate the machine learned decision trees. Simultaneously, we address the problem of tagging when only small training material is available, which is crucial in any process of constructing, from scratch, an annotated corpus. We show that quite high accuracy can be achieved with our system in this situation.Postprint (published version
    • …
    corecore