1,875 research outputs found
Adaptive Sentence Boundary Disambiguation
Labeling of sentence boundaries is a necessary prerequisite for many natural
language processing tasks, including part-of-speech tagging and sentence
alignment. End-of-sentence punctuation marks are ambiguous; to disambiguate
them most systems use brittle, special-purpose regular expression grammars and
exception rules. As an alternative, we have developed an efficient, trainable
algorithm that uses a lexicon with part-of-speech probabilities and a
feed-forward neural network. After training for less than one minute, the
method correctly labels over 98.5\% of sentence boundaries in a corpus of over
27,000 sentence-boundary marks. We show the method to be efficient and easily
adaptable to different text genres, including single-case texts.Comment: This is a Latex version of the previously submitted ps file
(formatted as a uuencoded gz-compressed .tar file created by csh script). The
software from the work described in this paper is available by contacting
[email protected]
Pair-Linking for Collective Entity Disambiguation: Two Could Be Better Than All
Collective entity disambiguation aims to jointly resolve multiple mentions by
linking them to their associated entities in a knowledge base. Previous works
are primarily based on the underlying assumption that entities within the same
document are highly related. However, the extend to which these mentioned
entities are actually connected in reality is rarely studied and therefore
raises interesting research questions. For the first time, we show that the
semantic relationships between the mentioned entities are in fact less dense
than expected. This could be attributed to several reasons such as noise, data
sparsity and knowledge base incompleteness. As a remedy, we introduce MINTREE,
a new tree-based objective for the entity disambiguation problem. The key
intuition behind MINTREE is the concept of coherence relaxation which utilizes
the weight of a minimum spanning tree to measure the coherence between
entities. Based on this new objective, we design a novel entity disambiguation
algorithms which we call Pair-Linking. Instead of considering all the given
mentions, Pair-Linking iteratively selects a pair with the highest confidence
at each step for decision making. Via extensive experiments, we show that our
approach is not only more accurate but also surprisingly faster than many
state-of-the-art collective linking algorithms
Russian word sense induction by clustering averaged word embeddings
The paper reports our participation in the shared task on word sense
induction and disambiguation for the Russian language (RUSSE-2018). Our team
was ranked 2nd for the wiki-wiki dataset (containing mostly homonyms) and 5th
for the bts-rnc and active-dict datasets (containing mostly polysemous words)
among all 19 participants.
The method we employed was extremely naive. It implied representing contexts
of ambiguous words as averaged word embedding vectors, using off-the-shelf
pre-trained distributional models. Then, these vector representations were
clustered with mainstream clustering techniques, thus producing the groups
corresponding to the ambiguous word senses. As a side result, we show that word
embedding models trained on small but balanced corpora can be superior to those
trained on large but noisy data - not only in intrinsic evaluation, but also in
downstream tasks like word sense induction.Comment: Proceedings of the 24rd International Conference on Computational
Linguistics and Intellectual Technologies (Dialogue-2018
Prosodic Boundary Effects on Syntactic Disambiguation in Children with Cochlear Implants, and in Normal Hearing Adults and Children
Theoretical Framework: Manipulations of prosodic structure influence how listeners interpret syntactically ambiguous sentences. However, the interface between prosody and syntax has received very little attention in languages other than English. Furthermore, many children with cochlear implants (CI) have deficits in sentence comprehension. Until now, these deficits have been attributed only to syntax, leaving prosody a neglected area, despite its clear deficit on this population and the role it plays in sentence comprehension. Purposes: Experiment 1 investigates prosodic boundary effects on the comprehension of attachment ambiguities in Brazilian Portuguese while experiment 2 investigates these effects in Brazilian Portuguese speaking children with CIs. Both experiments tested two hypotheses relying on the notion of boundary strength: the absolute boundary hypothesis (ABH) and the relative boundary hypothesis (RBH). The ABH states that only the high boundary before the ambiguous constituent influences attachment whereas the RBH advocates that the high boundary before the ambiguous constituent can only be interpreted according to the relative size of an earlier low boundary. Specific predictions of the two hypotheses were tested. Relationships between attachment results and performance on psychoacoustic tests of gap detection threshold and frequency limen were also investigated. Materials: The experiments were designed on E-Prime 2.0 software (Psychology Software Tools, Pittsburgh, PA). The sentences were recorded on Praat software (Boersma & Weenink, 2013), controlling for F0, duration of components and pauses between components. The prosodic boundaries were measured with the ToBI coding system distinguishing acoustic measures of intermediate phrase (ip) and intonational phrase (IPh) boundaries. Methods: Twenty-three normal hearing (NH) adults, 15 NH children and 13 children with CIs who are monolingual speakers of Brazilian Portuguese participated in a computerized sentence comprehension task. The target stimuli consisted of eight base sentences containing a prepositional phrase attachment ambiguity. Prosodic boundaries were manipulated by varying IPh, ip and null boundaries. Participants also engaged on psychoacoustic tests that investigated gap detection threshold and frequency discrimination ability on nonlinguistic stimuli. An adaptive 3-interval forced-choice procedure was used in gap detection. For the frequency discrimination task, participants completed a same-different two-alternative forced choice task. Results and Discussion: Unlike NH adults and children, children with CIs did not exhibit an overall effect of prosody on syntactic disambiguation. Nonetheless, adults and children with NH and children CIs had the same two predictions of the RBH confirmed, suggesting that they perceived and used the relative size of the boundaries similarly. Two predictions of the ABH were confirmed for adults with NH whereas only one was confirmed for children with NH. The ABH does not govern the syntactic disambiguation of children with CIs. Children with NH were significantly slower than adults with NH to indicate a high attachment response in all prosodic types. However, hearing status did not influence processing speed. Gap detection thresholds and frequency limens on nonlinguistic stimuli did not influence the attachment of syntactically ambiguous sentences with different prosodic boundaries in adults and children with NH. Although children with CIs exhibited a decreased ability to perceive the acoustic changes on a nonlinguistic level, no correlation was found between frequency limens and proportion of high attachment. In children with CIs, gap detection thresholds were only correlated with the proportion of high attachment on sentences with strong prosody contrasts, suggesting that gap detection thresholds possibly influenced the attachment of syntactically ambiguous sentences with strong prosodic dissimilarity between boundaries
Period disambiguation with MaxEnt model
Abstract. This paper presents our recent work on period disambiguation, the kernel problem in sentence boundary identification, with the maximum entropy (Maxent) model. A number of experiments are conducted on PTB-II WSJ corpus for the investigation of how context window, feature space and lexical information such as abbreviated and sentence-initial words affect the learning performance. Such lexical information can be automatically acquired from a training corpus by a learner. Our experimental results show that extending the feature space to integrate these two kinds of lexical information can eliminate 93.52% of the remaining errors from the baseline Maxent model, achieving an F-score of 99.8227%.
A text Ontology Method based on mining Develop D –MATRIX
In this issue, we demonstrate a text mining method of ontology based on the development and updating of a D-matrix naturally extraction of a large number of verbatim repairs (written in unstructured text) collected during the analysis stages. dependence (D) Fault - Matrix is a systematic demonstrative model is used to capture data symptomatic level progressive elimination system including dependencies between observable symptoms and failure modes associated with a frame. Matrix is a time D-long process. The development of D-matrix from first standards and update using the domain information is a concentrated work. In addition, increased D-die time for the disclosure of new symptoms and failure modes observed for the first race is a difficult task. In this methodology, we first develop the fault diagnosis ontology includes concepts and relationships regularly seen in fault diagnosis field. Then we use text mining algorithm that make use of this ontology to distinguish basic items, such as coins, symptoms, failure modes, and conditions of the unstructured text verbatim repair. The proposed technique is tools like a prototype tool and accepted using real - life information collected from cars space
Breaking Sticks and Ambiguities with Adaptive Skip-gram
Recently proposed Skip-gram model is a powerful method for learning
high-dimensional word representations that capture rich semantic relationships
between words. However, Skip-gram as well as most prior work on learning word
representations does not take into account word ambiguity and maintain only
single representation per word. Although a number of Skip-gram modifications
were proposed to overcome this limitation and learn multi-prototype word
representations, they either require a known number of word meanings or learn
them using greedy heuristic approaches. In this paper we propose the Adaptive
Skip-gram model which is a nonparametric Bayesian extension of Skip-gram
capable to automatically learn the required number of representations for all
words at desired semantic resolution. We derive efficient online variational
learning algorithm for the model and empirically demonstrate its efficiency on
word-sense induction task
A Machine learning approach to POS tagging
We have applied inductive learning of statistical decision trees
and relaxation labelling to the Natural Language Processing (NLP)
task of morphosyntactic disambiguation (Part Of Speech Tagging).
The learning process is supervised and obtains a language
model oriented to resolve POS ambiguities. This model consists
of a set of statistical decision trees expressing distribution of
tags and words in some relevant contexts.
The acquired language models are complete enough to be directly
used as sets of POS disambiguation rules, and include more complex
contextual information than simple collections of n-grams usually
used in statistical taggers.
We have implemented a quite simple and fast tagger that has been
tested and evaluated on the Wall Street Journal (WSJ) corpus with
a remarkable accuracy.
However, better results can be obtained by translating the trees
into rules to feed a flexible relaxation labelling based tagger.
In this direction we describe a tagger which is able to use
information of any kind (n-grams, automatically acquired constraints,
linguistically motivated manually written constraints, etc.), and in
particular to incorporate the machine learned decision trees.
Simultaneously, we address the problem of tagging when only
small training material is available, which is crucial in any process
of constructing, from scratch, an annotated corpus. We show that quite
high accuracy can be achieved with our system in this situation.Postprint (published version
- …