4,524 research outputs found
Methods for Amharic part-of-speech tagging
The paper describes a set of experiments
involving the application of three state-of-
the-art part-of-speech taggers to Ethiopian
Amharic, using three different tagsets.
The taggers showed worse performance
than previously reported results for Eng-
lish, in particular having problems with
unknown words. The best results were
obtained using a Maximum Entropy ap-
proach, while HMM-based and SVM-
based taggers got comparable results
Classification of semantic relations in different syntactic structures in medical text using the MeSH hierarchy
Thesis (M. Eng.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2005.Includes bibliographical references (leaf 38).Two different classification algorithms are evaluated in recognizing semantic relationships of different syntactic compounds. The compounds, which include noun- noun, adjective-noun, noun-adjective, noun-verb, and verb-noun, were extracted from a set of doctors' notes using a part of speech tagger and a parser. Each compound was labeled with a semantic relationship, and each word in the compound was mapped to its corresponding entry in the MeSH hierarchy. MeSH includes only medical terminology so it was extended to include everyday, non-medical terms. The two classification algorithms, neural networks and a classification tree, were trained and tested on the data set for each type of syntactic compound. Models representing different levels of MeSH were generated and fed into the neural networks. Both algorithms performed better than random guessing, and the classification tree performed better than the neural networks in predicting the semantic relationship between phrases from their syntactic structure.by Neha Bhooshan.M.Eng
Skeleton Key: Image Captioning by Skeleton-Attribute Decomposition
Recently, there has been a lot of interest in automatically generating
descriptions for an image. Most existing language-model based approaches for
this task learn to generate an image description word by word in its original
word order. However, for humans, it is more natural to locate the objects and
their relationships first, and then elaborate on each object, describing
notable attributes. We present a coarse-to-fine method that decomposes the
original image description into a skeleton sentence and its attributes, and
generates the skeleton sentence and attribute phrases separately. By this
decomposition, our method can generate more accurate and novel descriptions
than the previous state-of-the-art. Experimental results on the MS-COCO and a
larger scale Stock3M datasets show that our algorithm yields consistent
improvements across different evaluation metrics, especially on the SPICE
metric, which has much higher correlation with human ratings than the
conventional metrics. Furthermore, our algorithm can generate descriptions with
varied length, benefiting from the separate control of the skeleton and
attributes. This enables image description generation that better accommodates
user preferences.Comment: Accepted by CVPR 201
Las Relaciones Semánticas Predicen la Desambiguación Estructural de las Unidades Terminológicas Poliléxicas con Tres Formantes
For English multiword terms (MWTs) of three or more constituents (e.g., sea level rise), a semantic analysis, based on linguistic and domain knowledge, is necessary to resolve the dependency between components. This structural disambiguation, often known as bracketing, involves the grouping of the dependent components so that the MWT is reduced to its basic form of modifier+head, as in [sea level] [rise]. Knowledge of these dependencies facilitates the comprehension of an MWT and its accurate translation into other languages. Moreover, the resolution of MWT bracketing provides a higher overall accuracy in machine translation systems and sentence parsers. This paper thus presents a pilot study that explored whether the bracketing of a ternary compound, when used as an argument in a sentence, can be predicted from the semantic information encoded in that sentence. It is shown that, with a random forest model, the semantic relation of the MWT to another argument in the same sentence, the lexical domain of the predicate, and the semantic role of the MWT were able to predict the bracketing of the 190 ternary compounds used as arguments in a sample of 188 semantically annotated sentences from a Coastal Engineering corpus (100% F1-score). Furthermore, only the semantic relation of an MWT to another argument in the same sentence proved enormous capability to predict ternary compound bracketing with a binary decision-tree model (94.12%F1-score).En unidades terminológicas poliléxicas (UTP) con tres o más formantes en lengua inglesa (p.ej., sea level rise), establecer la dependencia entre dichos formantes requiere de un análisis lingüÃstico y de conocimiento especializado del área concreta en que se emplean las UTP. Esta desambiguación estructural, o bracketing, implica el agrupamiento de los formantes para reducir la UTP a su estructura básica de modificador+núcleo, como en [sea level] [rise]. Conocer el bracketing de una UTP no solo facilita su comprensión y traducción a otras lenguas, sino que también mejora el desempeño de los sistemas de traducción automática y de los analizadores sintácticos. Por tanto, en este artÃculo presentamos un estudio piloto que explora si el bracketing de una UTP con tres formantes, al emplearse como argumento en una oración, puede predecirse a partir de la información semántica codificada en dicha oración. Se muestra que, con un modelo random forest, la relación semántica de la UTP con otro argumento en la misma oración, el dominio léxico del verbo y el rol semántico de la UTP son capaces de predecir el bracketing de las 190 UTP ternarias que se usan como argumento en una muestra de 188 oraciones, anotadas semánticamente y extraÃdas de un corpus sobre ingenierÃa de costas (con un valor de F1 del 100%). Además, únicamente la relación semántica que mantiene una UTP ternaria con otro argumento en la misma oración posee una enorme capacidad para predecir su bracketing mediante un árbol de decisión binario (con un valor de F1 del 94,12%).This research was carried out as part of projects PID2020-118369GB-I00, "Transversal Integration of Culture in a Terminological Knowledge Base on Environment" (TRANSCULTURE), funded by the Spanish Ministry of Science and Innovation; and A-HUM-600-UGR20, "Culture as Transversal Module in a Terminological Knowledge Base on the Environment" (CULTURAMA), funded by the Andalusian Ministry of Economy, Knowledge, Business, and University
Crowdsourcing Multiple Choice Science Questions
We present a novel method for obtaining high-quality, domain-targeted
multiple choice questions from crowd workers. Generating these questions can be
difficult without trading away originality, relevance or diversity in the
answer options. Our method addresses these problems by leveraging a large
corpus of domain-specific text and a small set of existing questions. It
produces model suggestions for document selection and answer distractor choice
which aid the human question generation process. With this method we have
assembled SciQ, a dataset of 13.7K multiple choice science exam questions
(Dataset available at http://allenai.org/data.html). We demonstrate that the
method produces in-domain questions by providing an analysis of this new
dataset and by showing that humans cannot distinguish the crowdsourced
questions from original questions. When using SciQ as additional training data
to existing questions, we observe accuracy improvements on real science exams.Comment: accepted for the Workshop on Noisy User-generated Text (W-NUT) 201
- …