650 research outputs found
A non-projective greedy dependency parser with bidirectional LSTMs
The LyS-FASTPARSE team presents BIST-COVINGTON, a neural implementation of
the Covington (2001) algorithm for non-projective dependency parsing. The
bidirectional LSTM approach by Kipperwasser and Goldberg (2016) is used to
train a greedy parser with a dynamic oracle to mitigate error propagation. The
model participated in the CoNLL 2017 UD Shared Task. In spite of not using any
ensemble methods and using the baseline segmentation and PoS tagging, the
parser obtained good results on both macro-average LAS and UAS in the big
treebanks category (55 languages), ranking 7th out of 33 teams. In the all
treebanks category (LAS and UAS) we ranked 16th and 12th. The gap between the
all and big categories is mainly due to the poor performance on four parallel
PUD treebanks, suggesting that some `suffixed' treebanks (e.g. Spanish-AnCora)
perform poorly on cross-treebank settings, which does not occur with the
corresponding `unsuffixed' treebank (e.g. Spanish). By changing that, we obtain
the 11th best LAS among all runs (official and unofficial). The code is made
available at https://github.com/CoNLL-UD-2017/LyS-FASTPARSEComment: 12 pages, 2 figures, 5 table
Better, Faster, Stronger Sequence Tagging Constituent Parsers
Sequence tagging models for constituent parsing are faster, but less accurate
than other types of parsers. In this work, we address the following weaknesses
of such constituent parsers: (a) high error rates around closing brackets of
long constituents, (b) large label sets, leading to sparsity, and (c) error
propagation arising from greedy decoding. To effectively close brackets, we
train a model that learns to switch between tagging schemes. To reduce
sparsity, we decompose the label set and use multi-task learning to jointly
learn to predict sublabels. Finally, we mitigate issues from greedy decoding
through auxiliary losses and sentence-level fine-tuning with policy gradient.
Combining these techniques, we clearly surpass the performance of sequence
tagging constituent parsers on the English and Chinese Penn Treebanks, and
reduce their parsing time even further. On the SPMRL datasets, we observe even
greater improvements across the board, including a new state of the art on
Basque, Hebrew, Polish and Swedish.Comment: NAACL 2019 (long papers). Contains corrigendu
El modelo probabilÃstico: caracterÃsticas y modelos derivados
A review of the art of the family of probabilistic model of information retrieval is presented. Based on its basic priciples, difefferent specific models are analysed: the BIR model –the most basic one– the classic BM25 and the DFR models, one of the last developed.Presentamos en este trabajo una revisión del estado del arte de la familia de los modelos probabilÃsticas de recuperación de información. Partiendo de los principios básicos que sustentan estos modelos, estudiaremos diferentes modelos concretos: el modelo de independencia binaria –el más básico–, el ya clásico BM25 y, finalmente, los modelos DFR –uno de los últimos desarrollos–
Studying the Effect and Treatment of Misspelled Queries in Cross-Language Information Retrieval
[Abstract] The performance of Information Retrieval systems is limited by the linguistic variation present in natural language texts. Word-level Natural Language Processing techniques have been shown to be useful in reducing this variation. In this article, we summarize our work on the extension of these techniques for dealing with phrase-level variation in European languages, taking Spanish as a case in point. We propose the use of syntactic dependencies as complex index terms in an attempt to solve the problems deriving from both syntactic and morpho-syntactic variation and, in this way, to obtain more precise index terms. Such dependencies are obtained through a shallow parser based on cascades of finite-state transducers in order to reduce as far as possible the overhead due to this parsing process. The use of different sources of syntactic information, queries or documents, has been also studied, as has the restriction of the dependencies applied to those obtained from noun phrases. Our approaches have been tested using the CLEF corpus, obtaining consistent improvements with regard to classical word-level non-linguistic techniques. Results show, on the one hand, that syntactic information extracted from documents is more useful than that from queries. On the other hand, it has been demonstrated that by restricting dependencies to those corresponding to noun phrases, important reductions of storage and management costs can be achieved, albeit at the expense of a slight reduction in performance.Ministerio de EconomÃa y Competitividad; FFI2014-51978-C2-1-RRede Galega de Procesamento da Linguaxe e Recuperación de Información; CN2014/034Ministerio de EconomÃa y Competitividad; BES-2015-073768Ministerio de EconomÃa y Competitividad; FFI2014-51978-C2-2-
Misspelled queries in cross-language IR: analysis and management
Este artÃculo estudia el impacto de los errores ortográficos en las consultas sobre el rendimiento de los sistemas de recuperación de información multilingüe, proponiendo dos estrategias para su tratamiento: el empleo de técnicas de corrección ortográfica automática y la utilización de n-gramas de caracteres como términos Ãndice y unidad de traducción, para asà aprovecharnos de su robustez inherente. Los resultados demuestran la sensibilidad de estos sistemas frente a dichos errores asà como la efectividad de las soluciones propuestas. Hasta donde alcanza nuestro conocimiento no existen trabajos similares en el ámbito multilingüe.This paper studies the impact of misspelled queries on the performance of Cross-Language Information Retrieval systems and proposes two strategies for dealing with them: the use of automatic spelling correction techniques and the use of character n-grams both as index terms and translation units, thus allowing to take advantage of their inherent robustness. Our results demonstrate the sensitivity of these systems to such errors and the effectiveness of the proposed solutions. To the best of our knowledge there are no similar jobs in the cross-language field.Trabajo parcialmente subvencionado por el Ministerio de EconomÃa y Competitividad y FEDER (proyectos TIN2010-18552-C03-01 y TIN2010-18552-C03-02) y por la Xunta de Galicia (ayudas CN 2012/008, CN 2012/317 y CN 2012/319)
- …