2,122 research outputs found
Recommended from our members
MG Parsing as a Model of Gradient Acceptability in Syntactic Islands
It is well-known that the acceptability judgments at the core of current syntactic theories are continuous. However, an open debate is whether the source of such gradience is situated in the grammar itself, or can be derived from extra-grammatical factors. In this paper, we propose the use of a top-down parser for Minimalist grammars (Stabler, 2013; Kobele et al., 2013; Graf et al., 2017), as a formal model of how gradient acceptability can arise from categorical grammars. As a test case, we target the acceptability judgments for island effects collected by Sprouse et al. (2012a)
Transparency at the Source: Evaluating and Interpreting Language Models With Access to the True Distribution
We present a setup for training, evaluating and interpreting neural language
models, that uses artificial, language-like data. The data is generated using a
massive probabilistic grammar (based on state-split PCFGs), that is itself
derived from a large natural language corpus, but also provides us complete
control over the generative process. We describe and release both grammar and
corpus, and test for the naturalness of our generated data. This approach
allows us to define closed-form expressions to efficiently compute exact lower
bounds on obtainable perplexity using both causal and masked language
modelling. Our results show striking differences between neural language
modelling architectures and training objectives in how closely they allow
approximating the lower bound on perplexity. Our approach also allows us to
directly compare learned representations to symbolic rules in the underlying
source. We experiment with various techniques for interpreting model behaviour
and learning dynamics. With access to the underlying true source, our results
show striking differences and outcomes in learning dynamics between different
classes of words.Comment: EMNLP Findings 202
A Hybrid Environment for Syntax-Semantic Tagging
The thesis describes the application of the relaxation labelling algorithm to
NLP disambiguation. Language is modelled through context constraint inspired on
Constraint Grammars. The constraints enable the use of a real value statind
"compatibility". The technique is applied to POS tagging, Shallow Parsing and
Word Sense Disambigation. Experiments and results are reported. The proposed
approach enables the use of multi-feature constraint models, the simultaneous
resolution of several NL disambiguation tasks, and the collaboration of
linguistic and statistical models.Comment: PhD Thesis. 120 page
Discourse Structure in Machine Translation Evaluation
In this article, we explore the potential of using sentence-level discourse
structure for machine translation evaluation. We first design discourse-aware
similarity measures, which use all-subtree kernels to compare discourse parse
trees in accordance with the Rhetorical Structure Theory (RST). Then, we show
that a simple linear combination with these measures can help improve various
existing machine translation evaluation metrics regarding correlation with
human judgments both at the segment- and at the system-level. This suggests
that discourse information is complementary to the information used by many of
the existing evaluation metrics, and thus it could be taken into account when
developing richer evaluation metrics, such as the WMT-14 winning combined
metric DiscoTKparty. We also provide a detailed analysis of the relevance of
various discourse elements and relations from the RST parse trees for machine
translation evaluation. In particular we show that: (i) all aspects of the RST
tree are relevant, (ii) nuclearity is more useful than relation type, and (iii)
the similarity of the translation RST tree to the reference tree is positively
correlated with translation quality.Comment: machine translation, machine translation evaluation, discourse
analysis. Computational Linguistics, 201
Knowledge-based and data-driven approaches for geographical information access
Geographical Information Access (GeoIA) can be defined as a way of retrieving information from textual collections that includes the automatic analysis and interpretation of the geographical constraints and terms present in queries and documents. This PhD thesis presents, describes and evaluates several heterogeneous approaches for the following three GeoIA tasks: Geographical Information Retrieval (GIR), Geographical Question Answering (GeoQA), and Textual Georeferencing (TG). The GIR task deals with user queries that search over documents (e.g. ¿vineyards in California?) and the GeoQA task treats questions that retrieve answers (e.g. ¿What is the capital of France?). On the other hand, TG is the task of associate one or more georeferences (such as polygons or coordinates in a geodetic reference system) to electronic documents.
Current state-of-the-art AI algorithms are not yet fully understanding the semantic meaning and the geographical constraints and terms present in queries and document collections. This thesis attempts to improve the effectiveness results of GeoIA tasks by: 1) improving the detection, understanding, and use of a part of the geographical and the thematic content of queries and documents with Toponym Recognition, Toponym Disambiguation and Natural Language Processing (NLP) techniques, and 2) combining Geographical Knowledge-Based Heuristics based on common sense with Data-Driven IR algorithms.
The main contributions of this thesis to the state-of-the-art of GeoIA tasks are:
1) The presentation of 10 novel approaches for GeoIA tasks: 3 approaches for GIR, 3 for GeoQA, and 4 for Textual Georeferencing (TG).
2) The evaluation of these novel approaches in these contexts: within official evaluation benchmarks, after evaluation benchmarks with the test collections, and with other specific datasets. Most of these algorithms have been evaluated in international evaluations and some of them achieved top-ranked state-of-the-art results, including top-performing results in GIR (GeoCLEF 2007) and TG (MediaEval 2014) benchmarks.
3) The experiments reported in this PhD thesis show that the approaches can combine effectively Geographical Knowledge and NLP with Data-Driven techniques to improve the efectiveness measures of the three Geographical Information Access tasks investigated.
4) TALPGeoIR: a novel GIR approach that combines Geographical Knowledge ReRanking (GeoKR), NLP and Relevance Feedback (RF) that achieved state-of-the-art results in official GeoCLEF benchmarks (Ferrés and RodrÃguez, 2008; Mandl et al., 2008) and posterior experiments (Ferrés and RodrÃguez, 2015a). This approach has been evaluated with the full GeoCLEF corpus (100 topics) and showed that GeoKR, NLP, and RF techniques evaluated separately or in combination improve the results in MAP and R-Precision effectiveness measures of the state-of-the-art IR algorithms TF-IDF, BM25 and InL2 and show statistical significance in most of the experiments.
5) GeoTALP-QA: a scope-based GeoQA approach for Spanish and English and its evaluation with a set of questions of the Spanish geography (Ferrés and RodrÃguez, 2006).
6) Four state-of-the-art Textual Georeferencing approaches for informal and formal documents that achieved state-of-the-art results in evaluation benchmarks (Ferrés and RodrÃguez, 2014) and posterior experiments (Ferrés and RodrÃguez, 2011; Ferrés and RodrÃguez, 2015b).L'Accés a la Informació Geogrà fica (GeoAI) pot ser definit com una forma de recuperar informació de col·lecions textuals que inclou l'anà lisi automà tic i la interpretació dels termes i restriccions geogrà fiques que apareixen en consultes i documents. Aquesta tesi doctoral presenta, descriu i avalua varies aproximacions heterogènies a les seguents tasques de GeoAI: Recuperació de la Informació Geogrà fica (RIG), Cerca de la Resposta Geogrà fica (GeoCR), i Georeferenciament Textual (GT). La tasca de RIG tracta amb consultes d'usuari que cerquen documents (e.g. ¿vinyes a California?) i la tasca GeoCR tracta de recuperar respostes concretes a preguntes (e.g. ¿Quina és la capital de França?). D'altra banda, GT es la tasca de relacionar una o més referències geogrà fiques (com polÃgons o coordenades en un sistema de referència geodètic) a documents electrònics. Els algoritmes de l'estat de l'art actual en Intel·ligència Artificial encara no comprenen completament el significat semà ntic i els termes i les restriccions geogrà fiques presents en consultes i col·leccions de documents. Aquesta tesi intenta millorar els resultats en efectivitat de les tasques de GeoAI de la seguent manera: 1) millorant la detecció, comprensió, i la utilització d'una part del contingut geogrà fic i temà tic de les consultes i documents amb tècniques de reconeixement de topònims, desambiguació de topònims, i Processament del Llenguatge Natural (PLN), i 2) combinant heurÃstics basats en Coneixement Geogrà fic i en el sentit comú humà amb algoritmes de Recuperació de la Informació basats en dades. Les principals contribucions d'aquesta tesi a l'estat de l'art de les tasques de GeoAI són: 1) La presentació de 10 noves aproximacions a les tasques de GeoAI: 3 aproximacions per RIG, 3 per GeoCR, i 4 per Georeferenciament Textual (GT). 2) L'avaluació d'aquestes noves aproximacions en aquests contexts: en el marc d'avaluacions comparatives internacionals, posteriorment a avaluacions comparatives internacionals amb les col·lections de test, i amb altres conjunts de dades especÃfics. La majoria d'aquests algoritmes han estat avaluats en avaluacions comparatives internacionals i alguns d'ells aconseguiren alguns dels millors resultats en l'estat de l'art, com per exemple els resultats en comparatives de RIG (GeoCLEF 2007) i GT (MediaEval 2014). 3) Els experiments descrits en aquesta tesi mostren que les aproximacions poden combinar coneixement geogrà fic i PLN amb tècniques basades en dades per millorar les mesures d'efectivitat en les tres tasques de l'Accés a la Informació Geogrà fica investigades. 4) TALPGeoIR: una nova aproximació a la RIG que combina Re-Ranking amb Coneixement Geogrà fic (GeoKR), PLN i Retroalimentació de Rellevancia (RR) que aconseguà resultats en l'estat de l'art en comparatives oficials GeoCLEF (Ferrés and RodrÃguez, 2008; Mandl et al., 2008) i en experiments posteriors (Ferrés and RodrÃguez, 2015a). Aquesta aproximació ha estat avaluada amb el conjunt complert del corpus GeoCLEF (100 topics) i ha mostrat que les tècniques GeoKR, PLN i RR avaluades separadament o en combinació milloren els resultats en les mesures efectivitat MAP i R-Precision dels algoritmes de l'estat de l'art en Recuperació de la Infomació TF-IDF, BM25 i InL2 i a més mostren significació estadÃstica en la majoria dels experiments. 5) GeoTALP-QA: una aproximació basada en l'à mbit geogrà fic per espanyol i anglès i la seva avaluació amb un conjunt de preguntes de la geografÃa espanyola (Ferrés and RodrÃguez, 2006). 6) Quatre aproximacions per al georeferenciament de documents formals i informals que obtingueren resultats en l'estat de l'art en avaluacions comparatives (Ferrés and RodrÃguez, 2014) i en experiments posteriors (Ferrés and RodrÃguez, 2011; Ferrés and RodrÃguez, 2015b)
- …