2,617 research outputs found
Cross-Lingual Induction and Transfer of Verb Classes Based on Word Vector Space Specialisation
Existing approaches to automatic VerbNet-style verb classification are
heavily dependent on feature engineering and therefore limited to languages
with mature NLP pipelines. In this work, we propose a novel cross-lingual
transfer method for inducing VerbNets for multiple languages. To the best of
our knowledge, this is the first study which demonstrates how the architectures
for learning word embeddings can be applied to this challenging
syntactic-semantic task. Our method uses cross-lingual translation pairs to tie
each of the six target languages into a bilingual vector space with English,
jointly specialising the representations to encode the relational information
from English VerbNet. A standard clustering algorithm is then run on top of the
VerbNet-specialised representations, using vector dimensions as features for
learning verb classes. Our results show that the proposed cross-lingual
transfer approach sets new state-of-the-art verb classification performance
across all six target languages explored in this work.Comment: EMNLP 2017 (long paper
A question-answering machine learning system for FAQs
With the increase in usage and dependence on the internet for gathering
information, it’s now essential to efficiently retrieve information according
to users’ needs. Question Answering (QA) systems aim to fulfill this need
by trying to provide the most relevant answer for a user’s query expressed
in natural language text or speech. Virtual assistants like Apple Siri and
automated FAQ systems have become very popular and with this the constant
rush of developing an efficient, advanced and expedient QA system is
reaching new limits.
In the field of QA systems, this thesis addresses the problem of finding the
FAQ question that is most similar to a user’s query. Finding semantic similarities
between database question banks and natural language text is its
foremost step. The work aims at exploring unsupervised approaches for
measuring semantic similarities for developing a closed domain QA system.
To meet this objective modern sentence representation techniques, such as
BERT and FLAIR GloVe, are coupled with various similarity measures (cosine,
Euclidean and Manhattan) to identify the best model. The developed
models were tested with three FAQs and SemEval 2015 datasets for English
language; the best results were obtained from the coupling of BERT embedding
with Euclidean distance similarity measure with a performance of
85.956% on a FAQ dataset. The model is also tested for Portuguese language
with Portuguese Health support phone line SNS24 dataset; Sumário:
Um sistema de pergunta-resposta de aprendizagem
automatica para FAQs
Com o aumento da utilização e da dependência da internet para a recolha
de informação, tornou-se essencial recuperar a informação de forma eficiente
de acordo com as necessidades dos utilizadores. Os Sistemas de Pergunta-
Resposta (PR) visam responder a essa necessidade, tentando fornecer a resposta
mais relevante para a consulta de um utilizador expressa em texto em
linguagem natural escrita ou falada. Os assistentes virtuais como o Apple
Siri e sistemas automatizados de perguntas frequentes tornaram-se muito
populares aumentando a necessidade de desenvolver um sistema de controle
de qualidade eficiente, avançado e conveniente.
No campo dos sistemas de PR, esta dissertação aborda o problema de encontrar
a pergunta que mais se assemelha à consulta de um utilizador. Encontrar
semelhanças semânticas entre a base de dados de perguntas e o texto em
linguagem natural é a sua etapa mais importante. Neste sentido, esta dissertação
tem como objetivo explorar abordagens não supervisionadas para
medir similaridades semânticas para o desenvolvimento de um sistema de
pergunta-resposta de domÃnio fechado. Neste sentido, técnicas modernas
de representação de frases como o BERT e FLAIR GloVe são utilizadas em
conjunto com várias medidas de similaridade (cosseno, Euclidiana e Manhattan)
para identificar os melhores modelos. Os modelos desenvolvidos foram
testados com conjuntos de dados de três FAQ e o SemEval 2015; os melhores
resultados foram obtidos da combinação entre modelos de embedding
BERT e a distância euclidiana, tendo-se obtido um desempenho máximo de
85,956% num conjunto de dados FAQ. O modelo também é testado para a
lÃngua portuguesa com o conjunto de dados SNS24 da linha telefónica de
suporte de saúde em português
Tracking Dengue Epidemics using Twitter Content Classification and Topic Modelling
Detecting and preventing outbreaks of mosquito-borne diseases such as Dengue
and Zika in Brasil and other tropical regions has long been a priority for
governments in affected areas. Streaming social media content, such as Twitter,
is increasingly being used for health vigilance applications such as flu
detection. However, previous work has not addressed the complexity of drastic
seasonal changes on Twitter content across multiple epidemic outbreaks. In
order to address this gap, this paper contrasts two complementary approaches to
detecting Twitter content that is relevant for Dengue outbreak detection,
namely supervised classification and unsupervised clustering using topic
modelling. Each approach has benefits and shortcomings. Our classifier achieves
a prediction accuracy of about 80\% based on a small training set of about
1,000 instances, but the need for manual annotation makes it hard to track
seasonal changes in the nature of the epidemics, such as the emergence of new
types of virus in certain geographical locations. In contrast, LDA-based topic
modelling scales well, generating cohesive and well-separated clusters from
larger samples. While clusters can be easily re-generated following changes in
epidemics, however, this approach makes it hard to clearly segregate relevant
tweets into well-defined clusters.Comment: Procs. SoWeMine - co-located with ICWE 2016. 2016, Lugano,
Switzerlan
Multilingual Models for Compositional Distributed Semantics
We present a novel technique for learning semantic representations, which
extends the distributional hypothesis to multilingual data and joint-space
embeddings. Our models leverage parallel data and learn to strongly align the
embeddings of semantically equivalent sentences, while maintaining sufficient
distance between those of dissimilar sentences. The models do not rely on word
alignments or any syntactic information and are successfully applied to a
number of diverse languages. We extend our approach to learn semantic
representations at the document level, too. We evaluate these models on two
cross-lingual document classification tasks, outperforming the prior state of
the art. Through qualitative analysis and the study of pivoting effects we
demonstrate that our representations are semantically plausible and can capture
semantic relationships across languages without parallel data.Comment: Proceedings of ACL 2014 (Long papers
ASAPP 2.0: Advancing the state-of-the-art of semantic textual similarity for Portuguese
Semantic Textual Similarity (STS) aims at computing the proximity of meaning transmitted by two sentences. In 2016, the ASSIN shared task targeted STS in Portuguese and released training and test collections. This paper describes the development of ASAPP, a system that participated in ASSIN, but has been improved since then, and now achieves the best results in this task. ASAPP learns a STS function from a broad range of lexical, syntactic, semantic and distributional features. This paper describes the features used in the current version of ASAPP, and how they are exploited in a regression algorithm to achieve the best published results for ASSIN to date, in both European and Brazilian Portuguese
Recommended from our members
Minimally supervised induction of morphology through bitexts
textA knowledge of morphology can be useful for many natural language processing systems. Thus, much effort has been expended in developing accurate computational tools for morphology that lemmatize, segment and generate new forms. The most powerful and accurate of these have been manually encoded, such endeavors being without exception expensive and time-consuming. There have been consequently many attempts to reduce this cost in the development of morphological systems through the development of unsupervised or minimally supervised algorithms and learning methods for acquisition of morphology. These efforts have yet to produce a tool that approaches the performance of manually encoded systems.
Here, I present a strategy for dealing with morphological clustering and segmentation in a minimally supervised manner but one that will be more linguistically informed than previous unsupervised approaches. That is, this study will attempt to induce clusters of words from an unannotated text that are inflectional variants of each other. Then a set of inflectional suffixes by part-of-speech will be induced from these clusters. This level of detail is made possible by a method known as alignment and transfer (AT), among other names, an approach that uses aligned bitexts to transfer linguistic resources developed for one language–the source language–to another language–the target. This approach has a further advantage in that it allows a reduction in the amount of training data without a significant degradation in performance making it useful in applications targeted at data collected from endangered languages. In the current study, however, I use English as the source and German as the target for ease of evaluation and for certain typlogical properties of German. The two main tasks, that of clustering and segmentation, are approached as sequential tasks with the clustering informing the segmentation to allow for greater accuracy in morphological analysis.
While the performance of these methods does not exceed the current roster of unsupervised or minimally supervised approaches to morphology acquisition, it attempts to integrate more learning methods than previous studies. Furthermore, it attempts to learn inflectional morphology as opposed to derivational morphology, which is a crucial distinction in linguistics.Linguistic
Reconstructing Native Language Typology from Foreign Language Usage
Linguists and psychologists have long been studying cross-linguistic
transfer, the influence of native language properties on linguistic performance
in a foreign language. In this work we provide empirical evidence for this
process in the form of a strong correlation between language similarities
derived from structural features in English as Second Language (ESL) texts and
equivalent similarities obtained from the typological features of the native
languages. We leverage this finding to recover native language typological
similarity structure directly from ESL text, and perform prediction of
typological features in an unsupervised fashion with respect to the target
languages. Our method achieves 72.2% accuracy on the typology prediction task,
a result that is highly competitive with equivalent methods that rely on
typological resources.Comment: CoNLL 201
- …