795 research outputs found
External Lexical Information for Multilingual Part-of-Speech Tagging
Morphosyntactic lexicons and word vector representations have both proven
useful for improving the accuracy of statistical part-of-speech taggers. Here
we compare the performances of four systems on datasets covering 16 languages,
two of these systems being feature-based (MEMMs and CRFs) and two of them being
neural-based (bi-LSTMs). We show that, on average, all four approaches perform
similarly and reach state-of-the-art results. Yet better performances are
obtained with our feature-based models on lexically richer datasets (e.g. for
morphologically rich languages), whereas neural-based results are higher on
datasets with less lexical variability (e.g. for English). These conclusions
hold in particular for the MEMM models relying on our system MElt, which
benefited from newly designed features. This shows that, under certain
conditions, feature-based approaches enriched with morphosyntactic lexicons are
competitive with respect to neural methods
Creating language resources for under-resourced languages: methodologies, and experiments with Arabic
Language resources are important for those working on computational methods to analyse and study languages. These resources are needed to help advancing the research in fields such as natural language processing, machine learning, information retrieval and text analysis in general. We describe the creation of useful resources for languages that currently lack them, taking resources for Arabic summarisation as a case study. We illustrate three different paradigms for creating language resources, namely: (1) using crowdsourcing to produce a small resource rapidly and relatively cheaply; (2) translating an existing gold-standard dataset, which is relatively easy but potentially of lower quality; and (3) using manual effort with appropriately skilled human participants to create a resource that is more expensive but of high quality. The last of these was used as a test collection for TAC-2011. An evaluation of the resources is also presented
Overlaps in Maltese conversational and task-oriented dialogues
This paper deals with overlaps in spoken Maltese. Overlaps are studied in two different corpora recorded in different communicative situations. One is a multimodal corpus involving first acquaintance conversations; the other consists of Map Task dialogues. The results show that the number of overlaps is larger in the free conversations, where it varies depending on specific aspects of the interaction. They also show that overlaps in the MapTask dialogues tend to be longer, serving the function of establishing common understanding to achieve optimal task completion.peer-reviewe
Crowdsourcing for Language Resource Development: Criticisms About Amazon Mechanical Turk Overpowering Use
International audienceThis article is a position paper about Amazon Mechanical Turk, the use of which has been steadily growing in language processing in the past few years. According to the mainstream opinion expressed in articles of the domain, this type of on-line working platforms allows to develop quickly all sorts of quality language resources, at a very low price, by people doing that as a hobby. We shall demonstrate here that the situation is far from being that ideal. Our goal here is manifold: 1- to inform researchers, so that they can make their own choices, 2- to develop alternatives with the help of funding agencies and scientific associations, 3- to propose practical and organizational solutions in order to improve language resources development, while limiting the risks of ethical and legal issues without letting go price or quality, 4- to introduce an Ethics and Big Data Charter for the documentation of language resourc
A context based model for sentiment analysis in twitter for the italian language
Studi recenti per la Sentiment
Analysis in Twitter hanno tentato di creare
modelli per caratterizzare la polarit´a di
un tweet osservando ciascun messaggio
in isolamento. In realt`a, i tweet fanno
parte di conversazioni, la cui natura pu`o
essere sfruttata per migliorare la qualit`a
dell’analisi da parte di sistemi automatici.
In (Vanzo et al., 2014) `e stato proposto un
modello basato sulla classificazione di sequenze
per la caratterizzazione della polarit`
a dei tweet, che sfrutta il contesto in
cui il messaggio `e immerso. In questo lavoro,
si vuole verificare l’applicabilit`a di
tale metodologia anche per la lingua Italiana.Recent works on Sentiment
Analysis over Twitter leverage the idea
that the sentiment depends on a single
incoming tweet. However, tweets are
plunged into streams of posts, thus making
available a wider context. The contribution
of this information has been recently
investigated for the English language by
modeling the polarity detection as a sequential
classification task over streams of
tweets (Vanzo et al., 2014). Here, we want
to verify the applicability of this method
even for a morphological richer language,
i.e. Italian
UT-DB: an experimental study on sentiment analysis in twitter
This paper describes our system for participating SemEval2013 Task2-B (Kozareva et al., 2013): Sentiment Analysis in Twitter. Given a message, our system classifies whether the message is positive, negative or neutral sentiment. It uses a co-occurrence rate model. The training data are constrained to the data provided by the task organizers (No other tweet data are used). We consider 9 types of features and use a subset of them in our submitted system. To see the contribution of each type of features, we do experimental study on features by leaving one type of features out each time. Results suggest that unigrams are the most important features, bigrams and POS tags seem not helpful, and stopwords should be retained to achieve the best results. The overall results of our system are promising regarding the constrained features and data we use
Learning languages from parallel corpora
This work describes a blueprint for an application that generates language learning exercises from parallel corpora. Word alignment and parallel structures allow for the automatic assessment of sentence pairs in the source and target languages, while users of the application continuously improve the quality of the data with their interactions, thus crowdsourcing parallel language learning material. Through triangulation, their assessment can be transferred to language pairs other than the original ones if multiparallel corpora are used as a source.
Several challenges need to be addressed for such an application to work, and we will discuss three of them here. First, the question of how adequate learning material can be identified in corpora has received some attention in the last decade, and we will detail what the structure of parallel corpora implies for that selection. Secondly, we will consider which type of exercises can be generated automatically from parallel corpora such that they foster learning and keep learners motivated. And thirdly, we will highlight the potential of employing users, that is both teachers and learners, as crowdsourcers to help improve the material
A multilingual collection of CoNLL-U-compatible morphological lexicons
International audienceWe introduce UDLexicons, a multilingual collection of morphological lexicons that follow the guidelines and format of the Universal Dependencies initiative. We describe the three approaches we use to create 53 morphological lexicons covering 38 languages, based on existing resources. These lexicons, which are freely available, have already proven useful for improving part-of-speech tagging accuracy in state-of-the-art architectures
- …