795 research outputs found

    External Lexical Information for Multilingual Part-of-Speech Tagging

    Get PDF
    Morphosyntactic lexicons and word vector representations have both proven useful for improving the accuracy of statistical part-of-speech taggers. Here we compare the performances of four systems on datasets covering 16 languages, two of these systems being feature-based (MEMMs and CRFs) and two of them being neural-based (bi-LSTMs). We show that, on average, all four approaches perform similarly and reach state-of-the-art results. Yet better performances are obtained with our feature-based models on lexically richer datasets (e.g. for morphologically rich languages), whereas neural-based results are higher on datasets with less lexical variability (e.g. for English). These conclusions hold in particular for the MEMM models relying on our system MElt, which benefited from newly designed features. This shows that, under certain conditions, feature-based approaches enriched with morphosyntactic lexicons are competitive with respect to neural methods

    Creating language resources for under-resourced languages: methodologies, and experiments with Arabic

    Get PDF
    Language resources are important for those working on computational methods to analyse and study languages. These resources are needed to help advancing the research in fields such as natural language processing, machine learning, information retrieval and text analysis in general. We describe the creation of useful resources for languages that currently lack them, taking resources for Arabic summarisation as a case study. We illustrate three different paradigms for creating language resources, namely: (1) using crowdsourcing to produce a small resource rapidly and relatively cheaply; (2) translating an existing gold-standard dataset, which is relatively easy but potentially of lower quality; and (3) using manual effort with appropriately skilled human participants to create a resource that is more expensive but of high quality. The last of these was used as a test collection for TAC-2011. An evaluation of the resources is also presented

    Overlaps in Maltese conversational and task-oriented dialogues

    Get PDF
    This paper deals with overlaps in spoken Maltese. Overlaps are studied in two different corpora recorded in different communicative situations. One is a multimodal corpus involving first acquaintance conversations; the other consists of Map Task dialogues. The results show that the number of overlaps is larger in the free conversations, where it varies depending on specific aspects of the interaction. They also show that overlaps in the MapTask dialogues tend to be longer, serving the function of establishing common understanding to achieve optimal task completion.peer-reviewe

    Crowdsourcing for Language Resource Development: Criticisms About Amazon Mechanical Turk Overpowering Use

    Get PDF
    International audienceThis article is a position paper about Amazon Mechanical Turk, the use of which has been steadily growing in language processing in the past few years. According to the mainstream opinion expressed in articles of the domain, this type of on-line working platforms allows to develop quickly all sorts of quality language resources, at a very low price, by people doing that as a hobby. We shall demonstrate here that the situation is far from being that ideal. Our goal here is manifold: 1- to inform researchers, so that they can make their own choices, 2- to develop alternatives with the help of funding agencies and scientific associations, 3- to propose practical and organizational solutions in order to improve language resources development, while limiting the risks of ethical and legal issues without letting go price or quality, 4- to introduce an Ethics and Big Data Charter for the documentation of language resourc

    A context based model for sentiment analysis in twitter for the italian language

    Get PDF
    Studi recenti per la Sentiment Analysis in Twitter hanno tentato di creare modelli per caratterizzare la polarit´a di un tweet osservando ciascun messaggio in isolamento. In realt`a, i tweet fanno parte di conversazioni, la cui natura pu`o essere sfruttata per migliorare la qualit`a dell’analisi da parte di sistemi automatici. In (Vanzo et al., 2014) `e stato proposto un modello basato sulla classificazione di sequenze per la caratterizzazione della polarit` a dei tweet, che sfrutta il contesto in cui il messaggio `e immerso. In questo lavoro, si vuole verificare l’applicabilit`a di tale metodologia anche per la lingua Italiana.Recent works on Sentiment Analysis over Twitter leverage the idea that the sentiment depends on a single incoming tweet. However, tweets are plunged into streams of posts, thus making available a wider context. The contribution of this information has been recently investigated for the English language by modeling the polarity detection as a sequential classification task over streams of tweets (Vanzo et al., 2014). Here, we want to verify the applicability of this method even for a morphological richer language, i.e. Italian

    UT-DB: an experimental study on sentiment analysis in twitter

    Get PDF
    This paper describes our system for participating SemEval2013 Task2-B (Kozareva et al., 2013): Sentiment Analysis in Twitter. Given a message, our system classifies whether the message is positive, negative or neutral sentiment. It uses a co-occurrence rate model. The training data are constrained to the data provided by the task organizers (No other tweet data are used). We consider 9 types of features and use a subset of them in our submitted system. To see the contribution of each type of features, we do experimental study on features by leaving one type of features out each time. Results suggest that unigrams are the most important features, bigrams and POS tags seem not helpful, and stopwords should be retained to achieve the best results. The overall results of our system are promising regarding the constrained features and data we use

    Learning languages from parallel corpora

    Full text link
    This work describes a blueprint for an application that generates language learning exercises from parallel corpora. Word alignment and parallel structures allow for the automatic assessment of sentence pairs in the source and target languages, while users of the application continuously improve the quality of the data with their interactions, thus crowdsourcing parallel language learning material. Through triangulation, their assessment can be transferred to language pairs other than the original ones if multiparallel corpora are used as a source. Several challenges need to be addressed for such an application to work, and we will discuss three of them here. First, the question of how adequate learning material can be identified in corpora has received some attention in the last decade, and we will detail what the structure of parallel corpora implies for that selection. Secondly, we will consider which type of exercises can be generated automatically from parallel corpora such that they foster learning and keep learners motivated. And thirdly, we will highlight the potential of employing users, that is both teachers and learners, as crowdsourcers to help improve the material

    A multilingual collection of CoNLL-U-compatible morphological lexicons

    Get PDF
    International audienceWe introduce UDLexicons, a multilingual collection of morphological lexicons that follow the guidelines and format of the Universal Dependencies initiative. We describe the three approaches we use to create 53 morphological lexicons covering 38 languages, based on existing resources. These lexicons, which are freely available, have already proven useful for improving part-of-speech tagging accuracy in state-of-the-art architectures
    corecore