17 research outputs found

    bot.zen @ EVALITA 2016 -A minimally-deep learning PoS-tagger (trained for Italian Tweets)

    No full text
    Abstract English. This article describes the system that participated in the POS tagging for Italian Social Media Texts (PoST-WITA) task of the 5 th periodic evaluation campaign of Natural Language Processing (NLP) and speech tools for the Italian language EVALITA 2016. The work is a continuation of Stemle (2016) with minor modifications to the system and different data sets. It combines a small assertion of trending techniques, which implement matured methods, from NLP and ML to achieve competitive results on PoS tagging of Italian Twitter texts; in particular, the system uses word embeddings and character-level representations of word beginnings and endings in a LSTM RNN architecture. Labelled data (Italian UD corpus, DiDi and PoSTWITA) and unlabbelled data (Italian C4Corpus and PAISÀ) were used for training. The system is available under the APLv2 open-source license. Italiano. Questo articolo descrive il sistema che ha partecipato al task POS tagging for Italian Social Media Texts (PoSTWita) nell'ambito di EVALITA 2016, la 5°campagna di valutazione periodica del Natural Language Processing (NLP) e delle tecnologie del linguaggio. Il lavoroè un proseguimento di quanto descritto in Stemle (2016), con modifiche minime al sistema e insiemi di dati differenti. Il lavoro combina alcune tecniche correnti che implementano metodi comprovati dell'NLP e del Machine Learning, per raggiungere risultati competitivi nel PoS tagging dei testi italiani di Twitter. In particolare il sistema utilizza strategie di word embedding e di rappresentazione character-level di inizio e fine parola, in un'architettura LSTM RNN. Dati etichettati (Italian UD corpus, DiDi e PoSTWITA) e dati non etichettati (Italian C4Corpus e PAISÀ) sono stati utilizzati in fase di training. Il sistemaè disponibile sotto licenza open source APLv2

    Proceedings of the 5th Conference on CMC and Social Media Corpora for the Humanities (cmccorpora17)

    No full text
    International audienceThis volume presents the proceedings of the 5th edition of the annual conference series on CMC and Social Media Corpora for the Humanities (cmc-corpora2017). This conference series is dedicated to the collection, annotation, processing, and exploitation of corpora of computer-mediated communication (CMC) and social media for research in the humanities. The annual event brings together language-centered research on CMC and social media in linguistics, philologies, communication sciences, media and social sciences with research questions from the fields of corpus and computational linguistics, language technology, text technology, and machine learning.The 5th Conference on CMC and Social Media Corpora for the Humanities was held at Eurac Research on October, 4th and 5th, in Bolzano, Italy. This volume contains extended abstracts of the invited talks, papers, and extended abstracts of posters presented at the event. The conference attracted 26 valid submissions. Each submission was reviewed by at least two members of the scientific committee. This committee decided to accept 16 papers and 8 posters of which 14 papers and 3 posters were presented at the conference. The programme also includes three invited talks: two keynote talks by Aivars Glaznieks (Eurac Research, Italy) and A. Seza Doğruöz (Independent researcher) and an invited talk on the Common Language Resources and Technology Infrastructure (CLARIN) given by Darja Fišer, the CLARIN ERIC Director of User Involvement

    Exploring Reusability and Reproducibility for a Research Infrastructure for L1 and L2 Learner Corpora

    No full text
    Up until today research in various educational and linguistic domains such as learner corpus research, writing research, or second language acquisition has produced a substantial amount of research data in the form of L1 and L2 learner corpora. However, the multitude of individual solutions combined with domain-inherent obstacles in data sharing have so far hampered comparability, reusability and reproducibility of data and research results. In this article, we present work in creating a digital infrastructure for L1 and L2 learner corpora and populating it with data collected in the past. We embed our infrastructure efforts in the broader field of infrastructures for scientific research, drawing from technical solutions and frameworks from research data management, among which the FAIR guiding principles for data stewardship. We share our experiences from integrating some L1 and L2 learner corpora from concluded projects into the infrastructure while trying to ensure compliance with the FAIR principles and the standards we established for reproducibility, discussing how far research data that has been collected in the past can be made comparable, reusable and reproducible. Our results show that some basic needs for providing comparable and reusable data are covered by existing general infrastructure solutions and can be exploited for domain-specific infrastructures such as the one presented in this article. Other aspects need genuinely domain-driven approaches. The solutions found for the corpora in the presented infrastructure can only be a preliminary attempt, and further community involvement would be needed to provide templates and models acknowledged and promoted by the community. Furthermore, forward-looking data management would be needed starting from the beginning of new corpus creation projects to ensure that all requirements for FAIR data can be met

    The DiDi Corpus of South Tyrolean CMC Data: A multilingual corpus of Facebook texts

    No full text
    Abstract English. The DiDi corpus of South Tyrolean data of computer-mediated communication (CMC) is a multilingual sociolinguistic language corpus. It consists of around 600,000 tokens collected from 136 profiles of Facebook users residing in South Tyrol, Italy. In conformity with the multilingual situation of the territory, the main languages of the corpus are German and Italian (followed by English). The data has been manually anonymised and provides manually corrected part-ofspeech tags for the Italian language texts and manually normalised data for German texts. Moreover, it is annotated with userprovided socio-demographic data (among others L1, gender, age, education, and internet communication habits) from a questionnaire, and linguistic annotations regarding CMC phenomena, languages and varieties. The anonymised corpus is freely available for research purposes. Italiano. DiDiè un corpus di comunicazione mediata dal computer (CMC), che raccoglie dati linguistici di area sudtirolese. Il corpus, multilingue e sociolinguistico,è composto da circa 600,000 occorrenze raccolte (previo consenso all'utilizzo dei dati) dai profili di 136 iscritti a Facebook e residenti in Alto Adige. Le principali lingue del corpus, tedesco e italiano (seguite dall'inglese), riflettono lo spazio plurilingue del territorio. I dati sono stati manualmente anonimizzati e i testi in lingua italiana sono corredati da etichette (manualmente corrette) per le parti del discorso. Inoltre, DiDiè annotato con dati sociodemografici forniti dall'utente (fra gli altri: L1, genere, età, istruzione e modalità di comunicazione via Internet) attraverso un questionario e contiene ulteriori annotazioni linguistiche relative a fenomeni legati alla CMC e agli usi di varietà linguistiche. Il corpus anonimizzatoè liberamente disponibile a fini di ricerca

    Collecting language data of non-public social media profiles

    Get PDF
    In this paper, we propose an integrated web strategy for mixed sociolinguistic research methodologies in the context of social media corpora. After stating the particular challenges for building corpora of private, non-public computer-mediated communication, we will present our solution to these problems: a Facebook web application for the acquisition of such data and the corresponding meta data. Finally, we will discuss positive and negative implications for this method

    The DiDi Corpus of South Tyrolean CMC Data: A multilingual corpus of Facebook texts

    Get PDF
    The DiDi corpus of South Tyrolean data of computer-mediated communication (CMC) is a multilingual sociolinguistic language corpus. It consists of around 600,000 tokens collected from 136 profiles of Facebook users residing in South Tyrol, Italy. In conformity with the multilingual situation of the territory, the main languages of the corpus are German and Italian (followed by English). The data has been manually anonymised and provides manually corrected part-of-speech tags for the Italian language texts and manually normalised data for German texts. Moreover, it is annotated with userprovided socio-demographic data (among others L1, gender, age, education, and internet communication habits) from a questionnaire, and linguistic annotations regarding CMC phenomena, languages and varieties. The anonymised corpus is freely available for research purposes.DiDi è un corpus di comunicazione mediata dal computer (CMC), che raccoglie dati linguistici di area sudtirolese. Il corpus, multilingue e sociolinguistico, è composto da circa 600,000 occorrenze raccolte (previo consenso all’utilizzo dei dati) dai profili di 136 iscritti a Facebook e residenti in Alto Adige. Le principali lingue del corpus, tedesco e italiano (seguite dall’inglese), riflettono lo spazio plurilingue del territorio. I dati sono stati manualmente anonimizzati e i testi in lingua italiana sono corredati da etichette (manualmente corrette) per le parti del discorso. Inoltre, DiDi è annotato con dati sociodemografici forniti dall’utente (fra gli altri: L1, genere, età, istruzione e modalità di comunicazione via Internet) attraverso un questionario e contiene ulteriori annotazioni linguistiche relative a fenomeni legati alla CMC e agli usi di varietà linguistiche. Il corpus anonimizzato è liberamente disponibile a fini di ricerca

    Structure-Preserving Pipelines for Digital Libraries

    No full text
    Most existing HLT pipelines assume the input is pure text or, at most, HTML and either ignore (logical) document structure or remove it. We argue that identifying the structure of documents is essential in digital library and other types of applications, and show that it is relatively straightforward to extend existing pipelines to achieve ones in which the structure of a document is preserved.

    Comparison of automatic vs. manual language identification in multilingual social media texts

    No full text
    Multilingual speakers communicate in more than one language in daily life and on social media. In order to process or investigate multilingual communication, there is a need for language identification. This study compares the performance of human annotators with automatic ways of language identification on a multilingual (mainly German-Italian-English) social media corpus collected in South Tyrol, Italy. Our results indicate that humans and Natural Language Processing (NLP) systems follow their individual techniques to make a decision about multilingual text messages. This results in low agreement when different annotators or NLP systems execute the same task. In general, annotators agree with each other more than NLP systems. However, there is also variation in human agreement depending on the prior establishment of guidelines for the annotation task or not
    corecore