Search CORE

40 research outputs found

Résumé automatique de textes d'opinion

Author: Bossard Aurélien
Généreux Michel
Poibeau Thierry
Publication venue: ATALA (Association pour le Traitement Automatique des Langues)
Publication date: 01/01/2010
Field of study

International audienceIn this paper, we present a summarization system that is specifically designed to process blog posts, where factual information is mixed with opinions on the discussed facts. Our approach combines redundancy analysis with new information tracking and is enriched by a module that computes the polarity of textual fragments in order to summarize blog posts more efficiently. The system is evaluated against English data, especially through the participation in TAC (Text Analysis Conference), an international evaluation framework for automatic summarization, in which our system obtained interesting results.Nous présentons dans cet article un système de résumé automatique tourné vers l'analyse de blogs, où sont exprimées à la fois des informations factuelles et des prises de position sur les faits considérés. Notre système de résumé est fondé sur une approche nouvelle qui mêle analyse de la redondance et repérage des informations nouvelles dans les textes ; ce système générique est en outre enrichi d'un module de calcul de la polarité de l'opinion véhiculée afin de traiter de façon appropriée la subjectivité qui est le propre des billets de blogs. Le système est évalué sur l'anglais, à travers la participation à la campagne d'évaluation internationale TAC (Text Analysis Conference) où notre système a obtenu des performances satisfaisantes

HAL-Paris 13

A large Portuguese corpus on-line: cleaning and preprocessing

Author: Généreux Michel
Hendrickx Iris
Mendes Amália
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2012
Field of study

We present a newly available on-line resource for Portuguese,a corpus of 310 million words, a new version of the Reference Corpus of Contemporary Portuguese, now searchable via a user-friendly web interface. Here we report on work carried out on the corpus previous toits publication on-line. We focus on the processes and tools involved for the cleaning, preparation and annotation to make the corpus suitable for linguistic inquiries.info:eu-repo/semantics/publishedVersio

Universidade de Lisboa: Repositório.UL

A corpus of European Portuguese child and child-directed speech

Author: Santos Ana Lúcia
Généreux Michel
Cardoso Aida
Agostinho Celina
Abalada Silvana
Publication venue: European Language Resources Association
Publication date: 01/01/2014
Field of study

We present a corpus of child and child-directed speech of European Portuguese. This corpus results from the expansion of an already existing database (Santos, 2006). It includes around 52 hours of child-adult interaction and now contains 27,595 child utterances and 70,736 adult utterances. The corpus was transcribed according to the CHILDES system (Child Language Data Exchange System) and using the CLAN software (MacWhinney, 2000). The corpus itself represents a valuable resource for the study of lexical, syntax and discourse acquisition. In this paper, we also show how we used an existing part-of-speech tagger trained on written material (Généreux, Hendrickx & Mendes, 2012) to automatically lemmatize and tag child and child-directed speech and generate a line with part-of-speech information compatible with the CLAN interface. We show that a POS-tagger trained on the analysis of written language can be exploited for the treatment of spoken material with minimal effort, with only a small number of written rules assisting the statistical model.info:eu-repo/semantics/publishedVersio

Hal - Université Grenoble Alpes

Universidade de Lisboa: Repositório.UL

Lexical analysis of pre and post revolution discourse in Portugal

Author: Bacelar do Nascimento Maria Fernanda
Généreux Michel
Mendes Amália
Santos Pereira Luísa Alice
Publication venue: European Language Resources Association
Publication date: 01/01/2010
Field of study

This paper presents a lexical comparison of pre (1954-74) and post (1974-94) revolution parliamentary discourse in four comparable sub-corpora extracted from the Reference Corpus of Contemporary Portuguese (CRPC). After introducing the CRPC, including annotation and meta-data, we focus on a subset of the corpus dealing with parliamentary discourses, more particularly a time frame of forty years divided into four comparable sub-corpora, each covering a ten-year period, two pre revolution and two post revolution. We extract lexical density information as well as salient terms pertaining to each period to make a comparative evaluation of the periods. Our results show how a linguistic analysis essentially based on the use of simple n-gram statistics can produce key insights into the use, change and evolution of the Portuguese language around a critical time period in its history.info:eu-repo/semantics/publishedVersio

Universidade de Lisboa: Repositório.UL

The Gulf of Guinea Creole Corpora

Author: Généreux Michel
Hagemeijer Tjerk
Hendrickx Iris
Mendes Amália
Tiny Abigail
Zamora Armando
Publication venue: European Language Resources Association
Publication date: 01/01/2014
Field of study

We present the process of building linguistic corpora of the Portuguese-related Gulf of Guinea creoles, a cluster of four historically related languages: Santome, Angolar, Principense and Fa d’Ambô. We faced the typical difficulties of languages lacking an official status, such as lack of standard spelling, language variation, lack of basic language instruments, and small data sets, which comprise data from the late 19th century to the present. In order to tackle these problems, the compiled written and transcribed spoken data collected during field work trips were adapted to a normalized spelling that was applied to the four languages. For the corpus compilation we followed corpus linguistics standards. We recorded meta data for each file and added morphosyntactic information based on a part-of-speech tag set that was designed to deal with the specificities of these languages. The corpora of three of the four creoles are already available and searchable via an online web interface.info:eu-repo/semantics/publishedVersio

Universidade de Lisboa: Repositório.UL

A corpus of European Portuguese child and child-directed speech

Author: Aida Cardoso
Ana Lúcia Santos
Celina Agostinho
Michel Généreux
Silvana Abalada
Publication venue
Publication date: 11/04/2020
Field of study

Abstract We present a corpus of child and child-directed speech of European Portuguese. This corpus results from the expansion of an already existing databas

CiteSeerX

CQPWeb: Uma nova plataforma de pesquisa para o CRPC

Author: Antunes Sandra
Bacelar do Nascimento Maria Fernanda
Généreux Michel
Hendrickx Iris
Mendes Amália
Pereira Luísa
Publication venue: 'Associacao Portuguesa de Linguistica'
Publication date: 01/01/2012
Field of study

We present a newly available online resource for Portuguese, a new version of the Reference Corpus of Contemporary Portuguese, now searchable via a user-friendly web interface. We report on work carried out on the corpus previous to its publication online, namely how the corpus was built, our choice of metadata and the processes and tools involved for the cleaning, preparation and annotation to make the corpus suitable for linguistic inquiries. We also describe the web platform and resume the extensive search options available for linguistic or NLP studies.info:eu-repo/semantics/publishedVersio

Universidade de Lisboa: Repositório.UL

Résumé automatique de textes d'opinion

Author: Bossard Aurélien
Généreux Michel
Publication venue: HAL CCSD
Publication date: 01/04/2009
Field of study

There is currently a growing need concerning the analysis of texts expressing opinions or judgements. In this paper, we present a summarization system that is specifically designed to process blog posts, where factual information is mixed with opinions. We show that a classical approach based on surface cues is efficient to summarize this kind of texts. The system is evaluated through a participation to TAC (Text Analysis Conference), an international evaluation framework for automatic summarization, in which our system obtained good results

HAL-Paris 13

Sentiment analysis using automatically labelled financial news

Author: Généreux Michel
Koppel Moshe
Poibeau Thierry
Publication venue: HAL CCSD
Publication date: 01/06/2008
Field of study

International audienceGiven a corpus of financial news labelled according to the market reaction following their publication, we investigate cotemporeneous and forward-looking price stock movements. Our approach is to provide a pool of relevant textual features to a machine learning algorithm to detect substantial stock price variations. Our two working hypotheses are that the market reaction to a news is a good indicator for labelling financial news, and that a machine learning algorithm can be trained on those news to build models detecting price movement effectively

HAL-Paris 13