40 research outputs found

    Résumé automatique de textes d'opinion

    No full text
    International audienceIn this paper, we present a summarization system that is specifically designed to process blog posts, where factual information is mixed with opinions on the discussed facts. Our approach combines redundancy analysis with new information tracking and is enriched by a module that computes the polarity of textual fragments in order to summarize blog posts more efficiently. The system is evaluated against English data, especially through the participation in TAC (Text Analysis Conference), an international evaluation framework for automatic summarization, in which our system obtained interesting results.Nous présentons dans cet article un système de résumé automatique tourné vers l'analyse de blogs, où sont exprimées à la fois des informations factuelles et des prises de position sur les faits considérés. Notre système de résumé est fondé sur une approche nouvelle qui mêle analyse de la redondance et repérage des informations nouvelles dans les textes ; ce système générique est en outre enrichi d'un module de calcul de la polarité de l'opinion véhiculée afin de traiter de façon appropriée la subjectivité qui est le propre des billets de blogs. Le système est évalué sur l'anglais, à travers la participation à la campagne d'évaluation internationale TAC (Text Analysis Conference) où notre système a obtenu des performances satisfaisantes

    A large Portuguese corpus on-line: cleaning and preprocessing

    Get PDF
    We present a newly available on-line resource for Portuguese,a corpus of 310 million words, a new version of the Reference Corpus of Contemporary Portuguese, now searchable via a user-friendly web interface. Here we report on work carried out on the corpus previous toits publication on-line. We focus on the processes and tools involved for the cleaning, preparation and annotation to make the corpus suitable for linguistic inquiries.info:eu-repo/semantics/publishedVersio

    A corpus of European Portuguese child and child-directed speech

    Get PDF
    We present a corpus of child and child-directed speech of European Portuguese. This corpus results from the expansion of an already existing database (Santos, 2006). It includes around 52 hours of child-adult interaction and now contains 27,595 child utterances and 70,736 adult utterances. The corpus was transcribed according to the CHILDES system (Child Language Data Exchange System) and using the CLAN software (MacWhinney, 2000). The corpus itself represents a valuable resource for the study of lexical, syntax and discourse acquisition. In this paper, we also show how we used an existing part-of-speech tagger trained on written material (Généreux, Hendrickx & Mendes, 2012) to automatically lemmatize and tag child and child-directed speech and generate a line with part-of-speech information compatible with the CLAN interface. We show that a POS-tagger trained on the analysis of written language can be exploited for the treatment of spoken material with minimal effort, with only a small number of written rules assisting the statistical model.info:eu-repo/semantics/publishedVersio

    Lexical analysis of pre and post revolution discourse in Portugal

    Get PDF
    This paper presents a lexical comparison of pre (1954-74) and post (1974-94) revolution parliamentary discourse in four comparable sub-corpora extracted from the Reference Corpus of Contemporary Portuguese (CRPC). After introducing the CRPC, including annotation and meta-data, we focus on a subset of the corpus dealing with parliamentary discourses, more particularly a time frame of forty years divided into four comparable sub-corpora, each covering a ten-year period, two pre revolution and two post revolution. We extract lexical density information as well as salient terms pertaining to each period to make a comparative evaluation of the periods. Our results show how a linguistic analysis essentially based on the use of simple n-gram statistics can produce key insights into the use, change and evolution of the Portuguese language around a critical time period in its history.info:eu-repo/semantics/publishedVersio

    The Gulf of Guinea Creole Corpora

    Get PDF
    We present the process of building linguistic corpora of the Portuguese-related Gulf of Guinea creoles, a cluster of four historically related languages: Santome, Angolar, Principense and Fa d’Ambô. We faced the typical difficulties of languages lacking an official status, such as lack of standard spelling, language variation, lack of basic language instruments, and small data sets, which comprise data from the late 19th century to the present. In order to tackle these problems, the compiled written and transcribed spoken data collected during field work trips were adapted to a normalized spelling that was applied to the four languages. For the corpus compilation we followed corpus linguistics standards. We recorded meta data for each file and added morphosyntactic information based on a part-of-speech tag set that was designed to deal with the specificities of these languages. The corpora of three of the four creoles are already available and searchable via an online web interface.info:eu-repo/semantics/publishedVersio

    A corpus of European Portuguese child and child-directed speech

    Get PDF
    Abstract We present a corpus of child and child-directed speech of European Portuguese. This corpus results from the expansion of an already existing databas

    CQPWeb: Uma nova plataforma de pesquisa para o CRPC

    Get PDF
    We present a newly available online resource for Portuguese, a new version of the Reference Corpus of Contemporary Portuguese, now searchable via a user-friendly web interface. We report on work carried out on the corpus previous to its publication online, namely how the corpus was built, our choice of metadata and the processes and tools involved for the cleaning, preparation and annotation to make the corpus suitable for linguistic inquiries. We also describe the web platform and resume the extensive search options available for linguistic or NLP studies.info:eu-repo/semantics/publishedVersio

    Résumé automatique de textes d'opinion

    No full text
    There is currently a growing need concerning the analysis of texts expressing opinions or judgements. In this paper, we present a summarization system that is specifically designed to process blog posts, where factual information is mixed with opinions. We show that a classical approach based on surface cues is efficient to summarize this kind of texts. The system is evaluated through a participation to TAC (Text Analysis Conference), an international evaluation framework for automatic summarization, in which our system obtained good results

    Sentiment analysis using automatically labelled financial news

    Get PDF
    International audienceGiven a corpus of financial news labelled according to the market reaction following their publication, we investigate cotemporeneous and forward-looking price stock movements. Our approach is to provide a pool of relevant textual features to a machine learning algorithm to detect substantial stock price variations. Our two working hypotheses are that the market reaction to a news is a good indicator for labelling financial news, and that a machine learning algorithm can be trained on those news to build models detecting price movement effectively
    corecore