Search CORE

554 research outputs found

Towards Application of Speech Act Theory to Opinion Mining

Author: Agnieszka Magdalena Pluwak
Publication venue: 'Institute of Slavic Studies Polish Academy of Sciences'
Publication date: 01/01/2016
Field of study

Towards the Application of Speech Act Theory to Opinion Mining The paper refers to the pragmatics’ perspective on opinion mining in Polish and English, inspired by the discrepancy between the coverage of sentiment analysis and the market demand. An analysis of speech acts expressed in opinion texts reveals that almost half of all opinions include ways of indirect evaluation that might not get extracted while applying traditional methods of sentiment analysis based on direct evaluative vocabulary and polarity lexicons. Coding of sentiment with respect to speech acts could vastly broaden data mining results within an NLP-system. O zastosowaniu teorii aktów mowy w ekstrakcji danych z tekstów opinii internetowych Jedno z aktualnych zagadnień językoznawstwa komputerowego, jakim jest automatyczne badanie wydźwięku wypowiedzi, nie uwzględniło dotychczas w wystarczającym stopniu pragmatyki językoznawczej, np. aktów mowy Austina (1961) i Searla (1969), a zatem również implicytnych sposobów wyrażania ewaluacji. Tymczasem podejście od pragmatyki ku konstrukcjom przełożonym na reguły programistyczne umożliwiłoby nie tylko szersze spojrzenie na analizę sentymentu, ale też zbliżyłoby automat do sposobu, w jaki odbiera go człowiek. W szczególności chodzi tu sposoby wyrażania (nie)zadowolenia wykraczające poza poziom leksykalny (bez nacechowanej negatywnie leksyki), typu Nigdy więcej tam nie pójdę. Artykuł prezentuje: 1. aktualne podejścia do analizy wydźwięku w lingwistyce komputerowej, 2. propozycję zastosowania podejścia pragmatycznego, 3. wyniki badania próbki tekstów opinii internetowych pod kątem występowania w nich aktów mowy, 4. propozycję utworzenia reguł ekstrakcji danych na ich podstawie. Zaprezentowane podejście zakłada hipotezę wtórnej oralności, czyli tego, że język opinii jest zapisanym językiem mówionym

Crossref

Biblioteka Nauki - repozytorium artykuÅÃ³w

Directory of Open Access Journals

The STEM-ECR Dataset: Grounding Scientific Entity References in STEM Scholarly Content to Authoritative Encyclopedic and Lexicographic Sources

Author: Auer Sören
Brack Arthur
D'Souza Jennifer
Ewerth Ralph
Hoppe Anett
Jaradeh Mohamad Yaser
Publication venue
Publication date: 01/01/2020
Field of study

We introduce the STEM (Science, Technology, Engineering, and Medicine) Dataset for Scientific Entity Extraction, Classification, and Resolution, version 1.0 (STEM-ECR v1.0). The STEM-ECR v1.0 dataset has been developed to provide a benchmark for the evaluation of scientific entity extraction, classification, and resolution tasks in a domain-independent fashion. It comprises abstracts in 10 STEM disciplines that were found to be the most prolific ones on a major publishing platform. We describe the creation of such a multidisciplinary corpus and highlight the obtained findings in terms of the following features: 1) a generic conceptual formalism for scientific entities in a multidisciplinary scientific context; 2) the feasibility of the domain-independent human annotation of scientific entities under such a generic formalism; 3) a performance benchmark obtainable for automatic extraction of multidisciplinary scientific entities using BERT-based neural models; 4) a delineated 3-step entity resolution procedure for human annotation of the scientific entities via encyclopedic entity linking and lexicographic word sense disambiguation; and 5) human evaluations of Babelfy returned encyclopedic links and lexicographic senses for our entities. Our findings cumulatively indicate that human annotation and automatic learning of multidisciplinary scientific concepts as well as their semantic disambiguation in a wide-ranging setting as STEM is reasonable.Comment: Published in LREC 2020. Publication URL https://www.aclweb.org/anthology/2020.lrec-1.268/; Dataset DOI https://doi.org/10.25835/001754

arXiv.org e-Print Archive

Repositorium für Naturwissenschaften und Technik

The STEM-ECR Dataset: Grounding Scientific Entity References in STEM Scholarly Content to Authoritative Encyclopedic and Lexicographic Sources

Author: Auer Sören
Blache Philippe
Brack Arthur
Béchet Frédéric
Calzolari Nicoletta
Choukri Khalid
Cieri Christopher
D'Souza Jennifer
Declerck Thierry
Ewerth Ralph
Goggi Sara
Hoppe Anett
Isahara Hitoshi
Jaradeh Mohamad Yaser
Maegaard Bente
Mariani Joseph
Mazo Hélène
Moreno Asuncion
Odijk Jan
Piperidis Stelios
Publication venue: Paris : The European Language Resources Association (ELRA)
Publication date: 01/01/2020
Field of study

Institutionelles Repositorium der Leibniz Universität Hannover

Workshop on Extracting and Using Constructions in Computational Linguistics

Author: Knutsson Ola
Sahlgren Magnus
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 01/01/2010
Field of study

RISE – Research Institutes of Sweden

Digitala Vetenskapliga Arkivet - Academic Archive On-line

Swedish Institute of Computer Science Publications Database

Software institutes' Online Digital Archive

Exploiting Representation Bias for Data Distillation in Abstractive Text Summarization

Author: Atri Yash Kumar
Chakraborty Tanmoy
Goyal Vikram
Publication venue
Publication date: 20/12/2023
Field of study

Abstractive text summarization is surging with the number of training samples to cater to the needs of the deep learning models. These models tend to exploit the training data representations to attain superior performance by improving the quantitative element of the resultant summary. However, increasing the size of the training set may not always be the ideal solution to maximize the performance, and therefore, a need to revisit the quality of training samples and the learning protocol of deep learning models is a must. In this paper, we aim to discretize the vector space of the abstractive text summarization models to understand the characteristics learned between the input embedding space and the models' encoder space. We show that deep models fail to capture the diversity of the input space. Further, the distribution of data points on the encoder space indicates that an unchecked increase in the training samples does not add value; rather, a tear-down of data samples is highly needed to make the models focus on variability and faithfulness. We employ clustering techniques to learn the diversity of a model's sample space and how data points are mapped from the embedding space to the encoder space and vice versa. Further, we devise a metric to filter out redundant data points to make the model more robust and less data hungry. We benchmark our proposed method using quantitative metrics, such as Rouge, and qualitative metrics, such as BERTScore, FEQA and Pyramid score. We also quantify the reasons that inhibit the models from learning the diversity from the varied input samples

arXiv.org e-Print Archive

Computational Sociolinguistics: A Survey

Author: de Jong Franciska
Doğruöz A. Seza
Nguyen Dong
Rosé Carolyn P.
Publication venue
Publication date: 01/01/2016
Field of study

Language is a social phenomenon and variation is inherent to its social nature. Recently, there has been a surge of interest within the computational linguistics (CL) community in the social dimension of language. In this article we present a survey of the emerging field of "Computational Sociolinguistics" that reflects this increased interest. We aim to provide a comprehensive overview of CL research on sociolinguistic themes, featuring topics such as the relation between language and social identity, language use in social interaction and multilingual communication. Moreover, we demonstrate the potential for synergy between the research communities involved, by showing how the large-scale data-driven methods that are widely used in CL can complement existing sociolinguistic studies, and how sociolinguistics can inform and challenge the methods and assumptions employed in CL studies. We hope to convey the possible benefits of a closer collaboration between the two communities and conclude with a discussion of open challenges.Comment: To appear in Computational Linguistics. Accepted for publication: 18th February, 201

arXiv.org e-Print Archive

Crossref

Ghent University Academic Bibliography

EUR Research Repository

University of Twente Research Information

A Hybrid Environment for Syntax-Semantic Tagging

Author: Padro Lluis
Publication venue
Publication date: 01/01/1998
Field of study

The thesis describes the application of the relaxation labelling algorithm to NLP disambiguation. Language is modelled through context constraint inspired on Constraint Grammars. The constraints enable the use of a real value statind "compatibility". The technique is applied to POS tagging, Shallow Parsing and Word Sense Disambigation. Experiments and results are reported. The proposed approach enables the use of multi-feature constraint models, the simultaneous resolution of several NL disambiguation tasks, and the collaboration of linguistic and statistical models.Comment: PhD Thesis. 120 page

arXiv.org e-Print Archive

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

UPCommons. Portal del coneixement obert de la UPC

Tesis Doctorals en Xarxa

Secretaría de Estado de Cultura