193 research outputs found

    Strategies for the analysis of large social media corpora: sampling and keyword extraction methods

    Get PDF
    In the context of the COVID-19 pandemic, social media platforms such as Twitter have been of great importance for users to exchange news, ideas, and perceptions. Researchers from fields such as discourse analysis and the social sciences have resorted to this content to explore public opinion and stance on this topic, and they have tried to gather information through the compilation of large-scale corpora. However, the size of such corpora is both an advantage and a drawback, as simple text retrieval techniques and tools may prove to be impractical or altogether incapable of handling such masses of data. This study provides methodological and practical cues on how to manage the contents of a large-scale social media corpus such as Chen et al. (JMIR Public Health Surveill 6(2):e19273, 2020) COVID-19 corpus. We compare and evaluate, in terms of efficiency and efficacy, available methods to handle such a large corpus. First, we compare different sample sizes to assess whether it is possible to achieve similar results despite the size difference and evaluate sampling methods following a specific data management approach to storing the original corpus. Second, we examine two keyword extraction methodologies commonly used to obtain a compact representation of the main subject and topics of a text: the traditional method used in corpus linguistics, which compares word frequencies using a reference corpus, and graph-based techniques as developed in Natural Language Processing tasks. The methods and strategies discussed in this study enable valuable quantitative and qualitative analyses of an otherwise intractable mass of social media data.Funding for open access publishing: Universidad de Málaga/CBUA. This work was funded by the Spanish Ministry of Science and Innovation [Grant No. PID2020-115310RB-I00], the Regional Govvernment of Andalusia [Grant No. UMA18-FEDERJA-158] and the Spanish Ministry of Education and Vocational Training [Grant No. FPU 19/04880]. Funding for open access charge: Universidad de Málaga / CBU

    Tracking diachronic sentiment change of economic terms in times of crisis: Connotative fluctuations of ‘inflation’ in the news discourse

    Get PDF
    The present study focuses on the fluctuation of sentiment in economic terminology to observe semantic changes in restricted diachrony. Our study examines the evolution of the target term ‘inflation’ in the business section of quality news and the impact of the Great Recession. This is carried out through the application of quantitative and qualitative methods: Sentiment Analysis, Usage Fluctuation Analysis, Corpus Linguistics, and Discourse Analysis. From the diachronic Great Recession News Corpus that covers the 2007–2015 period, we extracted sentences containing the term ‘inflation’. Several facts are evidenced: (i) terms become event words given the increase in their frequency of use due to the unfolding of relevant crisis events, and (ii) there are statistically significant culturally motivated changes in the form of emergent collocations with sentiment-laden words with a lower level of domain-specificity

    Corpus annotation and analysis of sarcasm on Twitter: #CatsMovie vs. #TheRiseOfSkywalker

    Get PDF
    Sentiment analysis is a natural language processing task that has received increased attention in the last decade due to the vast amount of opinionated data on social media platforms such as Twitter. Although the methodologies employed have grown in number and sophistication, analysing irony and sarcasm still poses a severe problem. From the linguistic perspective, sarcasm has been studied in discourse analysis from several perspectives, but little attention has been given to specific metrics that measure its relevance. In this paper we describe the creation of a manually-annotated dataset where detailed text markers are included. This dataset is a sample from a larger corpus of tweets (n= 76,764) on two highly controversial films: Cats and Star Wars: The Rise of Skywalker. We took two different samples for each film, one before and one after their release, to compare reception and presence of sarcasm. We then used a sentiment analysis tool to measure the impact of sarcasm in polarity detection and then manually classified the mechanisms of sarcasm generation. The resulting corpus will be useful for machine learning approaches to sarcasm detection as well as discourse analysis studies on irony and sarcasm

    Building the Great Recession News Corpus (GRNC): a contemporary diachronic corpus of economy news in English

    Get PDF
    The paper describes the process involved in developing the Great Recession News Corpus (GRNC); a specialized web corpus, which contains a wide range of written texts obtained from the Business section of The Guardian and The New York Times between 2007 and 2015. The corpus was compiled as the main resource in a sentiment analysis project on the economic/financial domain. In this paper we describe its design, compilation criteria and methodological approach, as well as the description of the overall creation process. Although the corpus can be used for a variety of purposes, we include a sentiment analysis study on the evolution of the sentiment conveyed by the word credit during the years of the Great Recession which we think provides validation of the corpus.Ministerio de Economía, Industria y Competitividad. Proyecto "Lingmotif2: Plataforma Universal de Análisis de Sentimiento" (FFI2016-78141-P

    Mapping of political events related to the COVID-19 pandemic on Twitter using topic modelling and keywords over time.

    Get PDF
    This research aims to study the relationship between actual, real-world events related to the COVID-19 pandemic and the impact these events produced on social media. To achieve this objective, we employ topic modelling and keyword extraction techniques. Topic modelling is a Natural Language Processing technique that attempts to identify topics automatically from a collection of documents (Vayansky and Kumar, 2020). This is similar to keyword extraction but, unlike this, topic modelling algorithms return clusters of words that make up the topic. Thus, a second objective is to compare the results of these two methods when it comes to identifying the salient topics in a corpus. We have used the publicly available and multilingual COVID-19 Twitter dataset collected from January 21, 2020 (and still ongoing) available via the COVID-19-TweetsIDs GitHub repository (Chen, Lerman & Ferrara, 2020). For this study, we will focus on tweets written in English from 2020 and 2021. We limited our study to the years 2020 to 2021, which contains 1 billion tweets (31 billion tokens), and extracted a random, time-stratified sample of 0,1%, which resulted in a total of approximately 1 million tweets (31 million tokens). In terms of methods, we employed unsupervised machine learning methods for both tasks. For topic modelling we used BERT embeddings and the BERTopic library (Grootendorst, 2022). Our script generates a full list of topics and assigned terms, a coherence score, and several data visualisations, such as topics-over-time graphs, heatmaps, and topic hierarchies. For keyword extraction, we used TextRank (Mihalcea & Tarau, 2004), a language-independent, graph-based ranking model. We then compare results returned by both methods in terms of usefulness and, finally, provide an interpretation of results by relating the extracted topics to the situation of the global pandemic at different stages of the crisis.Universidad de Málaga. Campus de Excelencia Internacional Andalucía Tech

    Design and validation of annotation schemas for aspect‑based sentiment analysis in the tourism sector.

    Get PDF
    The use of linguistic resources beyond the scope of language studies, e.g., commercial purposes, has become commonplace since the availability of massive amounts of data and the development of software tools to process them. An interesting perspective on these data is provided by Sentiment Analysis, which attempts to identify the polarity of a text, but can also pursue further, more challenging aims, such as the automatic identification of the specific entities and aspects being discussed in the evaluative speech act, along with the polarity associated with them. This approach, known as aspect-based sentiment analysis, seeks to offer fine-grained information from raw text, but its success depends largely on the existence of pre-annotated domain-specific corpora, which in turn calls for the design and validation of an annotation schema. This paper examines the methodological aspects involved in the creation of such annotation schema and is motivated by the scarcity of information found in the literature. We describe the insights we obtained from the annotation schema generation and validation process within our project, whose objectives include the development of advanced sentiment analysis software of user reviews in the tourism sector. We focus on the identification of the relevant entities and attributes in the domain, which we extract from a corpus of user reviews, and go on to describe the schema creation and validation process. We begin by describing the corpus annotation process and its further iterative refinement by means of several inter-annotator agreement measurements, which we believe is key to a successful annotation schema.Esta obra es resultado de investigación del proyecto "Lingmotif2: Plataforma Universal de Análisis de Sentimiento" (FFI2016-78141-P), financiado con 48.800€ por el Ministerio de Economía, Industria y Competitivida

    The language of happiness in self-reported descriptions of happy moments: words, concepts, and entities

    Get PDF
    This article attempts to study the language of happiness from a double perspective. First, the impact and relevance of sentiment words and expressions in self-reported descriptions of happiness are examined. Second, the sources of happiness that are mentioned in such descriptions are identified. A large sample of “happy moments” from the HappyDB corpus is processed employing advanced text analytics techniques. The sentiment analysis results reveal that positive lexical items have a limited role in the description of happy moments. For the second objective, unsupervised machine learning algorithms are used to extract and cluster keywords and manually label the resulting semantic classes. Results indicate that these classes, linguistically materialized in compact lexical families, accurately describe the sources of happiness, a result that is reinforced by our named entities analysis, which also reveals the important role that commercial products and services play as a source of happiness. Thus, this study attempts to provide methodological underpinnings for the automatic processing of self-reported happy moments, and contributes to a better understanding of the linguistic expression of happiness, with interdisciplinary implications for fields such as affective content analysis, sentiment analysis, and cultural, social and behavioural studies."Ministerio de Ciencia e Innovación. proyecto de investigación PID2020-115310RB-I0

    The use of Tik Tok in higher education as a motivating source for students

    Get PDF
    This article presents a study conducted at the University of Málaga with the participation of second-year students from the Degree in English Studies. It focuses on a Tik Tok project that the participants had to edit for the British History class in the academic year 2020/2021. The students’ reception of said project as an innovative learning tool, both as applied to English as a second language and to the content of the courses, was analysed and measured using a questionnaire that was elaborated ad hoc and properly validated. Our results indicate great success and acceptance of the activity on the part of the students, who consider that this innovative approach to learning being highly integrated with new technologies fosters the comprehension and active learning of the subject, thus enhancing comprehension in a stimulating and motivating way. Key words: Tik Tok, learning tool, innovative approach, new technologies, motivation.Este artículo describe un estudio llevado a cabo en la Universidad de Málaga en el Grado de Estudios Ingleses con alumnos del segundo curso en la asignatura Historia y Civilización de las Islas Británicas. En él se presenta un proyecto basado en Tik Tok que los alumnos tuvieron que realizar en la clase de Historia Británica durante el curso académico 2020/2021. Tras describir el proyecto mencionado anteriormente, se analizó y midió mediante un cuestionario, elaborado para este propósito y debidamente validado, la recepción que los estudiantes tuvieron de dicho proyecto como una herramienta integradora del proceso digital y a su vez motivadora, tanto dentro del contexto de la lengua inglesa como con el contenido de la asignatura de historia, dentro del ámbito de un aprendizaje integrado en contenidos y lengua. Los resultados obtenidos son muy positivos y muestran una gran aceptación por parte del alumnado, que considera que el aprendizaje compuesto por un alto componente digital motiva e implica a los alumnos de manera fructífera y efectiva, fomentando así el aprendizaje de una manera más significativa. Palabras clave: Tik Tok, herramienta de aprendizaje, enfoque innovador, nuevas tecnologías, motivación

    Pixel-Bit

    Get PDF
    Resumen tomado de la publicaciónSe describe la infraestructura y protocolos que se han desarrollado para utilizar el gestor de aprendizaje Moodle como plataforma de apoyo a la investigación, en este caso para la creación de un corpus de aprendices de lenguas extranjeras, cuya finalidad es el estudio del conocimiento que los estudiantes universitarios tienen de la lengua inglesa. Se pone el énfasis en la metodología empleada, más que en la aplicación específica a los córpora de aprendices de lenguas, ya que se piensa que puede ser aplicada a cualquier otro ámbito de investigación educativa que requiera de un proceso de recogida de datos controlado.ES

    Dental treatment with single-implants. A 5-year study

    Get PDF
    IIntroducción: La implantología oral representa en la actualidad, una modalidad terapéutica odontológica en los pacientes con pérdida dental total y parcial. El estudio muestra la evaluación de los pacientes tratados con coronas unitarias mediante la carga de los implantes unitarios. Métodos: Fueron tratados146 pacientes con pérdidas dentales unitarias con implantes con superficie arenada y grabada Galimplant®. Los implantes fueron cargados funcionalmente tras un periodo de tiempo de 6 semanas en la mandíbula y 8 semanas en el maxilar superior. Los hallazgos clínicos (implantológicos y prostodóncicos) se han seguido durante 5 años. Resultado: Fueron insertados 216 implantes en ambos maxilares (168 en el maxilar superior y 48 en la mandíbula) para su rehabilitación prostodóncica con coronas implantosoportadas. 81 implantes fueron insertados en el sector anterior y 135 implantes en el sector posterior. Después de 5 años de seguimiento clínico, los resultados indican una supervivencia y éxito de los implantes del 95,8%. Durante el periodo de cicatrización, se perdieron 4 implantes por movilidad, mientras que 5 implantes se perdieron por periimplantitis. Se presentaron complicaciones prostodóncicas en 8 coronas con fractura de cerámica. Conclusiones: Los hallazgos clínicos del presente estudio indican que la rehabilitación prostodóncica con coronas unitarias mediante la inserción de implantes con superficie arenada y grabada, representa una terapéutica odontológica con éxito.Introduction: Implant dentistry constitute a therapeutic modality in the prosthodontic treatment of patients with partial and total tooth loss. This study reports the evaluation of patients treated with single crowns by loading of single implants. Methods: 146 patients with single-tooth loss were treated with Galimplant® sandblasted and acid-etched surface implants. Implants were loaded after a healing period of 6 weeks (mandible) and 8 weeks (maxilla). Clinical findings (implants and prosthodontics) were followed during 5 years. Results: 216 implants were inserted (168 maxillary, and 48 mandibular) for prosthodontic rehabilitation with single-tooth crowns. 81 implants were inserted in anterior sites and 135 in posterior sites. After 5-year followup, clinical results indicate a survival and success rate of implants of 95,8%. 4 implants were lost during the healing period by mobility, while 5 implants were lost by peri-implantitits. Technical complications showed 8 cases of ceramic fracture. Conclusions: Clinical results of this study indicate that single crowns supported by sandblasted and etchedsurface implants and can be a successful dental treatment
    corecore