Search CORE

120 research outputs found

Methodologies for the Automatic Location of Academic and Educational Texts on the Internet

Author: Oxnard L.
Evans A.
Publication venue: School of Geography
Publication date: 01/01/2003
Field of study

Traditionally online databases of web resources have been compiled by a human editor, or though the submissions of authors or interested parties. Considerable resources are needed to maintain a constant level of input and relevance in the face of increasing material quantity and quality, and much of what is in databases is of an ephemeral nature. These pressures dictate that many databases stagnate after an initial period of enthusiastic data entry. The solution to this problem would seem to be the automatic harvesting of resources, however, this process necessitates the automatic classification of resources as ‘appropriate’ to a given database, a problem only solved by complex text content analysis. This paper outlines the component methodologies necessary to construct such an automated harvesting system, including a number of novel approaches. In particular this paper looks at the specific problems of automatically identifying academic research work and Higher Education pedagogic materials. Where appropriate, experimental data is presented from searches in the field of Geography as well as the Earth and Environmental Sciences. In addition, appropriate software is reviewed where it exists, and future directions are outlined

ChatGPT vs State-of-the-Art Models: A Benchmarking Study in Keyphrase Generation Task

Author: López-López Alvaro J.
Martínez-Cruz Roberto
Portela José
Publication venue
Publication date: 29/06/2023
Field of study

Transformer-based language models, including ChatGPT, have demonstrated exceptional performance in various natural language generation tasks. However, there has been limited research evaluating ChatGPT's keyphrase generation ability, which involves identifying informative phrases that accurately reflect a document's content. This study seeks to address this gap by comparing ChatGPT's keyphrase generation performance with state-of-the-art models, while also testing its potential as a solution for two significant challenges in the field: domain adaptation and keyphrase generation from long documents. We conducted experiments on six publicly available datasets from scientific articles and news domains, analyzing performance on both short and long documents. Our results show that ChatGPT outperforms current state-of-the-art models in all tested datasets and environments, generating high-quality keyphrases that adapt well to diverse domains and document lengths

arXiv.org e-Print Archive

Political advertising on social media: Issues sponsored on Facebook ads during the 2019 General Elections in Spain

Author: Baviera T. (Tomás)
Rosso P. (Paolo)
Sánchez-Junqueras J. (Javier)
Publication venue: Servicio de Publicaciones de la Universidad de Navarra
Publication date: 01/01/2022
Field of study

Facebook’s advertising platform provides political parties with an electoral tool that enables them to reach an extremely detailed audience. Unlike television, the sponsored content on Facebook is seen only by the targeted users. This opacity was an obstacle to political communications research until Facebook released advertiser-sponsored content in 2018. The company’s new transparent policy included sharing metadata related to the cost and number of impressions the ads received. This research studies the content sponsored on Facebook by the five main national political parties in Spain during the two General Elections held in 2019. The research corpus consists of 14,684 Facebook ads. An extraction algorithm detected the key terms in the text-based messages conveyed in the ad. The prominence of these topics was estimated from the aggregate number of impressions accumulated by each term. Different content patterns were assessed following three categories: user mobilization, candidate presence, and ideological issues. PSOE and PP positioned themselves more toward calls to action. Podemos had the greater number of issues related to policy among the most salient topics in its advertising. Ciudadanos’ strategy focused more on its candidate and mobilization. Vox sponsored few Facebook ads, and they barely included policy issues. Spain was a highly prominent term in all parties’ campaigns. Ciudadanos shared the middle ground on the ideological axis: they promoted social issues more aligned with left-wing parties as well as economic topics usually advocated by the right-wing. Overall, our results point to a greater emphasis on candidates rather than issues

Universidad de Navarra

Strategies for the analysis of large social media corpora: sampling and keyword extraction methods

Author: García-Gámez María
Moreno-Ortiz Antonio Jesús
Publication venue: Springer
Publication date: 01/01/2023
Field of study

In the context of the COVID-19 pandemic, social media platforms such as Twitter have been of great importance for users to exchange news, ideas, and perceptions. Researchers from fields such as discourse analysis and the social sciences have resorted to this content to explore public opinion and stance on this topic, and they have tried to gather information through the compilation of large-scale corpora. However, the size of such corpora is both an advantage and a drawback, as simple text retrieval techniques and tools may prove to be impractical or altogether incapable of handling such masses of data. This study provides methodological and practical cues on how to manage the contents of a large-scale social media corpus such as Chen et al. (JMIR Public Health Surveill 6(2):e19273, 2020) COVID-19 corpus. We compare and evaluate, in terms of efficiency and efficacy, available methods to handle such a large corpus. First, we compare different sample sizes to assess whether it is possible to achieve similar results despite the size difference and evaluate sampling methods following a specific data management approach to storing the original corpus. Second, we examine two keyword extraction methodologies commonly used to obtain a compact representation of the main subject and topics of a text: the traditional method used in corpus linguistics, which compares word frequencies using a reference corpus, and graph-based techniques as developed in Natural Language Processing tasks. The methods and strategies discussed in this study enable valuable quantitative and qualitative analyses of an otherwise intractable mass of social media data.Funding for open access publishing: Universidad de Málaga/CBUA. This work was funded by the Spanish Ministry of Science and Innovation [Grant No. PID2020-115310RB-I00], the Regional Govvernment of Andalusia [Grant No. UMA18-FEDERJA-158] and the Spanish Ministry of Education and Vocational Training [Grant No. FPU 19/04880]. Funding for open access charge: Universidad de Málaga / CBU

Human-competitive automatic topic indexing

Author: Medelyan Olena
Publication venue: The University of Waikato
Publication date: 01/01/2009
Field of study

Topic indexing is the task of identifying the main topics covered by a document. These are useful for many purposes: as subject headings in libraries, as keywords in academic publications and as tags on the web. Knowing a document's topics helps people judge its relevance quickly. However, assigning topics manually is labor intensive. This thesis shows how to generate them automatically in a way that competes with human performance. Three kinds of indexing are investigated: term assignment, a task commonly performed by librarians, who select topics from a controlled vocabulary; tagging, a popular activity of web users, who choose topics freely; and a new method of keyphrase extraction, where topics are equated to Wikipedia article names. A general two-stage algorithm is introduced that first selects candidate topics and then ranks them by significance based on their properties. These properties draw on statistical, semantic, domain-specific and encyclopedic knowledge. They are combined using a machine learning algorithm that models human indexing behavior from examples. This approach is evaluated by comparing automatically generated topics to those assigned by professional indexers, and by amateurs. We claim that the algorithm is human-competitive because it chooses topics that are as consistent with those assigned by humans as their topics are with each other. The approach is generalizable, requires little training data and applies across different domains and languages