Search CORE

16 research outputs found

Text Classification in an Under-Resourced Language via Lexical Normalization and Feature Pooling

Author: Elahi Inam
Ijaz Ahsan
Kamiran Faisal
Karim Asim
Sohail Omayya
Publication venue: AIS Electronic Library (AISeL)
Publication date: 26/06/2018
Field of study

Automatic classification of textual content in an under-resourced language is challenging, since lexical resources and preprocessing tools are not available for such languages. Their bag-of-words (BoW) representation is usually highly sparse and noisy, and text classification built on such a representation yields poor performance. In this paper, we explore the effectiveness of lexical normalization of terms and statistical feature pooling for improving text classification in an under-resourced language. We focus on classifying citizen feedback on government services provided through SMS texts which are written predominantly in Roman Urdu (an informal forward transliterated version of the Urdu language). Our proposed methodology performs normalization of lexical variations of terms using phonetic and string similarity. It subsequently employs a supervised feature extraction technique to obtain category-specific highly discriminating features. Our experiments with classifiers reveal that significant improvement in classification performance is achieved by lexical normalization plus feature pooling over standard representations

AIS Electronic Library (AISeL)

UKHTI VS UGHTEA: ARABIC KINSHIP ADDRESS TERM AS SLANG AND IDENTITY IN INDONESIAN TWITTER

Author: Qonitah Salma
Triwinarti Wiwin
Publication venue: 'Universitas Indonesia, Directorate of Research and Public Service'
Publication date: 31/07/2020
Field of study

Microblogging has taken a quotidian position in the scope of internet usage. This research explores the pragmatic of ughtea, a slang form of ukhti, as a term of address slang and identity in Twitter’s prominent behaviour on virtual sphere: tweeting. Semantically, ukhti refers to “sister” of a possessive pronoun of the first person i.e. the speaker, both in biological and ideological contexts. For the last two years (2018—2019), the usage of the term ukhti has undergone the extension of its meaning through its use among Indonesian Twitter users by changing its form into ughtea as a slang with degenerative meaning, in order to insinuate the exclusivity of the use of the term ukhti among Indonesian conservative Muslims and the misbehavior of ukhti. As a result, the meaning of the term ukhti experiences pejoration. These certain Indonesian Twitter users, according to McCulloch’s classification of Internet People (2019) are classified as Post Internet People. This research problem focuses on the analysis of the speakers, terms, and how both terms used in the context of pejoration. This study aims to analyze both terms in terms of shifting meaning in terms of speakers, speech, and usage by implementing corpus linguistic approach and Martin and White (2005)’s appraisal system. Data sources were obtained from Twitter users' tweets during a certain period (October 2019)

International Review of Humanities Studies (IRHS)

Comparison of the Influence of Different Normalization Methods on Tweet Sentiment Analysis in the Serbian language

Author: Ljajić Adela
Marovac Ulfeta
Stanković Milena
Publication venue: 'University of Nis - Faculty of Philosophy'
Publication date: 25/01/2019
Field of study

Given the growing need to quickly process texts and extract information from the data for various purposes, correct normalization that will contribute to better and faster processing is of great importance. The paper presents the comparison of different methods of short text (tweet) normalization. The comparison is illustrated by the example of text sentiment analysis. The results of an application of different normalizations are presented, taking into account time complexity and sentiment algorithm classification accuracy. It has been shown that using cutting to n-gram normalization, better or similar results are obtained compared to language-dependent normalizations. Including the time complexity, it is concluded that the application of this language-independent normalization gives optimal results in the classification of short informal texts

University of Niš: Facta Universitatis (E-Journals) / Универзитет у Нишу

Stylistic variation on the Donald Trump Twitter account:a linguistic analysis of tweets posted between 2009 and 2018

Author: Clarke Isobelle
Grieve Jack
Publication venue
Publication date: 01/01/2019
Field of study

Twitter was an integral part of Donald Trump's communication platform during his 2016 campaign. Although its topical content has been examined by researchers and the media, we know relatively little about the style of the language used on the account or how this style changed over time. In this study, we present the first detailed description of stylistic variation on the Trump Twitter account based on a multivariate analysis of grammatical co-occurrence patterns in tweets posted between 2009 and 2018. We identify four general patterns of stylistic variation, which we interpret as representing the degree of conversational, campaigning, engaged, and advisory discourse. We then track how the use of these four styles changed over time, focusing on the period around the campaign, showing that the style of tweets shifts systematically depending on the communicative goals of Trump and his team. Based on these results, we propose a series of hypotheses about how the Trump campaign used social media during the 2016 elections

University of Birmingham Research Portal

Directory of Open Access Journals

TweetNorm: a benchmark for lexical normalization of spanish tweets

Author: Alegria Iñaki
Aranberri Nora
Comas Umbert Pere Ramon
Fresno Víctor
Gamallo Pablo
Padró Lluís
San Vicente Roncal Iñaki
Turmo Borras Jorge
Zubiaga Arkaitz
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2015
Field of study

The language used in social media is often characterized by the abundance of informal and non-standard writing. The normalization of this non-standard language can be crucial to facilitate the subsequent textual processing and to consequently help boost the performance of natural language processing tools applied to social media text. In this paper we present a benchmark for lexical normalization of social media posts, specifically for tweets in Spanish language. We describe the tweet normalization challenge we organized recently, analyze the performance achieved by the different systems submitted to the challenge, and delve into the characteristics of systems to identify the features that were useful. The organization of this challenge has led to the production of a benchmark for lexical normalization of social media, including an evaluation framework, as well as an annotated corpus of Spanish tweets-TweetNorm_es-, which we make publicly available. The creation of this benchmark and the evaluation has brought to light the types of words that submitted systems did best with, and posits the main shortcomings to be addressed in future work.Postprint (published version

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Crossref

UPCommons. Portal del coneixement obert de la UPC

Warwick Research Archives Portal Repository

Normalización de texto en español de Argentina

Author: Bracco Alan Gabriel
Publication venue
Publication date: 01/01/2018
Field of study

Tesis (Lic. en Cs. de la Computación)--Universidad Nacional de Córdoba, Facultad de Matemática, Astronomía, Física y Computación, 2018.En la actualidad la cantidad de datos que consume y genera una sola persona es gigantesca. Los datos cada vez son más, ya que cualquiera puede generarlos. Esto trae consigo un aumento en el ruido que hay en esos datos. Es por eso que el texto de las redes sociales se caracteriza por ser ruidoso, lo que es un problema cuando se quiere trabajar sobre ellos. En este trabajo construimos un corpus de tweets en español de Argentina. Recolectamos un conjunto grande de tweets y luego los seleccionamos manualmente para obtener una muestra representativa de los errores típicos de normalización. Luego, definimos criterios claros y explícitos de corrección y los utilizamos para proceder a la anotación manual del corpus. Además, presentamos un sistema de normalización de texto que trabaja sobre tweets. Dado un conjunto de tweets como entrada, el sistema detecta y corrige las palabras que deben ser estandarizadas. Para ello, utiliza una serie de componentes como recursos léxicos, sistemas de reglas y modelos de lenguaje. Finalmente, realizamos experimentos con diferentes corpus, entre ellos el nuestro, y diferentes configuraciones del sistema para entender las ventajas y desventajas de cada uno.Nowadays, the amount of data consumed and generated by only one person is enormous. Data amount keeps growing because anyone can generate it. This brings along an increment of noisy data. That is why social network text is noisy, which is a problem when it is needed to work on it. Here, we built a corpus of tweets in argentinian spanish. We collected a big set of tweets and we selected them manually to obtain a representative sample of common normalization errors. Then, we defined explicit and clear correction criteria and we used it to continue with the manual corpus annotation. Besides, we present a text normalization system that works on tweets. Given a set of tweets as input, the system detects and corrects words that need to be standardized. To do that, it uses a group of components as lexical resources, rule-based systems and language models. Finally, we made some experiments with different corpus, among them, the one we built, and different system configurations to understand each one’s advantages and disadvantages

Repositorio Digital de la Universidad Nacional de Córdoba

Social Media Influencers: An Examination of Influence Throughout the Customer Journey

Author: Leggett Britton
Publication venue: JagWorks@USA
Publication date: 01/01/2022
Field of study

Social media influencers (SMI) expanded exponentially in both numbers and credibility shortly after the widespread emergence of social media platforms like Facebook and Instagram. Firms have noticed this increase and as a result, diverted billions of dollars in their marketing budgets toward SMI endorsements and campaigns, and away from traditional media. As often happens with quickly occurring phenomena, academic research is subsequently racing to understand the integral roles SMIs now command in social media marketing, and in marketing in general. Much of the latest research designed to understand and measure the effects of SMIs relies on previous research into traditional celebrity endorsers. SMI attributes and approaches have been researched like previous traditional celebrity studies. Another emerging and relevant topic is para-social relationships – in which followers feel as if they know the influencer like a friend though the SMI likely does not feel the same way. While there are similarities, major differences exist between traditional celebrities and SMIs. Examples include the delivery via social media platforms, increased engagement through the platforms, and uploadable user-generated content (UGC). Unlike musicians, athletes, and actresses, SMIs are generating their stardom and followings on social media platforms with their UGC. Though the traditional xiii celebrity concept is still quite relevant regarding endorsements, younger consumers have been opting for less traditional media for entertainment purposes. Businesses have realized reaching Generation Z is effective and efficient through SMIs. This study advances the SMI literature in understanding the differences in para-social relationships formed with SMIs and their role throughout selected components of the customer journey rather than individual parts of it

University of South Alabama Institutional Repository