16 research outputs found

    Text Classification in an Under-Resourced Language via Lexical Normalization and Feature Pooling

    Get PDF
    Automatic classification of textual content in an under-resourced language is challenging, since lexical resources and preprocessing tools are not available for such languages. Their bag-of-words (BoW) representation is usually highly sparse and noisy, and text classification built on such a representation yields poor performance. In this paper, we explore the effectiveness of lexical normalization of terms and statistical feature pooling for improving text classification in an under-resourced language. We focus on classifying citizen feedback on government services provided through SMS texts which are written predominantly in Roman Urdu (an informal forward transliterated version of the Urdu language). Our proposed methodology performs normalization of lexical variations of terms using phonetic and string similarity. It subsequently employs a supervised feature extraction technique to obtain category-specific highly discriminating features. Our experiments with classifiers reveal that significant improvement in classification performance is achieved by lexical normalization plus feature pooling over standard representations

    UKHTI VS UGHTEA: ARABIC KINSHIP ADDRESS TERM AS SLANG AND IDENTITY IN INDONESIAN TWITTER

    Get PDF
    Microblogging has taken a quotidian position in the scope of internet usage. This research explores the pragmatic of ughtea, a slang form of ukhti, as a term of address slang and identity in Twitter’s prominent behaviour on virtual sphere: tweeting. Semantically, ukhti refers to “sister” of a possessive pronoun of the first person i.e. the speaker, both in biological and ideological contexts. For the last two years (2018—2019), the usage of the term ukhti has undergone the extension of its meaning through its use among Indonesian Twitter users by changing its form into ughtea as a slang with degenerative meaning, in order to insinuate the exclusivity of the use of the term ukhti among Indonesian conservative Muslims and the misbehavior of ukhti. As a result, the meaning of the term ukhti experiences pejoration. These certain Indonesian Twitter users, according to McCulloch’s classification of Internet People (2019) are classified as Post Internet People. This research problem focuses on the analysis of the speakers, terms, and how both terms used in the context of pejoration. This study aims to analyze both terms in terms of shifting meaning in terms of speakers, speech, and usage by implementing corpus linguistic approach and Martin and White (2005)’s appraisal system. Data sources were obtained from Twitter users' tweets during a certain period (October 2019)

    Comparison of the Influence of Different Normalization Methods on Tweet Sentiment Analysis in the Serbian language

    Get PDF
    Given the growing need to quickly process texts and extract information from the data for various purposes, correct normalization that will contribute to better and faster processing is of great importance. The paper presents the comparison of different methods of short text (tweet) normalization.  The comparison is illustrated by the example of text sentiment analysis.  The results of an application of different normalizations are presented, taking into account time complexity and sentiment algorithm classification accuracy. It has been shown that using cutting to n-gram normalization, better or similar results are obtained compared to language-dependent normalizations. Including the time complexity, it is concluded that the application of this language-independent normalization gives optimal results in the classification of short informal texts

    Stylistic variation on the Donald Trump Twitter account:a linguistic analysis of tweets posted between 2009 and 2018

    Get PDF
    Twitter was an integral part of Donald Trump's communication platform during his 2016 campaign. Although its topical content has been examined by researchers and the media, we know relatively little about the style of the language used on the account or how this style changed over time. In this study, we present the first detailed description of stylistic variation on the Trump Twitter account based on a multivariate analysis of grammatical co-occurrence patterns in tweets posted between 2009 and 2018. We identify four general patterns of stylistic variation, which we interpret as representing the degree of conversational, campaigning, engaged, and advisory discourse. We then track how the use of these four styles changed over time, focusing on the period around the campaign, showing that the style of tweets shifts systematically depending on the communicative goals of Trump and his team. Based on these results, we propose a series of hypotheses about how the Trump campaign used social media during the 2016 elections

    TweetNorm: a benchmark for lexical normalization of spanish tweets

    Get PDF
    The language used in social media is often characterized by the abundance of informal and non-standard writing. The normalization of this non-standard language can be crucial to facilitate the subsequent textual processing and to consequently help boost the performance of natural language processing tools applied to social media text. In this paper we present a benchmark for lexical normalization of social media posts, specifically for tweets in Spanish language. We describe the tweet normalization challenge we organized recently, analyze the performance achieved by the different systems submitted to the challenge, and delve into the characteristics of systems to identify the features that were useful. The organization of this challenge has led to the production of a benchmark for lexical normalization of social media, including an evaluation framework, as well as an annotated corpus of Spanish tweets-TweetNorm_es-, which we make publicly available. The creation of this benchmark and the evaluation has brought to light the types of words that submitted systems did best with, and posits the main shortcomings to be addressed in future work.Postprint (published version

    Normalización de texto en español de Argentina

    Get PDF
    Tesis (Lic. en Cs. de la Computación)--Universidad Nacional de Córdoba, Facultad de Matemática, Astronomía, Física y Computación, 2018.En la actualidad la cantidad de datos que consume y genera una sola persona es gigantesca. Los datos cada vez son más, ya que cualquiera puede generarlos. Esto trae consigo un aumento en el ruido que hay en esos datos. Es por eso que el texto de las redes sociales se caracteriza por ser ruidoso, lo que es un problema cuando se quiere trabajar sobre ellos. En este trabajo construimos un corpus de tweets en español de Argentina. Recolectamos un conjunto grande de tweets y luego los seleccionamos manualmente para obtener una muestra representativa de los errores típicos de normalización. Luego, definimos criterios claros y explícitos de corrección y los utilizamos para proceder a la anotación manual del corpus. Además, presentamos un sistema de normalización de texto que trabaja sobre tweets. Dado un conjunto de tweets como entrada, el sistema detecta y corrige las palabras que deben ser estandarizadas. Para ello, utiliza una serie de componentes como recursos léxicos, sistemas de reglas y modelos de lenguaje. Finalmente, realizamos experimentos con diferentes corpus, entre ellos el nuestro, y diferentes configuraciones del sistema para entender las ventajas y desventajas de cada uno.Nowadays, the amount of data consumed and generated by only one person is enormous. Data amount keeps growing because anyone can generate it. This brings along an increment of noisy data. That is why social network text is noisy, which is a problem when it is needed to work on it. Here, we built a corpus of tweets in argentinian spanish. We collected a big set of tweets and we selected them manually to obtain a representative sample of common normalization errors. Then, we defined explicit and clear correction criteria and we used it to continue with the manual corpus annotation. Besides, we present a text normalization system that works on tweets. Given a set of tweets as input, the system detects and corrects words that need to be standardized. To do that, it uses a group of components as lexical resources, rule-based systems and language models. Finally, we made some experiments with different corpus, among them, the one we built, and different system configurations to understand each one’s advantages and disadvantages

    Social Media Influencers: An Examination of Influence Throughout the Customer Journey

    Get PDF
    Social media influencers (SMI) expanded exponentially in both numbers and credibility shortly after the widespread emergence of social media platforms like Facebook and Instagram. Firms have noticed this increase and as a result, diverted billions of dollars in their marketing budgets toward SMI endorsements and campaigns, and away from traditional media. As often happens with quickly occurring phenomena, academic research is subsequently racing to understand the integral roles SMIs now command in social media marketing, and in marketing in general. Much of the latest research designed to understand and measure the effects of SMIs relies on previous research into traditional celebrity endorsers. SMI attributes and approaches have been researched like previous traditional celebrity studies. Another emerging and relevant topic is para-social relationships – in which followers feel as if they know the influencer like a friend though the SMI likely does not feel the same way. While there are similarities, major differences exist between traditional celebrities and SMIs. Examples include the delivery via social media platforms, increased engagement through the platforms, and uploadable user-generated content (UGC). Unlike musicians, athletes, and actresses, SMIs are generating their stardom and followings on social media platforms with their UGC. Though the traditional xiii celebrity concept is still quite relevant regarding endorsements, younger consumers have been opting for less traditional media for entertainment purposes. Businesses have realized reaching Generation Z is effective and efficient through SMIs. This study advances the SMI literature in understanding the differences in para-social relationships formed with SMIs and their role throughout selected components of the customer journey rather than individual parts of it
    corecore