17,304 research outputs found

    Terminology mining in social media

    Get PDF
    The highly variable and dynamic word usage in social media presents serious challenges for both research and those commercial applications that are geared towards blogs or other user-generated non-editorial texts. This paper discusses and exemplifies a terminology mining approach for dealing with the productive character of the textual environment in social media. We explore the challenges of practically acquiring new terminology, and of modeling similarity and relatedness of terms from observing realistic amounts of data. We also discuss semantic evolution and density, and investigate novel measures for characterizing the preconditions for terminology mining

    Authorship Identification in Bengali Literature: a Comparative Analysis

    Full text link
    Stylometry is the study of the unique linguistic styles and writing behaviors of individuals. It belongs to the core task of text categorization like authorship identification, plagiarism detection etc. Though reasonable number of studies have been conducted in English language, no major work has been done so far in Bengali. In this work, We will present a demonstration of authorship identification of the documents written in Bengali. We adopt a set of fine-grained stylistic features for the analysis of the text and use them to develop two different models: statistical similarity model consisting of three measures and their combination, and machine learning model with Decision Tree, Neural Network and SVM. Experimental results show that SVM outperforms other state-of-the-art methods after 10-fold cross validations. We also validate the relative importance of each stylistic feature to show that some of them remain consistently significant in every model used in this experiment.Comment: 9 pages, 5 tables, 4 picture

    A comparative analysis of metal subgenres in terms of lexical richness and keyness

    Get PDF
    Metal music is realized under a vast variety of subgenres all of which have their unique (or shared) characteristics not only in sound but also in their lyrics. Much research has been done to distinguish or classify subgenres but little has addressed the linguistic differences across them. This study seeks to find out the lexical richness and keyness levels of heavy metal, thrash metal and death metal using a corpus of 200 songs from each subgenre with a total of 600 songs. The selection of the bands and songs was carried out finding references in the metal literature. The metal literature in the present study takes into account the academic books and articles on metal as well as noteworthy media productions, websites and metal blogs such as Metal Evolution and Encyclopaedia Metallum. The song lyrics were manually processed and meta-data, mark-ups and repeats have been removed so that the differences in repeat lengths do not affect the comparisons. Furthermore, the analyses used in the study are sensitive to repeats as they measure the frequencies and repeat ratios of the words. The song lengths – after the processing – were limited to lower and upper thresholds of 100 and 400 words. The songs were analyzed for their lexical richness levels in three aspects: 1) lexical variation, 2) lexical sophistication and 3) lexical density. Lexical variation was operationalized as TTR, Guiraud, Uber and HD-D. Lexical sophistication was measured using lexical frequency profile with two different frequency lists – the GSL and the BNC/COCA – by looking at the ratios of tokens and types which fell beyond the most frequent two thousand words (Laufer 1995). Another sophistication measure – P_Lex – which also runs on GSL, was applied. Lexical density analysis was based on the ratio of content words to all tokens in the texts. In order to complement this quantitative and data-driven approach, a keyness analysis was administered to add a qualitative dimension to the research. All lexical richness analyses pointed out to statistically significant differences between all subgenres, marking heavy metal as the least and death metal as the most lexically rich one. Keyness analysis indicated differences among all three subgenres as well. Heavy metal key words tended to be Dionysian whereas thrash and death metal keywords were more Chaotic as proposed by Weinstein (2000). Finally, a correlation analysis showed that all lexical richness measures were statistically significantly correlated to each other. Based on the findings, it could be claimed that 1) these three subgenres differ from each other not only in terms of music but also of lexical richness levels and key words and 2) lexical richness analyses, coupled with keyness, are capable of reflecting the genre differences in song lyrics. However, as a result of a discriminant analysis of the present corpus, a reverse approach whereby genres are attempted to be classified based on lexical features does not provide a pattern which fully corresponds to the existing classifications

    A comparative analysis of metal subgenres in terms of lexical richness and keyness

    Get PDF
    Metal music is realized under a vast variety of subgenres all of which have their unique (or shared) characteristics not only in sound but also in their lyrics. Much research has been done to distinguish or classify subgenres but little has addressed the linguistic differences across them. This study seeks to find out the lexical richness and keyness levels of heavy metal, thrash metal and death metal using a corpus of 200 songs from each subgenre with a total of 600 songs. The selection of the bands and songs was carried out finding references in the metal literature. The metal literature in the present study takes into account the academic books and articles on metal as well as noteworthy media productions, websites and metal blogs such as Metal Evolution and Encyclopaedia Metallum. The song lyrics were manually processed and meta-data, mark-ups and repeats have been removed so that the differences in repeat lengths do not affect the comparisons. Furthermore, the analyses used in the study are sensitive to repeats as they measure the frequencies and repeat ratios of the words. The song lengths – after the processing – were limited to lower and upper thresholds of 100 and 400 words. The songs were analyzed for their lexical richness levels in three aspects: 1) lexical variation, 2) lexical sophistication and 3) lexical density. Lexical variation was operationalized as TTR, Guiraud, Uber and HD-D. Lexical sophistication was measured using lexical frequency profile with two different frequency lists – the GSL and the BNC/COCA – by looking at the ratios of tokens and types which fell beyond the most frequent two thousand words (Laufer 1995). Another sophistication measure – P_Lex – which also runs on GSL, was applied. Lexical density analysis was based on the ratio of content words to all tokens in the texts. In order to complement this quantitative and data-driven approach, a keyness analysis was administered to add a qualitative dimension to the research. All lexical richness analyses pointed out to statistically significant differences between all subgenres, marking heavy metal as the least and death metal as the most lexically rich one. Keyness analysis indicated differences among all three subgenres as well. Heavy metal key words tended to be Dionysian whereas thrash and death metal keywords were more Chaotic as proposed by Weinstein (2000). Finally, a correlation analysis showed that all lexical richness measures were statistically significantly correlated to each other. Based on the findings, it could be claimed that 1) these three subgenres differ from each other not only in terms of music but also of lexical richness levels and key words and 2) lexical richness analyses, coupled with keyness, are capable of reflecting the genre differences in song lyrics. However, as a result of a discriminant analysis of the present corpus, a reverse approach whereby genres are attempted to be classified based on lexical features does not provide a pattern which fully corresponds to the existing classifications

    Measuring Lexical Richness through Type-Token Curve: a Corpus-Based Analysis of Arabic and English Texts

    Get PDF
    WordSmith Tools (5.0) is used to analyze samples from texts of different genres written by eight different authors. These texts are grouped into two corpora: Arabic and English. The Arabic corpus includes textual samples from the Qur'an, Al-Sahifa al-Sajjadiyya ( a prayer manual), Modern Standard Arabic (Mistaghanmy's novel Chaos of Sensations) and Imam Ali's Nahjul-Balagah (Peak of Eloquence). The English Corpus comprises The New Testament, Conrad's Heart of Darkness, Dickens' David Copperfield, and Eliot's Adam Bede. Each textual sample is statistically analyzed to find about its lexical richness or vocabulary size. The number of tokens (total number of words) and the number of types (distinct vocabulary words) are counted for each sample. Then both numbers are plotted against each other using Microsoft Office Excel diagrams. The resulted curves in both corpora give a vivid idea about the lexical richness of each textual sample. They open an active avenue to compare between the different authors in terms of their vocabulary size and the range at which they begin to exhaust their linguistic repertoire by repetition. The curves for Imam Ali's Nahjul-Balagah (Arabic corpus) and Conrad's Heart of Darkness (English corpus) rise up high reaching the maximum. By contrast, Qur'anic Verses and The New Testament have the lowest curve for the ritualistic quality of their texts. Keywords: corpus stylistics, type-token curve, lexical richnes

    Lexical Composition of Effective L1 and L2 Students\u27 Academic Presentations

    Get PDF
    The present study set out to examine the lexical profiles of L1 (n = 30) and proficient L2 students\u27 presentations (n = 30), aiming at finding out the overall lexical composition of successful academic presentations. It was also of interest to see how some of the presentations\u27 lexical features compared to findings about the lexical composition of students\u27 productively used vocabulary in writing. In addition to this, the analysis focused on the lexical composition of both groups\u27 oral production in an attempt to uncover patterns of lexical uses that may need to be discussed in oral communication courses, specifically targeting the development of L1 and L2 students\u27 presentation skills. Overall, the analysis revealed more similarities than differences in the lexical composition of the L1 and L2 presentations while, at the same time, outlined few areas that need to be addressed in oral academic instruction

    Lexical Complexity of Academic Presentations: Similarities Despite Situational Differences

    Get PDF
    The present study examined the lexical complexity profiles of academic presentations of three groups of university students– native English speaking, English as a second language, and English as a lingua franca users. It adopted a notion of lexical complexity which includes lexical diversity, lexical density, and lexical sophistication as main dimensions of the framework. The study aimed at finding out how the three academically similar groups of presenters compared on their lexical complexity choices, what the lexical complexity profiles of high quality students’ academic presentations looked like, and whether we can identify variables that contribute to the overall lexical complexity of presentations given by each group in a unique way. The findings revealed overwhelming similarities across the three groups of presenters and also suggested that the three dimensional framework provides a holistic picture of the lexical complexity for various groups of English for academic purposes presenters

    An affect-based video retrieval system with open vocabulary querying

    Get PDF
    Content-based video retrieval systems (CBVR) are creating new search and browse capabilities using metadata describing significant features of the data. An often overlooked aspect of human interpretation of multimedia data is the affective dimension. Incorporating affective information into multimedia metadata can potentially enable search using this alternative interpretation of multimedia content. Recent work has described methods to automatically assign affective labels to multimedia data using various approaches. However, the subjective and imprecise nature of affective labels makes it difficult to bridge the semantic gap between system-detected labels and user expression of information requirements in multimedia retrieval. We present a novel affect-based video retrieval system incorporating an open-vocabulary query stage based on WordNet enabling search using an unrestricted query vocabulary. The system performs automatic annotation of video data with labels of well defined affective terms. In retrieval annotated documents are ranked using the standard Okapi retrieval model based on open-vocabulary text queries. We present experimental results examining the behaviour of the system for retrieval of a collection of automatically annotated feature films of different genres. Our results indicate that affective annotation can potentially provide useful augmentation to more traditional objective content description in multimedia retrieval
    corecore