582 research outputs found

    Computational behavioral analytics: estimating psychological traits in foreign languages.

    Get PDF
    The rise of technology proliferating into the workplace has increased the threat of loss of intellectual property, classified, and proprietary information for companies, governments, and academics. This can cause economic damage to the creators of new IP, companies, and whole economies. This technology proliferation has also assisted terror groups and lone wolf actors in pushing their message to a larger audience or finding similar tribal groups that share common, sometimes flawed, beliefs across various social media platforms. These types of challenges have created numerous studies in psycholinguistics, as well as commercial tools, that look to assist in identifying potential threats before they have an opportunity to conduct malicious acts. This has led to an area of study that this dissertation defines as ``Computational Behavioral Analytics. A common practice espoused in various Natural Language Processing studies (both commercial and academic) conducted on foreign language text is the use of Machine Translation (MT) systems before conducting NLP tasks. In this dissertation, we explore three psycholinguistic traits conducted on foreign language text. We explore the effects (and failures) of MT systems in these types of psycholinguistic tasks in order to help push the field of study into a direction that will greatly improve the efficacy of such systems. Given the results of the experimentation in this dissertation, it is highly recommended to avoid the use of translations whenever the greatest levels of accuracy are necessary, such as for National Security and Law Enforcement purposes. If translations must be used for any reason, scientist should conduct a full analysis of the impact of their chosen translation system on their estimates to determine which traits are more significantly affected. This will help ensure that analysts and scientists are better informed of the potential inaccuracies and change any resulting decisions from the data accordingly. This dissertation introduces psycholinguistics and the benefits of using Machine Learning technologies in estimating various psychological traits, and provides a brief discussion on the potential privacy and legal issues that should be addressed in order to avoid the abuse of such systems in Chapter I. Chapter II outlines the datasets that are used during the experimentation and evaluation of the algorithms. Chapter III discusses each of the various implementations of the algorithms used in the three psycholinguistic tasks - Affect Analysis, Authorship Attribution, and Personality Estimation. Chapter IV discusses the experiments that were run in order to understand the effects of MT on the psycholinguistic tasks, and to understand how these tasks can be accomplished in the face of MT limitations, including rationale on the selection of the MT system used in this study. The dissertation concludes with Chapter V, providing a discussion and speculating on the findings and future experimentation that should be done

    Comparison of Style Features for the Authorship Verification of Literary Texts

    Get PDF
    The article compares character-level, word-level, and rhythm features for the authorship verification of literary texts of the 19th-21st centuries. Text corpora contains fragments of novels, each fragment has a size of about 50 000 characters. There are 40 fragments for each author. 20 authors who wrote in English, Russian, French, and 8 Spanish-language authors are considered.The authors of this paper use existing algorithms for calculation of low-level features, popular in the computer linguistics, and rhythm features, common for the literary texts. Low-level features include n-grams of words, frequencies of letters and punctuation marks, average word and sentence lengths, etc. Rhythm features are based on lexico-grammatical figures: anaphora, epiphora, symploce, aposiopesis, epanalepsis, anadiplosis, diacope, epizeuxis, chiasmus, polysyndeton, repetitive exclamatory and interrogative sentences. These features include the frequency of occurrence of particular rhythm figures per 100 sentences, the number of unique words in the aspects of rhythm, the percentage of nouns, adjectives, adverbs and verbs in the aspects of rhythm. Authorship verification is considered as a binary classification problem: whether the text belongs to a particular author or not. AdaBoost and a neural network with an LSTM layer are considered as classification algorithms. The experiments demonstrate the effectiveness of rhythm features in verification of particular authors, and superiority of feature types combinations over single feature types on average. The best value for precision, recall, and F-measure for the AdaBoost classifier exceeds 90% when all three types of features are combined

    Lingualyzer: A computational linguistic tool for multilingual and multidimensional text analysis

    Full text link
    Most natural language models and tools are restricted to one language, typically English. For researchers in the behavioral sciences investigating languages other than English, and for those researchers who would like to make cross-linguistic comparisons, hardly any computational linguistic tools exist, particularly none for those researchers who lack deep computational linguistic knowledge or programming skills. Yet, for interdisciplinary researchers in a variety of fields, ranging from psycholinguistics, social psychology, cognitive psychology, education, to literary studies, there certainly is a need for such a cross-linguistic tool. In the current paper, we present Lingualyzer (https://lingualyzer.com), an easily accessible tool that analyzes text at three different text levels (sentence, paragraph, document), which includes 351 multidimensional linguistic measures that are available in 41 different languages. This paper gives an overview of Lingualyzer, categorizes its hundreds of measures, demonstrates how it distinguishes itself from other text quantification tools, explains how it can be used, and provides validations. Lingualyzer is freely accessible for scientific purposes using an intuitive and easy-to-use interface

    On the Detection of False Information: From Rumors to Fake News

    Full text link
    Tesis por compendio[ES] En tiempos recientes, el desarrollo de las redes sociales y de las agencias de noticias han traído nuevos retos y amenazas a la web. Estas amenazas han llamado la atención de la comunidad investigadora en Procesamiento del Lenguaje Natural (PLN) ya que están contaminando las plataformas de redes sociales. Un ejemplo de amenaza serían las noticias falsas, en las que los usuarios difunden y comparten información falsa, inexacta o engañosa. La información falsa no se limita a la información verificable, sino que también incluye información que se utiliza con fines nocivos. Además, uno de los desafíos a los que se enfrentan los investigadores es la gran cantidad de usuarios en las plataformas de redes sociales, donde detectar a los difusores de información falsa no es tarea fácil. Los trabajos previos que se han propuesto para limitar o estudiar el tema de la detección de información falsa se han centrado en comprender el lenguaje de la información falsa desde una perspectiva lingüística. En el caso de información verificable, estos enfoques se han propuesto en un entorno monolingüe. Además, apenas se ha investigado la detección de las fuentes o los difusores de información falsa en las redes sociales. En esta tesis estudiamos la información falsa desde varias perspectivas. En primer lugar, dado que los trabajos anteriores se centraron en el estudio de la información falsa en un entorno monolingüe, en esta tesis estudiamos la información falsa en un entorno multilingüe. Proponemos diferentes enfoques multilingües y los comparamos con un conjunto de baselines monolingües. Además, proporcionamos estudios sistemáticos para los resultados de la evaluación de nuestros enfoques para una mejor comprensión. En segundo lugar, hemos notado que el papel de la información afectiva no se ha investigado en profundidad. Por lo tanto, la segunda parte de nuestro trabajo de investigación estudia el papel de la información afectiva en la información falsa y muestra cómo los autores de contenido falso la emplean para manipular al lector. Aquí, investigamos varios tipos de información falsa para comprender la correlación entre la información afectiva y cada tipo (Propaganda, Trucos / Engaños, Clickbait y Sátira). Por último, aunque no menos importante, en un intento de limitar su propagación, también abordamos el problema de los difusores de información falsa en las redes sociales. En esta dirección de la investigación, nos enfocamos en explotar varias características basadas en texto extraídas de los mensajes de perfiles en línea de tales difusores. Estudiamos diferentes conjuntos de características que pueden tener el potencial de ayudar a discriminar entre difusores de información falsa y verificadores de hechos.[CA] En temps recents, el desenvolupament de les xarxes socials i de les agències de notícies han portat nous reptes i amenaces a la web. Aquestes amenaces han cridat l'atenció de la comunitat investigadora en Processament de Llenguatge Natural (PLN) ja que estan contaminant les plataformes de xarxes socials. Un exemple d'amenaça serien les notícies falses, en què els usuaris difonen i comparteixen informació falsa, inexacta o enganyosa. La informació falsa no es limita a la informació verificable, sinó que també inclou informació que s'utilitza amb fins nocius. A més, un dels desafiaments als quals s'enfronten els investigadors és la gran quantitat d'usuaris en les plataformes de xarxes socials, on detectar els difusors d'informació falsa no és tasca fàcil. Els treballs previs que s'han proposat per limitar o estudiar el tema de la detecció d'informació falsa s'han centrat en comprendre el llenguatge de la informació falsa des d'una perspectiva lingüística. En el cas d'informació verificable, aquests enfocaments s'han proposat en un entorn monolingüe. A més, gairebé no s'ha investigat la detecció de les fonts o els difusors d'informació falsa a les xarxes socials. En aquesta tesi estudiem la informació falsa des de diverses perspectives. En primer lloc, atès que els treballs anteriors es van centrar en l'estudi de la informació falsa en un entorn monolingüe, en aquesta tesi estudiem la informació falsa en un entorn multilingüe. Proposem diferents enfocaments multilingües i els comparem amb un conjunt de baselines monolingües. A més, proporcionem estudis sistemàtics per als resultats de l'avaluació dels nostres enfocaments per a una millor comprensió. En segon lloc, hem notat que el paper de la informació afectiva no s'ha investigat en profunditat. Per tant, la segona part del nostre treball de recerca estudia el paper de la informació afectiva en la informació falsa i mostra com els autors de contingut fals l'empren per manipular el lector. Aquí, investiguem diversos tipus d'informació falsa per comprendre la correlació entre la informació afectiva i cada tipus (Propaganda, Trucs / Enganys, Clickbait i Sàtira). Finalment, però no menys important, en un intent de limitar la seva propagació, també abordem el problema dels difusors d'informació falsa a les xarxes socials. En aquesta direcció de la investigació, ens enfoquem en explotar diverses característiques basades en text extretes dels missatges de perfils en línia de tals difusors. Estudiem diferents conjunts de característiques que poden tenir el potencial d'ajudar a discriminar entre difusors d'informació falsa i verificadors de fets.[EN] In the recent years, the development of social media and online news agencies has brought several challenges and threats to the Web. These threats have taken the attention of the Natural Language Processing (NLP) research community as they are polluting the online social media platforms. One of the examples of these threats is false information, in which false, inaccurate, or deceptive information is spread and shared by online users. False information is not limited to verifiable information, but it also involves information that is used for harmful purposes. Also, one of the challenges that researchers have to face is the massive number of users in social media platforms, where detecting false information spreaders is not an easy job. Previous work that has been proposed for limiting or studying the issue of detecting false information has focused on understanding the language of false information from a linguistic perspective. In the case of verifiable information, approaches have been proposed in a monolingual setting. Moreover, detecting the sources or the spreaders of false information in social media has not been investigated much. In this thesis we study false information from several aspects. First, since previous work focused on studying false information in a monolingual setting, in this thesis we study false information in a cross-lingual one. We propose different cross-lingual approaches and we compare them to a set of monolingual baselines. Also, we provide systematic studies for the evaluation results of our approaches for better understanding. Second, we noticed that the role of affective information was not investigated in depth. Therefore, the second part of our research work studies the role of the affective information in false information and shows how the authors of false content use it to manipulate the reader. Here, we investigate several types of false information to understand the correlation between affective information and each type (Propaganda, Hoax, Clickbait, Rumor, and Satire). Last but not least, in an attempt to limit its spread, we also address the problem of detecting false information spreaders in social media. In this research direction, we focus on exploiting several text-based features extracted from the online profile messages of those spreaders. We study different feature sets that can have the potential to help to identify false information spreaders from fact checkers.Ghanem, BHH. (2020). On the Detection of False Information: From Rumors to Fake News [Tesis doctoral]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/158570TESISCompendi

    FinBook: literary content as digital commodity

    Get PDF
    This short essay explains the significance of the FinBook intervention, and invites the reader to participate. We have associated each chapter within this book with a financial robot (FinBot), and created a market whereby book content will be traded with financial securities. As human labour increasingly consists of unstable and uncertain work practices and as algorithms replace people on the virtual trading floors of the worlds markets, we see members of society taking advantage of FinBots to invest and make extra funds. Bots of all kinds are making financial decisions for us, searching online on our behalf to help us invest, to consume products and services. Our contribution to this compilation is to turn the collection of chapters in this book into a dynamic investment portfolio, and thereby play out what might happen to the process of buying and consuming literature in the not-so-distant future. By attaching identities (through QR codes) to each chapter, we create a market in which the chapter can ‘perform’. Our FinBots will trade based on features extracted from the authors’ words in this book: the political, ethical and cultural values embedded in the work, and the extent to which the FinBots share authors’ concerns; and the performance of chapters amongst those human and non-human actors that make up the market, and readership. In short, the FinBook model turns our work and the work of our co-authors into an investment portfolio, mediated by the market and the attention of readers. By creating a digital economy specifically around the content of online texts, our chapter and the FinBook platform aims to challenge the reader to consider how their personal values align them with individual articles, and how these become contested as they perform different value judgements about the financial performance of each chapter and the book as a whole. At the same time, by introducing ‘autonomous’ trading bots, we also explore the different ‘network’ affordances that differ between paper based books that’s scarcity is developed through analogue form, and digital forms of books whose uniqueness is reached through encryption. We thereby speak to wider questions about the conditions of an aggressive market in which algorithms subject cultural and intellectual items – books – to economic parameters, and the increasing ubiquity of data bots as actors in our social, political, economic and cultural lives. We understand that our marketization of literature may be an uncomfortable juxtaposition against the conventionally-imagined way a book is created, enjoyed and shared: it is intended to be

    Tradition and Technology: A Design-Based Prototype of an Online Ginan Semantization Tool

    Get PDF
    The heritage of ginans of the Nizari Ismaili community comprises over 1,000 individual hymn-like poems of varying lengths and languages. The ginans were originally composed to spread the teachings of the Satpanth Ismaili faith and served as scriptural texts that guided the normative understanding of the community in South Asia. The emotive melodies of the ginans continue to enchant the members of the community in the diaspora who do not necessarily understand the language of the ginans. The language of the ginans is mixed and borrows vocabulary from Indo-Aryan and Perso-Arabic dialects. With deliberate and purposeful use of information technology, the online tool blends the Western best practices of language learning with the traditional transmission methods and materials of the Ismaili community. This study is based on the premise that for the teachings of the ginans to survive in the Euro-American diaspora, the successive generations must learn and understand the vocabulary of the ginans. The process through which humans learn and master vocabulary is called semantization, which refers to the process of learning and understand various senses and uses of words in a language. To this end, a sample ginan corpus was chosen and semantically analyzed to develop an online ginan lexicon. This lexicon was then used to enrich ginan texts with online glosses to facilitate semantization of ginan vocabulary. The design based-research methodology for prototyping the tool comprised two design iterations of analysis, design, and review. In the first iteration, the initial design of the prototype was based on the multidisciplinary literature review and an in-depth semantic analysis of ginan materials. The initial design was then reviewed by community ginan experts and teachers to inform the next design iteration. In the second design iteration, the initial design was enhanced into a functional prototype by adding features based on the expert suggestions as well as the needs of community learners gathered by surveying a convenience sample of 515 community members across the globe. The analysis of the survey data revealed that over 90% of the survey participants preferred English materials for learning and understanding the language of the ginans. In addition, having online access to ginan materials was expressed as a dire need for the community to engage with the ginans. The development and dissemination of curriculum-based educational programs and supporting resources for the ginans emerged as the most urgent and unmet expectations of the community. The study also confirmed that the wide availability of an online ginan learning tool, such as the one designed in this study, is highly desirable by English-speaking community members who want to learn and understand the tradition and teachings of ginans. However, such a tool is only a part of the solution for fostering sustainable community engagement for the preservation of ginans. To ensure that the tradition is carried forward by the future generations with compassion and understanding, the community institutions must make ginans an educational priority and ensure educational resources for ginans are widely available to community members

    Detecting offensive speech in conversational code-mixed dialogue on social media: A contextual dataset and benchmark experiments

    Get PDF
    The spread of Hate Speech on online platforms is a severe issue for societies and requires the identification of offensive content by platforms. Research has modeled Hate Speech recognition as a text classification problem that predicts the class of a message based on the text of the message only. However, context plays a huge role in communication. In particular, for short messages, the text of the preceding tweets can completely change the interpretation of a message within a discourse. This work extends previous efforts to classify Hate Speech by considering the current and previous tweets jointly. In particular, we introduce a clearly defined way of extracting context. We present the development of the first dataset for conversational-based Hate Speech classification with an approach for collecting context from long conversations for code-mixed Hindi (ICHCL dataset). Overall, our benchmark experiments show that the inclusion of context can improve classification performance over a baseline. Furthermore, we develop a novel processing pipeline for processing the context. The best-performing pipeline uses a fine-tuned SentBERT paired with an LSTM as a classifier. This pipeline achieves a macro F1 score of 0.892 on the ICHCL test dataset. Another KNN, SentBERT, and ABC weighting-based pipeline yields an F1 Macro of 0.807, which gives the best results among traditional classifiers. So even a KNN model gives better results with an optimized BERT than a vanilla BERT model
    corecore