12 research outputs found

    Hi Doppelgänger: Towards Detecting Manipulation in News Comments

    Get PDF
    Public opinion manipulation is a serious threat to society, potentially influencing elections and the political situation even in established democracies. The prevalence of online media and the opportunity for users to express opinions in comments magnifies the problem. Governments, organizations, and companies can exploit this situation for biasing opinions. Typically, they deploy a large number of pseudonyms to create an impression of a crowd that supports specific opinions. Side channel information (such as IP addresses or identities of browsers) often allows a reliable detection of pseudonyms managed by a single person. However, while spoofing and anonymizing data that links these accounts is simple, a linking without is very challenging. In this paper, we evaluate whether stylometric features allow a detection of such doppelgängers within comment sections on news articles. To this end, we adapt a state-of-the-art doppelgänger detector to work on small texts (such as comments) and apply it on three popular news sites in two languages. Our results reveal that detecting potential doppelgängers based on linguistics is a promising approach even when no reliable side channel information is available. Preliminary results following an application in the wild shows indications for doppelgängers in real world data sets

    Authorship Identification and Writeprint Visualization

    Get PDF
    The Internet provides an ideal anonymous channel for concealing computer-mediated malicious activities, as the network-based origins of critical electronic textual evidence (e.g., emails, blogs, forum posts, chat log etc.) can be easily repudiated. Authorship attribution is the study of identifying the actual author of the given anonymous documents based on the text itself, and, for decades, many linguistic stylometry and computational techniques have been extensively studied for this purpose. However, most of the previous research emphasizes promoting the authorship attribution accuracy and few works have been done for the purpose of constructing and visualizing the evidential traits; also, these sophisticated techniques are difficult for cyber investigators or linguistic experts to interpret. In this thesis, based on the EEDI (End-to-End Digital Investigation) Framework we propose a visualizable evidence-driven approach, namely VEA, which aims at facilitating the work of cyber investigation. Our comprehensive controlled experiment and stratified experiment on the real-life Enron email data set both demonstrate that our approach can achieve even higher accuracy than traditional methods; meanwhile, its output can be easily visualized and interpreted as evidential traits. In addition to identifying the most plausible author of a given text, our approach also estimates the confidence for the predicted result based on a given identification context and presents visualizable linguistic evidence for each candidate

    RETOS DE LA ESTILÍSTICA FORENSE EN EL ÁMBITO DEL DISCURSO ELECTRÓNICO DELICTIVO

    Get PDF
    Despite its benefits, Internet provides an accessible, affordable and anonymous way for the dissemination of offensive contents or hate speeches. Among their object of study Forensic Linguistics includes the authorship attribution of this type of messages. This study looks into the key methodological aspects to be considered in Authorship Attribution. The selection of the more appropriate features, the text size and how to draw conclusions from data are among them. There is still a long way to solve some of the problems related to them in this scientific field.A pesar de sus beneficios, Internet proporciona una manera accesible, asequible y anónima para la difusión de contenidos ofensivos o discursos de odio. La Lingüística forense cuenta entre su objeto de estudio la atribución de autoría de este tipo de mensajes. Este estudio analiza los factores metodológicos clave que se tienen que considerar en el proceso de identificación de un posible autor. Entre ellos se destacan la selección de los rasgos más apropiados, el tamaño del texto y cómo extraer conclusiones a partir de los datos. Aún queda un largo recorrido en este campo científico para poder solucionar algunos de los problemas relacionados con esta metodología

    Ranking to Learn and Learning to Rank: On the Role of Ranking in Pattern Recognition Applications

    Get PDF
    The last decade has seen a revolution in the theory and application of machine learning and pattern recognition. Through these advancements, variable ranking has emerged as an active and growing research area and it is now beginning to be applied to many new problems. The rationale behind this fact is that many pattern recognition problems are by nature ranking problems. The main objective of a ranking algorithm is to sort objects according to some criteria, so that, the most relevant items will appear early in the produced result list. Ranking methods can be analyzed from two different methodological perspectives: ranking to learn and learning to rank. The former aims at studying methods and techniques to sort objects for improving the accuracy of a machine learning model. Enhancing a model performance can be challenging at times. For example, in pattern classification tasks, different data representations can complicate and hide the different explanatory factors of variation behind the data. In particular, hand-crafted features contain many cues that are either redundant or irrelevant, which turn out to reduce the overall accuracy of the classifier. In such a case feature selection is used, that, by producing ranked lists of features, helps to filter out the unwanted information. Moreover, in real-time systems (e.g., visual trackers) ranking approaches are used as optimization procedures which improve the robustness of the system that deals with the high variability of the image streams that change over time. The other way around, learning to rank is necessary in the construction of ranking models for information retrieval, biometric authentication, re-identification, and recommender systems. In this context, the ranking model's purpose is to sort objects according to their degrees of relevance, importance, or preference as defined in the specific application.Comment: European PhD Thesis. arXiv admin note: text overlap with arXiv:1601.06615, arXiv:1505.06821, arXiv:1704.02665 by other author

    Ranking to Learn and Learning to Rank: On the Role of Ranking in Pattern Recognition Applications

    Get PDF
    The last decade has seen a revolution in the theory and application of machine learning and pattern recognition. Through these advancements, variable ranking has emerged as an active and growing research area and it is now beginning to be applied to many new problems. The rationale behind this fact is that many pattern recognition problems are by nature ranking problems. The main objective of a ranking algorithm is to sort objects according to some criteria, so that, the most relevant items will appear early in the produced result list. Ranking methods can be analyzed from two different methodological perspectives: ranking to learn and learning to rank. The former aims at studying methods and techniques to sort objects for improving the accuracy of a machine learning model. Enhancing a model performance can be challenging at times. For example, in pattern classification tasks, different data representations can complicate and hide the different explanatory factors of variation behind the data. In particular, hand-crafted features contain many cues that are either redundant or irrelevant, which turn out to reduce the overall accuracy of the classifier. In such a case feature selection is used, that, by producing ranked lists of features, helps to filter out the unwanted information. Moreover, in real-time systems (e.g., visual trackers) ranking approaches are used as optimization procedures which improve the robustness of the system that deals with the high variability of the image streams that change over time. The other way around, learning to rank is necessary in the construction of ranking models for information retrieval, biometric authentication, re-identification, and recommender systems. In this context, the ranking model's purpose is to sort objects according to their degrees of relevance, importance, or preference as defined in the specific application.Comment: European PhD Thesis. arXiv admin note: text overlap with arXiv:1601.06615, arXiv:1505.06821, arXiv:1704.02665 by other author

    Uticaj klasifikacije teksta na primene u obradi prirodnih jezika

    Get PDF
    The main goal of this dissertation is to put different text classification tasks in the same frame, by mapping the input data into the common vector space of linguistic attributes. Subsequently, several classification problems of great importance for natural language processing are solved by applying the appropriate classification algorithms. The dissertation deals with the problem of validation of bilingual translation pairs, so that the final goal is to construct a classifier which provides a substitute for human evaluation and which decides whether the pair is a proper translation between the appropriate languages by means of applying a variety of linguistic information and methods. In dictionaries it is useful to have a sentence that demonstrates use for a particular dictionary entry. This task is called the classification of good dictionary examples. In this thesis, a method is developed which automatically estimates whether an example is good or bad for a specific dictionary entry. Two cases of short message classification are also discussed in this dissertation. In the first case, classes are the authors of the messages, and the task is to assign each message to its author from that fixed set. This task is called authorship identification. The other observed classification of short messages is called opinion mining, or sentiment analysis. Starting from the assumption that a short message carries a positive or negative attitude about a thing, or is purely informative, classes can be: positive, negative and neutral. These tasks are of great importance in the field of natural language processing and the proposed solutions are language-independent, based on machine learning methods: support vector machines, decision trees and gradient boosting. For all of these tasks, a demonstration of the effectiveness of the proposed methods is shown on for the Serbian language.Osnovni cilj disertacije je stavljanje različitih zadataka klasifikacije teksta u isti okvir, preslikavanjem ulaznih podataka u isti vektorski prostor lingvističkih atributa..

    Conversationally-Inspired Stylometric Features for Authorship Attribution in Instant Messaging

    No full text
    Authorship attribution (AA) aims at recognizing automatically the author of a given text sample. Traditionally applied to literary texts, AA faces now the new challenge of recognizing the identity of people involved in chat conversations. These share many aspects with spoken conversations, but AA approaches did not take it into account so far. Hence, this paper tries to fill the gap and proposes two novelties that improve the effectiveness of traditional AA approaches for this type of data: the first is to adopt features inspired by Conversation Analysis (in particular for turn-taking), the second is to extract the features from individual turns rather than from entire conversations. The experiments have been performed over a corpus of dyadic chat conversations (77 individuals in total). The performance in identifying the persons involved in each exchange, measured in terms of area under the Cumulative Match Characteristic curve, is 89.5%

    Browse-to-search

    Full text link
    This demonstration presents a novel interactive online shopping application based on visual search technologies. When users want to buy something on a shopping site, they usually have the requirement of looking for related information from other web sites. Therefore users need to switch between the web page being browsed and other websites that provide search results. The proposed application enables users to naturally search products of interest when they browse a web page, and make their even causal purchase intent easily satisfied. The interactive shopping experience is characterized by: 1) in session - it allows users to specify the purchase intent in the browsing session, instead of leaving the current page and navigating to other websites; 2) in context - -the browsed web page provides implicit context information which helps infer user purchase preferences; 3) in focus - users easily specify their search interest using gesture on touch devices and do not need to formulate queries in search box; 4) natural-gesture inputs and visual-based search provides users a natural shopping experience. The system is evaluated against a data set consisting of several millions commercial product images. © 2012 Authors
    corecore