40 research outputs found

    A Profile-Based Method for Authorship Verification

    Get PDF
    Abstract. Authorship verification is one of the most challenging tasks in stylebased text categorization. Given a set of documents, all by the same author, and another document of unknown authorship the question is whether or not the latter is also by that author. Recently, in the framework of the PAN-2013 evaluation lab, a competition in authorship verification was organized and the vast majority of submitted approaches, including the best performing models, followed the instance-based paradigm where each text sample by one author is treated separately. In this paper, we show that the profile-based paradigm (where all samples by one author are treated cumulatively) can be very effective surpassing the performance of PAN-2013 winners without using any information from external sources. The proposed approach is fully-trainable and we demonstrate an appropriate tuning of parameter settings for PAN-2013 corpora achieving accurate answers especially when the cost of false negatives is high.

    Authorship Verification, Neighborhood-based Classification

    Get PDF
    El análisis de autoría se ha convertido en una herramienta determinante para el análisis de documentos digitales en las ciencias forenses. Proponemos un método de Verificación de Autoría mediante el análisis de las semejanzas entre documentos de un autor por vecindad, sin estimar umbrales a partir de un entrenamiento, implementamos dos estrategias de representación de los documentos de un autor, una basada en instancias y otra en el cálculo del centroide. Evaluamos colecciones según el número de muestras, los géneros textuales y el tema abordado. Realizamos un análisis del aporte de cada función de comparación y de cada rasgo empleado así como una combinación por mayoría de los votos de cada par función-rasgo empleado en la semejanza entre documentos. Las pruebas se realizaron usando las colecciones públicas de las competencias PAN 2014 y 2015. Los resultados obtenidos son prometedores y nos permiten evaluar nuestra propuesta y la identificación del trabajo futuro a desarrollar.The Authorship Analysis task has become a determining tool for the analysis of digital documents in forensic sciences. We propose a neighborhood classification method of Authorship Verification analyzing the similarities of a document of unknown authorship between samples documents of one author, without estimating parameters values from a training data, we implemented two strategies of representation of the documents of an author, an instance based and a profile based one. We will evaluate the methods in different data collections according to the number of samples, the textual genres and the topic addressed. We perform an analysis of the contribution of each function of comparison and each feature used to take as final decision a combination by majority of the votes of each function-feature pair used in the similarity between documents. The tests were carried out using the public data sets of the Authorship Verification PAN 2014 and 2015 competitions. The results obtained are promising and allow us to evaluate our proposal and the identification of future work to be developed

    On the Detection of False Information: From Rumors to Fake News

    Full text link
    Tesis por compendio[ES] En tiempos recientes, el desarrollo de las redes sociales y de las agencias de noticias han traído nuevos retos y amenazas a la web. Estas amenazas han llamado la atención de la comunidad investigadora en Procesamiento del Lenguaje Natural (PLN) ya que están contaminando las plataformas de redes sociales. Un ejemplo de amenaza serían las noticias falsas, en las que los usuarios difunden y comparten información falsa, inexacta o engañosa. La información falsa no se limita a la información verificable, sino que también incluye información que se utiliza con fines nocivos. Además, uno de los desafíos a los que se enfrentan los investigadores es la gran cantidad de usuarios en las plataformas de redes sociales, donde detectar a los difusores de información falsa no es tarea fácil. Los trabajos previos que se han propuesto para limitar o estudiar el tema de la detección de información falsa se han centrado en comprender el lenguaje de la información falsa desde una perspectiva lingüística. En el caso de información verificable, estos enfoques se han propuesto en un entorno monolingüe. Además, apenas se ha investigado la detección de las fuentes o los difusores de información falsa en las redes sociales. En esta tesis estudiamos la información falsa desde varias perspectivas. En primer lugar, dado que los trabajos anteriores se centraron en el estudio de la información falsa en un entorno monolingüe, en esta tesis estudiamos la información falsa en un entorno multilingüe. Proponemos diferentes enfoques multilingües y los comparamos con un conjunto de baselines monolingües. Además, proporcionamos estudios sistemáticos para los resultados de la evaluación de nuestros enfoques para una mejor comprensión. En segundo lugar, hemos notado que el papel de la información afectiva no se ha investigado en profundidad. Por lo tanto, la segunda parte de nuestro trabajo de investigación estudia el papel de la información afectiva en la información falsa y muestra cómo los autores de contenido falso la emplean para manipular al lector. Aquí, investigamos varios tipos de información falsa para comprender la correlación entre la información afectiva y cada tipo (Propaganda, Trucos / Engaños, Clickbait y Sátira). Por último, aunque no menos importante, en un intento de limitar su propagación, también abordamos el problema de los difusores de información falsa en las redes sociales. En esta dirección de la investigación, nos enfocamos en explotar varias características basadas en texto extraídas de los mensajes de perfiles en línea de tales difusores. Estudiamos diferentes conjuntos de características que pueden tener el potencial de ayudar a discriminar entre difusores de información falsa y verificadores de hechos.[CA] En temps recents, el desenvolupament de les xarxes socials i de les agències de notícies han portat nous reptes i amenaces a la web. Aquestes amenaces han cridat l'atenció de la comunitat investigadora en Processament de Llenguatge Natural (PLN) ja que estan contaminant les plataformes de xarxes socials. Un exemple d'amenaça serien les notícies falses, en què els usuaris difonen i comparteixen informació falsa, inexacta o enganyosa. La informació falsa no es limita a la informació verificable, sinó que també inclou informació que s'utilitza amb fins nocius. A més, un dels desafiaments als quals s'enfronten els investigadors és la gran quantitat d'usuaris en les plataformes de xarxes socials, on detectar els difusors d'informació falsa no és tasca fàcil. Els treballs previs que s'han proposat per limitar o estudiar el tema de la detecció d'informació falsa s'han centrat en comprendre el llenguatge de la informació falsa des d'una perspectiva lingüística. En el cas d'informació verificable, aquests enfocaments s'han proposat en un entorn monolingüe. A més, gairebé no s'ha investigat la detecció de les fonts o els difusors d'informació falsa a les xarxes socials. En aquesta tesi estudiem la informació falsa des de diverses perspectives. En primer lloc, atès que els treballs anteriors es van centrar en l'estudi de la informació falsa en un entorn monolingüe, en aquesta tesi estudiem la informació falsa en un entorn multilingüe. Proposem diferents enfocaments multilingües i els comparem amb un conjunt de baselines monolingües. A més, proporcionem estudis sistemàtics per als resultats de l'avaluació dels nostres enfocaments per a una millor comprensió. En segon lloc, hem notat que el paper de la informació afectiva no s'ha investigat en profunditat. Per tant, la segona part del nostre treball de recerca estudia el paper de la informació afectiva en la informació falsa i mostra com els autors de contingut fals l'empren per manipular el lector. Aquí, investiguem diversos tipus d'informació falsa per comprendre la correlació entre la informació afectiva i cada tipus (Propaganda, Trucs / Enganys, Clickbait i Sàtira). Finalment, però no menys important, en un intent de limitar la seva propagació, també abordem el problema dels difusors d'informació falsa a les xarxes socials. En aquesta direcció de la investigació, ens enfoquem en explotar diverses característiques basades en text extretes dels missatges de perfils en línia de tals difusors. Estudiem diferents conjunts de característiques que poden tenir el potencial d'ajudar a discriminar entre difusors d'informació falsa i verificadors de fets.[EN] In the recent years, the development of social media and online news agencies has brought several challenges and threats to the Web. These threats have taken the attention of the Natural Language Processing (NLP) research community as they are polluting the online social media platforms. One of the examples of these threats is false information, in which false, inaccurate, or deceptive information is spread and shared by online users. False information is not limited to verifiable information, but it also involves information that is used for harmful purposes. Also, one of the challenges that researchers have to face is the massive number of users in social media platforms, where detecting false information spreaders is not an easy job. Previous work that has been proposed for limiting or studying the issue of detecting false information has focused on understanding the language of false information from a linguistic perspective. In the case of verifiable information, approaches have been proposed in a monolingual setting. Moreover, detecting the sources or the spreaders of false information in social media has not been investigated much. In this thesis we study false information from several aspects. First, since previous work focused on studying false information in a monolingual setting, in this thesis we study false information in a cross-lingual one. We propose different cross-lingual approaches and we compare them to a set of monolingual baselines. Also, we provide systematic studies for the evaluation results of our approaches for better understanding. Second, we noticed that the role of affective information was not investigated in depth. Therefore, the second part of our research work studies the role of the affective information in false information and shows how the authors of false content use it to manipulate the reader. Here, we investigate several types of false information to understand the correlation between affective information and each type (Propaganda, Hoax, Clickbait, Rumor, and Satire). Last but not least, in an attempt to limit its spread, we also address the problem of detecting false information spreaders in social media. In this research direction, we focus on exploiting several text-based features extracted from the online profile messages of those spreaders. We study different feature sets that can have the potential to help to identify false information spreaders from fact checkers.Ghanem, BHH. (2020). On the Detection of False Information: From Rumors to Fake News [Tesis doctoral]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/158570TESISCompendi

    Drawing Elena Ferrante's Profile. Workshop Proceedings, Padova, 7 September 2017

    Get PDF
    Elena Ferrante is an internationally acclaimed Italian novelist whose real identity has been kept secret by E/O publishing house for more than 25 years. Owing to her popularity, major Italian and foreign newspapers have long tried to discover her real identity. However, only a few attempts have been made to foster a scientific debate on her work. In 2016, Arjuna Tuzzi and Michele Cortelazzo led an Italian research team that conducted a preliminary study and collected a well-founded, large corpus of Italian novels comprising 150 works published in the last 30 years by 40 different authors. Moreover, they shared their data with a select group of international experts on authorship attribution, profiling, and analysis of textual data: Maciej Eder and Jan Rybicki (Poland), Patrick Juola (United States), Vittorio Loreto and his research team, Margherita Lalli and Francesca Tria (Italy), George Mikros (Greece), Pierre Ratinaud (France), and Jacques Savoy (Switzerland). The chapters of this volume report the results of this endeavour that were first presented during the international workshop Drawing Elena Ferrante's Profile in Padua on 7 September 2017 as part of the 3rd IQLA-GIAT Summer School in Quantitative Analysis of Textual Data. The fascinating research findings suggest that Elena Ferrante\u2019s work definitely deserves \u201cmany hands\u201d as well as an extensive effort to understand her distinct writing style and the reasons for her worldwide success

    The text classification pipeline: Starting shallow, going deeper

    Get PDF
    An increasingly relevant and crucial subfield of Natural Language Processing (NLP), tackled in this PhD thesis from a computer science and engineering perspective, is the Text Classification (TC). Also in this field, the exceptional success of deep learning has sparked a boom over the past ten years. Text retrieval and categorization, information extraction and summarization all rely heavily on TC. The literature has presented numerous datasets, models, and evaluation criteria. Even if languages as Arabic, Chinese, Hindi and others are employed in several works, from a computer science perspective the most used and referred language in the literature concerning TC is English. This is also the language mainly referenced in the rest of this PhD thesis. Even if numerous machine learning techniques have shown outstanding results, the classifier effectiveness depends on the capability to comprehend intricate relations and non-linear correlations in texts. In order to achieve this level of understanding, it is necessary to pay attention not only to the architecture of a model but also to other stages of the TC pipeline. In an NLP framework, a range of text representation techniques and model designs have emerged, including the large language models. These models are capable of turning massive amounts of text into useful vector representations that effectively capture semantically significant information. The fact that this field has been investigated by numerous communities, including data mining, linguistics, and information retrieval, is an aspect of crucial interest. These communities frequently have some overlap, but are mostly separate and do their research on their own. Bringing researchers from other groups together to improve the multidisciplinary comprehension of this field is one of the objectives of this dissertation. Additionally, this dissertation makes an effort to examine text mining from both a traditional and modern perspective. This thesis covers the whole TC pipeline in detail. However, the main contribution is to investigate the impact of every element in the TC pipeline to evaluate the impact on the final performance of a TC model. It is discussed the TC pipeline, including the traditional and the most recent deep learning-based models. This pipeline consists of State-Of-The-Art (SOTA) datasets used in the literature as benchmark, text preprocessing, text representation, machine learning models for TC, evaluation metrics and current SOTA results. In each chapter of this dissertation, I go over each of these steps, covering both the technical advancements and my most significant and recent findings while performing experiments and introducing novel models. The advantages and disadvantages of various options are also listed, along with a thorough comparison of the various approaches. At the end of each chapter, there are my contributions with experimental evaluations and discussions on the results that I have obtained during my three years PhD course. The experiments and the analysis related to each chapter (i.e., each element of the TC pipeline) are the main contributions that I provide, extending the basic knowledge of a regular survey on the matter of TC.An increasingly relevant and crucial subfield of Natural Language Processing (NLP), tackled in this PhD thesis from a computer science and engineering perspective, is the Text Classification (TC). Also in this field, the exceptional success of deep learning has sparked a boom over the past ten years. Text retrieval and categorization, information extraction and summarization all rely heavily on TC. The literature has presented numerous datasets, models, and evaluation criteria. Even if languages as Arabic, Chinese, Hindi and others are employed in several works, from a computer science perspective the most used and referred language in the literature concerning TC is English. This is also the language mainly referenced in the rest of this PhD thesis. Even if numerous machine learning techniques have shown outstanding results, the classifier effectiveness depends on the capability to comprehend intricate relations and non-linear correlations in texts. In order to achieve this level of understanding, it is necessary to pay attention not only to the architecture of a model but also to other stages of the TC pipeline. In an NLP framework, a range of text representation techniques and model designs have emerged, including the large language models. These models are capable of turning massive amounts of text into useful vector representations that effectively capture semantically significant information. The fact that this field has been investigated by numerous communities, including data mining, linguistics, and information retrieval, is an aspect of crucial interest. These communities frequently have some overlap, but are mostly separate and do their research on their own. Bringing researchers from other groups together to improve the multidisciplinary comprehension of this field is one of the objectives of this dissertation. Additionally, this dissertation makes an effort to examine text mining from both a traditional and modern perspective. This thesis covers the whole TC pipeline in detail. However, the main contribution is to investigate the impact of every element in the TC pipeline to evaluate the impact on the final performance of a TC model. It is discussed the TC pipeline, including the traditional and the most recent deep learning-based models. This pipeline consists of State-Of-The-Art (SOTA) datasets used in the literature as benchmark, text preprocessing, text representation, machine learning models for TC, evaluation metrics and current SOTA results. In each chapter of this dissertation, I go over each of these steps, covering both the technical advancements and my most significant and recent findings while performing experiments and introducing novel models. The advantages and disadvantages of various options are also listed, along with a thorough comparison of the various approaches. At the end of each chapter, there are my contributions with experimental evaluations and discussions on the results that I have obtained during my three years PhD course. The experiments and the analysis related to each chapter (i.e., each element of the TC pipeline) are the main contributions that I provide, extending the basic knowledge of a regular survey on the matter of TC

    Detecting deceptive behaviour in the wild:text mining for online child protection in the presence of noisy and adversarial social media communications

    Get PDF
    A real-life application of text mining research “in the wild”, i.e. in online social media, differs from more general applications in that its defining characteristics are both domain and process dependent. This gives rise to a number of challenges of which contemporary research has only scratched the surface. More specifically, a text mining approach applied in the wild typically has no control over the dataset size. Hence, the system has to be robust towards limited data availability, a variable number of samples across users and a highly skewed dataset. Additionally, the quality of the data cannot be guaranteed. As a result, the approach needs to be tolerant to a certain degree of linguistic noise. Finally, it has to be robust towards deceptive behaviour or adversaries. This thesis examines the viability of a text mining approach for supporting cybercrime investigations pertaining to online child protection. The main contributions of this dissertation are as follows. A systematic study of different aspects of methodological design of a state-ofthe- art text mining approach is presented to assess its scalability towards a large, imbalanced and linguistically noisy social media dataset. In this framework, three key automatic text categorisation tasks are examined, namely the feasibility to (i) identify a social network user’s age group and gender based on textual information found in only one single message; (ii) aggregate predictions on the message level to the user level without neglecting potential clues of deception and detect false user profiles on social networks and (iii) identify child sexual abuse media among thousands of legal other media, including adult pornography, based on their filename. Finally, a novel approach is presented that combines age group predictions with advanced text clustering techniques and unsupervised learning to identify online child sex offenders’ grooming behaviour. The methodology presented in this thesis was extensively discussed with law enforcement to assess its forensic readiness. Additionally, each component was evaluated on actual child sex offender data. Despite the challenging characteristics of these text types, the results show high degrees of accuracy for false profile detection, identifying grooming behaviour and child sexual abuse media identification

    A Computational Academic Integrity Framework

    Get PDF
    L'abast creixent i la naturalesa canviant dels programes acadèmics constitueixen un repte per a la integritat dels protocols tradicionals de proves i exàmens. L'objectiu d'aquesta tesi és introduir una alternativa als enfocaments tradicionals d'integritat acadèmica, per a cobrir la bretxa del buit de l'anonimat i donar la possibilitat als instructors i administradors acadèmics de fer servir nous mitjans que permetin mantenir la integritat acadèmica i promoguin la responsabilitat, accessibilitat i eficiència, a més de preservar la privadesa i minimitzin la interrupció en el procés d'aprenentatge. Aquest treball té com a objectiu començar un canvi de paradigma en les pràctiques d'integritat acadèmica. La recerca en l'àrea de la identitat de l'estudiant i la garantia de l'autoria són importants perquè la concessió de crèdits d'estudi a entitats no verificades és perjudicial per a la credibilitat institucional i la seguretat pública. Aquesta tesi es basa en la noció que la identitat de l'alumne es compon de dues capes diferents, física i de comportament, en les quals tant els criteris d'identitat com els d'autoria han de ser confirmats per a mantenir un nivell raonable d'integritat acadèmica. Per a això, aquesta tesi s'organitza en tres seccions, cadascuna de les quals aborda el problema des d'una de les perspectives següents: (a) teòrica, (b) empírica i (c) pragmàtica.El creciente alcance y la naturaleza cambiante de los programas académicos constituyen un reto para la integridad de los protocolos tradicionales de pruebas y exámenes. El objetivo de esta tesis es introducir una alternativa a los enfoques tradicionales de integridad académica, para cubrir la brecha del vacío anonimato y dar la posibilidad a los instructores y administradores académicos de usar nuevos medios que permitan mantener la integridad académica y promuevan la responsabilidad, accesibilidad y eficiencia, además de preservar la privacidad y minimizar la interrupción en el proceso de aprendizaje. Este trabajo tiene como objetivo iniciar un cambio de paradigma en las prácticas de integridad académica. La investigación en el área de la identidad del estudiante y la garantía de la autoría son importantes porque la concesión de créditos de estudio a entidades no verificadas es perjudicial para la credibilidad institucional y la seguridad pública. Esta tesis se basa en la noción de que la identidad del alumno se compone de dos capas distintas, física y de comportamiento, en las que tanto los criterios de identidad como los de autoría deben ser confirmados para mantener un nivel razonable de integridad académica. Para ello, esta tesis se organiza en tres secciones, cada una de las cuales aborda el problema desde una de las siguientes perspectivas: (a) teórica, (b) empírica y (c) pragmática.The growing scope and changing nature of academic programmes provide a challenge to the integrity of traditional testing and examination protocols. The aim of this thesis is to introduce an alternative to the traditional approaches to academic integrity, bridging the anonymity gap and empowering instructors and academic administrators with new ways of maintaining academic integrity that preserve privacy, minimize disruption to the learning process, and promote accountability, accessibility and efficiency. This work aims to initiate a paradigm shift in academic integrity practices. Research in the area of learner identity and authorship assurance is important because the award of course credits to unverified entities is detrimental to institutional credibility and public safety. This thesis builds upon the notion of learner identity consisting of two distinct layers (a physical layer and a behavioural layer), where the criteria of identity and authorship must both be confirmed to maintain a reasonable level of academic integrity. To pursue this goal in organized fashion, this thesis has the following three sections: (a) theoretical, (b) empirical, and (c) pragmatic

    A computational academic integrity framework

    Get PDF
    L'abast creixent i la naturalesa canviant dels programes acadèmics constitueixen un repte per a la integritat dels protocols tradicionals de proves i exàmens. L'objectiu d¿aquesta tesi és introduir una alternativa als enfocaments tradicionals d'integritat acadèmica, per a cobrir la bretxa del buit de l'anonimat i donar la possibilitat als instructors i administradors acadèmics de fer servir nous mitjans que permetin mantenir la integritat acadèmica i promoguin la responsabilitat, accessibilitat i eficiència, a més de preservar la privadesa i minimitzin la interrupció en el procés d'aprenentatge. Aquest treball té com a objectiu començar un canvi de paradigma en les pràctiques d'integritat acadèmica. La recerca en l'àrea de la identitat de l'estudiant i la garantia de l'autoria són importants perquè la concessió de crèdits d'estudi a entitats no verificades és perjudicial per a la credibilitat institucional i la seguretat pública. Aquesta tesi es basa en la noció que la identitat de l'alumne es compon de dues capes diferents, física i de comportament, en les quals tant els criteris d'identitat com els d'autoria han de ser confirmats per a mantenir un nivell raonable d'integritat acadèmica. Per a això, aquesta tesi s'organitza en tres seccions, cadascuna de les quals aborda el problema des d'una de les perspectives següents: (a) teòrica, (b) empírica i (c) pragmàtica.El creciente alcance y la naturaleza cambiante de los programas académicos constituyen un reto para la integridad de los protocolos tradicionales de pruebas y exámenes. El objetivo de esta tesis es introducir una alternativa a los enfoques tradicionales de integridad académica, para cubrir la brecha del vacío anonimato y dar la posibilidad a los instructores y administradores académicos de usar nuevos medios que permitan mantener la integridad académica y promuevan la responsabilidad, accesibilidad y eficiencia, además de preservar la privacidad y minimizar la interrupción en el proceso de aprendizaje. Este trabajo tiene como objetivo iniciar un cambio de paradigma en las prácticas de integridad académica. La investigación en el área de la identidad del estudiante y la garantía de la autoría son importantes porque la concesión de créditos de estudio a entidades no verificadas es perjudicial para la credibilidad institucional y la seguridad pública. Esta tesis se basa en la noción de que la identidad del alumno se compone de dos capas distintas, física y de comportamiento, en las que tanto los criterios de identidad como los de autoría deben ser confirmados para mantener un nivel razonable de integridad académica. Para ello, esta tesis se organiza en tres secciones, cada una de las cuales aborda el problema desde una de las siguientes perspectivas: (a) teórica, (b) empírica y (c) pragmática.The growing scope and changing nature of academic programmes provide a challenge to the integrity of traditional testing and examination protocols. The aim of this thesis is to introduce an alternative to the traditional approaches to academic integrity, bridging the anonymity gap and empowering instructors and academic administrators with new ways of maintaining academic integrity that preserve privacy, minimize disruption to the learning process, and promote accountability, accessibility and efficiency. This work aims to initiate a paradigm shift in academic integrity practices. Research in the area of learner identity and authorship assurance is important because the award of course credits to unverified entities is detrimental to institutional credibility and public safety. This thesis builds upon the notion of learner identity consisting of two distinct layers (a physical layer and a behavioural layer), where the criteria of identity and authorship must both be confirmed to maintain a reasonable level of academic integrity. To pursue this goal in organized fashion, this thesis has the following three sections: (a) theoretical, (b) empirical, and (c) pragmatic

    Geographic information extraction from texts

    Get PDF
    A large volume of unstructured texts, containing valuable geographic information, is available online. This information – provided implicitly or explicitly – is useful not only for scientific studies (e.g., spatial humanities) but also for many practical applications (e.g., geographic information retrieval). Although large progress has been achieved in geographic information extraction from texts, there are still unsolved challenges and issues, ranging from methods, systems, and data, to applications and privacy. Therefore, this workshop will provide a timely opportunity to discuss the recent advances, new ideas, and concepts but also identify research gaps in geographic information extraction
    corecore