22 research outputs found

    A Profile-Based Method for Authorship Verification

    Get PDF
    Abstract. Authorship verification is one of the most challenging tasks in stylebased text categorization. Given a set of documents, all by the same author, and another document of unknown authorship the question is whether or not the latter is also by that author. Recently, in the framework of the PAN-2013 evaluation lab, a competition in authorship verification was organized and the vast majority of submitted approaches, including the best performing models, followed the instance-based paradigm where each text sample by one author is treated separately. In this paper, we show that the profile-based paradigm (where all samples by one author are treated cumulatively) can be very effective surpassing the performance of PAN-2013 winners without using any information from external sources. The proposed approach is fully-trainable and we demonstrate an appropriate tuning of parameter settings for PAN-2013 corpora achieving accurate answers especially when the cost of false negatives is high.

    Authorship Verification, Neighborhood-based Classification

    Get PDF
    El análisis de autoría se ha convertido en una herramienta determinante para el análisis de documentos digitales en las ciencias forenses. Proponemos un método de Verificación de Autoría mediante el análisis de las semejanzas entre documentos de un autor por vecindad, sin estimar umbrales a partir de un entrenamiento, implementamos dos estrategias de representación de los documentos de un autor, una basada en instancias y otra en el cálculo del centroide. Evaluamos colecciones según el número de muestras, los géneros textuales y el tema abordado. Realizamos un análisis del aporte de cada función de comparación y de cada rasgo empleado así como una combinación por mayoría de los votos de cada par función-rasgo empleado en la semejanza entre documentos. Las pruebas se realizaron usando las colecciones públicas de las competencias PAN 2014 y 2015. Los resultados obtenidos son prometedores y nos permiten evaluar nuestra propuesta y la identificación del trabajo futuro a desarrollar.The Authorship Analysis task has become a determining tool for the analysis of digital documents in forensic sciences. We propose a neighborhood classification method of Authorship Verification analyzing the similarities of a document of unknown authorship between samples documents of one author, without estimating parameters values from a training data, we implemented two strategies of representation of the documents of an author, an instance based and a profile based one. We will evaluate the methods in different data collections according to the number of samples, the textual genres and the topic addressed. We perform an analysis of the contribution of each function of comparison and each feature used to take as final decision a combination by majority of the votes of each function-feature pair used in the similarity between documents. The tests were carried out using the public data sets of the Authorship Verification PAN 2014 and 2015 competitions. The results obtained are promising and allow us to evaluate our proposal and the identification of future work to be developed

    Geographic information extraction from texts

    Get PDF
    A large volume of unstructured texts, containing valuable geographic information, is available online. This information – provided implicitly or explicitly – is useful not only for scientific studies (e.g., spatial humanities) but also for many practical applications (e.g., geographic information retrieval). Although large progress has been achieved in geographic information extraction from texts, there are still unsolved challenges and issues, ranging from methods, systems, and data, to applications and privacy. Therefore, this workshop will provide a timely opportunity to discuss the recent advances, new ideas, and concepts but also identify research gaps in geographic information extraction

    Leveraging Longitudinal Data for Personalized Prediction and Word Representations

    Full text link
    This thesis focuses on personalization, word representations, and longitudinal dialog. We first look at users expressions of individual preferences. In this targeted sentiment task, we find that we can improve entity extraction and sentiment classification using domain lexicons and linear term weighting. This task is important to personalization and dialog systems, as targets need to be identified in conversation and personal preferences affect how the system should react. Then we examine individuals with large amounts of personal conversational data in order to better predict what people will say. We consider extra-linguistic features that can be used to predict behavior and to predict the relationship between interlocutors. We show that these features improve over just using message content and that training on personal data leads to much better performance than training on a sample from all other users. We look not just at using personal data for these end-tasks, but also constructing personalized word representations. When we have a lot of data for an individual, we create personalized word embeddings that improve performance on language modeling and authorship attribution. When we have limited data, but we have user demographics, we can instead construct demographic word embeddings. We show that these representations improve language modeling and word association performance. When we do not have demographic information, we show that using a small amount of data from an individual, we can calculate similarity to existing users and interpolate or leverage data from these users to improve language modeling performance. Using these types of personalized word representations, we are able to provide insight into what words vary more across users and demographics. The kind of personalized representations that we introduce in this work allow for applications such as predictive typing, style transfer, and dialog systems. Importantly, they also have the potential to enable more equitable language models, with improved performance for those demographic groups that have little representation in the data.PHDComputer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/167971/1/cfwelch_1.pd

    Proceedings of the Seventh Italian Conference on Computational Linguistics CLiC-it 2020

    Get PDF
    On behalf of the Program Committee, a very warm welcome to the Seventh Italian Conference on Computational Linguistics (CLiC-it 2020). This edition of the conference is held in Bologna and organised by the University of Bologna. The CLiC-it conference series is an initiative of the Italian Association for Computational Linguistics (AILC) which, after six years of activity, has clearly established itself as the premier national forum for research and development in the fields of Computational Linguistics and Natural Language Processing, where leading researchers and practitioners from academia and industry meet to share their research results, experiences, and challenges

    Identifying Human Trafficking Networks in Louisiana by Using Authorship Attribution and Network Modeling

    Get PDF
    Human trafficking or modern slavery is a problem that has plagued every U.S. state, in both urban and rural areas. During the past decades, online advertisements for sex trafficking have rapidly increased in numbers. The advancement of the Internet and smart phones have made it easier for sex traffickers to contact and recruit their victims and advertise and sell them online. Also, they have made it more difficult for law enforcement to trace the victims and identify the traffickers. Sadly, more than fifty percent of the victims of sex trafficking are children, many of which are exploited through the Internet. The first step for preventing and fighting human trafficking is to identify the traffickers. The primary goal of this study is to identify potential organized sex trafficking networks in Louisiana by analyzing the ads posted online in Louisiana and its five neighboring states. The secondary goal of this study is to examine the possibility of using authorship attribution techniques (in addition to phone numbers and ad IDs) to group together the online advertisements that may have been posted by the same entity. The data used in this study was collected from the website Backpage for a time period of ten months. After cleaning the data set, we were left with 123,436 ads from 47 cities in the specified area. Through the application of network analysis, we found many entities that are potentially such networks, all of which posted a large number of ads with many phone numbers in different cities. Also, we identified the time period that each phone number was used in and the cities and states that each entity posted ads for, which shows how these entities moved around between different cities and states. The four supervised machine learning methods that we used to classify the collected advertisements are Support Vector Machines (SVMs), the Naïve Bayesian classifier, Logistic Regression, and Neural Networks. We calculated 40 accuracy rates, 35 of which were over 90% for classifying any number of ads per entity, as long as each entity (or author) posted more than 10 ads

    Computer Science & Technology Series : XVIII Argentine Congress of Computer Science. Selected papers

    Get PDF
    CACIC’12 was the eighteenth Congress in the CACIC series. It was organized by the School of Computer Science and Engineering at the Universidad Nacional del Sur. The Congress included 13 Workshops with 178 accepted papers, 5 Conferences, 2 invited tutorials, different meetings related with Computer Science Education (Professors, PhD students, Curricula) and an International School with 5 courses. CACIC 2012 was organized following the traditional Congress format, with 13 Workshops covering a diversity of dimensions of Computer Science Research. Each topic was supervised by a committee of 3-5 chairs of different Universities. The call for papers attracted a total of 302 submissions. An average of 2.5 review reports were collected for each paper, for a grand total of 752 review reports that involved about 410 different reviewers. A total of 178 full papers, involving 496 authors and 83 Universities, were accepted and 27 of them were selected for this book.Red de Universidades con Carreras en Informática (RedUNCI

    AVATAR - Machine Learning Pipeline Evaluation Using Surrogate Model

    Get PDF
    © 2020, The Author(s). The evaluation of machine learning (ML) pipelines is essential during automatic ML pipeline composition and optimisation. The previous methods such as Bayesian-based and genetic-based optimisation, which are implemented in Auto-Weka, Auto-sklearn and TPOT, evaluate pipelines by executing them. Therefore, the pipeline composition and optimisation of these methods requires a tremendous amount of time that prevents them from exploring complex pipelines to find better predictive models. To further explore this research challenge, we have conducted experiments showing that many of the generated pipelines are invalid, and it is unnecessary to execute them to find out whether they are good pipelines. To address this issue, we propose a novel method to evaluate the validity of ML pipelines using a surrogate model (AVATAR). The AVATAR enables to accelerate automatic ML pipeline composition and optimisation by quickly ignoring invalid pipelines. Our experiments show that the AVATAR is more efficient in evaluating complex pipelines in comparison with the traditional evaluation approaches requiring their execution

    Natural Language Processing: Emerging Neural Approaches and Applications

    Get PDF
    This Special Issue highlights the most recent research being carried out in the NLP field to discuss relative open issues, with a particular focus on both emerging approaches for language learning, understanding, production, and grounding interactively or autonomously from data in cognitive and neural systems, as well as on their potential or real applications in different domains

    Proceedings of the Fifth Italian Conference on Computational Linguistics CLiC-it 2018 : 10-12 December 2018, Torino

    Get PDF
    On behalf of the Program Committee, a very warm welcome to the Fifth Italian Conference on Computational Linguistics (CLiC-­‐it 2018). This edition of the conference is held in Torino. The conference is locally organised by the University of Torino and hosted into its prestigious main lecture hall “Cavallerizza Reale”. The CLiC-­‐it conference series is an initiative of the Italian Association for Computational Linguistics (AILC) which, after five years of activity, has clearly established itself as the premier national forum for research and development in the fields of Computational Linguistics and Natural Language Processing, where leading researchers and practitioners from academia and industry meet to share their research results, experiences, and challenges
    corecore