2,252 research outputs found

    Can Authorship Representation Learning Capture Stylistic Features?

    Full text link
    Automatically disentangling an author's style from the content of their writing is a longstanding and possibly insurmountable problem in computational linguistics. At the same time, the availability of large text corpora furnished with author labels has recently enabled learning authorship representations in a purely data-driven manner for authorship attribution, a task that ostensibly depends to a greater extent on encoding writing style than encoding content. However, success on this surrogate task does not ensure that such representations capture writing style since authorship could also be correlated with other latent variables, such as topic. In an effort to better understand the nature of the information these representations convey, and specifically to validate the hypothesis that they chiefly encode writing style, we systematically probe these representations through a series of targeted experiments. The results of these experiments suggest that representations learned for the surrogate authorship prediction task are indeed sensitive to writing style. As a consequence, authorship representations may be expected to be robust to certain kinds of data shift, such as topic drift over time. Additionally, our findings may open the door to downstream applications that require stylistic representations, such as style transfer.Comment: appearing at TACL 202

    IDTraffickers:An Authorship Attribution Dataset to link and connect Potential Human-Trafficking Operations on Text Escort Advertisements

    Get PDF
    Human trafficking (HT) is a pervasive global issue affecting vulnerable individuals, violating their fundamental human rights. Investigations reveal that a significant number of HT cases are associated with online advertisements (ads), particularly in escort markets. Consequently, identifying and connecting HT vendors has become increasingly challenging for Law Enforcement Agencies (LEAs). To address this issue, we introduce IDTraffickers, an extensive dataset consisting of 87,595 text ads and 5,244 vendor labels to enable the verification and identification of potential HT vendors on online escort markets. To establish a benchmark for authorship identification, we train a DeCLUTR-small model, achieving a macro-F1 score of 0.8656 in a closed-set classification environment. Next, we leverage the style representations extracted from the trained classifier to conduct authorship verification, resulting in a mean r-precision score of 0.8852 in an open-set ranking environment. Finally, to encourage further research and ensure responsible data sharing, we plan to release IDTraffickers for the authorship attribution task to researchers under specific conditions, considering the sensitive nature of the data. We believe that the availability of our dataset and benchmarks will empower future researchers to utilize our findings, thereby facilitating the effective linkage of escort ads and the development of more robust approaches for identifying HT indicators

    IDTraffickers:An Authorship Attribution Dataset to link and connect Potential Human-Trafficking Operations on Text Escort Advertisements

    Get PDF
    Human trafficking (HT) is a pervasive global issue affecting vulnerable individuals, violating their fundamental human rights. Investigations reveal that a significant number of HT cases are associated with online advertisements (ads), particularly in escort markets. Consequently, identifying and connecting HT vendors has become increasingly challenging for Law Enforcement Agencies (LEAs). To address this issue, we introduce IDTraffickers, an extensive dataset consisting of 87,595 text ads and 5,244 vendor labels to enable the verification and identification of potential HT vendors on online escort markets. To establish a benchmark for authorship identification, we train a DeCLUTR-small model, achieving a macro-F1 score of 0.8656 in a closed-set classification environment. Next, we leverage the style representations extracted from the trained classifier to conduct authorship verification, resulting in a mean r-precision score of 0.8852 in an open-set ranking environment. Finally, to encourage further research and ensure responsible data sharing, we plan to release IDTraffickers for the authorship attribution task to researchers under specific conditions, considering the sensitive nature of the data. We believe that the availability of our dataset and benchmarks will empower future researchers to utilize our findings, thereby facilitating the effective linkage of escort ads and the development of more robust approaches for identifying HT indicators

    Rethinking the Authorship Verification Experimental Setups

    Full text link
    One of the main drivers of the recent advances in authorship verification is the PAN large-scale authorship dataset. Despite generating significant progress in the field, inconsistent performance differences between the closed and open test sets have been reported. To this end, we improve the experimental setup by proposing five new public splits over the PAN dataset, specifically designed to isolate and identify biases related to the text topic and to the author's writing style. We evaluate several BERT-like baselines on these splits, showing that such models are competitive with authorship verification state-of-the-art methods. Furthermore, using explainable AI, we find that these baselines are biased towards named entities. We show that models trained without the named entities obtain better results and generalize better when tested on DarkReddit, our new dataset for authorship verification.Comment: Accepted as a short paper at the EMNLP 2022 conference. 10 pages, 5 figures, 9 table

    Can Authorship Attribution Models Distinguish Speakers in Speech Transcripts?

    Full text link
    Authorship verification is the problem of determining if two distinct writing samples share the same author and is typically concerned with the attribution of written text. In this paper, we explore the attribution of transcribed speech, which poses novel challenges. The main challenge is that many stylistic features, such as punctuation and capitalization, are not available or reliable. Therefore, we expect a priori that transcribed speech is a more challenging domain for attribution. On the other hand, other stylistic features, such as speech disfluencies, may enable more successful attribution but, being specific to speech, require special purpose models. To better understand the challenges of this setting, we contribute the first systematic study of speaker attribution based solely on transcribed speech. Specifically, we propose a new benchmark for speaker attribution focused on conversational speech transcripts. To control for spurious associations of speakers with topic, we employ both conversation prompts and speakers' participating in the same conversation to construct challenging verification trials of varying difficulties. We establish the state of the art on this new benchmark by comparing a suite of neural and non-neural baselines, finding that although written text attribution models achieve surprisingly good performance in certain settings, they struggle in the hardest settings we consider

    The “Hypertension Approaches in the Elderly: a Lifestyle study” multicenter, randomized trial (HAEL Study): rationale and methodological protocol

    Get PDF
    Background: Hypertension is a clinical condition highly prevalent in the elderly, imposing great risks to cardiovascular diseases and loss of quality of life. Current guidelines emphasize the importance of nonpharmacological strategies as a first-line approach to lower blood pressure. Exercise is an efficient lifestyle tool that can benefit a myriad of health-related outcomes, including blood pressure control, in older adults. We herein report the protocol of the HAEL Study, which aims to evaluate the efficacy of a pragmatic combined exercise training compared with a health education program on ambulatory blood pressure and other health-related outcomes in older individuals. Methods: Randomized, single-blinded, multicenter, two-arm, parallel, superiority trial. A total of 184 subjects (92/center), ≥60 years of age, with no recent history of cardiovascular events, will be randomized on a 1:1 ratio to 12-week interventions consisting either of a combined exercise (aerobic and strength) training, three times per week, or an active-control group receiving health education intervention, once a week. Ambulatory (primary outcome) and office blood pressures, cardiorespiratory fitness and endothelial function, together with quality of life, functional fitness and autonomic control will be measured in before and after intervention. Discussion: Our conceptual hypothesis is that combined training intervention will reduce ambulatory blood pressure in comparison with health education group. Using a superiority framework, analysis plan prespecifies an intention-to-treat approach, per protocol criteria, subgroups analysis, and handling of missing data. The trial is recruiting since September 2017. Finally, this study was designed to adhere to data sharing practices. Trial registration: NCT03264443. Registered on 29 August, 2017

    Detecting Deception, Partisan, and Social Biases

    Full text link
    Tesis por compendio[ES] En la actualidad, el mundo político tiene tanto o más impacto en la sociedad que ésta en el mundo político. Los líderes o representantes de partidos políticos hacen uso de su poder en los medios de comunicación, para modificar posiciones ideológicas y llegar al pueblo con el objetivo de ganar popularidad en las elecciones gubernamentales.A través de un lenguaje engañoso, los textos políticos pueden contener sesgos partidistas y sociales que minan la percepción de la realidad. Como resultado, los seguidores de una ideología, o miembros de una categoría social, se sienten amenazados por otros grupos sociales o ideológicos, o los perciben como competencia, derivándose así una polarización política con agresiones físicas y verbales. La comunidad científica del Procesamiento del Lenguaje Natural (NLP, según sus siglas en inglés) contribuye cada día a detectar discursos de odio, insultos, mensajes ofensivos, e información falsa entre otras tareas computacionales que colindan con ciencias sociales. Sin embargo, para abordar tales tareas, es necesario hacer frente a diversos problemas entre los que se encuentran la dificultad de tener textos etiquetados, las limitaciones de no trabajar con un equipo interdisciplinario, y los desafíos que entraña la necesidad de soluciones interpretables por el ser humano. Esta tesis se enfoca en la detección de sesgos partidistas y sesgos sociales, tomando como casos de estudio el hiperpartidismo y los estereotipos sobre inmigrantes. Para ello, se propone un modelo basado en una técnica de enmascaramiento de textos capaz de detectar lenguaje engañoso incluso en temas controversiales, siendo capaz de capturar patrones del contenido y el estilo de escritura. Además, abordamos el problema usando modelos basados en BERT, conocidos por su efectividad al capturar patrones sintácticos y semánticos sobre las mismas representaciones de textos. Ambos enfoques, la técnica de enmascaramiento y los modelos basados en BERT, se comparan en términos de desempeño y explicabilidad en la detección de hiperpartidismo en noticias políticas y estereotipos sobre inmigrantes. Para la identificación de estos últimos, se propone una nueva taxonomía con fundamentos teóricos en sicología social, y con la que se etiquetan textos extraídos de intervenciones partidistas llevadas a cabo en el Parlamento español. Los resultados muestran que los enfoques propuestos contribuyen al estudio del hiperpartidismo, así como a identif i car cuándo los ciudadanos y políticos enmarcan a los inmigrantes en una imagen de víctima, recurso económico, o amenaza. Finalmente, en esta investigación interdisciplinaria se demuestra que los estereotipos sobre inmigrantes son usados como estrategia retórica en contextos políticos.[CA] Avui, el món polític té tant o més impacte en la societat que la societat en el món polític. Els líders polítics, o representants dels partits polítics, fan servir el seu poder als mitjans de comunicació per modif i car posicions ideològiques i arribar al poble per tal de guanyar popularitat a les eleccions governamentals. Mitjançant un llenguatge enganyós, els textos polítics poden contenir biaixos partidistes i socials que soscaven la percepció de la realitat. Com a resultat, augmenta la polarització política nociva perquè els seguidors d'una ideologia, o els membres d'una categoria social, veuen els altres grups com una amenaça o competència, que acaba en agressions verbals i físiques amb resultats desafortunats. La comunitat de Processament del llenguatge natural (PNL) té cada dia noves aportacions amb enfocaments que ajuden a detectar discursos d'odi, insults, missatges ofensius i informació falsa, entre altres tasques computacionals relacionades amb les ciències socials. No obstant això, molts obstacles impedeixen eradicar aquests problemes, com ara la dif i cultat de tenir textos anotats, les limitacions dels enfocaments no interdisciplinaris i el repte afegit per la necessitat de solucions interpretables. Aquesta tesi se centra en la detecció de biaixos partidistes i socials, prenent com a cas pràctic l'hiperpartidisme i els estereotips sobre els immigrants. Proposem un model basat en una tècnica d'emmascarament que permet detectar llenguatge enganyós en temes polèmics i no polèmics, capturant pa-trons relacionats amb l'estil i el contingut. A més, abordem el problema avaluant models basats en BERT, coneguts per ser efectius per capturar patrons semàntics i sintàctics en la mateixa representació. Comparem aquests dos enfocaments (la tècnica d'emmascarament i els models basats en BERT) en termes de rendiment i les seves solucions explicables en la detecció de l'hiperpartidisme en les notícies polítiques i els estereotips d'immigrants. Per tal d'identificar els estereotips dels immigrants, proposem una nova tax-onomia recolzada per la teoria de la psicologia social i anotem un conjunt de dades de les intervencions partidistes al Parlament espanyol. Els resultats mostren que els nostres models poden ajudar a estudiar l'hiperpartidisme i identif i car diferents marcs en què els ciutadans i els polítics perceben els immigrants com a víctimes, recursos econòmics o amenaces. Finalment, aquesta investigació interdisciplinària demostra que els estereotips dels immigrants s'utilitzen com a estratègia retòrica en contextos polítics.[EN] Today, the political world has as much or more impact on society than society has on the political world. Political leaders, or representatives of political parties, use their power in the media to modify ideological positions and reach the people in order to gain popularity in government elections. Through deceptive language, political texts may contain partisan and social biases that undermine the perception of reality. As a result, harmful political polarization increases because the followers of an ideology, or members of a social category, see other groups as a threat or competition, ending in verbal and physical aggression with unfortunate outcomes. The Natural Language Processing (NLP) community has new contri-butions every day with approaches that help detect hate speech, insults, of f ensive messages, and false information, among other computational tasks related to social sciences. However, many obstacles prevent eradicating these problems, such as the dif f i culty of having annotated texts, the limitations of non-interdisciplinary approaches, and the challenge added by the necessity of interpretable solutions. This thesis focuses on the detection of partisan and social biases, tak-ing hyperpartisanship and stereotypes about immigrants as case studies. We propose a model based on a masking technique that can detect deceptive language in controversial and non-controversial topics, capturing patterns related to style and content. Moreover, we address the problem by evalu-ating BERT-based models, known to be ef f ective at capturing semantic and syntactic patterns in the same representation. We compare these two approaches (the masking technique and the BERT-based models) in terms of their performance and the explainability of their decisions in the detection of hyperpartisanship in political news and immigrant stereotypes. In order to identify immigrant stereotypes, we propose a new taxonomy supported by social psychology theory and annotate a dataset from partisan interventions in the Spanish parliament. Results show that our models can help study hyperpartisanship and identify dif f erent frames in which citizens and politicians perceive immigrants as victims, economic resources, or threat. Finally, this interdisciplinary research proves that immigrant stereotypes are used as a rhetorical strategy in political contexts.This PhD thesis was funded by the MISMIS-FAKEnHATE research project (PGC2018-096212-B-C31) of the Spanish Ministry of Science and Innovation.Sánchez Junquera, JJ. (2022). Detecting Deception, Partisan, and Social Biases [Tesis doctoral]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/185784Compendi

    Drawing Elena Ferrante's Profile. Workshop Proceedings, Padova, 7 September 2017

    Get PDF
    Elena Ferrante is an internationally acclaimed Italian novelist whose real identity has been kept secret by E/O publishing house for more than 25 years. Owing to her popularity, major Italian and foreign newspapers have long tried to discover her real identity. However, only a few attempts have been made to foster a scientific debate on her work. In 2016, Arjuna Tuzzi and Michele Cortelazzo led an Italian research team that conducted a preliminary study and collected a well-founded, large corpus of Italian novels comprising 150 works published in the last 30 years by 40 different authors. Moreover, they shared their data with a select group of international experts on authorship attribution, profiling, and analysis of textual data: Maciej Eder and Jan Rybicki (Poland), Patrick Juola (United States), Vittorio Loreto and his research team, Margherita Lalli and Francesca Tria (Italy), George Mikros (Greece), Pierre Ratinaud (France), and Jacques Savoy (Switzerland). The chapters of this volume report the results of this endeavour that were first presented during the international workshop Drawing Elena Ferrante's Profile in Padua on 7 September 2017 as part of the 3rd IQLA-GIAT Summer School in Quantitative Analysis of Textual Data. The fascinating research findings suggest that Elena Ferrante\u2019s work definitely deserves \u201cmany hands\u201d as well as an extensive effort to understand her distinct writing style and the reasons for her worldwide success
    corecore