521 research outputs found
Towards Harmful Erotic Content Detection through Coreference-Driven Contextual Analysis
Adult content detection still poses a great challenge for automation.
Existing classifiers primarily focus on distinguishing between erotic and
non-erotic texts. However, they often need more nuance in assessing the
potential harm. Unfortunately, the content of this nature falls beyond the
reach of generative models due to its potentially harmful nature. Ethical
restrictions prohibit large language models (LLMs) from analyzing and
classifying harmful erotics, let alone generating them to create synthetic
datasets for other neural models. In such instances where data is scarce and
challenging, a thorough analysis of the structure of such texts rather than a
large model may offer a viable solution. Especially given that harmful erotic
narratives, despite appearing similar to harmless ones, usually reveal their
harmful nature first through contextual information hidden in the non-sexual
parts of the narrative.
This paper introduces a hybrid neural and rule-based context-aware system
that leverages coreference resolution to identify harmful contextual cues in
erotic content. Collaborating with professional moderators, we compiled a
dataset and developed a classifier capable of distinguishing harmful from
non-harmful erotic content. Our hybrid model, tested on Polish text,
demonstrates a promising accuracy of 84% and a recall of 80%. Models based on
RoBERTa and Longformer without explicit usage of coreference chains achieved
significantly weaker results, underscoring the importance of coreference
resolution in detecting such nuanced content as harmful erotics. This approach
also offers the potential for enhanced visual explainability, supporting
moderators in evaluating predictions and taking necessary actions to address
harmful content.Comment: Accepted for 6th Workshop on Computational Models of Reference,
Anaphora and Coreference at EMNLP 2023 Conferenc
Reflexive constructions in the world's languages
Synopsis:
This landmark publication brings together 28 papers on reflexive constructions in languages from all continents, representing very diverse language types. While reflexive constructions have been discussed in the past from a variety of angles, this is the first edited volume of its kind. All the chapters are based on original data, and they are broadly comparable through a common terminological framework. The volume opens with two introductory chapters by the editors that set the stage and lay out the main comparative concepts, and it concludes with a chapter presenting generalizations on the basis of the studies of individual languages
Neural Coreference Resolution for Turkish
Coreference resolution deals with resolving mentions of the same underlying entity in a given text. This challenging task is an indispensable aspect of text understanding and has important applications in various language processing systems such as question answering and machine translation. Although a significant amount of studies is devoted to coreference resolution, the research on Turkish is scarce and mostly limited to pronoun resolution. To our best knowledge, this article presents the first neural Turkish coreference resolution study where two learning-based models are explored. Both models follow the mention-ranking approach while forming clusters of mentions. The first model uses a set of hand-crafted features whereas the second coreference model relies on embeddings learned from large-scale pre-trained language models for capturing similarities between a mention and its candidate antecedents. Several language models trained specifically for Turkish are used to obtain mention representations and their effectiveness is compared in conducted experiments using automatic metrics. We argue that the results of this study shed light on the possible contributions of neural architectures to Turkish coreference resolution.119683
All Purpose Textual Data Information Extraction, Visualization and Querying
abstract: Since the advent of the internet and even more after social media platforms, the explosive growth of textual data and its availability has made analysis a tedious task. Information extraction systems are available but are generally too specific and often only extract certain kinds of information they deem necessary and extraction worthy. Using data visualization theory and fast, interactive querying methods, leaving out information might not really be necessary. This thesis explores textual data visualization techniques, intuitive querying, and a novel approach to all-purpose textual information extraction to encode large text corpus to improve human understanding of the information present in textual data.
This thesis presents a modified traversal algorithm on dependency parse output of text to extract all subject predicate object pairs from text while ensuring that no information is missed out. To support full scale, all-purpose information extraction from large text corpuses, a data preprocessing pipeline is recommended to be used before the extraction is run. The output format is designed specifically to fit on a node-edge-node model and form the building blocks of a network which makes understanding of the text and querying of information from corpus quick and intuitive. It attempts to reduce reading time and enhancing understanding of the text using interactive graph and timeline.Dissertation/ThesisMasters Thesis Software Engineering 201
Cross-lingual Coreference Resolution of Pronouns
This work is, to our knowledge, a first attempt at a machine learning approach to cross-lingual
coreference resolution, i.e. coreference resolution (CR) performed on a bitext. Focusing on CR of English pronouns, we leverage language differences and enrich the feature set of a standard monolingual CR system for English with features extracted from the Czech side of the bitext. Our work also includes a supervised pronoun aligner that outperforms a GIZA++ baseline in terms of both intrinsic evaluation and evaluation on CR. The final cross-lingual CR system has successfully outperformed both a monolingual CR and a cross-lingual projection system
A preliminary study in zero anaphora coreference resolution for Polish
A preliminary study in zero anaphora coreference resolution for PolishZero anaphora is an element of the coreference resolution task that has not yet been directly addressed in Polish and, in most studies, it has been left as the most challenging aspect for further investigation. This article presents an initial study of this problem. The preparation of a machine learning approach, alongside engineering features based on linguistic study of the KPWr corpus, is discussed. This study utilizes existing tools for Polish coreference resolution as sources of partial coreferential clusters containing pronoun, noun and named entity mentions. They are also used as baseline zero coreference resolution systems for comparison with our system. The evaluation process is focused not only on clustering correctness, without taking into account types of mentions, using standard CoNLL-2012 measures, but also on the informativeness of the resulting relations. According to the annotation approach used for coreference to the KPWr corpus, only named entities are treated as mentions that are informative enough to constitute a link to real world objects. Consequently, we provide an evaluation of informativeness based on found links between zero anaphoras and named entities. For the same reason, we restrict coreference resolution in this study to mention clusters built around named entities. Wstępne studium rozwiązywania problemu koreferencji anafory zerowej w języku polskimKoreferencja zerowa, w języku polskim, jest jednym z zagadnień rozpoznawania koreferencji. Dotychczas nie była ona bezpośrednim przedmiotem badań, gdyż ze względu na jej złożoność była pomijana i odsuwana na dalsze etapy badań. Artykuł prezentuje wstępne studium problemu, jakim jest rozpoznawanie koreferencji zerowej. Przedstawiamy podejście wykorzystujące techniki uczenia maszynowego oraz proces tworzenia cech w oparciu o analizę lingwistyczną korpusu KPWr. W przedstawionej pracy wykorzystujemy istniejące narzędzia do rozpoznawania koreferencji dla pozostałych rodzajów wzmianek (tj. nazwy własne, frazy rzeczownikowe oraz zaimki) jako źródło częściowych zbiorów wzmianek odnoszących się do tego samego obiektu, a także jako punkt odniesienia dla uzyskanych przez nas wyników. Ocena skupia się nie tylko na poprawności uzyskanych zbiorów wzmianek, bez względu na ich typ, co odzwierciedlają wyniki podane dla standardowych metryk CoNLL-2012, ale także na wartości informacji, która zostaje uzyskana w wyniku rozpoznania koreferencji. W nawiązaniu do założeń anotacji korpusu KPWr, jedynie nazwy własne traktowane są jako wzmianki, które zawierają w sobie wystarczająco szczegółową informację, aby można było powiązać je z obiektami rzeczywistymi. W konsekwencji dostarczamy także ocenę opartą na wartości informacji dla podmiotów domyślnych połączonych relacją koreferencji z nazwami własnymi. Z tą samą motywacją rozpatrujemy jedynie zbiory wzmianek koreferencyjnych zbudowane wokół nazw własnych
Inter-Annotator Agreement in Coreference Annotation of Polish
Abstract. This paper discusses different methods of estimating the inter-annotator agreement in manual annotation of Polish coreference and proposes a new BLANC-based annotation agreement metric. The commonly used agreement indicators are calculated for mention detection, semantic head annotation, near-identity markup and coreference resolution
Adopting ISO 24617-8 for Discourse Relations Annotation in Polish: Challenges and Future Directions
This paper explores a discourse relations annotation project carried out under the CLARIN-PL initiative, leveraging the ISO 24617-8 standard. The goal is to boost research interoperability and foster multilingual research. Our team of three linguist-annotators tackled the annotation of a corpus spanning several genres, including e.g., literature and press articles in the Polish language. This effort was guided by a project expert and external linguists from the CLARIN-PL language technology research infrastructure. Several significant challenges emerged during the process. Ambiguities within the ISO standard’s relation categories, poorly-defined definitions for certain relation categories, and the difficulty of identifying and annotating implicit discourse relations, which lack explicit discourse connectives or signaling devices, were among the key issues. To overcome these problems, we implemented strategies such as regular team meetings, collaborative annotation forms, and preliminary revisions to the annotation scheme. This paper presents the project, the annotation process, and offers initial annotation data on the discourse relations and connectives identified within the corpus. Looking forward, we discuss potential enhancements to the process, including additional revisions to the guidelines and conclude with an overview of the project’s contributions and a discussion of our future development plans
- …