1,876 research outputs found

    In no uncertain terms : a dataset for monolingual and multilingual automatic term extraction from comparable corpora

    Get PDF
    Automatic term extraction is a productive field of research within natural language processing, but it still faces significant obstacles regarding datasets and evaluation, which require manual term annotation. This is an arduous task, made even more difficult by the lack of a clear distinction between terms and general language, which results in low inter-annotator agreement. There is a large need for well-documented, manually validated datasets, especially in the rising field of multilingual term extraction from comparable corpora, which presents a unique new set of challenges. In this paper, a new approach is presented for both monolingual and multilingual term annotation in comparable corpora. The detailed guidelines with different term labels, the domain- and language-independent methodology and the large volumes annotated in three different languages and four different domains make this a rich resource. The resulting datasets are not just suited for evaluation purposes but can also serve as a general source of information about terms and even as training data for supervised methods. Moreover, the gold standard for multilingual term extraction from comparable corpora contains information about term variants and translation equivalents, which allows an in-depth, nuanced evaluation

    TermEval 2020 : shared task on automatic term extraction using the Annotated Corpora for term Extraction Research (ACTER) dataset

    Get PDF
    The TermEval 2020 shared task provided a platform for researchers to work on automatic term extraction (ATE) with the same dataset: the Annotated Corpora for Term Extraction Research (ACTER). The dataset covers three languages (English, French, and Dutch) and four domains, of which the domain of heart failure was kept as a held-out test set on which final f1-scores were calculated. The aim was to provide a large, transparent, qualitatively annotated, and diverse dataset to the ATE research community, with the goal of promoting comparative research and thus identifying strengths and weaknesses of various state-of-the-art methodologies. The results show a lot of variation between different systems and illustrate how some methodologies reach higher precision or recall, how different systems extract different types of terms, how some are exceptionally good at finding rare terms, or are less impacted by term length. The current contribution offers an overview of the shared task with a comparative evaluation, which complements the individual papers by all participants

    D-TERMINE : data-driven term extraction methodologies investigated

    Get PDF
    Automatic term extraction is a task in the field of natural language processing that aims to automatically identify terminology in collections of specialised, domain-specific texts. Terminology is defined as domain-specific vocabulary and consists of both single-word terms (e.g., corpus in the field of linguistics, referring to a large collection of texts) and multi-word terms (e.g., automatic term extraction). Terminology is a crucial part of specialised communication since terms can concisely express very specific and essential information. Therefore, quickly and automatically identifying terms is useful in a wide range of contexts. Automatic term extraction can be used by language professionals to find which terms are used in a domain and how, based on a relevant corpus. It is also useful for other tasks in natural language processing, including machine translation. One of the main difficulties with term extraction, both manual and automatic, is the vague boundary between general language and terminology. When different people identify terms in the same text, it will invariably produce different results. Consequently, creating manually annotated datasets for term extraction is a costly, time- and effort- consuming task. This can hinder research on automatic term extraction, which requires gold standard data for evaluation, preferably even in multiple languages and domains, since terms are language- and domain-dependent. Moreover, supervised machine learning methodologies rely on annotated training data to automatically deduce the characteristics of terms, so this knowledge can be used to detect terms in other corpora as well. Consequently, the first part of this PhD project was dedicated to the construction and validation of a new dataset for automatic term extraction, called ACTER – Annotated Corpora for Term Extraction Research. Terms and Named Entities were manually identified with four different labels in twelve specialised corpora. The dataset contains corpora in three languages and four domains, leading to a total of more than 100k annotations, made over almost 600k tokens. It was made publicly available during a shared task we organised, in which five international teams competed to automatically extract terms from the same test data. This illustrated how ACTER can contribute towards advancing the state-of-the-art. It also revealed that there is still a lot of room for improvement, with moderate scores even for the best teams. Therefore, the second part of this dissertation was devoted to researching how supervised machine learning techniques might contribute. The traditional, hybrid approach to automatic term extraction relies on a combination of linguistic and statistical clues to detect terms. An initial list of unique candidate terms is extracted based on linguistic information (e.g., part-of-speech patterns) and this list is filtered based on statistical metrics that use frequencies to measure whether a candidate term might be relevant. The result is a ranked list of candidate terms. HAMLET – Hybrid, Adaptable Machine Learning Approach to Extract Terminology – was developed based on this traditional approach and applies machine learning to efficiently combine more information than could be used with a rule-based approach. This makes HAMLET less susceptible to typical issues like low recall on rare terms. While domain and language have a large impact on results, robust performance was reached even without domain- specific training data, and HAMLET compared favourably to a state-of-the-art rule-based system. Building on these findings, the third and final part of the project was dedicated to investigating methodologies that are even further removed from the traditional approach. Instead of starting from an initial list of unique candidate terms, potential terms were labelled immediately in the running text, in their original context. Two sequential labelling approaches were developed, evaluated and compared: a feature- based conditional random fields classifier, and a recurrent neural network with word embeddings. The latter outperformed the feature-based approach and was compared to HAMLET as well, obtaining comparable and even better results. In conclusion, this research resulted in an extensive, reusable dataset and three distinct new methodologies for automatic term extraction. The elaborate evaluations went beyond reporting scores and revealed the strengths and weaknesses of the different approaches. This identified challenges for future research, since some terms, especially ambiguous ones, remain problematic for all systems. However, overall, results were promising and the approaches were complementary, revealing great potential for new methodologies that combine multiple strategies

    Validating multilingual hybrid automatic term extraction for search engine optimisation : the use case of EBM-GUIDELINES

    Get PDF
    Tools that automatically extract terms and their equivalents in other languages from parallel corpora can contribute to multilingual professional communication in more than one way. By means of a use case with data from a medical web site with point of care evidence summaries (Ebpracticenet), we illustrate how hybrid multilingual automatic term extraction from parallel corpora works and how it can be used in a practical application such as search engine optimisation. The original aim was to use the result of the extraction to improve the recall of a search engine by allowing automated multilingual searches. Two additional possible applications were found while considering the data: searching via related forms and searching via strongly semantically related words. The second stage of this research was to find the most suitable format for the required manual validation of the raw extraction results and to compare the validation process when performed by a domain expert versus a terminologist

    many faces, many places (Term21)

    Get PDF
    UIDB/03213/2020 UIDP/03213/2020Proceedings of the LREC 2022 Workshop Language Resources and Evaluation Conferencepublishersversionpublishe

    many faces, many places (Term21)

    Get PDF
    UIDB/03213/2020 UIDP/03213/2020publishersversionpublishe

    The Measure of a Man: A Critical Methodology for Investigating Essentialist Beliefs about Sexual Orientation Categories in Japan and the United States

    Full text link
    Methods for studying laypeople’s beliefs about sexual orientation categories have evolved in step with larger theoretical and epistemological shifts in the interdisciplinary study of sexuality. The dominant approach to measuring laypeople’s sexual orientation beliefs over the past decade was made possible through an epistemological shift from a nature vs. nurture paradigm to a social constructionist theoretical model of psychological essentialism (Medin, 1989; Medin & Ortony, 1989; Rothbart & Taylor, 1992). Despite this shift, I argue that the forced-response scale-based survey methodologies typically used to operationally define essentialist beliefs about sexual orientation at best only partially realize the social constructionist potential of this underlying theory. By critically reconstructing this theory of psychological essentialism from an epistemological stance rooted in discourse, I developed a methodology reliant not on investigators’ but rather laypeople’s own mobilization of culturally shared discourses of sexuality. In testing this methodology, I focus on one theoretical dimension of psychological essentialism—inductive potential, or the extent to which shared knowledge about category membership allows for inference of a wealth of associated information about specific category members. I explored this critical methodology through a mixed-method empirical investigation of laypeople’s beliefs in the inductive potential of sexual orientation categories in relation to two components of sexuality: sexual desire and romantic love. I sought to answer two research questions: To what extent, and in what ways, do laypeople discursively mobilize inductive potential beliefs about homosexual or heterosexual men’s sexual desire and romantic love? To what extent, and in what ways, is laypeople’s discursive mobilization of those inductive potential beliefs explained by their gendered and/or cultural contexts? In Study 1, I primed cultural discourses of sexual orientation categories prior to an impression formation task. Students from four-year public universities in the Tokyo (N = 197; ages 18-23) and New York City (N = 208; ages 18-25) metropolitan areas read a series of fictional diary entries featuring a male college student (the target) describing his attraction to either a female or male classmate. Each participant then manually drew a Euler diagram comprised of circles representing their impressions of the relative importance (circle size) and interrelationships between (circle overlap) six identities associated with the target. To the extent participants engaged in inductive potential beliefs, I predicted that: (H1) participants would perceive sexual desire as more centrally defining of a same-sex attracted male target relative to an other-sex attracted male target; and (H2) participants would perceive romantic love as less centrally defining of a same-sex attracted male target relative to an other-sex attracted male target. Fitting multiple circle size and overlap outcomes to separate generalized linear models, I found a consistent pattern of support for both predictions. Cultural and gendered differences added additional nuance to these experimental patterns: Japanese participants associated men with greater sexual desire and less romantic love relative to their US peers, regardless of perceived sexual orientation. Additionally, US and Japanese men, compared to women, appeared to associate these two components of sexuality more frequently with men’s social roles. As such, while these results strongly suggested the presence of participants’ inductive potential beliefs about sexual orientation categories, they also pointed to important variation across culture and gender. In an effort to discursively unpack the inductively rich meanings associated with these additional gendered and cultural patterns, as well as establish the cultural credibility of my interpretations of the results of this experimental manipulation, in a second study I engaged separate peer focus groups in New York City (N = 20; ages 19-25) and Tokyo (N = 21; ages 20- 24) in discursively interpreting the Euler diagrams produced in Study 1. Using thematic analysis, I identified three themes concerning the ways several distinct sexual orientation discourses were culturally understood in the US and Japan; the ways those discourses were imbricated with other distinct discourses of cultural identity; and the ways laypeople voiced resistance to these sexual orientation discourses. I concluded that the experimental pattern from Study 1 could be explained in part through US participants’ rejection of an essentialist discourse of binary sexual orientation in favor of a focus on sexual practices; Japanese participants’ responses marked instead a troubling of essentialist discourses of binary gender. Taken together, these findings from Study 1 and 2 implicate sexual orientation as an inductively potent discourse in laypeople’s construction of beliefs about male sexuality across cultural contexts and genders, albeit in cultural distinct ways. These results thus add to past research on essentialist beliefs while also highlighting a need for critical methodologies sensitive to the ways culturally embedded and multiply imbricated transnational discourses of sexuality inform beliefs about men

    Knowledge in the dark: scientific challenges and ways forward

    Get PDF
    A key dimension of our current era is Big Data, the rapid rise in produced data and information; a key frustration is that we are nonetheless living in an age of ignorance, as the real knowledge and understanding of people does not seem to be substantially increasing. This development has critical consequences, for example it limits the ability to find and apply effective solutions to pressing environmental and socioeconomic challenges. Here, we propose the concept of “knowledge in the dark”—or short: dark knowledge—and outline how it can help clarify key reasons for this development: (i) production of biased, erroneous, or fabricated data and information; (ii) inaccessibility and (iii) incomprehensibility of data and information; and (iv) loss of previous knowledge. Even in the academic realm, where financial interests are less pronounced than in the private sector, several factors lead to dark knowledge, that is they inhibit a more substantial increase in knowledge and understanding. We highlight four of these factors—loss of academic freedom, research biases, lack of reproducibility, and the Scientific tower of Babel—and offer ways to tackle them, for example establishing an international court of arbitration for research and developing advanced tools for research synthesis

    Individual behavioral phenotypes: An integrative meta-theoretical framework. Why “behavioral syndromes” are not analogs of “personality”

    Get PDF
    Animal researchers are increasingly interested in individual differences in behavior. Their interpretation as meaningful differences in behavioral strategies stable over time and across contexts, adaptive, heritable, and acted upon by natural selection has triggered new theoretical developments. However, the analytical approaches used to explore behavioral data still address population-level phenomena, and statistical methods suitable to analyze individual behavior are rarely applied. I discuss fundamental investigative principles and analytical approaches to explore whether, in what ways, and under which conditions individual behavioral differences are actually meaningful. I elaborate the meta-theoretical ideas underlying common theoretical concepts and integrate them into an overarching meta-theoretical and methodological framework. This unravels commonalities and differences, and shows that assumptions of analogy to concepts of human personality are not always warranted and that some theoretical developments may be based on methodological artifacts. Yet, my results also highlight possible directions for new theoretical developments in animal behavior research
    • …
    corecore