9,024 research outputs found

    Prosodic description: An introduction for fieldworkers

    Get PDF
    This article provides an introductory tutorial on prosodic features such as tone and accent for researchers working on little-known languages. It specifically addresses the needs of non-specialists and thus does not presuppose knowledge of the phonetics and phonology of prosodic features. Instead, it intends to introduce the uninitiated reader to a field often shied away from because of its (in part real, but in part also just imagined) complexities. It consists of a concise overview of the basic phonetic phenomena (section 2) and the major categories and problems of their functional and phonological analysis (sections 3 and 4). Section 5 gives practical advice for documenting and analyzing prosodic features in the field.National Foreign Language Resource Cente

    Unifying Amplitude and Phase Analysis: A Compositional Data Approach to Functional Multivariate Mixed-Effects Modeling of Mandarin Chinese

    Full text link
    Mandarin Chinese is characterized by being a tonal language; the pitch (or F0F_0) of its utterances carries considerable linguistic information. However, speech samples from different individuals are subject to changes in amplitude and phase which must be accounted for in any analysis which attempts to provide a linguistically meaningful description of the language. A joint model for amplitude, phase and duration is presented which combines elements from Functional Data Analysis, Compositional Data Analysis and Linear Mixed Effects Models. By decomposing functions via a functional principal component analysis, and connecting registration functions to compositional data analysis, a joint multivariate mixed effect model can be formulated which gives insights into the relationship between the different modes of variation as well as their dependence on linguistic and non-linguistic covariates. The model is applied to the COSPRO-1 data set, a comprehensive database of spoken Taiwanese Mandarin, containing approximately 50 thousand phonetically diverse sample F0F_0 contours (syllables), and reveals that phonetic information is jointly carried by both amplitude and phase variation.Comment: 49 pages, 13 figures, small changes to discussio

    Argumentation Mining in User-Generated Web Discourse

    Full text link
    The goal of argumentation mining, an evolving research field in computational linguistics, is to design methods capable of analyzing people's argumentation. In this article, we go beyond the state of the art in several ways. (i) We deal with actual Web data and take up the challenges given by the variety of registers, multiple domains, and unrestricted noisy user-generated Web discourse. (ii) We bridge the gap between normative argumentation theories and argumentation phenomena encountered in actual data by adapting an argumentation model tested in an extensive annotation study. (iii) We create a new gold standard corpus (90k tokens in 340 documents) and experiment with several machine learning methods to identify argument components. We offer the data, source codes, and annotation guidelines to the community under free licenses. Our findings show that argumentation mining in user-generated Web discourse is a feasible but challenging task.Comment: Cite as: Habernal, I. & Gurevych, I. (2017). Argumentation Mining in User-Generated Web Discourse. Computational Linguistics 43(1), pp. 125-17

    Subject-tracking and topic continuity in the Church Slavonic translation of the story of Abraham and his niece Mary

    Get PDF
    The present article addresses issues of referentiality and text cohesion in a Church Slavonic narrative text. Starting with the specific problem of referential conflict as formulated by Kibrik (19871, issues of tracking personal participants in a narrative text are broadly explored in order to arrive at a rationale for the construction of cohesive text interpretation through topic continuity in subject position. The article takes an interpretative text-based approach of close-reading and argues for participant tracking to be dependent on text genre and general cultural prerequisites of text reading and interpretation rather than on systemic grammatical features of language. It is also hinted at the possibility that medieval narrative text genres (like the Byzantine-Slavic hagiographic genre being explored in this paper through the specimen of the Story of Abraham and Mary) may adhere to a type of narrative construction which places more responsibility on the reader-listener than on the narrator

    Data-poor categorization and passage retrieval for Gene Ontology Annotation in Swiss-Prot

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>In the context of the BioCreative competition, where training data were very sparse, we investigated two complementary tasks: 1) given a Swiss-Prot triplet, containing a protein, a GO (Gene Ontology) term and a relevant article, extraction of a short passage that justifies the GO category assignement; 2) given a Swiss-Prot pair, containing a protein and a relevant article, automatic assignement of a set of categories.</p> <p>Methods</p> <p>Sentence is the basic retrieval unit. Our classifier computes a distance between each sentence and the GO category provided with the Swiss-Prot entry. The Text Categorizer computes a distance between each GO term and the text of the article. Evaluations are reported both based on annotator judgements as established by the competition and based on mean average precision measures computed using a curated sample of Swiss-Prot.</p> <p>Results</p> <p>Our system achieved the best recall and precision combination both for passage retrieval and text categorization as evaluated by official evaluators. However, text categorization results were far below those in other data-poor text categorization experiments The top proposed term is relevant in less that 20% of cases, while categorization with other biomedical controlled vocabulary, such as the Medical Subject Headings, we achieved more than 90% precision. We also observe that the scoring methods used in our experiments, based on the retrieval status value of our engines, exhibits effective confidence estimation capabilities.</p> <p>Conclusion</p> <p>From a comparative perspective, the combination of retrieval and natural language processing methods we designed, achieved very competitive performances. Largely data-independent, our systems were no less effective that data-intensive approaches. These results suggests that the overall strategy could benefit a large class of information extraction tasks, especially when training data are missing. However, from a user perspective, results were disappointing. Further investigations are needed to design applicable end-user text mining tools for biologists.</p

    Data-efficient methods for information extraction

    Get PDF
    Strukturierte WissensreprĂ€sentationssysteme wie Wissensdatenbanken oder Wissensgraphen bieten Einblicke in EntitĂ€ten und Beziehungen zwischen diesen EntitĂ€ten in der realen Welt. Solche WissensreprĂ€sentationssysteme können in verschiedenen Anwendungen der natĂŒrlichen Sprachverarbeitung eingesetzt werden, z. B. bei der semantischen Suche, der Beantwortung von Fragen und der Textzusammenfassung. Es ist nicht praktikabel und ineffizient, diese WissensreprĂ€sentationssysteme manuell zu befĂŒllen. In dieser Arbeit entwickeln wir Methoden, um automatisch benannte EntitĂ€ten und Beziehungen zwischen den EntitĂ€ten aus Klartext zu extrahieren. Unsere Methoden können daher verwendet werden, um entweder die bestehenden unvollstĂ€ndigen WissensreprĂ€sentationssysteme zu vervollstĂ€ndigen oder ein neues strukturiertes WissensreprĂ€sentationssystem von Grund auf zu erstellen. Im Gegensatz zu den gĂ€ngigen ĂŒberwachten Methoden zur Informationsextraktion konzentrieren sich unsere Methoden auf das Szenario mit wenigen Daten und erfordern keine große Menge an kommentierten Daten. Im ersten Teil der Arbeit haben wir uns auf das Problem der Erkennung von benannten EntitĂ€ten konzentriert. Wir haben an der gemeinsamen Aufgabe von Bacteria Biotope 2019 teilgenommen. Die gemeinsame Aufgabe besteht darin, biomedizinische EntitĂ€tserwĂ€hnungen zu erkennen und zu normalisieren. Unser linguistically informed Named-Entity-Recognition-System besteht aus einem Deep-Learning-basierten Modell, das sowohl verschachtelte als auch flache EntitĂ€ten extrahieren kann; unser Modell verwendet mehrere linguistische Merkmale und zusĂ€tzliche Trainingsziele, um effizientes Lernen in datenarmen Szenarien zu ermöglichen. Unser System zur EntitĂ€tsnormalisierung verwendet String-Match, Fuzzy-Suche und semantische Suche, um die extrahierten benannten EntitĂ€ten mit den biomedizinischen Datenbanken zu verknĂŒpfen. Unser System zur Erkennung von benannten EntitĂ€ten und zur EntitĂ€tsnormalisierung erreichte die niedrigste Slot-Fehlerrate von 0,715 und belegte den ersten Platz in der gemeinsamen Aufgabe. Wir haben auch an zwei gemeinsamen Aufgaben teilgenommen: Adverse Drug Effect Span Detection (Englisch) und Profession Span Detection (Spanisch); beide Aufgaben sammeln Daten von der Social Media Plattform Twitter. Wir haben ein Named-Entity-Recognition-Modell entwickelt, das die Eingabedarstellung des Modells durch das Stapeln heterogener Einbettungen aus verschiedenen DomĂ€nen verbessern kann; unsere empirischen Ergebnisse zeigen komplementĂ€res Lernen aus diesen heterogenen Einbettungen. Unser Beitrag belegte den 3. Platz in den beiden gemeinsamen Aufgaben. Im zweiten Teil der Arbeit untersuchten wir Strategien zur Erweiterung synthetischer Daten, um ressourcenarme Informationsextraktion in spezialisierten DomĂ€nen zu ermöglichen. Insbesondere haben wir backtranslation an die Aufgabe der Erkennung von benannten EntitĂ€ten auf Token-Ebene und der Extraktion von Beziehungen auf Satzebene angepasst. Wir zeigen, dass die RĂŒckĂŒbersetzung sprachlich vielfĂ€ltige und grammatikalisch kohĂ€rente synthetische SĂ€tze erzeugen kann und als wettbewerbsfĂ€hige Erweiterungsstrategie fĂŒr die Aufgaben der Erkennung von benannten EntitĂ€ten und der Extraktion von Beziehungen dient. Bei den meisten realen Aufgaben zur Extraktion von Beziehungen stehen keine kommentierten Daten zur VerfĂŒgung, jedoch ist hĂ€ufig ein großer unkommentierter Textkorpus vorhanden. Bootstrapping-Methoden zur Beziehungsextraktion können mit diesem großen Korpus arbeiten, da sie nur eine Handvoll Startinstanzen benötigen. Bootstrapping-Methoden neigen jedoch dazu, im Laufe der Zeit Rauschen zu akkumulieren (bekannt als semantische Drift), und dieses PhĂ€nomen hat einen drastischen negativen Einfluss auf die endgĂŒltige Genauigkeit der Extraktionen. Wir entwickeln zwei Methoden zur EinschrĂ€nkung des Bootstrapping-Prozesses, um die semantische Drift bei der Extraktion von Beziehungen zu minimieren. Unsere Methoden nutzen die Graphentheorie und vortrainierte Sprachmodelle, um verrauschte Extraktionsmuster explizit zu identifizieren und zu entfernen. Wir berichten ĂŒber die experimentellen Ergebnisse auf dem TACRED-Datensatz fĂŒr vier Relationen. Im letzten Teil der Arbeit demonstrieren wir die Anwendung der DomĂ€nenanpassung auf die anspruchsvolle Aufgabe der mehrsprachigen Akronymextraktion. Unsere Experimente zeigen, dass die DomĂ€nenanpassung die Akronymextraktion in wissenschaftlichen und juristischen Bereichen in sechs Sprachen verbessern kann, darunter auch Sprachen mit geringen Ressourcen wie Persisch und Vietnamesisch.The structured knowledge representation systems such as knowledge base or knowledge graph can provide insights regarding entities and relationship(s) among these entities in the real-world, such knowledge representation systems can be employed in various natural language processing applications such as semantic search, question answering and text summarization. It is infeasible and inefficient to manually populate these knowledge representation systems. In this work, we develop methods to automatically extract named entities and relationships among the entities from plain text and hence our methods can be used to either complete the existing incomplete knowledge representation systems to create a new structured knowledge representation system from scratch. Unlike mainstream supervised methods for information extraction, our methods focus on the low-data scenario and do not require a large amount of annotated data. In the first part of the thesis, we focused on the problem of named entity recognition. We participated in the shared task of Bacteria Biotope 2019, the shared task consists of recognizing and normalizing the biomedical entity mentions. Our linguistically informed named entity recognition system consists of a deep learning based model which can extract both nested and flat entities; our model employed several linguistic features and auxiliary training objectives to enable efficient learning in data-scarce scenarios. Our entity normalization system employed string match, fuzzy search and semantic search to link the extracted named entities to the biomedical databases. Our named entity recognition and entity normalization system achieved the lowest slot error rate of 0.715 and ranked first in the shared task. We also participated in two shared tasks of Adverse Drug Effect Span detection (English) and Profession Span Detection (Spanish); both of these tasks collect data from the social media platform Twitter. We developed a named entity recognition model which can improve the input representation of the model by stacking heterogeneous embeddings from a diverse domain(s); our empirical results demonstrate complementary learning from these heterogeneous embeddings. Our submission ranked 3rd in both of the shared tasks. In the second part of the thesis, we explored synthetic data augmentation strategies to address low-resource information extraction in specialized domains. Specifically, we adapted backtranslation to the token-level task of named entity recognition and sentence-level task of relation extraction. We demonstrate that backtranslation can generate linguistically diverse and grammatically coherent synthetic sentences and serve as a competitive augmentation strategy for the task of named entity recognition and relation extraction. In most of the real-world relation extraction tasks, the annotated data is not available, however, quite often a large unannotated text corpus is available. Bootstrapping methods for relation extraction can operate on this large corpus as they only require a handful of seed instances. However, bootstrapping methods tend to accumulate noise over time (known as semantic drift) and this phenomenon has a drastic negative impact on the final precision of the extractions. We develop two methods to constrain the bootstrapping process to minimise semantic drift for relation extraction; our methods leverage graph theory and pre-trained language models to explicitly identify and remove noisy extraction patterns. We report the experimental results on the TACRED dataset for four relations. In the last part of the thesis, we demonstrate the application of domain adaptation to the challenging task of multi-lingual acronym extraction. Our experiments demonstrate that domain adaptation can improve acronym extraction within scientific and legal domains in 6 languages including low-resource languages such as Persian and Vietnamese

    Constrained decoding for text-level discourse parsing

    Get PDF
    International audienceThis paper presents a novel approach to document-based discourse analysis by performing a global A* search over the space of possible structures while optimizing a global criterion over the set of potential coherence relations. Existing approaches to discourse analysis have so far relied on greedy search strategies or restricted themselves to sentence-level discourse parsing. Another advantage of our approach, over other global alternatives (like Maximum Spanning Tree decoding algorithms), is its flexibility in being able to integrate constraints (including linguistically motivated ones like the Right Frontier Constraint). Finally, our paper provides the first discourse parsing system for French; our evaluation is carried out on the Annodis corpus. While using a lot less training data than earlier approaches than previous work on English, our system manages to achieve state-of-the-art results, with F1-scores of 66.2 and 46.8 when compared to unlabeled and labeled reference structures
    • 

    corecore