122 research outputs found
Proceedings of the Seventh International Conference Formal Approaches to South Slavic and Balkan languages
Proceedings of the Seventh International Conference Formal Approaches to South Slavic and Balkan Languages publishes 17 papers that were presented at the conference organised in Dubrovnik, Croatia, 4-6 Octobre 2010
Semantic Representation and Inference for NLP
Semantic representation and inference is essential for Natural Language
Processing (NLP). The state of the art for semantic representation and
inference is deep learning, and particularly Recurrent Neural Networks (RNNs),
Convolutional Neural Networks (CNNs), and transformer Self-Attention models.
This thesis investigates the use of deep learning for novel semantic
representation and inference, and makes contributions in the following three
areas: creating training data, improving semantic representations and extending
inference learning. In terms of creating training data, we contribute the
largest publicly available dataset of real-life factual claims for the purpose
of automatic claim verification (MultiFC), and we present a novel inference
model composed of multi-scale CNNs with different kernel sizes that learn from
external sources to infer fact checking labels. In terms of improving semantic
representations, we contribute a novel model that captures non-compositional
semantic indicators. By definition, the meaning of a non-compositional phrase
cannot be inferred from the individual meanings of its composing words (e.g.,
hot dog). Motivated by this, we operationalize the compositionality of a phrase
contextually by enriching the phrase representation with external word
embeddings and knowledge graphs. Finally, in terms of inference learning, we
propose a series of novel deep learning architectures that improve inference by
using syntactic dependencies, by ensembling role guided attention heads,
incorporating gating layers, and concatenating multiple heads in novel and
effective ways. This thesis consists of seven publications (five published and
two under review).Comment: PhD thesis, the University of Copenhage
Representation and parsing of multiword expressions
This book consists of contributions related to the definition, representation and parsing of MWEs. These reflect current trends in the representation and processing of MWEs. They cover various categories of MWEs such as verbal, adverbial and nominal MWEs, various linguistic frameworks (e.g. tree-based and unification-based grammars), various languages including English, French, Modern Greek, Hebrew, Norwegian), and various applications (namely MWE detection, parsing, automatic translation) using both symbolic and statistical approaches
Current trends
Deep parsing is the fundamental process aiming at the representation of the syntactic
structure of phrases and sentences. In the traditional methodology this process is
based on lexicons and grammars representing roughly properties of words and interactions
of words and structures in sentences. Several linguistic frameworks, such as Headdriven
Phrase Structure Grammar (HPSG), Lexical Functional Grammar (LFG), Tree Adjoining
Grammar (TAG), Combinatory Categorial Grammar (CCG), etc., offer different
structures and combining operations for building grammar rules. These already contain
mechanisms for expressing properties of Multiword Expressions (MWE), which, however,
need improvement in how they account for idiosyncrasies of MWEs on the one
hand and their similarities to regular structures on the other hand. This collaborative
book constitutes a survey on various attempts at representing and parsing MWEs in the
context of linguistic theories and applications
Recommended from our members
Computational Models of Argument Structure and Argument Quality for Understanding Misinformation
With the continuing spread of misinformation and disinformation online, it is of increasing importance to develop combating mechanisms at scale in the form of automated systems that can find checkworthy information, detect fallacious argumentation of online content, retrieve relevant evidence from authoritative sources and analyze the veracity of claims given the retrieved evidence. The robustness and applicability of these systems depend on the availability of annotated resources to train machine learning models in a supervised fashion, as well as machine learning models that capture patterns beyond domain-specific lexical clues or genre-specific stylistic insights. In this thesis, we investigate the role of models for argument structure and argument quality in improving tasks relevant to fact-checking and furthering our understanding of misinformation and disinformation. We contribute to argumentation mining, misinformation detection, and fact-checking by releasing multiple annotated datasets, developing unified models across datasets and task formulations, and analyzing the vulnerabilities of such models in adversarial settings.
We start by studying the argument structure's role in two downstream tasks related to fact-checking. As it is essential to differentiate factual knowledge from opinionated text, we develop a model for detecting the type of news articles (factual or opinionated) using highly transferable argumentation-based features. We also show the potential of argumentation features to predict the checkworthiness of information in news articles and provide the first multi-layer annotated corpus for argumentation and fact-checking.
We then study qualitative aspects of arguments through models for fallacy recognition. To understand the reasoning behind checkworthiness and the relation of argumentative fallacies to fake content, we develop an annotation scheme of fallacies in fact-checked content and investigate avenues for automating the detection of such fallacies considering single- and multi-dataset training. Using instruction-based prompting, we introduce a unified model for recognizing twenty-eight fallacies across five fallacy datasets. We also use this model to explain the checkworthiness of statements in two domains.
Next, we show our models for end-to-end fact-checking of statements that include finding the relevant evidence document and sentence from a collection of documents and then predicting the veracity of the given statements using the retrieved evidence. We also analyze the robustness of end-to-end fact extraction and verification by generating adversarial statements and addressing areas for improvements for models under adversarial attacks. Finally, we show that evidence-based verification is essential for fine-grained claim verification by modeling the human-provided justifications with the gold veracity labels
requirements and use cases
In this report, we introduce our initial vision of the Corporate Semantic Web
as the next step in the broad field of Semantic Web research. We identify
requirements of the corporate environment and gaps between current approaches
to tackle problems facing ontology engineering, semantic collaboration, and
semantic search. Each of these pillars will yield innovative methods and tools
during the project runtime until 2013. Corporate ontology engineering will
improve the facilitation of agile ontology engineering to lessen the costs of
ontology development and, especially, maintenance. Corporate semantic
collaboration focuses the human-centered aspects of knowledge management in
corporate contexts. Corporate semantic search is settled on the highest
application level of the three research areas and at that point it is a
representative for applications working on and with the appropriately
represented and delivered background knowledge. We propose an initial layout
for an integrative architecture of a Corporate Semantic Web provided by these
three core pillars
A Bigger Fish to Fry:Scaling up the Automatic Understanding of Idiomatic Expressions
In this thesis, we are concerned with idiomatic expressions and how to handle them within NLP. Idiomatic expressions are a type of multiword phrase which have a meaning that is not a direct combination of the meaning of its parts, e.g. 'at a crossroads' and 'move the goalposts'.In Part I, we provide a general introduction to idiomatic expressions and an overview of observations regarding idioms based on corpus data. In addition, we discuss existing research on idioms from an NLP perspective, providing an overview of existing tasks, approaches, and datasets. In Part II, we focus on the building of a large idiom corpus, consisting of developing a system for the automatic extraction of potentially idiom expressions and building a large corpus of idiom using crowdsourced annotation. Finally, in Part III, we improve an existing unsupervised classifier and compare it to other existing classifiers. Given the relatively poor performance of this unsupervised classifier, we also develop a supervised deep neural network-based system and find that a model involving two separate modules looking at different information sources yields the best performance, surpassing previous state-of-the-art approaches.In conclusion, this work shows the feasibility of building a large corpus of sense-annotated potentially idiomatic expressions, and the benefits such a corpus provides for further research. It provides the possibility for quick testing of hypotheses about the distribution and usage of idioms, it enables the training of data-hungry machine learning methods for PIE disambiguation systems, and it permits fine-grained, reliable evaluation of such systems
Eesti keele üldvaldkonna tekstide laia kattuvusega automaatne sündmusanalüüs
Seoses tekstide suuremahulise digitaliseerimisega ning digitaalse tekstiloome järjest laiema levikuga on tohutul hulgal loomuliku keele tekste muutunud ja muutumas masinloetavaks. Masinloetavus omab potentsiaali muuta tekstimassiivid inimeste jaoks lihtsamini hallatavaks, nt lubada rakendusi nagu automaatne sisukokkuvõtete tegemine ja tekstide põhjal küsimustele vastamine, ent paraku ei ulatu praegused automaatanalüüsi võimalused tekstide sisu tegeliku mõistmiseni. Oletatakse, tekstide sisu mõistvale automaatanalüüsile viib meid lähemale sündmusanalüüs – kuna paljud tekstid on narratiivse ülesehitusega, tõlgendatavad kui „sündmuste kirjeldused”, peaks tekstidest sündmuste eraldamine ja formaalsel kujul esitamine pakkuma alust mitmete „teksti mõistmist” nõudvate keeletehnoloogia rakenduste loomisel.
Käesolevas väitekirjas uuritakse, kuivõrd saab eestikeelsete tekstide sündmusanalüüsi käsitleda kui avatud sündmuste hulka ja üldvaldkonna tekste hõlmavat automaatse lingvistilise analüüsi ülesannet. Probleemile lähenetakse eesti keele automaatanalüüsi kontekstis uudsest, sündmuste ajasemantikale keskenduvast perspektiivist. Töös kohandatakse eesti keelele TimeML märgendusraamistik ja luuakse raamistikule toetuv automaatne ajaväljendite tuvastaja ning ajasemantilise märgendusega (sündmusviidete, ajaväljendite ning ajaseoste märgendusega) tekstikorpus; analüüsitakse korpuse põhjal inimmärgendajate kooskõla sündmusviidete ja ajaseoste määramisel ning lõpuks uuritakse võimalusi ajasemantika-keskse sündmusanalüüsi laiendamiseks geneeriliseks sündmusanalüüsiks sündmust väljendavate keelendite samaviitelisuse lahendamise näitel.
Töö pakub suuniseid tekstide ajasemantika ja sündmusstruktuuri märgenduse edasiarendamiseks tulevikus ning töös loodud keeleressurssid võimaldavad nii konkreetsete lõpp-rakenduste (nt automaatne ajaküsimustele vastamine) katsetamist kui ka automaatsete märgendustööriistade edasiarendamist.
Due to massive scale digitalisation processes and a switch from traditional means of written communication to digital written communication, vast amounts of human language texts are becoming machine-readable. Machine-readability holds a potential for easing human effort on searching and organising large text collections, allowing applications such as automatic text summarisation and question answering. However, current tools for automatic text analysis do not reach for text understanding required for making these applications generic. It is hypothesised that automatic analysis of events in texts leads us closer to the goal, as many texts can be interpreted as stories/narratives that are decomposable into events.
This thesis explores event analysis as broad-coverage and general domain automatic language analysis problem in Estonian, and provides an investigation starting from time-oriented event analysis and tending towards generic event analysis. We adapt TimeML framework to Estonian, and create an automatic temporal expression tagger and a news corpus manually annotated for temporal semantics (event mentions, temporal expressions, and temporal relations) for the language; we analyse consistency of human annotation of event mentions and temporal relations, and, finally, provide a preliminary study on event coreference resolution in Estonian news.
The current work also makes suggestions on how future research can improve Estonian event and temporal semantic annotation, and the language resources developed in this work will allow future experimentation with end-user applications (such as automatic answering of temporal questions) as well as provide a basis for developing automatic semantic analysis tools
- …