2,001 research outputs found
Ontologies and Information Extraction
This report argues that, even in the simplest cases, IE is an ontology-driven
process. It is not a mere text filtering method based on simple pattern
matching and keywords, because the extracted pieces of texts are interpreted
with respect to a predefined partial domain model. This report shows that
depending on the nature and the depth of the interpretation to be done for
extracting the information, more or less knowledge must be involved. This
report is mainly illustrated in biology, a domain in which there are critical
needs for content-based exploration of the scientific literature and which
becomes a major application domain for IE
Enhancing FunGramKB: Further Verbs of Feeling in English
The present dissertation aims at analyzing some linguistic aspects related to the lexical, semantic and syntactic behaviour of a number of verbs of FEELING in English whose lexical, grammatical and idiosyncratic properties have been entered into the FunGramKB Editor in application of study of the theoretical assumptions propounded by the Lexical-Constructional Model.
Analysis and subsequent input of data have been assessed against the background of some of the 20th-century trends in linguistics which find their expression in the first decade of this century, and the role of semantics in a world in which increasing priority is given to probabilistic, machine-learned output in lexicographic work. From this stance, the generic features contained in the FunGramKB meaning postulates and thematic frames as outlined in the Lexical-Constructional Model bring hope for a more faithful rendering of the semantic relationships established within human expression, while making provisions for a semanticist‟s contribution to refinement and storage of both thorough and extensive knowledge
Syntax and semantics of adjectives in portuguese analysis and modeling
Tese de doutoramento, LinguĂstica (LinguĂstica Computacional), Universidade de Lisboa, Faculdade de Letras, 2010DisponĂvel no documentoFundação para a CiĂŞncia e Tecnologia (SFRH/BD/8524/2002
Models to represent linguistic linked data
As the interest of the Semantic Web and computational linguistics communities in linguistic linked data (LLD) keeps increasing and the number of contributions that dwell on LLD rapidly grows, scholars (and linguists in particular) interested in the development of LLD resources sometimes find it difficult to determine which mechanism is suitable for their needs and which challenges have already been addressed. This review seeks to present the state of the art on the models, ontologies and their extensions to represent language resources as LLD by focusing on the nature of the linguistic content they aim to encode. Four basic groups of models are distinguished in this work: models to represent the main elements of lexical resources (group 1), vocabularies developed as extensions to models in group 1 and ontologies that provide more granularity on specific levels of linguistic analysis (group 2), catalogues of linguistic data categories (group 3) and other models such as corpora models or service-oriented ones (group 4). Contributions encompassed in these four groups are described, highlighting their reuse by the community and the modelling challenges that are still to be faced
Inducing the Cross-Disciplinary Usage of Morphological Language Data Through Semantic Modelling
Despite the enormous technological advancements in the area of data creation and management the vast majority of language data still exists as digital single-use artefacts that are inaccessible for further research efforts. At the same time the advent of digitisation in science increased the possibilities for knowledge acquisition through the computational application of linguistic information for various disciplines.
The purpose of this thesis, therefore, is to create the preconditions that enable the cross-disciplinary usage of morphological language data as a sub-area of linguistic data in order to induce a shared reusability for every research area that relies on such data. This involves the provision of morphological data on the Web under an open license and needs to take the prevalent diversity of data compilation into account. Various representation standards emerged across single disciplines which lead to heterogeneous data that differs with regard to complexity, scope and data formats. This situation requires a unifying foundation enabling direct reusability.
As a solution to fill the gap of missing open data and to overcome the presence of isolated datasets a semantic data modelling approach is applied. Being rooted in the Linked Open Data (LOD) paradigm it pursues the creation of data as uniquely identifiable resources that are realised as URIs, accessible on the Web, available under an open license, interlinked with other resources, and adhere to Linked Data representation standards such as the RDF format. Each resource then contributes to the LOD cloud in which they are all interconnected. This unification results from ontologically shared bases that formally define the classification of resources and their relation to other resources in a semantically interoperable manner. Subsequently, the possibility of creating semantically structured data has sparked the formation of the Linguistic Linked Open Data (LLOD) research community and LOD sub-cloud containing primarily language resources. Over the last decade, ontologies emerged mainly for the domain of lexical language data which lead to a significant increase in Linked Data-based linguistic datasets. However, an equivalent model for morphological data is still missing, leading to a lack of this type of language data within the LLOD cloud.
This thesis presents six publications that are concerned with the peculiarities of morphological data and the exploration of their semantic representation as an enabler of cross-disciplinary reuse. The Multilingual Morpheme Ontology (MMoOn Core) as well as an architectural framework for morphemic dataset creation as RDF resources are proposed as the first comprehensive domain representation model adhering to the LOD paradigm. It will be shown that MMoOn Core permits the joint representation of heterogeneous data sources such as interlinear glossed texts, inflection tables, the outputs of morphological analysers, lists of morphemic glosses or word-formation rules which are all equally labelled as “morphological data” across different research areas. Evidence for the applicability and adequacy of the semantic modelling entailed by the MMoOn Core ontology is provided by two datasets that were transformed from tabular data into RDF: the Hebrew Morpheme Inventory and Xhosa RDF dataset. Both further demonstrate how their integration into the LLOD cloud - by interlinking them with external language resources - yields insights that could not be obtained from the initial source data.
Altogether the research conducted in this thesis establishes the foundation for an interoperable data exchange and the enrichment of morphological language data. It strives to achieve the broader goal of advancing language data-driven research by overcoming data barriers and discipline boundaries
A Computational Lexicon and Representational Model for Arabic Multiword Expressions
The phenomenon of multiword expressions (MWEs) is increasingly recognised as a serious and challenging issue that has attracted the attention of researchers in various language-related disciplines. Research in these many areas has emphasised the primary role of MWEs in the process of analysing and understanding language, particularly in the computational treatment of natural languages. Ignoring MWE knowledge in any NLP system reduces the possibility of achieving high precision outputs. However, despite the enormous wealth of MWE research and language resources available for English and some other languages, research on Arabic MWEs (AMWEs) still faces multiple challenges, particularly in key computational tasks such as extraction, identification, evaluation, language resource building, and lexical representations.
This research aims to remedy this deficiency by extending knowledge of AMWEs and making noteworthy contributions to the existing literature in three related research areas on the way towards building a computational lexicon of AMWEs. First, this study develops a general understanding of AMWEs by establishing a detailed conceptual framework that includes a description of an adopted AMWE concept and its distinctive properties at multiple linguistic levels. Second, in the use of AMWE extraction and discovery tasks, the study employs a hybrid approach that combines knowledge-based and data-driven computational methods for discovering multiple types of AMWEs. Third, this thesis presents a representative system for AMWEs which consists of multilayer encoding of extensive linguistic descriptions.
This project also paves the way for further in-depth AMWE-aware studies in NLP and linguistics to gain new insights into this complicated phenomenon in standard Arabic. The implications of this research are related to the vital role of the AMWE lexicon, as a new lexical resource, in the improvement of various ANLP tasks and the potential opportunities this lexicon provides for linguists to analyse and explore AMWE phenomena
Historical lexicography of Old French and linked open data: transforming the resources of the Dictionnaire étymologique de l'ancien français with OntoLex-Lemon
The adaptation of novel techniques and standards in computational lexicography is taking place at an accelerating pace, as manifested by
recent extensions beyond the traditional XML-based paradigm of electronic publication. One important area of activity in this regard is the transformation of lexicographic resources into (Linguistic) Linked Open Data ([L]LOD), and the application of the OntoLex-Lemon
vocabulary to electronic editions of dictionaries. At the moment, however, these activities focus on machine-readable dictionaries,
natural language processing and modern languages and found only limited resonance in philology in general and in historical language
stages in particular. This paper presents an endeavor to transform the resources of a comprehensive dictionary of Old French into LOD
using OntoLex-Lemon and it sketches the difficulties of modeling particular aspects that are due to the medieval stage of the language
Designing Statistical Language Learners: Experiments on Noun Compounds
The goal of this thesis is to advance the exploration of the statistical
language learning design space. In pursuit of that goal, the thesis makes two
main theoretical contributions: (i) it identifies a new class of designs by
specifying an architecture for natural language analysis in which probabilities
are given to semantic forms rather than to more superficial linguistic
elements; and (ii) it explores the development of a mathematical theory to
predict the expected accuracy of statistical language learning systems in terms
of the volume of data used to train them.
The theoretical work is illustrated by applying statistical language learning
designs to the analysis of noun compounds. Both syntactic and semantic analysis
of noun compounds are attempted using the proposed architecture. Empirical
comparisons demonstrate that the proposed syntactic model is significantly
better than those previously suggested, approaching the performance of human
judges on the same task, and that the proposed semantic model, the first
statistical approach to this problem, exhibits significantly better accuracy
than the baseline strategy. These results suggest that the new class of designs
identified is a promising one. The experiments also serve to highlight the need
for a widely applicable theory of data requirements.Comment: PhD thesis (Macquarie University, Sydney; December 1995), LaTeX
source, xii+214 page
Survey of the State of the Art in Natural Language Generation: Core tasks, applications and evaluation
This paper surveys the current state of the art in Natural Language
Generation (NLG), defined as the task of generating text or speech from
non-linguistic input. A survey of NLG is timely in view of the changes that the
field has undergone over the past decade or so, especially in relation to new
(usually data-driven) methods, as well as new applications of NLG technology.
This survey therefore aims to (a) give an up-to-date synthesis of research on
the core tasks in NLG and the architectures adopted in which such tasks are
organised; (b) highlight a number of relatively recent research topics that
have arisen partly as a result of growing synergies between NLG and other areas
of artificial intelligence; (c) draw attention to the challenges in NLG
evaluation, relating them to similar challenges faced in other areas of Natural
Language Processing, with an emphasis on different evaluation methods and the
relationships between them.Comment: Published in Journal of AI Research (JAIR), volume 61, pp 75-170. 118
pages, 8 figures, 1 tabl
- …