25 research outputs found
Inducing the Cross-Disciplinary Usage of Morphological Language Data Through Semantic Modelling
Despite the enormous technological advancements in the area of data creation and management the vast majority of language data still exists as digital single-use artefacts that are inaccessible for further research efforts. At the same time the advent of digitisation in science increased the possibilities for knowledge acquisition through the computational application of linguistic information for various disciplines.
The purpose of this thesis, therefore, is to create the preconditions that enable the cross-disciplinary usage of morphological language data as a sub-area of linguistic data in order to induce a shared reusability for every research area that relies on such data. This involves the provision of morphological data on the Web under an open license and needs to take the prevalent diversity of data compilation into account. Various representation standards emerged across single disciplines which lead to heterogeneous data that differs with regard to complexity, scope and data formats. This situation requires a unifying foundation enabling direct reusability.
As a solution to fill the gap of missing open data and to overcome the presence of isolated datasets a semantic data modelling approach is applied. Being rooted in the Linked Open Data (LOD) paradigm it pursues the creation of data as uniquely identifiable resources that are realised as URIs, accessible on the Web, available under an open license, interlinked with other resources, and adhere to Linked Data representation standards such as the RDF format. Each resource then contributes to the LOD cloud in which they are all interconnected. This unification results from ontologically shared bases that formally define the classification of resources and their relation to other resources in a semantically interoperable manner. Subsequently, the possibility of creating semantically structured data has sparked the formation of the Linguistic Linked Open Data (LLOD) research community and LOD sub-cloud containing primarily language resources. Over the last decade, ontologies emerged mainly for the domain of lexical language data which lead to a significant increase in Linked Data-based linguistic datasets. However, an equivalent model for morphological data is still missing, leading to a lack of this type of language data within the LLOD cloud.
This thesis presents six publications that are concerned with the peculiarities of morphological data and the exploration of their semantic representation as an enabler of cross-disciplinary reuse. The Multilingual Morpheme Ontology (MMoOn Core) as well as an architectural framework for morphemic dataset creation as RDF resources are proposed as the first comprehensive domain representation model adhering to the LOD paradigm. It will be shown that MMoOn Core permits the joint representation of heterogeneous data sources such as interlinear glossed texts, inflection tables, the outputs of morphological analysers, lists of morphemic glosses or word-formation rules which are all equally labelled as “morphological data” across different research areas. Evidence for the applicability and adequacy of the semantic modelling entailed by the MMoOn Core ontology is provided by two datasets that were transformed from tabular data into RDF: the Hebrew Morpheme Inventory and Xhosa RDF dataset. Both further demonstrate how their integration into the LLOD cloud - by interlinking them with external language resources - yields insights that could not be obtained from the initial source data.
Altogether the research conducted in this thesis establishes the foundation for an interoperable data exchange and the enrichment of morphological language data. It strives to achieve the broader goal of advancing language data-driven research by overcoming data barriers and discipline boundaries
Translation-Based Dictionary Alignment for Under-Resourced Bantu Languages
Despite a large number of active speakers, most Bantu languages can be considered as under- or less-resourced languages. This includes especially the current situation of lexicographical data, which is highly unsatisfactory concerning the size, quality and consistency in format and provided information. Unfortunately, this does not only hold for the amount and quality of data for monolingual dictionaries, but also for their lack of interconnection to form a network of dictionaries. Current endeavours to promote the use of Bantu languages in primary and secondary education in countries like South Africa show the urgent need for high-quality digital dictionaries. This contribution describes a prototypical implementation for aligning Xhosa, Zimbabwean Ndebele and Kalanga language dictionaries based on their English translations using simple string matching techniques and via WordNet URIs. The RDF-based representation of the data using the Bantu Language Model (BLM) and - partial - references to the established WordNet dataset supported this process significantly
On the linguistic linked open data infrastructure
In this paper we describe the current state of development of the Linguistic Linked Open Data (LLOD) infrastructure, an LOD(sub-)cloud of linguistic resources, which covers various linguistic data bases, lexicons, corpora, terminology and metadata repositories.We give in some details an overview of the contributions made by the European H2020 projects “Prêt-à-LLOD” (‘Ready-to-useMultilingual Linked Language Data for Knowledge Services across Sectors’) and “ELEXIS” (‘European Lexicographic Infrastructure’) to the further development of the LLOD
Challenges for the representation of morphology in ontology lexicons
Recent years have experienced a growing trend in the publication of language resources as Linguistic Linked Data (LLD) to enhance their discovery, reuse and the interoperability of tools that consume language data. To this aim, the OntoLex-lemon model has emerged as a de facto standard to represent lexical data on the Web. However, traditional dictionaries contain a considerable amount of morphological information which is not straightforwardly representable
as LLD within the current model. In order to fill this gap a new Morphology Module of OntoLex-lemon is currently being developed. This paper presents the results of this model as on-going work as well as the underlying challenges that emerged during the module
development. Based on the MMoOn Core ontology, it aims to account for a wide range of morphological information, ranging from endings to derive whole paradigms to the decomposition and generation of lexical entries which is in compliance to other OntoLex-lemon modules and facilitates the encoding of complex morphological data in ontology lexicons
The Open Linguistics Working Group: developing the Linguistic Linked Open Data cloud
The Open Linguistics Working Group (OWLG) brings together researchers from various fields of linguistics, natural language processing, and information technology to present and discuss principles, case studies, and best practices for representing, publishing and linking linguistic data collections. A major outcome of our work is the Linguistic Linked Open Data (LLOD) cloud, an LOD (sub-)cloud of linguistic resources, which covers various linguistic databases, lexicons, corpora, terminologies, and metadata repositories. We present and summarize five years of progress on the development of the cloud and of advancements in open data in linguistics, and we describe recent community activities. The paper aims to serve as a guideline to orient and involve researchers with the community and/or Linguistic Linked Open Data
Worldwide trends in underweight and obesity from 1990 to 2022: a pooled analysis of 3663 population-representative studies with 222 million children, adolescents, and adults
Background Underweight and obesity are associated with adverse health outcomes throughout the life course. We
estimated the individual and combined prevalence of underweight or thinness and obesity, and their changes, from
1990 to 2022 for adults and school-aged children and adolescents in 200 countries and territories.
Methods We used data from 3663 population-based studies with 222 million participants that measured height and
weight in representative samples of the general population. We used a Bayesian hierarchical model to estimate
trends in the prevalence of different BMI categories, separately for adults (age ≥20 years) and school-aged children
and adolescents (age 5–19 years), from 1990 to 2022 for 200 countries and territories. For adults, we report the
individual and combined prevalence of underweight (BMI <18·5 kg/m2) and obesity (BMI ≥30 kg/m2). For schoolaged children and adolescents, we report thinness (BMI <2 SD below the median of the WHO growth reference)
and obesity (BMI >2 SD above the median).
Findings From 1990 to 2022, the combined prevalence of underweight and obesity in adults decreased in
11 countries (6%) for women and 17 (9%) for men with a posterior probability of at least 0·80 that the observed
changes were true decreases. The combined prevalence increased in 162 countries (81%) for women and
140 countries (70%) for men with a posterior probability of at least 0·80. In 2022, the combined prevalence of
underweight and obesity was highest in island nations in the Caribbean and Polynesia and Micronesia, and
countries in the Middle East and north Africa. Obesity prevalence was higher than underweight with posterior
probability of at least 0·80 in 177 countries (89%) for women and 145 (73%) for men in 2022, whereas the converse
was true in 16 countries (8%) for women, and 39 (20%) for men. From 1990 to 2022, the combined prevalence of
thinness and obesity decreased among girls in five countries (3%) and among boys in 15 countries (8%) with a
posterior probability of at least 0·80, and increased among girls in 140 countries (70%) and boys in 137 countries (69%)
with a posterior probability of at least 0·80. The countries with highest combined prevalence of thinness and
obesity in school-aged children and adolescents in 2022 were in Polynesia and Micronesia and the Caribbean for
both sexes, and Chile and Qatar for boys. Combined prevalence was also high in some countries in south Asia, such
as India and Pakistan, where thinness remained prevalent despite having declined. In 2022, obesity in school-aged
children and adolescents was more prevalent than thinness with a posterior probability of at least 0·80 among girls
in 133 countries (67%) and boys in 125 countries (63%), whereas the converse was true in 35 countries (18%) and
42 countries (21%), respectively. In almost all countries for both adults and school-aged children and adolescents,
the increases in double burden were driven by increases in obesity, and decreases in double burden by declining
underweight or thinness.
Interpretation The combined burden of underweight and obesity has increased in most countries, driven by an
increase in obesity, while underweight and thinness remain prevalent in south Asia and parts of Africa. A healthy
nutrition transition that enhances access to nutritious foods is needed to address the remaining burden of
underweight while curbing and reversing the increase in obesit
MMoOn Core – the Multilingual Morpheme Ontology
In the last years a rapid emergence of lexical resources has evolved in the Semantic Web. Whereas most of the linguistic information is already machine-readable, we found that morphological information is mostly absent or only contained in semi-structured strings. An integration of morphemic data has not yet been undertaken due to the lack of existing domain-specific ontologies and explicit morphemic data. In this paper, we present the Multilingual Morpheme Ontology called MMoOn Core which can be regarded as the first comprehensive ontology for the linguistic domain of morphological language data. It will be described how crucial concepts like morphs, morphemes, word forms and meanings are represented and interrelated and how language-specific morpheme inventories can be created as a new possibility of morphological datasets. The aim of the MMoOn Core ontology is to serve as a shared semantic model for linguists and NLP researchers alike to enable the creation, conversion, exchange, reuse and enrichment of morphological language data across different data-dependent language sciences. Therefore, various use cases are illustrated to draw attention to the cross-disciplinary potential which can be realized with the MMoOn Core ontology in the context of the existing Linguistic Linked Data research landscape
Creating Linked Data morphological language resources with MMoOn: the Hebrew Morpheme Inventory
The development of standard models for describing general lexical resources has led to the emergence of numerous lexical datasets of various languages in the Semantic Web. However, there are no models that describe the domain of Morphology in a similar manner.
As a result, there are hardly any language resources of morphemic data available in RDF to date. This paper presents the creation of the Hebrew Morpheme Inventory from a manually compiled tabular dataset comprising around 52.000 entries. It is an ongoing effort of representing the lexemes, word-forms and morphologigal patterns together with their underlying relations based on the newly created Multilingual
Morpheme Ontology (MMoOn). It will be shown how segmented Hebrew language data can be granularly described in a Linked Data format, thus, serving as an exemplary case for creating morpheme inventories of any inflectional language with MMoOn. The resulting dataset is described a) according to the structure of the underlying data format, b) with respect to the Hebrew language characteristic of building word-forms directly from roots, c) by exemplifying how inflectional information is realized and d) with regard to its enrichment with external links to sense resources