Search CORE

35 research outputs found

RDF/S)XML Linguistic Annotation of Semantic Web Pages

Author: Aguado de Cea G.
Pareja-Lora A.
Plaza Arteche R.
Álvarez de Mon Rego I.
Publication venue: Facultad de Informática (UPM)
Publication date: 01/01/2002
Field of study

Although with the Semantic Web initiative much research on web pages semantic annotation has already done by AI researchers, linguistic text annotation, including the semantic one, was originally developed in Corpus Linguistics and its results have been somehow neglected by AI. ..

Archivo Digital UPM

I NUNC-ES: strumenti nuovi per la linguistica dei corpora in spagnolo

Author: Barbera Manuel
Publication venue: Ediciones Complutense
Publication date: 01/11/2007
Field of study

After a short presentation of NUNC, a freely available multilingual suite of corpora based on newsgroups texts (querable online at http://www.bmanuel.org/projects/ng-HOME.html), this paper intends to investigate the Spanish subset of data collected in NUNC-ES. A brief description of the Spanish hierarchies is given, and some examples of corpus queries are suggested. The third part of the work presents an outline of future developments, especially the release of a new tagset for Spanish

Directory of Open Access Journals

Portal de Revistas Científicas Complutenses

A Semantic web page linguistic annotation model

Author: Aguado de Cea Guadalupe
Alvarez de Mon Rego Inmaculada
Gómez-Pérez A.
Pareja-Lora A.
Plaza-Arteche Rosario
Publication venue: Facultad de Informática (UPM)
Publication date: 01/01/2002
Field of study

Although with the Semantic Web initiative much research on web page semantic annotation has already been done by AI researchers, linguistic text annotation, including the semantic one, was originally developed in Corpus Linguistics and its results have been somehow neglected by AI. The purpose of the research presented in this proposal is to prove that integration of results in both fields is not only possible, but also highly useful in order to make Semantic Web pages more machine-readable. A multi-level (possibly multi-purpose and multi-language) annotation model based on EAGLES standards and Ontological Semantics, implemented with last generation Semantic Web languages is being developed to fit the needs of both communities

Archivo Digital UPM

RDF/S)XML Linguistic Annotation of Semantic Web Pages

Author: Aguado de Cea G.
Pareja-Lora A.
Plaza Arteche R.
Álvarez de Mon Rego I.
Publication venue: Facultad de Informática (UPM)
Publication date: 01/01/2002
Field of study

Crossref

Archivo Digital UPM

SMM: Detailed, Structured Morphological Analysis for Spanish

Author: Mahlow Cerstin
Piotrowski Michael
Publication venue
Publication date: 13/05/2015
Field of study

We present a morphological analyzer for Spanish called SMM. SMM is implemented in the grammar development framework Malaga, which is based on the formalism of Left-Associative Grammar. We briefly present the Malaga framework, describe the implementation decisions for some interesting morphological phenomena of Spanish, and report on the evaluation results from the analysis of corpora. SMM was originally only designed for analyzing word forms; in this article we outline two approaches for using SMM and the facilities provided by Malaga to also generate verbal paradigms. SMM can also be embedded into applications by making use of the Malagaprogramming interface; we briefly discuss some application scenarios

Publikationsserver des Instituts für Deutsche Sprache

Developing a multilayer semantic annotation scheme based on ISO standards for the visualization of a newswire corpus

Author: Cantante Inês
Jorge Alípio Mário
Leal António
Oliveira Fátima
Silva Maria de Fátima Henriques da
Silvano Maria da Purificação
Publication venue
Publication date: 01/01/2021
Field of study

In this paper, we describe the process of developing a multilayer semantic annotation scheme designed for extracting information from a European Portuguese corpus of news articles, at three levels, temporal, referential and semantic role labelling. The novelty of this scheme is the harmonization of parts 1, 4 and 9 of the ISO 24617 Language resource management - Semantic annotation framework. This annotation framework includes a set of entity structures (participants, events, times) and a set of links (temporal, aspectual, subordination, objectal and semantic roles) with several tags and attribute values that ensure adequate semantic and visual representations of news stories

Repositório Aberto da Universidade do Porto

Maximum Entropy Models For Natural Language Ambiguity Resolution

Author: Ratnaparkhi Adwait
Publication venue: ScholarlyCommons
Publication date: 01/01/1998
Field of study

This thesis demonstrates that several important kinds of natural language ambiguities can be resolved to state-of-the-art accuracies using a single statistical modeling technique based on the principle of maximum entropy. We discuss the problems of sentence boundary detection, part-of-speech tagging, prepositional phrase attachment, natural language parsing, and text categorization under the maximum entropy framework. In practice, we have found that maximum entropy models offer the following advantages: State-of-the-art Accuracy: The probability models for all of the tasks discussed perform at or near state-of-the-art accuracies, or outperform competing learning algorithms when trained and tested under similar conditions. Methods which outperform those presented here require much more supervision in the form of additional human involvement or additional supporting resources. Knowledge-Poor Features: The facts used to model the data, or features, are linguistically very simple, or knowledge-poor but yet succeed in approximating complex linguistic relationships. Reusable Software Technology: The mathematics of the maximum entropy framework are essentially independent of any particular task, and a single software implementation can be used for all of the probability models in this thesis. The experiments in this thesis suggest that experimenters can obtain state-of-the-art accuracies on a wide range of natural language tasks, with little task-specific effort, by using maximum entropy probability models

CiteSeerX

ScholarlyCommons@Penn

Developing Methods and Resources for Automated Processing of the African Language Igbo

Author: Onyenwe Ikechukwu Ekene
Publication venue: 'University of Sheffield Conference Proceedings'
Publication date: 25/04/2017
Field of study

Natural Language Processing (NLP) research is still in its infancy in Africa. Most of languages in Africa have few or zero NLP resources available, of which Igbo is among those at zero state. In this study, we develop NLP resources to support NLP-based research in the Igbo language. The springboard is the development of a new part-of-speech (POS) tagset for Igbo (IgbTS) based on a slight adaptation of the EAGLES guideline as a result of language internal features not recognized in EAGLES. The tagset consists of three granularities: fine-grain (85 tags), medium-grain (70 tags) and coarse-grain (15 tags). The medium-grained tagset is to strike a balance between the other two grains for practical purpose. Following this is the preprocessing of Igbo electronic texts through normalization and tokenization processes. The tokenizer is developed in this study using the tagset definition of a word token and the outcome is an Igbo corpus (IgbC) of about one million tokens. This IgbTS was applied to a part of the IgbC to produce the first Igbo tagged corpus (IgbTC). To investigate the effectiveness, validity and reproducibility of the IgbTS, an inter-annotation agreement (IAA) exercise was undertaken, which led to the revision of the IgbTS where necessary. A novel automatic method was developed to bootstrap a manual annotation process through exploitation of the by-products of this IAA exercise, to improve IgbTC. To further improve the quality of the IgbTC, a committee of taggers approach was adopted to propose erroneous instances on IgbTC for correction. A novel automatic method that uses knowledge of affixes to flag and correct all morphologically-inflected words in the IgbTC whose tags violate their status as not being morphologically-inflected was also developed and used. Experiments towards the development of an automatic POS tagging system for Igbo using IgbTC show good accuracy scores comparable to other languages that these taggers have been tested on, such as English. Accuracy on the words previously unseen during the taggers’ training (also called unknown words) is considerably low, and much lower on the unknown words that are morphologically-complex, which indicates difficulty in handling morphologically-complex words in Igbo. This was improved by adopting a morphological reconstruction method (a linguistically-informed segmentation into stems and affixes) that reformatted these morphologically-complex words into patterns learnable by machines. This enables taggers to use the knowledge of stems and associated affixes of these morphologically-complex words during the tagging process to predict their appropriate tags. Interestingly, this method outperforms other methods that existing taggers use in handling unknown words, and achieves an impressive increase for the accuracy of the morphologically-inflected unknown words and overall unknown words. These developments are the first NLP toolkit for the Igbo language and a step towards achieving the objective of Basic Language Resources Kits (BLARK) for the language. This IgboNLP toolkit will be made available for the NLP community and should encourage further research and development for the language

White Rose E-theses Online