33,864 research outputs found
Knowledge Organization Systems (KOS) in the Semantic Web: A Multi-Dimensional Review
Since the Simple Knowledge Organization System (SKOS) specification and its
SKOS eXtension for Labels (SKOS-XL) became formal W3C recommendations in 2009 a
significant number of conventional knowledge organization systems (KOS)
(including thesauri, classification schemes, name authorities, and lists of
codes and terms, produced before the arrival of the ontology-wave) have made
their journeys to join the Semantic Web mainstream. This paper uses "LOD KOS"
as an umbrella term to refer to all of the value vocabularies and lightweight
ontologies within the Semantic Web framework. The paper provides an overview of
what the LOD KOS movement has brought to various communities and users. These
are not limited to the colonies of the value vocabulary constructors and
providers, nor the catalogers and indexers who have a long history of applying
the vocabularies to their products. The LOD dataset producers and LOD service
providers, the information architects and interface designers, and researchers
in sciences and humanities, are also direct beneficiaries of LOD KOS. The paper
examines a set of the collected cases (experimental or in real applications)
and aims to find the usages of LOD KOS in order to share the practices and
ideas among communities and users. Through the viewpoints of a number of
different user groups, the functions of LOD KOS are examined from multiple
dimensions. This paper focuses on the LOD dataset producers, vocabulary
producers, and researchers (as end-users of KOS).Comment: 31 pages, 12 figures, accepted paper in International Journal on
Digital Librarie
Context-sensitive Spelling Correction Using Google Web 1T 5-Gram Information
In computing, spell checking is the process of detecting and sometimes
providing spelling suggestions for incorrectly spelled words in a text.
Basically, a spell checker is a computer program that uses a dictionary of
words to perform spell checking. The bigger the dictionary is, the higher is
the error detection rate. The fact that spell checkers are based on regular
dictionaries, they suffer from data sparseness problem as they cannot capture
large vocabulary of words including proper names, domain-specific terms,
technical jargons, special acronyms, and terminologies. As a result, they
exhibit low error detection rate and often fail to catch major errors in the
text. This paper proposes a new context-sensitive spelling correction method
for detecting and correcting non-word and real-word errors in digital text
documents. The approach hinges around data statistics from Google Web 1T 5-gram
data set which consists of a big volume of n-gram word sequences, extracted
from the World Wide Web. Fundamentally, the proposed method comprises an error
detector that detects misspellings, a candidate spellings generator based on a
character 2-gram model that generates correction suggestions, and an error
corrector that performs contextual error correction. Experiments conducted on a
set of text documents from different domains and containing misspellings,
showed an outstanding spelling error correction rate and a drastic reduction of
both non-word and real-word errors. In a further study, the proposed algorithm
is to be parallelized so as to lower the computational cost of the error
detection and correction processes.Comment: LACSC - Lebanese Association for Computational Sciences -
http://www.lacsc.or
Natural language processing
Beginning with the basic issues of NLP, this chapter aims to chart the major research activities in this area since the last ARIST Chapter in 1996 (Haas, 1996), including: (i) natural language text processing systems - text summarization, information extraction, information retrieval, etc., including domain-specific applications; (ii) natural language interfaces; (iii) NLP in the context of www and digital libraries ; and (iv) evaluation of NLP systems
Recommended from our members
A semantic web services-based infrastructure for context-adaptive process support
Current technologies aimed at supporting processes whether it is a business or learning process - primarily follow a metadata- and data-centric paradigm. Whereas process metadata is usually based on a specific standard specification - such as the Business Process Modeling Notation (BPMN) or the IMS Learning Design Standard - the allocation of resources is done manually at design-time, and the used data is often specific to one process context only. These facts limit the reusability of process models across different standards and contexts. To overcome these issues, we introduce an innovative Semantic Web Service-based framework aimed at changing the current paradigm to a context-adaptive service-oriented approach. Following the idea of layered semantic abstractions, our approach supports the development of abstract semantic process model - reusable across different contexts and standards - that enables a dynamic adaptation to specific actor needs and objectives. To illustrate the application of our framework and establish its feasibility, we describe a prototypical application in the E-Learning domain
Recommended from our members
The National Transport Data Framework
Report by Professor Peter Landshoff (Cambridge University) and
Professor John Polak (Imperial College London) on a project for
the Department for Transport.
emails: [email protected] [email protected] NTDF is designed to be a resource for data owners to deposit descriptions
into a central catalogue, so that people can search for data and find data
and understand their characteristics. The value of this is to individuals, to
commercial organizations, and to public bodies. For example, services that
provide better information to travellers will help to make their journey
less stressful and persuade them to make more use of public transport.
Transport operators need very diverse information to help them
plan developments to their services: demographic, geographical, economic etc.
And policy makers need a similar range of information to help them decide
how to divide their budget and afterwards to evaluate how valuable it has
been.This work was supported by the Department for Transport (DfT)
HUDDL for description and archive of hydrographic binary data
Many of the attempts to introduce a universal hydrographic binary data format have failed or have been only partially successful. In essence, this is because such formats either have to simplify the data to such an extent that they only support the lowest common subset of all the formats covered, or they attempt to be a superset of all formats and quickly become cumbersome. Neither choice works well in practice. This paper presents a different approach: a standardized description of (past, present, and future) data formats using the Hydrographic Universal Data Description Language (HUDDL), a descriptive language implemented using the Extensible Markup Language (XML). That is, XML is used to provide a structural and physical description of a data format, rather than the content of a particular file. Done correctly, this opens the possibility of automatically generating both multi-language data parsers and documentation for format specification based on their HUDDL descriptions, as well as providing easy version control of them. This solution also provides a powerful approach for archiving a structural description of data along with the data, so that binary data will be easy to access in the future. Intending to provide a relatively low-effort solution to index the wide range of existing formats, we suggest the creation of a catalogue of format descriptions, each of them capturing the logical and physical specifications for a given data format (with its subsequent upgrades). A C/C++ parser code generator is used as an example prototype of one of the possible advantages of the adoption of such a hydrographic data format catalogue
Embedding Web-based Statistical Translation Models in Cross-Language Information Retrieval
Although more and more language pairs are covered by machine translation
services, there are still many pairs that lack translation resources.
Cross-language information retrieval (CLIR) is an application which needs
translation functionality of a relatively low level of sophistication since
current models for information retrieval (IR) are still based on a
bag-of-words. The Web provides a vast resource for the automatic construction
of parallel corpora which can be used to train statistical translation models
automatically. The resulting translation models can be embedded in several ways
in a retrieval model. In this paper, we will investigate the problem of
automatically mining parallel texts from the Web and different ways of
integrating the translation models within the retrieval process. Our
experiments on standard test collections for CLIR show that the Web-based
translation models can surpass commercial MT systems in CLIR tasks. These
results open the perspective of constructing a fully automatic query
translation device for CLIR at a very low cost.Comment: 37 page
Measuring praise and criticism: Inference of semantic orientation from association
The evaluative character of a word is called its semantic orientation. Positive semantic orientation indicates praise (e.g., "honest", "intrepid") and negative semantic orientation indicates criticism (e.g., "disturbing", "superfluous"). Semantic orientation varies in both direction (positive or negative) and degree (mild to strong). An automated system for measuring semantic orientation would have application in text classification, text filtering, tracking opinions in online discussions, analysis of survey responses, and automated chat systems (chatbots). This paper introduces a method for inferring the semantic orientation of a word from its statistical association with a set of positive and negative paradigm words. Two instances of this approach are evaluated, based on two different statistical measures of word association: pointwise mutual information (PMI) and latent semantic analysis (LSA). The method is experimentally tested with 3,596 words (including adjectives, adverbs, nouns, and verbs) that have been manually labeled positive (1,614 words) and negative (1,982 words). The method attains an accuracy of 82.8% on the full test set, but the accuracy rises above 95% when the algorithm is allowed to abstain from classifying mild words
- …