3,934 research outputs found
Multilingual Schema Matching for Wikipedia Infoboxes
Recent research has taken advantage of Wikipedia's multilingualism as a
resource for cross-language information retrieval and machine translation, as
well as proposed techniques for enriching its cross-language structure. The
availability of documents in multiple languages also opens up new opportunities
for querying structured Wikipedia content, and in particular, to enable answers
that straddle different languages. As a step towards supporting such queries,
in this paper, we propose a method for identifying mappings between attributes
from infoboxes that come from pages in different languages. Our approach finds
mappings in a completely automated fashion. Because it does not require
training data, it is scalable: not only can it be used to find mappings between
many language pairs, but it is also effective for languages that are
under-represented and lack sufficient training samples. Another important
benefit of our approach is that it does not depend on syntactic similarity
between attribute names, and thus, it can be applied to language pairs that
have distinct morphologies. We have performed an extensive experimental
evaluation using a corpus consisting of pages in Portuguese, Vietnamese, and
English. The results show that not only does our approach obtain high precision
and recall, but it also outperforms state-of-the-art techniques. We also
present a case study which demonstrates that the multilingual mappings we
derive lead to substantial improvements in answer quality and coverage for
structured queries over Wikipedia content.Comment: VLDB201
Knowledge Organization Systems (KOS) in the Semantic Web: A Multi-Dimensional Review
Since the Simple Knowledge Organization System (SKOS) specification and its
SKOS eXtension for Labels (SKOS-XL) became formal W3C recommendations in 2009 a
significant number of conventional knowledge organization systems (KOS)
(including thesauri, classification schemes, name authorities, and lists of
codes and terms, produced before the arrival of the ontology-wave) have made
their journeys to join the Semantic Web mainstream. This paper uses "LOD KOS"
as an umbrella term to refer to all of the value vocabularies and lightweight
ontologies within the Semantic Web framework. The paper provides an overview of
what the LOD KOS movement has brought to various communities and users. These
are not limited to the colonies of the value vocabulary constructors and
providers, nor the catalogers and indexers who have a long history of applying
the vocabularies to their products. The LOD dataset producers and LOD service
providers, the information architects and interface designers, and researchers
in sciences and humanities, are also direct beneficiaries of LOD KOS. The paper
examines a set of the collected cases (experimental or in real applications)
and aims to find the usages of LOD KOS in order to share the practices and
ideas among communities and users. Through the viewpoints of a number of
different user groups, the functions of LOD KOS are examined from multiple
dimensions. This paper focuses on the LOD dataset producers, vocabulary
producers, and researchers (as end-users of KOS).Comment: 31 pages, 12 figures, accepted paper in International Journal on
Digital Librarie
Language technologies for a multilingual Europe
This volume of the series âTranslation and Multilingual Natural Language Processingâ includes most of the papers presented at the Workshop âLanguage Technology for a Multilingual Europeâ, held at the University of Hamburg on September 27, 2011 in the framework of the conference GSCL 2011 with the topic âMultilingual Resources and Multilingual Applicationsâ, along with several additional contributions. In addition to an overview article on Machine Translation and two contributions on the European initiatives META-NET and Multilingual Web, the volume includes six full research articles. Our intention with this workshop was to bring together various groups concerned with the umbrella topics of multilingualism and language technology, especially multilingual technologies. This encompassed, on the one hand, representatives from research and development in the field of language technologies, and, on the other hand, users from diverse areas such as, among others, industry, administration and funding agencies. The Workshop âLanguage Technology for a Multilingual Europeâ was co-organised by the two GSCL working groups âText Technologyâ and âMachine Translationâ (http://gscl.info) as well as by META-NET (http://www.meta-net.eu)
Doctor of Philosophy
dissertationThe explosion of structured Web data (e.g., online databases, Wikipedia infoboxes) creates many opportunities for integrating and querying these data that go far beyond the simple search capabilities provided by search engines. Although much work has been devoted to data integration in the database community, the Web brings new challenges: the Web-scale (e.g., the large and growing volume of data) and the heterogeneity in Web data. Because there are so much data, scalable techniques that require little or no manual intervention and that are robust to noisy data are needed. In this dissertation, we propose a new and effective approach for matching Web-form interfaces and for matching multilingual Wikipedia infoboxes. As a further step toward these problems, we propose a general prudent schema-matching framework that matches a large number of schemas effectively. Our comprehensive experiments for Web-form interfaces and Wikipedia infoboxes show that it can enable on-the-fly, automatic integration of large collections of structured Web data. Another problem we address in this dissertation is schema discovery. While existing integration approaches assume that the relevant data sources and their schemas have been identified in advance, schemas are not always available for structured Web data. Approaches exist that exploit information in Wikipedia to discover the entity types and their associate schemas. However, due to inconsistencies, sparseness, and noise from the community contribution, these approaches are error prone and require substantial human intervention. Given the schema heterogeneity in Wikipedia infoboxes, we developed a new approach that uses the structured information available in infoboxes to cluster similar infoboxes and infer the schemata for entity types. Our approach is unsupervised and resilient to the unpredictable skew in the entity class distribution. Our experiments, using over one hundred thousand infoboxes extracted from Wikipedia, indicate that our approach is effective and produces accurate schemata for Wikipedia entities
Access to recorded interviews: A research agenda
Recorded interviews form a rich basis for scholarly inquiry. Examples include oral histories, community memory projects, and interviews conducted for broadcast media. Emerging technologies offer the potential to radically transform the way in which recorded interviews are made accessible, but this vision will demand substantial investments from a broad range of research communities. This article reviews the present state of practice for making recorded interviews available and the state-of-the-art for key component technologies. A large number of important research issues are identified, and from that set of issues, a coherent research agenda is proposed
- âŚ