2,832 research outputs found
Discovering missing Wikipedia inter-language links by means of cross-lingual word sense disambiguation
Wikipedia is a very popular online multilingual encyclopedia that contains millions of articles covering most written languages. Wikipedia pages contain monolingual hypertext links to other pages, as well as inter-language links to the corresponding pages in other languages. These inter-language links, however, are not always complete.
We present a prototype for a cross-lingual link discovery tool that discovers missing Wikipedia inter-language links to corresponding pages in other languages for ambiguous nouns. Although the framework of our approach is language-independent, we built a prototype for our application using Dutch as an input language and Spanish, Italian, English, French and German as target languages. The input for our system is a set of Dutch pages for a given ambiguous noun, and the output of the system is a set of links to the corresponding pages in our five target languages.
Our link discovery application contains two submodules. In a first step all pages are retrieved that contain a translation (in our five target languages) of the ambiguous word in the page title (Greedy crawler module), whereas in a second step all corresponding pages are linked between the focus language (being Dutch in our case) and the five target languages (Cross-lingual web page linker module). We consider this second step as a disambiguation task and apply a cross-lingual Word Sense Disambiguation framework to determine whether two pages refer to the same content or not
Research in Linguistic Engineering: Resources and Tools
In this paper we are revisiting some of the resources and tools developed by the members of the Intelligent Systems Research Group (GSI) at UPM as well as from the Information Retrieval and Natural Language Processing Research Group (IR&NLP) at UNED. Details about developed resources (corpus, software) and current interests and projects are given for the two groups. It is also included a brief summary and links into open source resources and tools developed by other groups of the MAVIR consortium
Use Case Oriented Medical Visual Information Retrieval & System Evaluation
Large amounts of medical visual data are produced daily in hospitals, while new imaging techniques continue to emerge. In addition, many images are made available continuously via publications in the scientific literature and can also be valuable for clinical routine, research and education. Information retrieval systems are useful tools to provide access to the biomedical literature and fulfil the information needs of medical professionals. The tools developed in this thesis can potentially help clinicians make decisions about difficult diagnoses via a case-based retrieval system based on a use case associated with a specific evaluation task. This system retrieves articles from the biomedical literature when querying with a case description and attached images. This thesis proposes a multimodal approach for medical case-based retrieval with focus on the integration of visual information connected to text. Furthermore, the ImageCLEFmed evaluation campaign was organised during this thesis promoting medical retrieval system evaluation
Knowledge Expansion of a Statistical Machine Translation System using Morphological Resources
Translation capability of a Phrase-Based Statistical Machine Translation (PBSMT) system mostly depends on parallel data and phrases that are not present in the training data are not correctly translated. This paper describes a method that efficiently expands the existing knowledge of a PBSMT system without adding more parallel data but using external morphological resources. A set of new phrase associations is added to translation and reordering models; each of them corresponds to a morphological variation of the source/target/both phrases of an existing association. New associations are generated using a string similarity score based on morphosyntactic information. We tested our approach on En-Fr and Fr-En translations and results showed improvements of the performance in terms of automatic scores (BLEU and Meteor) and reduction of out-of-vocabulary (OOV) words. We believe that our knowledge expansion framework is generic and could be used to add different types of information to the model.JRC.G.2-Global security and crisis managemen
Challenges to knowledge representation in multilingual contexts
To meet the increasing demands of the complex inter-organizational processes and the demand for
continuous innovation and internationalization, it is evident that new forms of organisation are
being adopted, fostering more intensive collaboration processes and sharing of resources, in what
can be called collaborative networks (Camarinha-Matos, 2006:03). Information and knowledge are
crucial resources in collaborative networks, being their management fundamental processes to
optimize.
Knowledge organisation and collaboration systems are thus important instruments for the success of
collaborative networks of organisations having been researched in the last decade in the areas of
computer science, information science, management sciences, terminology and linguistics.
Nevertheless, research in this area didn’t give much attention to multilingual contexts of
collaboration, which pose specific and challenging problems. It is then clear that access to and
representation of knowledge will happen more and more on a multilingual setting which implies the
overcoming of difficulties inherent to the presence of multiple languages, through the use of
processes like localization of ontologies.
Although localization, like other processes that involve multilingualism, is a rather well-developed
practice and its methodologies and tools fruitfully employed by the language industry in the
development and adaptation of multilingual content, it has not yet been sufficiently explored as an
element of support to the development of knowledge representations - in particular ontologies -
expressed in more than one language. Multilingual knowledge representation is then an open
research area calling for cross-contributions from knowledge engineering, terminology, ontology
engineering, cognitive sciences, computational linguistics, natural language processing, and
management sciences.
This workshop joined researchers interested in multilingual knowledge representation, in a
multidisciplinary environment to debate the possibilities of cross-fertilization between knowledge
engineering, terminology, ontology engineering, cognitive sciences, computational linguistics,
natural language processing, and management sciences applied to contexts where multilingualism
continuously creates new and demanding challenges to current knowledge representation methods
and techniques.
In this workshop six papers dealing with different approaches to multilingual knowledge
representation are presented, most of them describing tools, approaches and results obtained in the
development of ongoing projects.
In the first case, Andrés Domínguez Burgos, Koen Kerremansa and Rita Temmerman present a
software module that is part of a workbench for terminological and ontological mining,
Termontospider, a wiki crawler that aims at optimally traverse Wikipedia in search of domainspecific
texts for extracting terminological and ontological information. The crawler is part of a tool
suite for automatically developing multilingual termontological databases, i.e. ontologicallyunderpinned
multilingual terminological databases. In this paper the authors describe the basic principles
behind the crawler and summarized the research setting in which the tool is currently tested.
In the second paper, Fumiko Kano presents a work comparing four feature-based similarity
measures derived from cognitive sciences. The purpose of the comparative analysis presented by the author is to verify the potentially most effective model that can be applied for mapping independent ontologies in a culturally influenced domain. For that, datasets based on standardized
pre-defined feature dimensions and values, which are obtainable from the UNESCO Institute for
Statistics (UIS) have been used for the comparative analysis of the similarity measures. The purpose
of the comparison is to verify the similarity measures based on the objectively developed datasets.
According to the author the results demonstrate that the Bayesian Model of Generalization provides
for the most effective cognitive model for identifying the most similar corresponding concepts
existing for a targeted socio-cultural community.
In another presentation, Thierry Declerck, Hans-Ulrich Krieger and Dagmar Gromann present an
ongoing work and propose an approach to automatic extraction of information from multilingual
financial Web resources, to provide candidate terms for building ontology elements or instances of
ontology concepts. The authors present a complementary approach to the direct
localization/translation of ontology labels, by acquiring terminologies through the access and
harvesting of multilingual Web presences of structured information providers in the field of finance,
leading to both the detection of candidate terms in various multilingual sources in the financial
domain that can be used not only as labels of ontology classes and properties but also for the
possible generation of (multilingual) domain ontologies themselves.
In the next paper, Manuel Silva, António Lucas Soares and Rute Costa claim that despite the
availability of tools, resources and techniques aimed at the construction of ontological artifacts,
developing a shared conceptualization of a given reality still raises questions about the principles
and methods that support the initial phases of conceptualization. These questions become, according
to the authors, more complex when the conceptualization occurs in a multilingual setting. To tackle
these issues the authors present a collaborative platform – conceptME - where terminological and
knowledge representation processes support domain experts throughout a conceptualization
framework, allowing the inclusion of multilingual data as a way to promote knowledge sharing and
enhance conceptualization and support a multilingual ontology specification.
In another presentation Frieda Steurs and Hendrik J. Kockaert present us TermWise, a large project
dealing with legal terminology and phraseology for the Belgian public services, i.e. the translation
office of the ministry of justice, a project which aims at developing an advanced tool including
expert knowledge in the algorithms that extract specialized language from textual data (legal
documents) and whose outcome is a knowledge database including Dutch/French equivalents for
legal concepts, enriched with the phraseology related to the terms under discussion.
Finally, Deborah Grbac, Luca Losito, Andrea Sada and Paolo Sirito report on the preliminary
results of a pilot project currently ongoing at UCSC Central Library, where they propose to adapt to
subject librarians, employed in large and multilingual Academic Institutions, the model used by
translators working within European Union Institutions. The authors are using User Experience
(UX) Analysis in order to provide subject librarians with a visual support, by means of “ontology
tables” depicting conceptual linking and connections of words with concepts presented according to
their semantic and linguistic meaning.
The organizers hope that the selection of papers presented here will be of interest to a broad audience, and will be a starting point for further discussion and cooperation
Proceedings
Proceedings of the Workshop
CHAT 2011: Creation, Harmonization and Application of Terminology Resources.
Editors: Tatiana Gornostay and Andrejs Vasiļjevs.
NEALT Proceedings Series, Vol. 12 (2011).
© 2011 The editors and contributors.
Published by
Northern European Association for Language
Technology (NEALT)
http://omilia.uio.no/nealt .
Electronically published at
Tartu University Library (Estonia)
http://hdl.handle.net/10062/16956
- …