7,579 research outputs found
Searching and organizing images across languages
With the continual growth of users on the Web
from a wide range of countries, supporting
such users in their search of cultural heritage
collections will grow in importance. In the
next few years, the growth areas of Internet
users will come from the Indian sub-continent
and China. Consequently, if holders of cultural
heritage collections wish their content to be
viewable by the full range of users coming to
the Internet, the range of languages that they
need to support will have to grow. This paper
will present recent work conducted at the
University of Sheffield (and now being
implemented in BRICKS) on how to use
automatic translation to provide search and
organisation facilities for a historical image
search engine. The system allows users to
search for images in seven different languages,
providing means for the user to examine
translated image captions and browse retrieved
images organised by categories written in their
native language
Computerization of African languages-French dictionaries
This paper relates work done during the DiLAF project. It consists in
converting 5 bilingual African language-French dictionaries originally in Word
format into XML following the LMF model. The languages processed are Bambara,
Hausa, Kanuri, Tamajaq and Songhai-zarma, still considered as under-resourced
languages concerning Natural Language Processing tools. Once converted, the
dictionaries are available online on the Jibiki platform for lookup and
modification. The DiLAF project is first presented. A description of each
dictionary follows. Then, the conversion methodology from .doc format to XML
files is presented. A specific point on the usage of Unicode follows. Then,
each step of the conversion into XML and LMF is detailed. The last part
presents the Jibiki lexical resources management platform used for the project.Comment: 8 page
Corpus planning for Irish â dictionaries and terminology
A description of the evolution and current situation of corpus planning for Irish, which includes dictionaries, terminology and corpora
Applying digital content management to support localisation
The retrieval and presentation of digital content such as that on the World Wide Web (WWW) is a substantial area of research. While recent years have seen huge expansion in the size of web-based archives that can be searched efficiently by commercial search engines, the presentation of potentially relevant content is still limited to ranked document lists represented by simple text snippets or image keyframe surrogates. There is expanding interest in techniques to personalise the presentation of content to improve the richness and effectiveness of the user experience. One of the most significant challenges to achieving this is the increasingly multilingual nature of this data, and the need to provide suitably localised responses to users based on this content. The Digital Content Management (DCM) track of the Centre for Next Generation Localisation (CNGL) is seeking to develop technologies to support advanced personalised access and presentation of information by combining elements from the existing research areas of Adaptive Hypermedia and Information Retrieval. The combination of these technologies is intended to produce significant improvements in the way users access information. We review key features of these technologies and introduce early ideas for how these technologies can support localisation and localised content before concluding with some impressions of future directions in DCM
Faclair na GĂ idhlig and Corpas na GĂ idhlig: New Approaches Make Sense
For minority languages in the twenty-first century increasingly overshadowed by their global counterparts, language maintenance and revitalisation are of paramount importance. Closely linked to these issues is the question of corpus planning. This essay will focus on two projects in Scottish Gaelic which will play a major part in preserving and maintaining the language by providing it with high quality lexicographical and research resources: Faclair na GĂ idhlig and Corpas na GĂ idhlig respectively ; the essay concludes with a brief case study on Gaelic numerals which illustrates how Corpas na GĂ idhlig can powerfully enhance our understanding of Gaelic
Domain-speciïŹc query translation for multilingual access to digital libraries
Accurate high-coverage translation is a vital component of reliable cross language information access (CLIR) systems. This is particularly true of access to archives such as Digital Libraries which are often speciïŹc to certain domains. While general machine translation (MT) has been shown to be effective for CLIR tasks in information retrieval evaluation workshops, it is not well suited to specialized tasks where domain speciïŹc translations are required. We demonstrate that effective query translation
in the domain of cultural heritage (CH) can be achieved by augmenting a standard MT system with domain-speciïŹc phrase dictionaries automatically mined from the online Wikipedia. Experiments using our hybrid translation system with sample query logs from users of CH websites demonstrate a large improvement in the accuracy of domain speciïŹc phrase detection and translation
General guidelines for designing bilingual low cost digital library services suitable for special library users in developing countries and the Arabic speaking world
The World is witnessing a considerable transformation from print based-formats to elec-tronic-based formats thanks to advanced computing technology, which has a profound impact on the dissemination of nearly all previous formats of publications into digital formats on computer networks. Text, still and moving images, sound tracks, music, and almost all known formats can be stored and retrieved on computer magnetic disk. Over the last two decades, a number of special libraries and information centres in the Arab world have introduced electronic resources into their library services. Very few have implemented automated and integrated systems. Despite the im-portance of designing digital libraries not merely for accessing to or retrieval of information but rather for the provision of electronic services, hardly any special library has started the design of digital library services. Managers of special libraries and information centres in developing countries in general and in the Arab world in particular should start building their local digital libraries, as the benefit of establishing such electronic services is considerably massive and well known for expansion of re-search activities and for delivering services that satisfy the needs of targeted end-users. The aim of this paper is to provide general guideline for design of special low cost digital library providing ser-vices that are most frequently required by various categories of special library users in developing countries. This paper also aims at illustrating strategies and method approaches that can be adopted for building such projects. Seeing the importance of designing an inexpensive digital li-brary as basic principle for the design accordingly, the utilisation of today's ICTs and freely avail-able open sources software is the right path for accomplishing such goal. The paper intends to de-scribe the phases and stages required for building such projects from scratch. It also aims at high-lighting the barriers and obstacles facing Arabic content and how could such problems overcome
Cross-lingual Distillation for Text Classification
Cross-lingual text classification(CLTC) is the task of classifying documents
written in different languages into the same taxonomy of categories. This paper
presents a novel approach to CLTC that builds on model distillation, which
adapts and extends a framework originally proposed for model compression. Using
soft probabilistic predictions for the documents in a label-rich language as
the (induced) supervisory labels in a parallel corpus of documents, we train
classifiers successfully for new languages in which labeled training data are
not available. An adversarial feature adaptation technique is also applied
during the model training to reduce distribution mismatch. We conducted
experiments on two benchmark CLTC datasets, treating English as the source
language and German, French, Japan and Chinese as the unlabeled target
languages. The proposed approach had the advantageous or comparable performance
of the other state-of-art methods.Comment: Accepted at ACL 2017; Code available at
https://github.com/xrc10/cross-distil
- âŠ