11,838 research outputs found
Searching and organizing images across languages
With the continual growth of users on the Web
from a wide range of countries, supporting
such users in their search of cultural heritage
collections will grow in importance. In the
next few years, the growth areas of Internet
users will come from the Indian sub-continent
and China. Consequently, if holders of cultural
heritage collections wish their content to be
viewable by the full range of users coming to
the Internet, the range of languages that they
need to support will have to grow. This paper
will present recent work conducted at the
University of Sheffield (and now being
implemented in BRICKS) on how to use
automatic translation to provide search and
organisation facilities for a historical image
search engine. The system allows users to
search for images in seven different languages,
providing means for the user to examine
translated image captions and browse retrieved
images organised by categories written in their
native language
Frequency Value Grammar and Information Theory
I previously laid the groundwork for Frequency Value Grammar (FVG) in papers I submitted in the proceedings of the 4th International Conference on Cognitive Science (2003), Sydney Australia, and Corpus Linguistics Conference (2003), Lancaster, UK. FVG is a formal syntax theoretically based in large part on Information Theory principles. FVG relies on dynamic physical principles external to the corpus which shape and mould the corpus whereas generative grammar and other formal syntactic theories are based exclusively on patterns (fractals) found occurring within the well-formed portion of the corpus. However, FVG should not be confused with Probability Syntax, (PS), as described by Manning (2003). PS is a corpus based approach that will yield the probability distribution of possible syntax constructions over a fixed corpus. PS makes no distinction between well and ill formed sentence constructions and assumes everything found in the corpus is well formed. In contrast, FVGâs primary objective is to distinguish between well and ill formed sentence constructions and, in so doing, relies on corpus based parameters which determine sentence competency. In PS, a syntax of high probability will not necessarily yield a well formed sentence. However, in FVG, a syntax or sentence construction of high âfrequency valueâ will yield a well-formed sentence, at least, 95% of the time satisfying most empirical standards. Moreover, in FVG, a sentence construction of âhigh frequency valueâ could very well be represented by an underlying syntactic construction of low probability as determined by PS. The characteristic âfrequency valuesâ calculated in FVG are not measures of probability but rather are fundamentally determined values derived from exogenous principles which impact and determine corpus based parameters serving as an index of sentence competency. The theoretical framework of FVG has broad applications beyond that of formal syntax and NLP. In this paper, I will demonstrate how FVG can be used as a model for improving the upper bound calculation of entropy of written English. Generally speaking, when a function word precedes an open class word, the backward n-gram analysis will be homomorphic with the information source and will result in frequency values more representative of co-occurrences in the information source
User experiments with the Eurovision cross-language image retrieval system
In this paper we present Eurovision, a text-based system for cross-language (CL) image retrieval.
The system is evaluated by multilingual users for two search tasks with the system configured in
English and five other languages. To our knowledge this is the first published set of user
experiments for CL image retrieval. We show that: (1) it is possible to create a usable multilingual
search engine using little knowledge of any language other than English, (2) categorizing images
assists the user's search, and (3) there are differences in the way users search between the proposed
search tasks. Based on the two search tasks and user feedback, we describe important aspects of
any CL image retrieval system
Recommended from our members
Presenting complex ideas using simple syntax in fiction for low-literate immigrant adults
A stemming algorithm for Latvian
The thesis covers construction, application and evaluation of a stemming algorithm for
advanced information searching and retrieval in Latvian databases. Its aim is to examine
the following two questions:
Is it possible to apply for Latvian a suffix removal algorithm originally designed
for English?
Can stemming in Latvian produce the same or better information retrieval results
than manual truncation?
In order to achieve these aims, the role and importance of automatic word conflation
both for document indexing and information retrieval are characterised. A review of
literature, which analyzes and evaluates different types of stemming techniques and
retrospective development of stemming algorithms, justifies the necessity to apply this
advanced IR method also for Latvian. Comparative analysis of morphological structure
both for English and Latvian language determined the selection of Porter's suffix
removal algorithm as a basis for the Latvian sternmer.
An extensive list of Latvian stopwords including conjunctions, particles and adverbs,
was designed and added to the initial sternmer in order to eliminate insignificant words
from further processing. A number of specific modifications and changes related to the
Latvian language were carried out to the structure and rules of the original stemming
algorithm.
Analysis of word stemming based on Latvian electronic dictionary and Latvian text
fragments confirmed that the suffix removal technique can be successfully applied also
to Latvian language. An evaluation study of user search statements revealed that the
stemming algorithm to a certain extent can improve effectiveness of information
retrieval
Typologies of agreement: some problems from Kayardild
In this paper I describe a number of agreement-type phenomena in the Australian language Kayardild, and assess them against existing definitions, stating both the boundaries of what is to be considered agreement, and characteristics of prototypical agreement phenomena. Though conforming, prima facie, to definitions of agreement that stress semantically based covariance in inflections on different words, the Kayardild phenomena considered here pose a number of challenges to accepted views of agreement: the rich possibilities for stacking case-like agreement inflections emanating from different syntactic levels, the fact that inflections resulting from agreement may change the word class of their host, and the semantic categories involved, in particular tense/aspect/mood, which have been claimed not to be agreement categories on nominals. Two types of inflection, in particular - 'modal case' and 'associating case' - lie somewhere between prototypical agreement and prototypical government. Like agreement, but unlike government, they are triggered by inflectional rather than lexical features of the head, and appear on more than one constituent; like government, but unlike agreement, the semantic categories on head and dependent are not isomorphic. Other types of inflection, though unusual in the categories involved, the possibility of recursion, and their effects on the host's word class, are close to prototypical in terms of how they fare in Corbett's proposed tests for canonical agreement
Handling unknown words in statistical latent-variable parsing models for Arabic, English and French
This paper presents a study of the impact of using simple and complex morphological clues to improve the classification of rare and unknown words for parsing. We compare this approach to a language-independent technique
often used in parsers which is based solely on word frequencies. This study is applied to three languages that exhibit different levels of morphological expressiveness: Arabic, French and English. We integrate information
about Arabic affixes and morphotactics into a PCFG-LA parser and obtain stateof-the-art accuracy. We also show that these morphological clues can be learnt automatically
from an annotated corpus
- âŚ