11,838 research outputs found

    Materials Review

    Get PDF

    Searching and organizing images across languages

    Get PDF
    With the continual growth of users on the Web from a wide range of countries, supporting such users in their search of cultural heritage collections will grow in importance. In the next few years, the growth areas of Internet users will come from the Indian sub-continent and China. Consequently, if holders of cultural heritage collections wish their content to be viewable by the full range of users coming to the Internet, the range of languages that they need to support will have to grow. This paper will present recent work conducted at the University of Sheffield (and now being implemented in BRICKS) on how to use automatic translation to provide search and organisation facilities for a historical image search engine. The system allows users to search for images in seven different languages, providing means for the user to examine translated image captions and browse retrieved images organised by categories written in their native language

    Frequency Value Grammar and Information Theory

    Get PDF
    I previously laid the groundwork for Frequency Value Grammar (FVG) in papers I submitted in the proceedings of the 4th International Conference on Cognitive Science (2003), Sydney Australia, and Corpus Linguistics Conference (2003), Lancaster, UK. FVG is a formal syntax theoretically based in large part on Information Theory principles. FVG relies on dynamic physical principles external to the corpus which shape and mould the corpus whereas generative grammar and other formal syntactic theories are based exclusively on patterns (fractals) found occurring within the well-formed portion of the corpus. However, FVG should not be confused with Probability Syntax, (PS), as described by Manning (2003). PS is a corpus based approach that will yield the probability distribution of possible syntax constructions over a fixed corpus. PS makes no distinction between well and ill formed sentence constructions and assumes everything found in the corpus is well formed. In contrast, FVG’s primary objective is to distinguish between well and ill formed sentence constructions and, in so doing, relies on corpus based parameters which determine sentence competency. In PS, a syntax of high probability will not necessarily yield a well formed sentence. However, in FVG, a syntax or sentence construction of high ‘frequency value’ will yield a well-formed sentence, at least, 95% of the time satisfying most empirical standards. Moreover, in FVG, a sentence construction of ‘high frequency value’ could very well be represented by an underlying syntactic construction of low probability as determined by PS. The characteristic ‘frequency values’ calculated in FVG are not measures of probability but rather are fundamentally determined values derived from exogenous principles which impact and determine corpus based parameters serving as an index of sentence competency. The theoretical framework of FVG has broad applications beyond that of formal syntax and NLP. In this paper, I will demonstrate how FVG can be used as a model for improving the upper bound calculation of entropy of written English. Generally speaking, when a function word precedes an open class word, the backward n-gram analysis will be homomorphic with the information source and will result in frequency values more representative of co-occurrences in the information source

    User experiments with the Eurovision cross-language image retrieval system

    Get PDF
    In this paper we present Eurovision, a text-based system for cross-language (CL) image retrieval. The system is evaluated by multilingual users for two search tasks with the system configured in English and five other languages. To our knowledge this is the first published set of user experiments for CL image retrieval. We show that: (1) it is possible to create a usable multilingual search engine using little knowledge of any language other than English, (2) categorizing images assists the user's search, and (3) there are differences in the way users search between the proposed search tasks. Based on the two search tasks and user feedback, we describe important aspects of any CL image retrieval system

    A stemming algorithm for Latvian

    Get PDF
    The thesis covers construction, application and evaluation of a stemming algorithm for advanced information searching and retrieval in Latvian databases. Its aim is to examine the following two questions: Is it possible to apply for Latvian a suffix removal algorithm originally designed for English? Can stemming in Latvian produce the same or better information retrieval results than manual truncation? In order to achieve these aims, the role and importance of automatic word conflation both for document indexing and information retrieval are characterised. A review of literature, which analyzes and evaluates different types of stemming techniques and retrospective development of stemming algorithms, justifies the necessity to apply this advanced IR method also for Latvian. Comparative analysis of morphological structure both for English and Latvian language determined the selection of Porter's suffix removal algorithm as a basis for the Latvian sternmer. An extensive list of Latvian stopwords including conjunctions, particles and adverbs, was designed and added to the initial sternmer in order to eliminate insignificant words from further processing. A number of specific modifications and changes related to the Latvian language were carried out to the structure and rules of the original stemming algorithm. Analysis of word stemming based on Latvian electronic dictionary and Latvian text fragments confirmed that the suffix removal technique can be successfully applied also to Latvian language. An evaluation study of user search statements revealed that the stemming algorithm to a certain extent can improve effectiveness of information retrieval

    Typologies of agreement: some problems from Kayardild

    Get PDF
    In this paper I describe a number of agreement-type phenomena in the Australian language Kayardild, and assess them against existing definitions, stating both the boundaries of what is to be considered agreement, and characteristics of prototypical agreement phenomena. Though conforming, prima facie, to definitions of agreement that stress semantically based covariance in inflections on different words, the Kayardild phenomena considered here pose a number of challenges to accepted views of agreement: the rich possibilities for stacking case-like agreement inflections emanating from different syntactic levels, the fact that inflections resulting from agreement may change the word class of their host, and the semantic categories involved, in particular tense/aspect/mood, which have been claimed not to be agreement categories on nominals. Two types of inflection, in particular - 'modal case' and 'associating case' - lie somewhere between prototypical agreement and prototypical government. Like agreement, but unlike government, they are triggered by inflectional rather than lexical features of the head, and appear on more than one constituent; like government, but unlike agreement, the semantic categories on head and dependent are not isomorphic. Other types of inflection, though unusual in the categories involved, the possibility of recursion, and their effects on the host's word class, are close to prototypical in terms of how they fare in Corbett's proposed tests for canonical agreement

    Handling unknown words in statistical latent-variable parsing models for Arabic, English and French

    Get PDF
    This paper presents a study of the impact of using simple and complex morphological clues to improve the classification of rare and unknown words for parsing. We compare this approach to a language-independent technique often used in parsers which is based solely on word frequencies. This study is applied to three languages that exhibit different levels of morphological expressiveness: Arabic, French and English. We integrate information about Arabic affixes and morphotactics into a PCFG-LA parser and obtain stateof-the-art accuracy. We also show that these morphological clues can be learnt automatically from an annotated corpus
    • …
    corecore