73 research outputs found
CLEF 2005: Ad Hoc track overview
We describe the objectives and organization of the CLEF 2005 ad hoc track and discuss the main characteristics of the tasks offered to test monolingual, bilingual and multilingual textual document retrieval. The performance achieved for each task is presented and a preliminary analysis of results is given. The paper focuses in particular on the multilingual tasks which reused the test collection created in CLEF 2003 in an attempt to see if an improvement in system performance over time could be measured, and also to examine the multilingual results merging problem
Assessing relevance using automatically translated documents for cross-language information retrieval
This thesis focuses on the Relevance Feedback (RF) process, and the scenario considered is that of a Portuguese-English Cross-Language Information Retrieval (CUR) system. CUR deals with the retrieval of documents in one natural language in response to a query expressed in another language. RF is an automatic process for query reformulation. The idea behind it is that users are unlikely to produce perfect
queries, especially if given just one attempt.The process aims at improving the queryspecification, which will lead to more relevant documents being retrieved. The method consists of asking the user to analyse an initial sample of documents retrieved in response to a query and judge them for relevance.
In that context, two main questions were posed. The first one relates to the user's ability in assessing the relevance of texts in a foreign language, texts hand translated into their language and texts automatically translated into their language. The second question concerns the relationship between the accuracy of the participant's judgements and the improvement achieved through the RF process.
In order to answer those questions, this work performed an experiment in which Portuguese speakers were asked to judge the relevance of English documents, documents hand-translated to Portuguese, and documents automatically translated to Portuguese. The results show that machine translation is as effective as hand translation in aiding users to assess relevance. In addition, the impact of misjudged
documents on the performance of RF is overall just moderate, and varies greatly for different query topics.
This work advances the existing research on RF by considering a CUR scenario and carrying out user experiments, which analyse aspects of RF and CUR that remained unexplored until now. The contributions of this work also include: the investigation of CUR using a new language pair; the design and implementation of a stemming algorithm for Portuguese; and the carrying out of several experiments using Latent Semantic Indexing which contribute data points to the CUR theory
CHORUS Deliverable 2.1: State of the Art on Multimedia Search Engines
Based on the information provided by European projects and national initiatives related to multimedia search as well as domains experts that participated in the CHORUS Think-thanks and workshops, this document reports on the state of the art related to multimedia content search from, a technical, and socio-economic perspective.
The technical perspective includes an up to date view on content based indexing and retrieval technologies, multimedia search in the context of mobile devices and peer-to-peer networks, and an overview of current evaluation and benchmark inititiatives to measure the performance of multimedia search engines.
From a socio-economic perspective we inventorize the impact and legal consequences of these technical advances and point out future directions of research
Developing a distributed electronic health-record store for India
The DIGHT project is addressing the problem of building a scalable and highly available information store for the Electronic Health Records (EHRs) of the over one billion citizens of India
Improved cross-language information retrieval via disambiguation and vocabulary discovery
Cross-lingual information retrieval (CLIR) allows people to find documents irrespective of the language used in the query or document. This thesis is concerned with the development of techniques to improve the effectiveness of Chinese-English CLIR. In Chinese-English CLIR, the accuracy of dictionary-based query translation is limited by two major factors: translation ambiguity and the presence of out-of-vocabulary (OOV) terms. We explore alternative methods for translation disambiguation, and demonstrate new techniques based on a Markov model and the use of web documents as a corpus to provide context for disambiguation. This simple disambiguation technique has proved to be extremely robust and successful. Queries that seek topical information typically contain OOV terms that may not be found in a translation dictionary, leading to inappropriate translations and consequent poor retrieval performance. Our novel OOV term translation method is based on the Chinese authorial practice of including unfamiliar English terms in both languages. It automatically extracts correct translations from the web and can be applied to both Chinese-English and English-Chinese CLIR. Our OOV translation technique does not rely on prior segmentation and is thus free from seg mentation error. It leads to a significant improvement in CLIR effectiveness and can also be used to improve Chinese segmentation accuracy. Good quality translation resources, especially bilingual dictionaries, are valuable resources for effective CLIR. We developed a system to facilitate construction of a large-scale translation lexicon of Chinese-English OOV terms using the web. Experimental results show that this method is reliable and of practical use in query translation. In addition, parallel corpora provide a rich source of translation information. We have also developed a system that uses multiple features to identify parallel texts via a k-nearest-neighbour classifier, to automatically collect high quality parallel Chinese-English corpora from the web. These two automatic web mining systems are highly reliable and easy to deploy. In this research, we provided new ways to acquire linguistic resources using multilingual content on the web. These linguistic resources not only improve the efficiency and effectiveness of Chinese-English cross-language web retrieval; but also have wider applications than CLIR
Evaluation Methodologies for Visual Information Retrieval and Annotation
Die automatisierte Evaluation von Informations-Retrieval-Systemen erlaubt
Performanz und Qualität der Informationsgewinnung zu bewerten. Bereits in
den 60er Jahren wurden erste Methodologien für die system-basierte
Evaluation aufgestellt und in den Cranfield Experimenten überprüft.
Heutzutage gehören Evaluation, Test und Qualitätsbewertung zu einem aktiven
Forschungsfeld mit erfolgreichen Evaluationskampagnen und etablierten
Methoden. Evaluationsmethoden fanden zunächst in der Bewertung von
Textanalyse-Systemen Anwendung. Mit dem rasanten Voranschreiten der
Digitalisierung wurden diese Methoden sukzessive auf die Evaluation von
Multimediaanalyse-Systeme übertragen. Dies geschah häufig, ohne die
Evaluationsmethoden in Frage zu stellen oder sie an die veränderten
Gegebenheiten der Multimediaanalyse anzupassen. Diese Arbeit beschäftigt
sich mit der system-basierten Evaluation von Indizierungssystemen für
Bildkollektionen. Sie adressiert drei Problemstellungen der Evaluation von
Annotationen: Nutzeranforderungen für das Suchen und Verschlagworten von
Bildern, Evaluationsmaße für die Qualitätsbewertung von
Indizierungssystemen und Anforderungen an die Erstellung visueller
Testkollektionen. Am Beispiel der Evaluation automatisierter
Photo-Annotationsverfahren werden relevante Konzepte mit Bezug zu
Nutzeranforderungen diskutiert, Möglichkeiten zur Erstellung einer
zuverlässigen Ground Truth bei geringem Kosten- und Zeitaufwand vorgestellt
und Evaluationsmaße zur Qualitätsbewertung eingeführt, analysiert und
experimentell verglichen. Traditionelle Maße zur Ermittlung der Performanz
werden in vier Dimensionen klassifiziert. Evaluationsmaße vergeben
üblicherweise binäre Kosten für korrekte und falsche Annotationen. Diese
Annahme steht im Widerspruch zu der Natur von Bildkonzepten. Das gemeinsame
Auftreten von Bildkonzepten bestimmt ihren semantischen Zusammenhang und
von daher sollten diese auch im Zusammenhang auf ihre Richtigkeit hin
überprüft werden. In dieser Arbeit wird aufgezeigt, wie semantische
Ähnlichkeiten visueller Konzepte automatisiert abgeschätzt und in den
Evaluationsprozess eingebracht werden können. Die Ergebnisse der Arbeit
inkludieren ein Nutzermodell für die konzeptbasierte Suche von Bildern,
eine vollständig bewertete Testkollektion und neue Evaluationsmaße für die
anforderungsgerechte Qualitätsbeurteilung von Bildanalysesystemen.Performance assessment plays a major role in the research on Information
Retrieval (IR) systems. Starting with the Cranfield experiments in the
early 60ies, methodologies for the system-based performance assessment
emerged and established themselves, resulting in an active research field
with a number of successful benchmarking activities. With the rise of the
digital age, procedures of text retrieval evaluation were often transferred
to multimedia retrieval evaluation without questioning their direct
applicability. This thesis investigates the problem of system-based
performance assessment of annotation approaches in generic image
collections. It addresses three important parts of annotation evaluation,
namely user requirements for the retrieval of annotated visual media,
performance measures for multi-label evaluation, and visual test
collections. Using the example of multi-label image annotation evaluation,
I discuss which concepts to employ for indexing, how to obtain a reliable
ground truth to moderate costs, and which evaluation measures are
appropriate. This is accompanied by a thorough analysis of related work on
system-based performance assessment in Visual Information Retrieval (VIR).
Traditional performance measures are classified into four dimensions and
investigated according to their appropriateness for visual annotation
evaluation. One of the main ideas in this thesis adheres to the common
assumption on the binary nature of the score prediction dimension in
annotation evaluation. However, the predicted concepts and the set of true
indexed concepts interrelate with each other. This work will show how to
utilise these semantic relationships for a fine-grained evaluation
scenario. Outcomes of this thesis result in a user model for concept-based
image retrieval, a fully assessed image annotation test collection, and a
number of novel performance measures for image annotation evaluation
Mixed-Language Arabic- English Information Retrieval
Includes abstract.Includes bibliographical references.This thesis attempts to address the problem of mixed querying in CLIR. It proposes mixed-language (language-aware) approaches in which mixed queries are used to retrieve most relevant documents, regardless of their languages. To achieve this goal, however, it is essential firstly to suppress the impact of most problems that are caused by the mixed-language feature in both queries and documents and which result in biasing the final ranked list. Therefore, a cross-lingual re-weighting model was developed. In this cross-lingual model, term frequency, document frequency and document length components in mixed queries are estimated and adjusted, regardless of languages, while at the same time the model considers the unique mixed-language features in queries and documents, such as co-occurring terms in two different languages. Furthermore, in mixed queries, non-technical terms (mostly those in non-English language) would likely overweight and skew the impact of those technical terms (mostly those in English) due to high document frequencies (and thus low weights) of the latter terms in their corresponding collection (mostly the English collection). Such phenomenon is caused by the dominance of the English language in scientific domains. Accordingly, this thesis also proposes reasonable re-weighted Inverse Document Frequency (IDF) so as to moderate the effect of overweighted terms in mixed queries
- …