641 research outputs found

    Japanese/English Cross-Language Information Retrieval: Exploration of Query Translation and Transliteration

    Full text link
    Cross-language information retrieval (CLIR), where queries and documents are in different languages, has of late become one of the major topics within the information retrieval community. This paper proposes a Japanese/English CLIR system, where we combine a query translation and retrieval modules. We currently target the retrieval of technical documents, and therefore the performance of our system is highly dependent on the quality of the translation of technical terms. However, the technical term translation is still problematic in that technical terms are often compound words, and thus new terms are progressively created by combining existing base words. In addition, Japanese often represents loanwords based on its special phonogram. Consequently, existing dictionaries find it difficult to achieve sufficient coverage. To counter the first problem, we produce a Japanese/English dictionary for base words, and translate compound words on a word-by-word basis. We also use a probabilistic method to resolve translation ambiguity. For the second problem, we use a transliteration method, which corresponds words unlisted in the base word dictionary to their phonetic equivalents in the target language. We evaluate our system using a test collection for CLIR, and show that both the compound word translation and transliteration methods improve the system performance

    Mind the Gap: Another look at the problem of the semantic gap in image retrieval

    No full text
    This paper attempts to review and characterise the problem of the semantic gap in image retrieval and the attempts being made to bridge it. In particular, we draw from our own experience in user queries, automatic annotation and ontological techniques. The first section of the paper describes a characterisation of the semantic gap as a hierarchy between the raw media and full semantic understanding of the media's content. The second section discusses real users' queries with respect to the semantic gap. The final sections of the paper describe our own experience in attempting to bridge the semantic gap. In particular we discuss our work on auto-annotation and semantic-space models of image retrieval in order to bridge the gap from the bottom up, and the use of ontologies, which capture more semantics than keyword object labels alone, as a technique for bridging the gap from the top down

    On the Feasibility of Automated Detection of Allusive Text Reuse

    Full text link
    The detection of allusive text reuse is particularly challenging due to the sparse evidence on which allusive references rely---commonly based on none or very few shared words. Arguably, lexical semantics can be resorted to since uncovering semantic relations between words has the potential to increase the support underlying the allusion and alleviate the lexical sparsity. A further obstacle is the lack of evaluation benchmark corpora, largely due to the highly interpretative character of the annotation process. In the present paper, we aim to elucidate the feasibility of automated allusion detection. We approach the matter from an Information Retrieval perspective in which referencing texts act as queries and referenced texts as relevant documents to be retrieved, and estimate the difficulty of benchmark corpus compilation by a novel inter-annotator agreement study on query segmentation. Furthermore, we investigate to what extent the integration of lexical semantic information derived from distributional models and ontologies can aid retrieving cases of allusive reuse. The results show that (i) despite low agreement scores, using manual queries considerably improves retrieval performance with respect to a windowing approach, and that (ii) retrieval performance can be moderately boosted with distributional semantics

    Bridging the semantic gap in content-based image retrieval.

    Get PDF
    To manage large image databases, Content-Based Image Retrieval (CBIR) emerged as a new research subject. CBIR involves the development of automated methods to use visual features in searching and retrieving. Unfortunately, the performance of most CBIR systems is inherently constrained by the low-level visual features because they cannot adequately express the user\u27s high-level concepts. This is known as the semantic gap problem. This dissertation introduces a new approach to CBIR that attempts to bridge the semantic gap. Our approach includes four components. The first one learns a multi-modal thesaurus that associates low-level visual profiles with high-level keywords. This is accomplished through image segmentation, feature extraction, and clustering of image regions. The second component uses the thesaurus to annotate images in an unsupervised way. This is accomplished through fuzzy membership functions to label new regions based on their proximity to the profiles in the thesaurus. The third component consists of an efficient and effective method for fusing the retrieval results from the multi-modal features. Our method is based on learning and adapting fuzzy membership functions to the distribution of the features\u27 distances and assigning a degree of worthiness to each feature. The fourth component provides the user with the option to perform hybrid querying and query expansion. This allows the enrichment of a visual query with textual data extracted from the automatically labeled images in the database. The four components are integrated into a complete CBIR system that can run in three different and complementary modes. The first mode allows the user to query using an example image. The second mode allows the user to specify positive and/or negative sample regions that should or should not be included in the retrieved images. The third mode uses a Graphical Text Interface to allow the user to browse the database interactively using a combination of low-level features and high-level concepts. The proposed system and ail of its components and modes are implemented and validated using a large data collection for accuracy, performance, and improvement over traditional CBIR techniques

    On Improving Generalization of CNN-Based Image Classification with Delineation Maps Using the CORF Push-Pull Inhibition Operator

    Get PDF
    Deployed image classification pipelines are typically dependent on the images captured in real-world environments. This means that images might be affected by different sources of perturbations (e.g. sensor noise in low-light environments). The main challenge arises by the fact that image quality directly impacts the reliability and consistency of classification tasks. This challenge has, hence, attracted wide interest within the computer vision communities. We propose a transformation step that attempts to enhance the generalization ability of CNN models in the presence of unseen noise in the test set. Concretely, the delineation maps of given images are determined using the CORF push-pull inhibition operator. Such an operation transforms an input image into a space that is more robust to noise before being processed by a CNN. We evaluated our approach on the Fashion MNIST data set with an AlexNet model. It turned out that the proposed CORF-augmented pipeline achieved comparable results on noise-free images to those of a conventional AlexNet classification model without CORF delineation maps, but it consistently achieved significantly superior performance on test images perturbed with different levels of Gaussian and uniform noise

    Saliency for Image Description and Retrieval

    Get PDF
    We live in a world where we are surrounded by ever increasing numbers of images. More often than not, these images have very little metadata by which they can be indexed and searched. In order to avoid information overload, techniques need to be developed to enable these image collections to be searched by their content. Much of the previous work on image retrieval has used global features such as colour and texture to describe the content of the image. However, these global features are insufficient to accurately describe the image content when different parts of the image have different characteristics. This thesis initially discusses how this problem can be circumvented by using salient interest regions to select the areas of the image that are most interesting and generate local descriptors to describe the image characteristics in that region. The thesis discusses a number of different saliency detectors that are suitable for robust retrieval purposes and performs a comparison between a number of these region detectors. The thesis then discusses how salient regions can be used for image retrieval using a number of techniques, but most importantly, two techniques inspired from the field of textual information retrieval. Using these robust retrieval techniques, a new paradigm in image retrieval is discussed, whereby the retrieval takes place on a mobile device using a query image captured by a built-in camera. This paradigm is demonstrated in the context of an art gallery, in which the device can be used to find more information about particular images. The final chapter of the thesis discusses some approaches to bridging the semantic gap in image retrieval. The chapter explores ways in which un-annotated image collections can be searched by keyword. Two techniques are discussed; the first explicitly attempts to automatically annotate the un-annotated images so that the automatically applied annotations can be used for searching. The second approach does not try to explicitly annotate images, but rather, through the use of linear algebra, it attempts to create a semantic space in which images and keywords are positioned such that images are close to the keywords that represent them within the space

    Clustering and its Application in Requirements Engineering

    Get PDF
    Large scale software systems challenge almost every activity in the software development life-cycle, including tasks related to eliciting, analyzing, and specifying requirements. Fortunately many of these complexities can be addressed through clustering the requirements in order to create abstractions that are meaningful to human stakeholders. For example, the requirements elicitation process can be supported through dynamically clustering incoming stakeholders’ requests into themes. Cross-cutting concerns, which have a significant impact on the architectural design, can be identified through the use of fuzzy clustering techniques and metrics designed to detect when a theme cross-cuts the dominant decomposition of the system. Finally, traceability techniques, required in critical software projects by many regulatory bodies, can be automated and enhanced by the use of cluster-based information retrieval methods. Unfortunately, despite a significant body of work describing document clustering techniques, there is almost no prior work which directly addresses the challenges, constraints, and nuances of requirements clustering. As a result, the effectiveness of software engineering tools and processes that depend on requirements clustering is severely limited. This report directly addresses the problem of clustering requirements through surveying standard clustering techniques and discussing their application to the requirements clustering process

    Pre-training Transformers on Indian Legal Text

    Full text link
    Natural Language Processing in the legal domain been benefited hugely by the emergence of Transformer-based Pre-trained Language Models (PLMs) pre-trained on legal text. There exist PLMs trained over European and US legal text, most notably LegalBERT. However, with the rapidly increasing volume of NLP applications on Indian legal documents, and the distinguishing characteristics of Indian legal text, it has become necessary to pre-train LMs over Indian legal text as well. In this work, we introduce transformer-based PLMs pre-trained over a large corpus of Indian legal documents. We also apply these PLMs over several benchmark legal NLP tasks over both Indian legal text, as well as over legal text belonging to other domains (countries). The NLP tasks with which we experiment include Legal Statute Identification from facts, Semantic segmentation of court judgements, and Court Judgement Prediction. Our experiments demonstrate the utility of the India-specific PLMs developed in this work

    Semantic concept detection from visual content with statistical learning

    Get PDF
    Master'sMASTER OF SCIENC
    corecore