1,684 research outputs found

    06171 Abstracts Collection -- Content-Based Retrieval

    Get PDF
    From 23.04.06 to 28.04.06, the Dagstuhl Seminar 06171 `Content-Based Retrieval\u27\u27 was held in the International Conference and Research Center (IBFI), Schloss Dagstuhl. During the seminar, several participants presented their current research, and ongoing work and open problems were discussed. Abstracts of the presentations given during the seminar as well as abstracts of seminar results and ideas are put together in this paper. The first section describes the seminar topics and goals in general. Links to extended abstracts or full papers are provided, if available

    Web Scale Image Retrieval based on Image Text Query Pair and Click Data

    Get PDF
    The growing importance of traditional text-based image retrieval is due to its popularity through web image search engines. Google, Yahoo, Bing etc. are some of search engines that use this technique. Text-based image retrieval is based on the assumption that surrounding text describes the image. For text-based image retrieval systems, input is a text query and output is a ranking set of images in which most relevant results appear first. The limitation of text-based image retrieval is that most of the times query text is not able to describe the content of the image perfectly since visual information is full of variety. Microsoft Research Bing Image retrieval Challenge aims to achieve cross-modal retrieval by ranking the relevance of the query text terms and the images. This thesis addresses the approaches of our team MUVIS for Microsoft research Bing image retrieval challenge to measure the relevance of web images and the query given in text form. This challenge is to develop an image-query pair scoring system to assess the effectiveness of query terms in describing the images. The provided dataset included a training set containing more than 23 million clicked image-query pairs collected from the web (One year). Also, a development set was collected which had been manually labelled. On each image-query pair, a floating-point score was produced. The floating-point score reflected the relevancy of the query to describe the given image, with higher number including higher relevance and vice versa. Sorting its corresponding score for all its associated images produced the retrieval ranking for the images of any query. The system developed by MUVIS team consisted of five modules. Two main modules were text processing module and principal component analysis assisted perceptron regression with random sub-space selection. To enhance evaluation accuracy, three complementary modules i.e. face bank, duplicate image detector and optical character recognition were also developed. Both main module and complementary modules relied on results returned by text processing module. OverFeat features extracted over text processing module results acted as input for principal component analysis assisted perceptron regression with random sub-space selection module which further transformed the features vector. The relevance score for each query-image pair was achieved by comparing the feature of the query image and the relevant training images. For features extraction, used in the face bank and duplicate image detector modules, we used CMUVIS framework. CMUVIS framework is a distributed computing framework for big data developed by the MUVIS group. Three runs were submitted for evaluation: “Master”, “Sub2”, and “Sub3”. The cumulative similarity was returned as the requested images relevance. Using the proposed approach we reached the value of 0.5099 in terms of discounted cumulative gain on the development set. On the test set we gained 0.5116. Our solution achieved fourth place in Microsoft Research Bing grand challenge 2014 for master submission and second place for overall submission

    Feature Extraction and Duplicate Detection for Text Mining: A Survey

    Get PDF
    Text mining, also known as Intelligent Text Analysis is an important research area. It is very difficult to focus on the most appropriate information due to the high dimensionality of data. Feature Extraction is one of the important techniques in data reduction to discover the most important features. Proce- ssing massive amount of data stored in a unstructured form is a challenging task. Several pre-processing methods and algo- rithms are needed to extract useful features from huge amount of data. The survey covers different text summarization, classi- fication, clustering methods to discover useful features and also discovering query facets which are multiple groups of words or phrases that explain and summarize the content covered by a query thereby reducing time taken by the user. Dealing with collection of text documents, it is also very important to filter out duplicate data. Once duplicates are deleted, it is recommended to replace the removed duplicates. Hence we also review the literature on duplicate detection and data fusion (remove and replace duplicates).The survey provides existing text mining techniques to extract relevant features, detect duplicates and to replace the duplicate data to get fine grained knowledge to the user

    Article Segmentation in Digitised Newspapers

    Get PDF
    Digitisation projects preserve and make available vast quantities of historical text. Among these, newspapers are an invaluable resource for the study of human culture and history. Article segmentation identifies each region in a digitised newspaper page that contains an article. Digital humanities, information retrieval (IR), and natural language processing (NLP) applications over digitised archives improve access to text and allow automatic information extraction. The lack of article segmentation impedes these applications. We contribute a thorough review of the existing approaches to article segmentation. Our analysis reveals divergent interpretations of the task, and inconsistent and often ambiguously defined evaluation metrics, making comparisons between systems challenging. We solve these issues by contributing a detailed task definition that examines the nuances and intricacies of article segmentation that are not immediately apparent. We provide practical guidelines on handling borderline cases and devise a new evaluation framework that allows insightful comparison of existing and future approaches. Our review also reveals that the lack of large datasets hinders meaningful evaluation and limits machine learning approaches. We solve these problems by contributing a distant supervision method for generating large datasets for article segmentation. We manually annotate a portion of our dataset and show that our method produces article segmentations over characters nearly as well as costly human annotators. We reimplement the seminal textual approach to article segmentation (Aiello and Pegoretti, 2006) and show that it does not generalise well when evaluated on a large dataset. We contribute a framework for textual article segmentation that divides the task into two distinct phases: block representation and clustering. We propose several techniques for block representation and contribute a novel highly-compressed semantic representation called similarity embeddings. We evaluate and compare different clustering techniques, and innovatively apply label propagation (Zhu and Ghahramani, 2002) to spread headline labels to similar blocks. Our similarity embeddings and label propagation approach substantially outperforms Aiello and Pegoretti but still falls short of human performance. Exploring visual approaches to article segmentation, we reimplement and analyse the state-of-the-art Bansal et al. (2014) approach. We contribute an innovative 2D Markov model approach that captures reading order dependencies and reduces the structured labelling problem to a Markov chain that we decode with Viterbi (1967). Our approach substantially outperforms Bansal et al., achieves accuracy as good as human annotators, and establishes a new state of the art in article segmentation. Our task definition, evaluation framework, and distant supervision dataset will encourage progress in the task of article segmentation. Our state-of-the-art textual and visual approaches will allow sophisticated IR and NLP applications over digitised newspaper archives, supporting research in the digital humanities
    • …
    corecore