43 research outputs found

    MWAND: A New Early Termination Algorithm for Fast and Efficient Query Evaluation

    Get PDF
    Nowadays, current information systems are so large and maintain huge amount of data. At every time, they process millions of documents and millions of queries. In order to choose the most important responses from this amount of data, it is well to apply what is so called early termination algorithms. These ones attempt to extract the Top-K documents according to a specified increasing monotone function. The principal idea behind is to reach and score the most significant less number of documents. So, they avoid fully processing the whole documents. WAND algorithm is at the state of the art in this area. Despite it is efficient, it is missing effectiveness and precision. In this paper, we propose two contributions, the principal proposal is a new early termination algorithm based on WAND approach, we call it MWAND (Modified WAND). This one is faster and more precise than the first. It has the ability to avoid unnecessary WAND steps. In this work, we integrate a tree structure as an index into WAND and we add new levels in query processing. In the second contribution, we define new fine metrics to ameliorate the evaluation of the retrieved information. The experimental results on real datasets show that MWAND is more efficient than the WAND approach

    The Evolution of Web Search User Interfaces -- An Archaeological Analysis of Google Search Engine Result Pages

    Full text link
    Web search engines have marked everyone's life by transforming how one searches and accesses information. Search engines give special attention to the user interface, especially search engine result pages (SERP). The well-known ''10 blue links'' list has evolved into richer interfaces, often personalized to the search query, the user, and other aspects. More than 20 years later, the literature has not adequately portrayed this development. We present a study on the evolution of SERP interfaces during the last two decades using Google Search as a case study. We used the most searched queries by year to extract a sample of SERP from the Internet Archive. Using this dataset, we analyzed how SERP evolved in content, layout, design (e.g., color scheme, text styling, graphics), navigation, and file size. We have also analyzed the user interface design patterns associated with SERP elements. We found that SERP are becoming more diverse in terms of elements, aggregating content from different verticals and including more features that provide direct answers. This systematic analysis portrays evolution trends in search engine user interfaces and, more generally, web design. We expect this work will trigger other, more specific studies that can take advantage of our dataset.Comment: 10 pages, Full Paper of CHIIR 202

    Scalable Text Mining with Sparse Generative Models

    Get PDF
    The information age has brought a deluge of data. Much of this is in text form, insurmountable in scope for humans and incomprehensible in structure for computers. Text mining is an expanding field of research that seeks to utilize the information contained in vast document collections. General data mining methods based on machine learning face challenges with the scale of text data, posing a need for scalable text mining methods. This thesis proposes a solution to scalable text mining: generative models combined with sparse computation. A unifying formalization for generative text models is defined, bringing together research traditions that have used formally equivalent models, but ignored parallel developments. This framework allows the use of methods developed in different processing tasks such as retrieval and classification, yielding effective solutions across different text mining tasks. Sparse computation using inverted indices is proposed for inference on probabilistic models. This reduces the computational complexity of the common text mining operations according to sparsity, yielding probabilistic models with the scalability of modern search engines. The proposed combination provides sparse generative models: a solution for text mining that is general, effective, and scalable. Extensive experimentation on text classification and ranked retrieval datasets are conducted, showing that the proposed solution matches or outperforms the leading task-specific methods in effectiveness, with a order of magnitude decrease in classification times for Wikipedia article categorization with a million classes. The developed methods were further applied in two 2014 Kaggle data mining prize competitions with over a hundred competing teams, earning first and second places

    Image retrieval using automatic region tagging

    Get PDF
    The task of tagging, annotating or labelling image content automatically with semantic keywords is a challenging problem. To automatically tag images semantically based on the objects that they contain is essential for image retrieval. In addressing these problems, we explore the techniques developed to combine textual description of images with visual features, automatic region tagging and region-based ontology image retrieval. To evaluate the techniques, we use three corpora comprising: Lonely Planet travel guide articles with images, Wikipedia articles with images and Goats comic strips. In searching for similar images or textual information specified in a query, we explore the unification of textual descriptions and visual features (such as colour and texture) of the images. We compare the effectiveness of using different retrieval similarity measures for the textual component. We also analyse the effectiveness of different visual features extracted from the images. We then investigate the best weight combination of using textual and visual features. Using the queries from the Multimedia Track of INEX 2005 and 2006, we found that the best weight combination significantly improves the effectiveness of the retrieval system. Our findings suggest that image regions are better in capturing the semantics, since we can identify specific regions of interest in an image. In this context, we develop a technique to tag image regions with high-level semantics. This is done by combining several shape feature descriptors and colour, using an equal-weight linear combination. We experimentally compare this technique with more complex machine-learning algorithms, and show that the equal-weight linear combination of shape features is simpler and at least as effective as using a machine learning algorithm. We focus on the synergy between ontology and image annotations with the aim of reducing the gap between image features and high-level semantics. Ontologies ease information retrieval. They are used to mine, interpret, and organise knowledge. An ontology may be seen as a knowledge base that can be used to improve the image retrieval process, and conversely keywords obtained from automatic tagging of image regions may be useful for creating an ontology. We engineer an ontology that surrogates concepts derived from image feature descriptors. We test the usability of the constructed ontology by querying the ontology via the Visual Ontology Query Interface, which has a formally specified grammar known as the Visual Ontology Query Language. We show that synergy between ontology and image annotations is possible and this method can reduce the gap between image features and high-level semantics by providing the relationships between objects in the image. In this thesis, we conclude that suitable techniques for image retrieval include fusing text accompanying the images with visual features, automatic region tagging and using an ontology to enrich the semantic meaning of the tagged image regions

    Helping academics manage students with “invisible disabilities”

    Get PDF

    Building query-based relevance sets without human intervention

    Get PDF
    A thesis submitted in partial fulfilment of the requirements of the University of Wolverhampton for the degree of Doctor of Philosophycollections are the standard framework used in the evaluation of an information retrieval system and the comparison between different systems. A text test collection consists of a set of documents, a set of topics, and a set of relevance assessments which is a list indicating the relevance of each document to each topic. Traditionally, forming the relevance assessments is done manually by human judges. But in large scale environments, such as the web, examining each document retrieved to determine its relevance is not possible. In the past there have been several studies that aimed to reduce the human effort required in building these assessments which are referred to as qrels (query-based relevance sets). Some research has also been done to completely automate the process of generating the qrels. In this thesis, we present different methodologies that lead to producing the qrels automatically without any human intervention. A first method is based on keyphrase (KP) extraction from documents presumed relevant; a second method uses Machine Learning classifiers, Naïve Bayes and Support Vector Machines. The experiments were conducted on the TREC-6, TREC-7 and TREC-8 test collections. The use of machine learning classifiers produced qrels resulting in information retrieval system rankings which were better correlated with those produced by TREC human assessments than any of the automatic techniques proposed in the literature. In order to produce a test collection which could discriminate between the best performing systems, an enhancement to the machine learning technique was made that used a small number of real or actual qrels as training sets for the classifiers. These actual relevant documents were selected by Losada et al.’s (2016) pooling technique. This modification led to an improvement in the overall system rankings and enabled discrimination between the best systems with only a little human effort. We also used the bpref-10 and infAP measures for evaluating the systems and comparing between the rankings, since they are more robust in incomplete judgment environments. We applied our new techniques to the French and Finnish test collections from CLEF2003 in order to confirm their reproducibility on non-English languages, and we achieved high correlations as seen for English

    Proceedings of the Fifth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2018)

    Get PDF
    Peer reviewe

    Jewish Studies in the Digital Age

    Get PDF
    The digitisation boom of the last two decades, and the rapid advancement of digital tools to analyse data in myriad ways, have opened up new avenues for humanities research. This volume discusses how the so-called digital turn has affected the field of Jewish Studies, explores the current state of the art and probes how digital developments can be harnessed to address the specific questions, challenges and problems in the field

    Intelligent Sensor Networks

    Get PDF
    In the last decade, wireless or wired sensor networks have attracted much attention. However, most designs target general sensor network issues including protocol stack (routing, MAC, etc.) and security issues. This book focuses on the close integration of sensing, networking, and smart signal processing via machine learning. Based on their world-class research, the authors present the fundamentals of intelligent sensor networks. They cover sensing and sampling, distributed signal processing, and intelligent signal learning. In addition, they present cutting-edge research results from leading experts
    corecore