65 research outputs found

    Lucene4IR: Developing information retrieval evaluation resources using Lucene

    Get PDF
    The workshop and hackathon on developing Information Retrieval Evaluation Resources using Lucene (L4IR) was held on the 8th and 9th of September, 2016 at the University of Strathclyde in Glasgow, UK and funded by the ESF Elias Network. The event featured three main elements: (i) a series of keynote and invited talks on industry, teaching and evaluation; (ii) planning, coding and hacking where a number of groups created modules and infrastructure to use Lucene to undertake TREC based evaluations; and (iii) a number of breakout groups discussing challenges, opportunities and problems in bridging the divide between academia and industry, and how we can use Lucene for teaching and learning Information Retrieval (IR). The event was composed of a mix and blend of academics, experts and students wanting to learn, share and create evaluation resources for the community. The hacking was intense and the discussions lively creating the basis of many useful tools but also raising numerous issues. It was clear that by adopting and contributing to most widely used and supported Open Source IR toolkit, there were many benefits for academics, students, researchers, developers and practitioners - providing a basis for stronger evaluation practices, increased reproducibility, more efficient knowledge transfer, greater collaboration between academia and industry, and shared teaching and training resources

    Toward Reproducible Baselines: The Open-Source IR Reproducibility Challenge

    Get PDF
    The Open-Source IR Reproducibility Challenge brought together developers of open-source search engines to provide reproducible baselines of their systems in a common environment on Amazon EC2. The product is a repository that contains all code necessary to generate competitive ad hoc retrieval baselines, such that with a single script, anyone with a copy of the collection can reproduce the submitted runs. Our vision is that these results would serve as widely accessible points of comparison in future IR research. This project represents an ongoing effort, but we describe the first phase of the challenge that was organized as part of a workshop at SIGIR 2015. We have succeeded modestly so far, achieving our main goals on the Gov2 collection with seven opensource search engines. In this paper, we describe our methodology, share experimental results, and discuss lessons learned as well as next steps

    Network Traffic Analysis Framework For Cyber Threat Detection

    Get PDF
    The growing sophistication of attacks and newly emerging cyber threats requires advanced cyber threat detection systems. Although there are several cyber threat detection tools in use, cyber threats and data breaches continue to rise. This research is intended to improve the cyber threat detection approach by developing a cyber threat detection framework using two complementary technologies, search engine and machine learning, combining artificial intelligence and classical technologies. In this design science research, several artifacts such as a custom search engine library, a machine learning-based engine and different algorithms have been developed to build a new cyber threat detection framework based on self-learning search and machine learning engines. Apache Lucene.Net search engine library was customized in order to function as a cyber threat detector, and Microsoft ML.NET was used to work with and train the customized search engine. This research proves that a custom search engine can function as a cyber threat detection system. Using both search and machine learning engines in the newly developed framework provides improved cyber threat detection capabilities such as self-learning and predicting attack details. When the two engines run together, the search engine is continuously trained by the machine learning engine and grow smarter to predict yet unknown threats with greater accuracy. While customizing the search engine to function as a cyber threat detector, this research also identified and proved the best algorithms for the search engine based cyber threat detection model. For example, the best scoring algorithm was found to be the Manhattan distance. The validation case study also shows that not every network traffic feature makes an equal contribution to determine the status of the traffic, and thus the variable-dimension Vector Space Model (VSM) achieves better detection accuracy than n-dimensional VSM. Although the use of different technologies and approaches improved detection results, this research is primarily focused on developing techniques rather than building a complete threat detection system. Additional components such as those that can track and investigate the impact of network traffic on the destination devices make the newly developed framework robust enough to build a comprehensive cyber threat detection appliance

    Implementation of an information retrieval system within a central knowledge management system

    Get PDF
    Páginas numeradas: I-XIII, 14-126Estágio realizado na Wipro Portugal SA e orientado pelo Eng.º Hugo NetoTese de mestrado integrado. Engenharia Informática e Computação. Faculdade de Engenharia. Universidade do Porto. 201

    Locating bugs without looking back

    Get PDF
    Bug localisation is a core program comprehension task in software maintenance: given the observation of a bug, e.g. via a bug report, where is it located in the source code? Information retrieval (IR) approaches see the bug report as the query, and the source code files as the documents to be retrieved, ranked by relevance. Such approaches have the advantage of not requiring expensive static or dynamic analysis of the code. However, current state-of-the-art IR approaches rely on project history, in particular previously fixed bugs or previous versions of the source code. We present a novel approach that directly scores each current file against the given report, thus not requiring past code and reports. The scoring method is based on heuristics identified through manual inspection of a small sample of bug reports. We compare our approach to eight others, using their own five metrics on their own six open source projects. Out of 30 performance indicators, we improve 27 and equal 2. Over the projects analysed, on average we find one or more affected files in the top 10 ranked files for 76% of the bug reports. These results show the applicability of our approach to software projects without history

    Impliance: A Next Generation Information Management Appliance

    Full text link
    ably successful in building a large market and adapting to the changes of the last three decades, its impact on the broader market of information management is surprisingly limited. If we were to design an information management system from scratch, based upon today's requirements and hardware capabilities, would it look anything like today's database systems?" In this paper, we introduce Impliance, a next-generation information management system consisting of hardware and software components integrated to form an easy-to-administer appliance that can store, retrieve, and analyze all types of structured, semi-structured, and unstructured information. We first summarize the trends that will shape information management for the foreseeable future. Those trends imply three major requirements for Impliance: (1) to be able to store, manage, and uniformly query all data, not just structured records; (2) to be able to scale out as the volume of this data grows; and (3) to be simple and robust in operation. We then describe four key ideas that are uniquely combined in Impliance to address these requirements, namely the ideas of: (a) integrating software and off-the-shelf hardware into a generic information appliance; (b) automatically discovering, organizing, and managing all data - unstructured as well as structured - in a uniform way; (c) achieving scale-out by exploiting simple, massive parallel processing, and (d) virtualizing compute and storage resources to unify, simplify, and streamline the management of Impliance. Impliance is an ambitious, long-term effort to define simpler, more robust, and more scalable information systems for tomorrow's enterprises.Comment: This article is published under a Creative Commons License Agreement (http://creativecommons.org/licenses/by/2.5/.) You may copy, distribute, display, and perform the work, make derivative works and make commercial use of the work, but, you must attribute the work to the author and CIDR 2007. 3rd Biennial Conference on Innovative Data Systems Research (CIDR) January 710, 2007, Asilomar, California, US

    Free Software for research in Information Retrieval and Textual Clustering

    Get PDF
    The document provides an overview of the main Free ("Open Source") software of interest for research in Information Retrieval, as well as some background on the context. I provides a guideline for choosing appropriate tools

    On the Additivity and Weak Baselines for Search Result Diversification Research

    Get PDF
    A recent study on the topic of additivity addresses the task of search result diversification and concludes that while weaker baselines are almost always significantly improved by the evaluated diversification methods, for stronger baselines, just the opposite happens, i.e., no significant improvement can be observed. Due to the importance of the issue in shaping future research directions and evaluation strategies in search results diversification, in this work, we first aim to reproduce the findings reported in the previous study, and then investigate its possible limitations. Our extensive experiments first reveal that under the same experimental setting with that previous study, we can reach similar results. Next, we hypothesize that for stronger baselines, tuning the parameters of some methods (i.e., the trade-off parameter between the relevance and diversity of the results in this particular scenario) should be done in a more fine-grained manner. With trade-off parameters that are specifically determined for each baseline run, we show that the percentage of significant improvements even over the strong baselines can be doubled. As a further issue, we discuss the possible impact of using the same strong baseline retrieval function for the diversity computations of the methods. Our takeaway message is that in the case of a strong baseline, it is more crucial to tune the parameters of the diversification methods to be evaluated; but once this is done, additivity is achievable
    corecore