20,450 research outputs found

    Fast Data in the Era of Big Data: Twitter's Real-Time Related Query Suggestion Architecture

    Full text link
    We present the architecture behind Twitter's real-time related query suggestion and spelling correction service. Although these tasks have received much attention in the web search literature, the Twitter context introduces a real-time "twist": after significant breaking news events, we aim to provide relevant results within minutes. This paper provides a case study illustrating the challenges of real-time data processing in the era of "big data". We tell the story of how our system was built twice: our first implementation was built on a typical Hadoop-based analytics stack, but was later replaced because it did not meet the latency requirements necessary to generate meaningful real-time results. The second implementation, which is the system deployed in production, is a custom in-memory processing engine specifically designed for the task. This experience taught us that the current typical usage of Hadoop as a "big data" platform, while great for experimentation, is not well suited to low-latency processing, and points the way to future work on data analytics platforms that can handle "big" as well as "fast" data

    WISeREP - An Interactive Supernova Data Repository

    Full text link
    We have entered an era of massive data sets in astronomy. In particular, the number of supernova (SN) discoveries and classifications has substantially increased over the years from few tens to thousands per year. It is no longer the case that observations of a few prototypical events encapsulate most spectroscopic information about SNe, motivating the development of modern tools to collect, archive, organize and distribute spectra in general, and SN spectra in particular. For this reason we have developed the Weizmann Interactive Supernova data REPository - WISeREP - an SQL-based database (DB) with an interactive web-based graphical interface. The system serves as an archive of high quality SN spectra, including both historical (legacy) data as well as data that is accumulated by ongoing modern programs. The archive provides information about objects, their spectra, and related meta-data. Utilizing interactive plots, we provide a graphical interface to visualize data, perform line identification of the major relevant species, determine object redshifts, classify SNe and measure expansion velocities. Guest users may view and download spectra or other data that have been placed in the public domain. Registered users may also view and download data that are proprietary to specific programs with which they are associated. The DB currently holds >8000 spectra, of which >5000 are public; the latter include published spectra from the Palomar Transient Factory, all of the SUSPECT archive, the Caltech-Core-Collapse Program, the CfA SN spectra archive and published spectra from the UC Berkeley SNDB repository. It offers an efficient and convenient way to archive data and share it with colleagues, and we expect that data stored in this way will be easy to access, increasing its visibility, usefulness and scientific impact.Comment: To be published in PASP. WISeREP: http://www.weizmann.ac.il/astrophysics/wiserep

    Embedding Web-based Statistical Translation Models in Cross-Language Information Retrieval

    Get PDF
    Although more and more language pairs are covered by machine translation services, there are still many pairs that lack translation resources. Cross-language information retrieval (CLIR) is an application which needs translation functionality of a relatively low level of sophistication since current models for information retrieval (IR) are still based on a bag-of-words. The Web provides a vast resource for the automatic construction of parallel corpora which can be used to train statistical translation models automatically. The resulting translation models can be embedded in several ways in a retrieval model. In this paper, we will investigate the problem of automatically mining parallel texts from the Web and different ways of integrating the translation models within the retrieval process. Our experiments on standard test collections for CLIR show that the Web-based translation models can surpass commercial MT systems in CLIR tasks. These results open the perspective of constructing a fully automatic query translation device for CLIR at a very low cost.Comment: 37 page

    Deriving query suggestions for site search

    Get PDF
    Modern search engines have been moving away from simplistic interfaces that aimed at satisfying a user's need with a single-shot query. Interactive features are now integral parts of web search engines. However, generating good query modification suggestions remains a challenging issue. Query log analysis is one of the major strands of work in this direction. Although much research has been performed on query logs collected on the web as a whole, query log analysis to enhance search on smaller and more focused collections has attracted less attention, despite its increasing practical importance. In this article, we report on a systematic study of different query modification methods applied to a substantial query log collected on a local website that already uses an interactive search engine. We conducted experiments in which we asked users to assess the relevance of potential query modification suggestions that have been constructed using a range of log analysis methods and different baseline approaches. The experimental results demonstrate the usefulness of log analysis to extract query modification suggestions. Furthermore, our experiments demonstrate that a more fine-grained approach than grouping search requests into sessions allows for extraction of better refinement terms from query log files. © 2013 ASIS&T

    The archive solution for distributed workflow management agents of the CMS experiment at LHC

    Full text link
    The CMS experiment at the CERN LHC developed the Workflow Management Archive system to persistently store unstructured framework job report documents produced by distributed workflow management agents. In this paper we present its architecture, implementation, deployment, and integration with the CMS and CERN computing infrastructures, such as central HDFS and Hadoop Spark cluster. The system leverages modern technologies such as a document oriented database and the Hadoop eco-system to provide the necessary flexibility to reliably process, store, and aggregate O\mathcal{O}(1M) documents on a daily basis. We describe the data transformation, the short and long term storage layers, the query language, along with the aggregation pipeline developed to visualize various performance metrics to assist CMS data operators in assessing the performance of the CMS computing system.Comment: This is a pre-print of an article published in Computing and Software for Big Science. The final authenticated version is available online at: https://doi.org/10.1007/s41781-018-0005-

    On User Modelling for Personalised News Video Recommendation

    Get PDF
    In this paper, we introduce a novel approach for modelling user interests. Our approach captures users evolving information needs, identifies aspects of their need and recommends relevant news items to the users. We introduce our approach within the context of personalised news video retrieval. A news video data set is used for experimentation. We employ a simulated user evaluation

    Searching the intranet: Corporate users and their queries

    Get PDF
    By examining the log files from a corporate intranet search engine, we have analysed the actual web searching behaviour of real users in a real business environment. While building on previous research on public search engines, we apply an alternative session definition that we argue is more appropriate. Our results regarding session length, query construction and result page viewing confirm some of the findings from similar studies carried out on public search engines but further our understanding of web searching by presenting details on corporate users’ activities. In particular, we suggest that search sessions are shorter than previously suggested, search queries have fewer terms than observed for public search engines, and number of examined result pages is smaller than reported in other research. More research on how corporate intranet users search for information is needed
    • …
    corecore