20,450 research outputs found
Fast Data in the Era of Big Data: Twitter's Real-Time Related Query Suggestion Architecture
We present the architecture behind Twitter's real-time related query
suggestion and spelling correction service. Although these tasks have received
much attention in the web search literature, the Twitter context introduces a
real-time "twist": after significant breaking news events, we aim to provide
relevant results within minutes. This paper provides a case study illustrating
the challenges of real-time data processing in the era of "big data". We tell
the story of how our system was built twice: our first implementation was built
on a typical Hadoop-based analytics stack, but was later replaced because it
did not meet the latency requirements necessary to generate meaningful
real-time results. The second implementation, which is the system deployed in
production, is a custom in-memory processing engine specifically designed for
the task. This experience taught us that the current typical usage of Hadoop as
a "big data" platform, while great for experimentation, is not well suited to
low-latency processing, and points the way to future work on data analytics
platforms that can handle "big" as well as "fast" data
WISeREP - An Interactive Supernova Data Repository
We have entered an era of massive data sets in astronomy. In particular, the
number of supernova (SN) discoveries and classifications has substantially
increased over the years from few tens to thousands per year. It is no longer
the case that observations of a few prototypical events encapsulate most
spectroscopic information about SNe, motivating the development of modern tools
to collect, archive, organize and distribute spectra in general, and SN spectra
in particular. For this reason we have developed the Weizmann Interactive
Supernova data REPository - WISeREP - an SQL-based database (DB) with an
interactive web-based graphical interface. The system serves as an archive of
high quality SN spectra, including both historical (legacy) data as well as
data that is accumulated by ongoing modern programs. The archive provides
information about objects, their spectra, and related meta-data. Utilizing
interactive plots, we provide a graphical interface to visualize data, perform
line identification of the major relevant species, determine object redshifts,
classify SNe and measure expansion velocities. Guest users may view and
download spectra or other data that have been placed in the public domain.
Registered users may also view and download data that are proprietary to
specific programs with which they are associated. The DB currently holds >8000
spectra, of which >5000 are public; the latter include published spectra from
the Palomar Transient Factory, all of the SUSPECT archive, the
Caltech-Core-Collapse Program, the CfA SN spectra archive and published spectra
from the UC Berkeley SNDB repository. It offers an efficient and convenient way
to archive data and share it with colleagues, and we expect that data stored in
this way will be easy to access, increasing its visibility, usefulness and
scientific impact.Comment: To be published in PASP. WISeREP:
http://www.weizmann.ac.il/astrophysics/wiserep
Embedding Web-based Statistical Translation Models in Cross-Language Information Retrieval
Although more and more language pairs are covered by machine translation
services, there are still many pairs that lack translation resources.
Cross-language information retrieval (CLIR) is an application which needs
translation functionality of a relatively low level of sophistication since
current models for information retrieval (IR) are still based on a
bag-of-words. The Web provides a vast resource for the automatic construction
of parallel corpora which can be used to train statistical translation models
automatically. The resulting translation models can be embedded in several ways
in a retrieval model. In this paper, we will investigate the problem of
automatically mining parallel texts from the Web and different ways of
integrating the translation models within the retrieval process. Our
experiments on standard test collections for CLIR show that the Web-based
translation models can surpass commercial MT systems in CLIR tasks. These
results open the perspective of constructing a fully automatic query
translation device for CLIR at a very low cost.Comment: 37 page
Deriving query suggestions for site search
Modern search engines have been moving away from simplistic interfaces that aimed at satisfying a user's need with a single-shot query. Interactive features are now integral parts of web search engines. However, generating good query modification suggestions remains a challenging issue. Query log analysis is one of the major strands of work in this direction. Although much research has been performed on query logs collected on the web as a whole, query log analysis to enhance search on smaller and more focused collections has attracted less attention, despite its increasing practical importance. In this article, we report on a systematic study of different query modification methods applied to a substantial query log collected on a local website that already uses an interactive search engine. We conducted experiments in which we asked users to assess the relevance of potential query modification suggestions that have been constructed using a range of log analysis methods and different baseline approaches. The experimental results demonstrate the usefulness of log analysis to extract query modification suggestions. Furthermore, our experiments demonstrate that a more fine-grained approach than grouping search requests into sessions allows for extraction of better refinement terms from query log files. © 2013 ASIS&T
The archive solution for distributed workflow management agents of the CMS experiment at LHC
The CMS experiment at the CERN LHC developed the Workflow Management Archive
system to persistently store unstructured framework job report documents
produced by distributed workflow management agents. In this paper we present
its architecture, implementation, deployment, and integration with the CMS and
CERN computing infrastructures, such as central HDFS and Hadoop Spark cluster.
The system leverages modern technologies such as a document oriented database
and the Hadoop eco-system to provide the necessary flexibility to reliably
process, store, and aggregate (1M) documents on a daily basis. We
describe the data transformation, the short and long term storage layers, the
query language, along with the aggregation pipeline developed to visualize
various performance metrics to assist CMS data operators in assessing the
performance of the CMS computing system.Comment: This is a pre-print of an article published in Computing and Software
for Big Science. The final authenticated version is available online at:
https://doi.org/10.1007/s41781-018-0005-
On User Modelling for Personalised News Video Recommendation
In this paper, we introduce a novel approach for modelling user interests. Our approach captures users evolving information needs, identifies aspects of their need and recommends relevant news items to the users. We introduce our approach within the context of personalised news video retrieval. A news video data set is used for experimentation. We employ a simulated user evaluation
Searching the intranet: Corporate users and their queries
By examining the log files from a corporate intranet search engine, we have analysed the actual web searching
behaviour of real users in a real business environment. While building on previous research on public search engines, we apply an alternative session definition that we argue is more appropriate. Our results regarding session length, query construction and result page viewing confirm some of the findings from similar studies carried out on public search engines but further our understanding of web searching by presenting details on corporate users’ activities. In particular, we suggest that search sessions are shorter than previously suggested, search queries have fewer terms than observed for public search engines, and number of examined result pages is smaller than reported in other research. More research on how corporate intranet users search for information is needed
- …