5,116 research outputs found
Fast Data in the Era of Big Data: Twitter's Real-Time Related Query Suggestion Architecture
We present the architecture behind Twitter's real-time related query
suggestion and spelling correction service. Although these tasks have received
much attention in the web search literature, the Twitter context introduces a
real-time "twist": after significant breaking news events, we aim to provide
relevant results within minutes. This paper provides a case study illustrating
the challenges of real-time data processing in the era of "big data". We tell
the story of how our system was built twice: our first implementation was built
on a typical Hadoop-based analytics stack, but was later replaced because it
did not meet the latency requirements necessary to generate meaningful
real-time results. The second implementation, which is the system deployed in
production, is a custom in-memory processing engine specifically designed for
the task. This experience taught us that the current typical usage of Hadoop as
a "big data" platform, while great for experimentation, is not well suited to
low-latency processing, and points the way to future work on data analytics
platforms that can handle "big" as well as "fast" data
Contextual Multilingual Spellchecker for User Queries
Spellchecking is one of the most fundamental and widely used search features.
Correcting incorrectly spelled user queries not only enhances the user
experience but is expected by the user. However, most widely available
spellchecking solutions are either lower accuracy than state-of-the-art
solutions or too slow to be used for search use cases where latency is a key
requirement. Furthermore, most innovative recent architectures focus on English
and are not trained in a multilingual fashion and are trained for spell
correction in longer text, which is a different paradigm from spell correction
for user queries, where context is sparse (most queries are 1-2 words long).
Finally, since most enterprises have unique vocabularies such as product names,
off-the-shelf spelling solutions fall short of users' needs. In this work, we
build a multilingual spellchecker that is extremely fast and scalable and that
adapts its vocabulary and hence speller output based on a specific product's
needs. Furthermore, our speller out-performs general purpose spellers by a wide
margin on in-domain datasets. Our multilingual speller is used in search in
Adobe products, powering autocomplete in various applications
From Frequency to Meaning: Vector Space Models of Semantics
Computers understand very little of the meaning of human language. This
profoundly limits our ability to give instructions to computers, the ability of
computers to explain their actions to us, and the ability of computers to
analyse and process text. Vector space models (VSMs) of semantics are beginning
to address these limits. This paper surveys the use of VSMs for semantic
processing of text. We organize the literature on VSMs according to the
structure of the matrix in a VSM. There are currently three broad classes of
VSMs, based on term-document, word-context, and pair-pattern matrices, yielding
three classes of applications. We survey a broad range of applications in these
three categories and we take a detailed look at a specific open source project
in each category. Our goal in this survey is to show the breadth of
applications of VSMs for semantics, to provide a new perspective on VSMs for
those who are already familiar with the area, and to provide pointers into the
literature for those who are less familiar with the field
- …