15,619 research outputs found
Scalable Semantic Matching of Queries to Ads in Sponsored Search Advertising
Sponsored search represents a major source of revenue for web search engines.
This popular advertising model brings a unique possibility for advertisers to
target users' immediate intent communicated through a search query, usually by
displaying their ads alongside organic search results for queries deemed
relevant to their products or services. However, due to a large number of
unique queries it is challenging for advertisers to identify all such relevant
queries. For this reason search engines often provide a service of advanced
matching, which automatically finds additional relevant queries for advertisers
to bid on. We present a novel advanced matching approach based on the idea of
semantic embeddings of queries and ads. The embeddings were learned using a
large data set of user search sessions, consisting of search queries, clicked
ads and search links, while utilizing contextual information such as dwell time
and skipped ads. To address the large-scale nature of our problem, both in
terms of data and vocabulary size, we propose a novel distributed algorithm for
training of the embeddings. Finally, we present an approach for overcoming a
cold-start problem associated with new ads and queries. We report results of
editorial evaluation and online tests on actual search traffic. The results
show that our approach significantly outperforms baselines in terms of
relevance, coverage, and incremental revenue. Lastly, we open-source learned
query embeddings to be used by researchers in computational advertising and
related fields.Comment: 10 pages, 4 figures, 39th International ACM SIGIR Conference on
Research and Development in Information Retrieval, SIGIR 2016, Pisa, Ital
Learning Visual Features from Snapshots for Web Search
When applying learning to rank algorithms to Web search, a large number of
features are usually designed to capture the relevance signals. Most of these
features are computed based on the extracted textual elements, link analysis,
and user logs. However, Web pages are not solely linked texts, but have
structured layout organizing a large variety of elements in different styles.
Such layout itself can convey useful visual information, indicating the
relevance of a Web page. For example, the query-independent layout (i.e., raw
page layout) can help identify the page quality, while the query-dependent
layout (i.e., page rendered with matched query words) can further tell rich
structural information (e.g., size, position and proximity) of the matching
signals. However, such visual information of layout has been seldom utilized in
Web search in the past. In this work, we propose to learn rich visual features
automatically from the layout of Web pages (i.e., Web page snapshots) for
relevance ranking. Both query-independent and query-dependent snapshots are
considered as the new inputs. We then propose a novel visual perception model
inspired by human's visual search behaviors on page viewing to extract the
visual features. This model can be learned end-to-end together with traditional
human-crafted features. We also show that such visual features can be
efficiently acquired in the online setting with an extended inverted indexing
scheme. Experiments on benchmark collections demonstrate that learning visual
features from Web page snapshots can significantly improve the performance of
relevance ranking in ad-hoc Web retrieval tasks.Comment: CIKM 201
Boilerplate Removal using a Neural Sequence Labeling Model
The extraction of main content from web pages is an important task for
numerous applications, ranging from usability aspects, like reader views for
news articles in web browsers, to information retrieval or natural language
processing. Existing approaches are lacking as they rely on large amounts of
hand-crafted features for classification. This results in models that are
tailored to a specific distribution of web pages, e.g. from a certain time
frame, but lack in generalization power. We propose a neural sequence labeling
model that does not rely on any hand-crafted features but takes only the HTML
tags and words that appear in a web page as input. This allows us to present a
browser extension which highlights the content of arbitrary web pages directly
within the browser using our model. In addition, we create a new, more current
dataset to show that our model is able to adapt to changes in the structure of
web pages and outperform the state-of-the-art model.Comment: WWW20 Demo pape
- …