3,184 research outputs found
Adversarial Sampling and Training for Semi-Supervised Information Retrieval
Ad-hoc retrieval models with implicit feedback often have problems, e.g., the
imbalanced classes in the data set. Too few clicked documents may hurt
generalization ability of the models, whereas too many non-clicked documents
may harm effectiveness of the models and efficiency of training. In addition,
recent neural network-based models are vulnerable to adversarial examples due
to the linear nature in them. To solve the problems at the same time, we
propose an adversarial sampling and training framework to learn ad-hoc
retrieval models with implicit feedback. Our key idea is (i) to augment clicked
examples by adversarial training for better generalization and (ii) to obtain
very informational non-clicked examples by adversarial sampling and training.
Experiments are performed on benchmark data sets for common ad-hoc retrieval
tasks such as Web search, item recommendation, and question answering.
Experimental results indicate that the proposed approaches significantly
outperform strong baselines especially for high-ranked documents, and they
outperform IRGAN in NDCG@5 using only 5% of labeled data for the Web search
task.Comment: Published in WWW 201
Neural Networks for Information Retrieval
Machine learning plays a role in many aspects of modern IR systems, and deep
learning is applied in all of them. The fast pace of modern-day research has
given rise to many different approaches for many different IR problems. The
amount of information available can be overwhelming both for junior students
and for experienced researchers looking for new research topics and directions.
Additionally, it is interesting to see what key insights into IR problems the
new technologies are able to give us. The aim of this full-day tutorial is to
give a clear overview of current tried-and-trusted neural methods in IR and how
they benefit IR research. It covers key architectures, as well as the most
promising future directions.Comment: Overview of full-day tutorial at SIGIR 201
Link-based similarity search to fight web spam
www.ilab.sztaki.hu/websearch We investigate the usability of similarity search in fighting Web spam based on the assumption that an unknown spam page is more similar to certain known spam pages than to honest pages. In order to be successful, search engine spam never appears in isolation: we observe link farms and alliances for the sole purpose of search engine ranking manipulation. The artificial nature and strong inside connectedness however gave rise to successful algorithms to identify search engine spam. One example is trust and distrust propagation, an idea originating in recommender systems and P2P networks, that yields spam classificators by spreading information along hyperlinks from white and blacklists. While most previous results use PageRank variants for propagation, we form classifiers by investigating similarity top lists of an unknown page along various measures such as co-citation, companion, nearest neighbors in low dimensional projections and SimRank. We test our method over two data sets previously used to measure spam filtering algorithms. 1
Convergence of Learning Dynamics in Information Retrieval Games
We consider a game-theoretic model of information retrieval with strategic
authors. We examine two different utility schemes: authors who aim at
maximizing exposure and authors who want to maximize active selection of their
content (i.e. the number of clicks). We introduce the study of author learning
dynamics in such contexts. We prove that under the probability ranking
principle (PRP), which forms the basis of the current state of the art ranking
methods, any better-response learning dynamics converges to a pure Nash
equilibrium. We also show that other ranking methods induce a strategic
environment under which such a convergence may not occur
Characterizing Phishing Threats with Natural Language Processing
Spear phishing is a widespread concern in the modern network security
landscape, but there are few metrics that measure the extent to which
reconnaissance is performed on phishing targets. Spear phishing emails closely
match the expectations of the recipient, based on details of their experiences
and interests, making them a popular propagation vector for harmful malware. In
this work we use Natural Language Processing techniques to investigate a
specific real-world phishing campaign and quantify attributes that indicate a
targeted spear phishing attack. Our phishing campaign data sample comprises 596
emails - all containing a web bug and a Curriculum Vitae (CV) PDF attachment -
sent to our institution by a foreign IP space. The campaign was found to
exclusively target specific demographics within our institution. Performing a
semantic similarity analysis between the senders' CV attachments and the
recipients' LinkedIn profiles, we conclude with high statistical certainty (p
) that the attachments contain targeted rather than randomly
selected material. Latent Semantic Analysis further demonstrates that
individuals who were a primary focus of the campaign received CVs that are
highly topically clustered. These findings differentiate this campaign from one
that leverages random spam.Comment: This paper has been accepted for publication by the IEEE Conference
on Communications and Network Security in September 2015 at Florence, Italy.
Copyright may be transferred without notice, after which this version may no
longer be accessibl
An adversarial imitation click model for information retrieval
Modern information retrieval systems, including web search, ads placement, and recommender systems, typically rely on learning from user feedback. Click models, which study how users interact with a ranked list of items, provide a useful understanding of user feedback for learning ranking models. Constructing "right"dependencies is the key of any successful click model. However, probabilistic graphical models (PGMs) have to rely on manually assigned dependencies, and oversimplify user behaviors. Existing neural network based methods promote PGMs by enhancing the expressive ability and allowing flexible dependencies, but still suffer from exposure bias and inferior estimation. In this paper, we propose a novel framework, Adversarial Imitation Click Model (AICM), based on imitation learning. Firstly, we explicitly learn the reward function that recovers users' intrinsic utility and underlying intentions. Secondly, we model user interactions with a ranked list as a dynamic system instead of one-step click prediction, alleviating the exposure bias problem. Finally, we minimize the JS divergence through adversarial training and learn a stable distribution of click sequences, which makes AICM generalize well across different distributions of ranked lists. A theoretical analysis has indicated that AICM reduces the exposure bias from O(T2) to O(T). Our studies on a public web search dataset show that AICM not only outperforms state-of-the-art models in traditional click metrics but also achieves superior performance in addressing the exposure bias and recovering the underlying patterns of click sequences
- …