774 research outputs found
University of Twente @ TREC 2009: Indexing half a billion web pages
This report presents results for the TREC 2009 adhoc task, the diversity task, and the relevance feedback task. We present ideas for unsupervised tuning of search system, an approach for spam removal, and the use of categories and query log information for diversifying search results
Estimating labels from label proportions
Consider the following problem: given sets of unlabeled observations, each set with known label proportions, predict the labels of another set of observations, also with known label proportions. This problem appears in areas like e-commerce, spam filtering and improper content detection. We present consistent estimators which can reconstruct the correct labels with high probability in a uniform convergence sense. Experiments show that our method works well in practice.
Wasserstein Distance Guided Representation Learning for Domain Adaptation
Domain adaptation aims at generalizing a high-performance learner on a target
domain via utilizing the knowledge distilled from a source domain which has a
different but related data distribution. One solution to domain adaptation is
to learn domain invariant feature representations while the learned
representations should also be discriminative in prediction. To learn such
representations, domain adaptation frameworks usually include a domain
invariant representation learning approach to measure and reduce the domain
discrepancy, as well as a discriminator for classification. Inspired by
Wasserstein GAN, in this paper we propose a novel approach to learn domain
invariant feature representations, namely Wasserstein Distance Guided
Representation Learning (WDGRL). WDGRL utilizes a neural network, denoted by
the domain critic, to estimate empirical Wasserstein distance between the
source and target samples and optimizes the feature extractor network to
minimize the estimated Wasserstein distance in an adversarial manner. The
theoretical advantages of Wasserstein distance for domain adaptation lie in its
gradient property and promising generalization bound. Empirical studies on
common sentiment and image classification adaptation datasets demonstrate that
our proposed WDGRL outperforms the state-of-the-art domain invariant
representation learning approaches.Comment: The Thirty-Second AAAI Conference on Artificial Intelligence (AAAI
2018
Deep Learning for Phishing Detection: Taxonomy, Current Challenges and Future Directions
This work was supported in part by the Ministry of Higher Education under the Fundamental Research Grant Scheme under Grant FRGS/1/2018/ICT04/UTM/01/1; and in part by the Faculty of Informatics and Management, University of Hradec Kralove, through SPEV project under Grant 2102/2022.Phishing has become an increasing concern and captured the attention of end-users as well
as security experts. Existing phishing detection techniques still suffer from the de ciency in performance
accuracy and inability to detect unknown attacks despite decades of development and improvement.
Motivated to solve these problems, many researchers in the cybersecurity domain have shifted their attention
to phishing detection that capitalizes on machine learning techniques. Deep learning has emerged as a branch
of machine learning that becomes a promising solution for phishing detection in recent years. As a result,
this study proposes a taxonomy of deep learning algorithm for phishing detection by examining 81 selected
papers using a systematic literature review approach. The paper rst introduces the concept of phishing and
deep learning in the context of cybersecurity. Then, taxonomies of phishing detection and deep learning
algorithm are provided to classify the existing literature into various categories. Next, taking the proposed
taxonomy as a baseline, this study comprehensively reviews the state-of-the-art deep learning techniques
and analyzes their advantages as well as disadvantages. Subsequently, the paper discusses various issues
that deep learning faces in phishing detection and proposes future research directions to overcome these
challenges. Finally, an empirical analysis is conducted to evaluate the performance of various deep learning
techniques in a practical context, and to highlight the related issues that motivate researchers in their future
works. The results obtained from the empirical experiment showed that the common issues among most of
the state-of-the-art deep learning algorithms are manual parameter-tuning, long training time, and de cient
detection accuracy.Ministry of Higher Education under the Fundamental Research Grant Scheme FRGS/1/2018/ICT04/UTM/01/1Faculty of Informatics and Management, University of Hradec Kralove, through SPEV project 2102/202
Fidelity-Weighted Learning
Training deep neural networks requires many training samples, but in practice
training labels are expensive to obtain and may be of varying quality, as some
may be from trusted expert labelers while others might be from heuristics or
other sources of weak supervision such as crowd-sourcing. This creates a
fundamental quality versus-quantity trade-off in the learning process. Do we
learn from the small amount of high-quality data or the potentially large
amount of weakly-labeled data? We argue that if the learner could somehow know
and take the label-quality into account when learning the data representation,
we could get the best of both worlds. To this end, we propose
"fidelity-weighted learning" (FWL), a semi-supervised student-teacher approach
for training deep neural networks using weakly-labeled data. FWL modulates the
parameter updates to a student network (trained on the task we care about) on a
per-sample basis according to the posterior confidence of its label-quality
estimated by a teacher (who has access to the high-quality labels). Both
student and teacher are learned from the data. We evaluate FWL on two tasks in
information retrieval and natural language processing where we outperform
state-of-the-art alternative semi-supervised methods, indicating that our
approach makes better use of strong and weak labels, and leads to better
task-dependent data representations.Comment: Published as a conference paper at ICLR 201
- …