111 research outputs found
Anytime Discovery of a Diverse Set of Patterns with Monte Carlo Tree Search
International audienceThe discovery of patterns that accurately discriminate one class label from another remains a challenging data mining task. Subgroup discovery (SD) is one of the frameworks that enables to elicit such interesting patterns from labeled data. A question remains fairly open: How to select an accurate heuristic search technique when exhaustive enumeration of the pattern space is infeasible? Existing approaches make use of beam-search, sampling, and genetic algorithms for discovering a pattern set that is non-redundant and of high quality w.r.t. a pattern quality measure. We argue that such approaches produce pattern sets that lack of diversity: Only few patterns of high quality, and different enough, are discovered. Our main contribution is then to formally define pattern mining as a game and to solve it with Monte Carlo tree search (MCTS). It can be seen as an exhaustive search guided by random simulations which can be stopped early (limited budget) by virtue of its best-first search property. We show through a comprehensive set of experiments how MCTS enables the anytime discovery of a diverse pattern set of high quality. It out-performs other approaches when dealing with a large pattern search space and for different quality measures. Thanks to its genericity, our MCTS approach can be used for SD but also for many other pattern mining tasks
A Survey on Cross-domain Recommendation: Taxonomies, Methods, and Future Directions
Traditional recommendation systems are faced with two long-standing
obstacles, namely, data sparsity and cold-start problems, which promote the
emergence and development of Cross-Domain Recommendation (CDR). The core idea
of CDR is to leverage information collected from other domains to alleviate the
two problems in one domain. Over the last decade, many efforts have been
engaged for cross-domain recommendation. Recently, with the development of deep
learning and neural networks, a large number of methods have emerged. However,
there is a limited number of systematic surveys on CDR, especially regarding
the latest proposed methods as well as the recommendation scenarios and
recommendation tasks they address. In this survey paper, we first proposed a
two-level taxonomy of cross-domain recommendation which classifies different
recommendation scenarios and recommendation tasks. We then introduce and
summarize existing cross-domain recommendation approaches under different
recommendation scenarios in a structured manner. We also organize datasets
commonly used. We conclude this survey by providing several potential research
directions about this field
Certifying LLM Safety against Adversarial Prompting
Large language models (LLMs) released for public use incorporate guardrails
to ensure their output is safe, often referred to as "model alignment." An
aligned language model should decline a user's request to produce harmful
content. However, such safety measures are vulnerable to adversarial prompts,
which contain maliciously designed token sequences to circumvent the model's
safety guards and cause it to produce harmful content. In this work, we
introduce erase-and-check, the first framework to defend against adversarial
prompts with verifiable safety guarantees. We erase tokens individually and
inspect the resulting subsequences using a safety filter. Our procedure labels
the input prompt as harmful if any subsequences or the input prompt are
detected as harmful by the filter. This guarantees that any adversarial
modification of a harmful prompt up to a certain size is also labeled harmful.
We defend against three attack modes: i) adversarial suffix, which appends an
adversarial sequence at the end of the prompt; ii) adversarial insertion, where
the adversarial sequence is inserted anywhere in the middle of the prompt; and
iii) adversarial infusion, where adversarial tokens are inserted at arbitrary
positions in the prompt, not necessarily as a contiguous block. Empirical
results demonstrate that our technique obtains strong certified safety
guarantees on harmful prompts while maintaining good performance on safe
prompts. For example, against adversarial suffixes of length 20, it certifiably
detects 93% of the harmful prompts and labels 94% of the safe prompts as safe
using the open source language model Llama 2 as the safety filter
Tackling scalability issues in mining path patterns from knowledge graphs: a preliminary study
Features mined from knowledge graphs are widely used within multiple
knowledge discovery tasks such as classification or fact-checking. Here, we
consider a given set of vertices, called seed vertices, and focus on mining
their associated neighboring vertices, paths, and, more generally, path
patterns that involve classes of ontologies linked with knowledge graphs. Due
to the combinatorial nature and the increasing size of real-world knowledge
graphs, the task of mining these patterns immediately entails scalability
issues. In this paper, we address these issues by proposing a pattern mining
approach that relies on a set of constraints (e.g., support or degree
thresholds) and the monotonicity property. As our motivation comes from the
mining of real-world knowledge graphs, we illustrate our approach with PGxLOD,
a biomedical knowledge graph
- …