221 research outputs found
Automatic Detection of Malware-Generated Domains with Recurrent Neural Models
Modern malware families often rely on domain-generation algorithms (DGAs) to
determine rendezvous points to their command-and-control server. Traditional
defence strategies (such as blacklisting domains or IP addresses) are
inadequate against such techniques due to the large and continuously changing
list of domains produced by these algorithms. This paper demonstrates that a
machine learning approach based on recurrent neural networks is able to detect
domain names generated by DGAs with high precision. The neural models are
estimated on a large training set of domains generated by various malwares.
Experimental results show that this data-driven approach can detect
malware-generated domain names with a F_1 score of 0.971. To put it
differently, the model can automatically detect 93 % of malware-generated
domain names for a false positive rate of 1:100.Comment: Submitted to NISK 201
Not All Dialogues are Created Equal: Instance Weighting for Neural Conversational Models
Neural conversational models require substantial amounts of dialogue data for
their parameter estimation and are therefore usually learned on large corpora
such as chat forums or movie subtitles. These corpora are, however, often
challenging to work with, notably due to their frequent lack of turn
segmentation and the presence of multiple references external to the dialogue
itself. This paper shows that these challenges can be mitigated by adding a
weighting model into the architecture. The weighting model, which is itself
estimated from dialogue data, associates each training example to a numerical
weight that reflects its intrinsic quality for dialogue modelling. At training
time, these sample weights are included into the empirical loss to be
minimised. Evaluation results on retrieval-based models trained on movie and TV
subtitles demonstrate that the inclusion of such a weighting model improves the
model performance on unsupervised metrics.Comment: Accepted to SIGDIAL 201
Redefining Context Windows for Word Embedding Models: An Experimental Study
Distributional semantic models learn vector representations of words through
the contexts they occur in. Although the choice of context (which often takes
the form of a sliding window) has a direct influence on the resulting
embeddings, the exact role of this model component is still not fully
understood. This paper presents a systematic analysis of context windows based
on a set of four distinct hyper-parameters. We train continuous Skip-Gram
models on two English-language corpora for various combinations of these
hyper-parameters, and evaluate them on both lexical similarity and analogy
tasks. Notable experimental results are the positive impact of cross-sentential
contexts and the surprisingly good performance of right-context windows
Probabilistic Dialogue Models with Prior Domain Knowledge
Probabilistic models such as Bayesian Networks are now in widespread use in spoken dialogue systems, but their scalability to complex interaction domains remains a challenge. One central limitation is that the state space of such models grows exponentially with the problem size, which makes parameter estimation increasingly difficult, especially for domains where only limited training data is available. In this paper, we show how to capture the underlying structure of a dialogue domain in terms of probabilistic rules operating on the dialogue state. The probabilistic rules are associated with a small, compact set of parameters that can be directly estimated from data. We argue that the introduction of this abstraction mechanism yields probabilistic models that are easier to learn and generalise better than their unstructured counterparts. We empirically demonstrate the benefits of such an approach learning a dialogue policy for a human-robot interaction domain based on a Wizard-of-Oz data set.
Proceedings of the 13th Annual Meeting of the Special Interest Group on Discourse and Dialogue (SIGDIAL), pages 179–188, Seoul, South Korea, 5-6 July 2012
Model-based Bayesian Reinforcement Learning for Dialogue Management
Reinforcement learning methods are increasingly used to optimise dialogue
policies from experience. Most current techniques are model-free: they directly
estimate the utility of various actions, without explicit model of the
interaction dynamics. In this paper, we investigate an alternative strategy
grounded in model-based Bayesian reinforcement learning. Bayesian inference is
used to maintain a posterior distribution over the model parameters, reflecting
the model uncertainty. This parameter distribution is gradually refined as more
data is collected and simultaneously used to plan the agent's actions. Within
this learning framework, we carried out experiments with two alternative
formalisations of the transition model, one encoded with standard multinomial
distributions, and one structured with probabilistic rules. We demonstrate the
potential of our approach with empirical results on a user simulator
constructed from Wizard-of-Oz data in a human-robot interaction scenario. The
results illustrate in particular the benefits of capturing prior domain
knowledge with high-level rules
Should we use movie subtitles to study linguistic patterns of conversational speech? A study based on French, English and Taiwan Mandarin
International audienceLinguistic research benefits from the wide range of resources and software tools developed for natural language processing (NLP) tasks. However, NLP has a strong historical bias towards written language, thereby making these resources and tools often inadequate to address research questions related to the linguistic patterns of spontaneous speech. In this preliminary study, we investigate whether corpora of movie and TV subtitles can be employed to estimate data-driven NLP models adapted to conversational speech. In particular, the presented work explore lexical and syntactic distributional aspects across three genres (conversational, written and subtitles) and three languages (French, English and Taiwan Mandarin). Ongoing work focuses on comparing these three genres on the basis of deeper syntactic conversational patterns , using graph-based modelling and visualisation
Detecting machine-translated subtitles in large parallel corpora
Parallel corpora extracted from online repositories of movie and TV subtitles are employed in a wide range of NLP applications, from language modelling to machine translation and dialogue systems. However, the subtitles uploaded in such repositories exhibit varying levels of quality. A particularly difficult problem stems from the fact that a substantial number of these subtitles are not written by human subtitlers but are simply generated through the use of online translation engines. This paper investigates whether these machine-generated subtitles can be detected automatically using a combination of linguistic and extra-linguistic features. We show that a feedforward neural network trained on a small dataset of subtitles can detect machine-generated subtitles with a F1-score of 0.64. Furthermore, applying this detection model on an unlabelled sample of subtitles allows us to provide a statistical estimate for the proportion of subtitles that are machine-translated (or are at least of very low quality) in the full corpus
A Graph-to-Text Approach to Knowledge-Grounded Response Generation in Human-Robot Interaction
Knowledge graphs are often used to represent structured information in a
flexible and efficient manner, but their use in situated dialogue remains
under-explored. This paper presents a novel conversational model for
human--robot interaction that rests upon a graph-based representation of the
dialogue state. The knowledge graph representing the dialogue state is
continuously updated with new observations from the robot sensors, including
linguistic, situated and multimodal inputs, and is further enriched by other
modules, in particular for spatial understanding. The neural conversational
model employed to respond to user utterances relies on a simple but effective
graph-to-text mechanism that traverses the dialogue state graph and converts
the traversals into a natural language form. This conversion of the state graph
into text is performed using a set of parameterized functions, and the values
for those parameters are optimized based on a small set of Wizard-of-Oz
interactions. After this conversion, the text representation of the dialogue
state graph is included as part of the prompt of a large language model used to
decode the agent response. The proposed approach is empirically evaluated
through a user study with a humanoid robot that acts as conversation partner to
evaluate the impact of the graph-to-text mechanism on the response generation.
After moving a robot along a tour of an indoor environment, participants
interacted with the robot using spoken dialogue and evaluated how well the
robot was able to answer questions about what the robot observed during the
tour. User scores show a statistically significant improvement in the perceived
factuality of the robot responses when the graph-to-text approach is employed,
compared to a baseline using inputs structured as semantic triples.Comment: Submitted to Dialogue & Discourse 202
- …