Search CORE

41,414 research outputs found

Weakly supervised deep learning for the detection of domain generation algorithms

Author: Choudhary Chhaya
De Cock Martine
Gray Daniel
Hu Jiaming
Nascimento Anderson C. A.
Pan Jie
Yu Bin
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2019
Field of study

Domain generation algorithms (DGAs) have become commonplace in malware that seeks to establish command and control communication between an infected machine and the botmaster. DGAs dynamically and consistently generate large volumes of malicious domain names, only a few of which are registered by the botmaster, within a short time window around their generation time, and subsequently resolved when the malware on the infected machine tries to access them. Deep neural networks that can classify domain names as benign or malicious are of great interest in the real-time defense against DGAs. In contrast with traditional machine learning models, deep networks do not rely on human engineered features. Instead, they can learn features automatically from data, provided that they are supplied with sufficiently large amounts of suitable training data. Obtaining cleanly labeled ground truth data is difficult and time consuming. Heuristically labeled data could potentially provide a source of training data for weakly supervised training of DGA detectors. We propose a set of heuristics for automatically labeling domain names monitored in real traffic, and then train and evaluate classifiers with the proposed heuristically labeled dataset. We show through experiments on a dataset with 50 million domain names that such heuristically labeled data is very useful in practice to improve the predictive accuracy of deep learning-based DGA classifiers, and that these deep neural networks significantly outperform a random forest classifier with human engineered features

Mosquito Detection with Neural Networks: The Buzz of Deep Learning

Author: Kiskin Ivan
Orozco Bernardo Pérez
Roberts Stephen
Sinka Marianne
Willis Kathy
Windebank Theo
Zilli Davide
Publication venue
Publication date: 15/05/2017
Field of study

Many real-world time-series analysis problems are characterised by scarce data. Solutions typically rely on hand-crafted features extracted from the time or frequency domain allied with classification or regression engines which condition on this (often low-dimensional) feature vector. The huge advances enjoyed by many application domains in recent years have been fuelled by the use of deep learning architectures trained on large data sets. This paper presents an application of deep learning for acoustic event detection in a challenging, data-scarce, real-world problem. Our candidate challenge is to accurately detect the presence of a mosquito from its acoustic signature. We develop convolutional neural networks (CNNs) operating on wavelet transformations of audio recordings. Furthermore, we interrogate the network's predictive power by visualising statistics of network-excitatory samples. These visualisations offer a deep insight into the relative informativeness of components in the detection problem. We include comparisons with conventional classifiers, conditioned on both hand-tuned and generic features, to stress the strength of automatic deep feature learning. Detection is achieved with performance metrics significantly surpassing those of existing algorithmic methods, as well as marginally exceeding those attained by individual human experts.Comment: For data and software related to this paper, see http://humbug.ac.uk/kiskin2017/. Submitted as a conference paper to ECML 201

arXiv.org e-Print Archive

Using Natural Language Processing with Deep Learning to Explore Clinical Notes

Author: Grinde Anders Benjamin
Johansen Bendik Mathias
Publication venue: The University of Bergen
Publication date: 01/01/2021
Field of study

In recent years, the deep learning community and technology have grown substantially, both in terms of research and applications. However, some application areas have lagged behind. The medical domain is an example of a field with a lot of untapped potential, partly caused by complex issues related to privacy and ethics. Still, deep learning is a very powerful tool to utilize structured and unstructured data, and could help save lives. In this thesis, we use natural language processing to interpret clinical notes and predict the mortality rate of subjects. We explore if language models trained on a specific domain would become more performant, and we compared them to language models trained on an intermediate data set. We found that our language model trained on an intermediate data set that had some resemblance to our target data set performed slightly better than its counterpart language model. We found that text classifiers built on top of the language models were capable of correctly predicting if a subject would die or not. Furthermore, we extracted the free-text features from the text classifiers and combined them, using stacking, with heterogeneous data as an attempt to increase the efficacy of the classifiers and to explore the relative performance boost gained by including free-text features. We found a correlation between the quality of text classifiers that produced the text features and the stacking classifiers' performances. The classifier that was trained on a data set without text features performed the worst, and the classifier trained on a data set with the best text features performed the best. We also discuss the central concerns that come with applying deep learning in a medical domain with regards to privacy and ethics. It is our intention that this thesis serves as a contribution to the advancement of deep learning within the medical domain, and as a testament as to what can be achieved with today's technology.Masteroppgave i Programutvikling samarbeid med HVLPROG399MAMN-PRO

University of Bergen