290 research outputs found
Multi-View Learning and Link Farm Discovery
The first part of this abstract focuses on estimation of mixture models for problems in which multiple views of the instances are available. Examples of this setting include clustering web pages or research papers that have intrinsic (text) and extrinsic (references) attributes. Mixture model estimation is a key problem for both semi-supervised and unsupervised learning. An appropriate optimization criterion quantifies the likelihood and the consensus among models in the individual views; maximizing this consensus minimizes a bound on the risk of assigning an instance to an incorrect mixture component. An EM algorithm maximizes this criterion. The second part of this abstract focuses on the problem of identifying link spam. Search engine optimizers inflate the page rank of a target site by spinning an artificial web for the sole purpose of providing inbound links to the target. Discriminating natural from artificial web sites is a difficult multi-view problem
Supervised clustering of streaming data for email batch detection
We address the problem of detecting batches of emails that have been created according to the same template. This problem is motivated by the desire to filter spam more effectively by exploiting collective information about entire batches of jointly generated messages. The application matches the problem setting of supervised clustering, because examples of correct clusterings can be collected. Known decoding procedures for supervised clustering are cubic in the number of instances. When decisions cannot be reconsidered once they have been made – owing to the streaming nature of the data – then the decoding problem can be solved in linear time. We devise a sequential decoding procedure and derive the corresponding optimization problem of supervised clustering. We study the impact of collective attributes of email batches on the effectiveness of recognizing spam emails. 1
Oculomotoric Biometric Identification under the Influence of Alcohol and Fatigue
Patterns of micro- and macro-movements of the eyes are highly individual and can serve as a biometric characteristic. It is also known that both alcohol inebriation and fatigue can reduce saccadic velocity and accuracy. This prompts the question of whether changes of gaze patterns caused by alcohol consumption and fatigue impact the accuracy of oculomotoric biometric identification. We collect an eye tracking data set from 66 participants in sober, fatigued and alcohol-intoxicated states. We find that after enrollment in a rested and sober state, identity verification based on a deep neural embedding of gaze sequences is significantly less accurate when probe sequences are taken in either an inebriated or a fatigued state. Moreover, we find that fatigue and intoxication appear to randomize gaze patterns: when the model is fine-tuned for invariance with respect to inebriation and fatigue, and even when it is trained exclusively on inebriated training person, the model still performs significantly better for sober than for sleep-deprived or intoxicated subjects
Detection of Drowsiness and Impending Microsleep from Eye Movements
Drowsiness is a contributing factor in an estimated 12% of all road traffic fatalities. It is known that drowsiness directly affects oculomotor control. We therefore investigate whether drowsiness can be detected based on eye movements. To this end, we develop deep neural sequence models that exploit a person's raw eye-gaze and eye-closure signals to detect drowsiness. We explore three measures of drowsiness ground truth: a widely-used sleepiness self-assessment, reaction time, and impending microsleep in the near future. We find that our sequence models are able to detect drowsiness and outperform a baseline processing established engineered features. We also find that the risk of a microsleep event in the near future can be predicted more accurately than the sleepiness self-assessment or the reaction time. Moreover, a model that has been trained on predicting microsleep also excels at predicting self-assessed sleepiness in a cross-task evaluation, which indicates that upcoming microsleep is a less noisy proxy of the drowsiness ground truth. We investigate the relative contribution of eye-closure and gaze information to the model's performance. In order to make the topic of drowsiness detection more accessible to the research community, we collect and share eye-gaze data with participants in baseline and sleep-deprived states
Joint Detection of Malicious Domains and Infected Clients
Detection of malware-infected computers and detection of malicious web
domains based on their encrypted HTTPS traffic are challenging problems,
because only addresses, timestamps, and data volumes are observable. The
detection problems are coupled, because infected clients tend to interact with
malicious domains. Traffic data can be collected at a large scale, and
antivirus tools can be used to identify infected clients in retrospect.
Domains, by contrast, have to be labeled individually after forensic analysis.
We explore transfer learning based on sluice networks; this allows the
detection models to bootstrap each other. In a large-scale experimental study,
we find that the model outperforms known reference models and detects
previously unknown malware, previously unknown malware families, and previously
unknown malicious domains.Comment: Mach Learn (2019
Pre-Trained Language Models Augmented with Synthetic Scanpaths for Natural Language Understanding
Human gaze data offer cognitive information that reflects natural language comprehension. Indeed, augmenting language models with human scanpaths has proven beneficial for a range of NLP tasks, including language understanding. However, the applicability of this approach is hampered because the abundance of text corpora is contrasted by a scarcity of gaze data. Although models for the generation of human-like scanpaths during reading have been developed, the potential of synthetic gaze data across NLP tasks remains largely unexplored. We develop a model that integrates synthetic scanpath generation with a scanpath-augmented language model, eliminating the need for human gaze data. Since the model’s error gradient can be propagated throughout all parts of the model, the scanpath generator can be fine-tuned to downstream tasks. We find that the proposed model not only outperforms the underlying language model, but achieves a performance that is comparable to a language model augmented with real human gaze data. Our code is publicly available
Fairness in Oculomotoric Biometric Identification
Gaze patterns are known to be highly individual, and therefore eye movements can serve as a biometric characteristic. We explore aspects of the fairness of biometric identification based on gaze patterns. We find that while oculomotoric identification does not favor any particular gender and does not significantly favor by age range, it is unfair with respect to ethnicity. Moreover, fairness concerning ethnicity cannot be achieved by balancing the training data for the best-performing model
- …
