195 research outputs found

    Multi-View Learning and Link Farm Discovery

    Get PDF
    The first part of this abstract focuses on estimation of mixture models for problems in which multiple views of the instances are available. Examples of this setting include clustering web pages or research papers that have intrinsic (text) and extrinsic (references) attributes. Mixture model estimation is a key problem for both semi-supervised and unsupervised learning. An appropriate optimization criterion quantifies the likelihood and the consensus among models in the individual views; maximizing this consensus minimizes a bound on the risk of assigning an instance to an incorrect mixture component. An EM algorithm maximizes this criterion. The second part of this abstract focuses on the problem of identifying link spam. Search engine optimizers inflate the page rank of a target site by spinning an artificial web for the sole purpose of providing inbound links to the target. Discriminating natural from artificial web sites is a difficult multi-view problem

    Supervised clustering of streaming data for email batch detection

    Get PDF
    We address the problem of detecting batches of emails that have been created according to the same template. This problem is motivated by the desire to filter spam more effectively by exploiting collective information about entire batches of jointly generated messages. The application matches the problem setting of supervised clustering, because examples of correct clusterings can be collected. Known decoding procedures for supervised clustering are cubic in the number of instances. When decisions cannot be reconsidered once they have been made – owing to the streaming nature of the data – then the decoding problem can be solved in linear time. We devise a sequential decoding procedure and derive the corresponding optimization problem of supervised clustering. We study the impact of collective attributes of email batches on the effectiveness of recognizing spam emails. 1

    Oculomotoric Biometric Identification under the Influence of Alcohol and Fatigue

    Full text link
    Patterns of micro- and macro-movements of the eyes are highly individual and can serve as a biometric characteristic. It is also known that both alcohol inebriation and fatigue can reduce saccadic velocity and accuracy. This prompts the question of whether changes of gaze patterns caused by alcohol consumption and fatigue impact the accuracy of oculomotoric biometric identification. We collect an eye tracking data set from 66 participants in sober, fatigued and alcohol-intoxicated states. We find that after enrollment in a rested and sober state, identity verification based on a deep neural embedding of gaze sequences is significantly less accurate when probe sequences are taken in either an inebriated or a fatigued state. Moreover, we find that fatigue and intoxication appear to randomize gaze patterns: when the model is fine-tuned for invariance with respect to inebriation and fatigue, and even when it is trained exclusively on inebriated training person, the model still performs significantly better for sober than for sleep-deprived or intoxicated subjects

    Pre-Trained Language Models Augmented with Synthetic Scanpaths for Natural Language Understanding

    Full text link
    Human gaze data offer cognitive information that reflects natural language comprehension. Indeed, augmenting language models with human scanpaths has proven beneficial for a range of NLP tasks, including language understanding. However, the applicability of this approach is hampered because the abundance of text corpora is contrasted by a scarcity of gaze data. Although models for the generation of human-like scanpaths during reading have been developed, the potential of synthetic gaze data across NLP tasks remains largely unexplored. We develop a model that integrates synthetic scanpath generation with a scanpath-augmented language model, eliminating the need for human gaze data. Since the model's error gradient can be propagated throughout all parts of the model, the scanpath generator can be fine-tuned to downstream tasks. We find that the proposed model not only outperforms the underlying language model, but achieves a performance that is comparable to a language model augmented with real human gaze data. Our code is publicly available.Comment: Pre-print for EMNLP 202

    Fairness in Oculomotoric Biometric Identification

    Full text link
    Gaze patterns are known to be highly individual, and therefore eye movements can serve as a biometric characteristic. We explore aspects of the fairness of biometric identification based on gaze patterns. We find that while oculomotoric identification does not favor any particular gender and does not significantly favor by age range, it is unfair with respect to ethnicity. Moreover, fairness concerning ethnicity cannot be achieved by balancing the training data for the best-performing model

    Pre-Trained Language Models Augmented with Synthetic Scanpaths for Natural Language Understanding

    Get PDF
    Human gaze data offer cognitive information that reflects natural language comprehension. Indeed, augmenting language models with human scanpaths has proven beneficial for a range of NLP tasks, including language understanding. However, the applicability of this approach is hampered because the abundance of text corpora is contrasted by a scarcity of gaze data. Although models for the generation of human-like scanpaths during reading have been developed, the potential of synthetic gaze data across NLP tasks remains largely unexplored. We develop a model that integrates synthetic scanpath generation with a scanpath-augmented language model, eliminating the need for human gaze data. Since the model’s error gradient can be propagated throughout all parts of the model, the scanpath generator can be fine-tuned to downstream tasks. We find that the proposed model not only outperforms the underlying language model, but achieves a performance that is comparable to a language model augmented with real human gaze data. Our code is publicly available
    • …
    corecore