Search CORE

7 research outputs found

Removing Manually-Generated Boilerplate from Electronic Texts: Experiments with Project Gutenberg e-Books

Author: Kaser Owen
Lemire Daniel
Publication venue
Publication date: 01/01/2007
Field of study

Collaborative work on unstructured or semi-structured documents, such as in literature corpora or source code, often involves agreed upon templates containing metadata. These templates are not consistent across users and over time. Rule-based parsing of these templates is expensive to maintain and tends to fail as new documents are added. Statistical techniques based on frequent occurrences have the potential to identify automatically a large fraction of the templates, thus reducing the burden on the programmers. We investigate the case of the Project Gutenberg corpus, where most documents are in ASCII format with preambles and epilogues that are often copied and pasted or manually typed. We show that a statistical approach can solve most cases though some documents require knowledge of English. We also survey various technical solutions that make our approach applicable to large data sets

R-libre

Archipel - Université du Québec à Montréal

“Got You!”: Automatic Vandalism Detection in Wikipedia with Web-based Shallow Syntactic-Semantic Modeling

Author: McKeown Kathleen
Wang William
Publication venue: 'Columbia University Libraries/Information Services'
Publication date: 01/01/2010
Field of study

Discriminating vandalism edits from non-vandalism edits in Wikipedia is a challenging task, as ill-intentioned edits can include a variety of content and be expressed in many different forms and styles. Previous studies are limited to rule-based methods and learning based on lexical features, lacking in linguistic analysis. In this paper, we propose a novel Web-based shallow syntactic-semantic modeling method, which utilizes Web search results as resource and trains topic-specific n-tag and syntactic n-gram language models to detect vandalism. By combining basic task-specific and lexical features, we have achieved high F-measures using logistic boosting and logistic model trees classifiers, surpassing the results reported by major Wikipedia vandalism detection systems

CiteSeerX

Columbia University Academic Commons

Can We Fight Social Engineering Attacks By Social Means? Assessing Social Salience as a Means to Improve Phish Detection

Author: Briggs Pamela
Coventry Lynne
Nicholson James
Publication venue: USENIX Association
Publication date: 12/07/2017
Field of study

Phishing continues to be a problem for both individuals and organisations, with billions of dollars lost every year. We propose the use of nudges – more specifically social saliency nudges that aim to highlight important information to the user when evaluating emails. We used a signal detection analysis to assess the effects of both sender saliency (highlighting important fields from the sender) and receiver saliency (showing numbers of other users in receipt of the same email). Sender saliency improved phish detection but did not introduce any unwanted response bias. Users were asked to rate their confidence in their own judgements and these confidence scores were poorly calibrated with actual performance, particularly for phishing (as opposed to genuine) emails. We also examined the role of impulsive behaviour on phish detection, concluding that those who score highly on dysfunctional impulsivity are less likely to detect the presence of phishing emails

Northumbria University Research Portal

Recommended from our members

WIDIT in TREC-2005 HARD, Robust, and SPAM tracks

Author: Akram Shahrier
George Nicholas
Loehrlen Aaron
McCaulay David
Mei Jue
Ning Yu
Record Ivan
Yang Kiduk
Zhang Hui
Publication venue
Publication date
Field of study

Web Information Discovery Tool (WIDIT) Laboratory at the Indiana University School of Library and Information Science participated in the HARD, Robust, and SPAM tracks in TREC- 2005. The basic approach of WIDIT is to combine multiple methods as well as to leverage multiple sources of evidence. Our main strategies for the tracks were: query expansion and fusion optimization for the HARD and Robust tracks; and combination of probabilistic, rule-based, pattern-based, and blacklist email filters for the SPAM track

ScholarsArchive@OSU

Tracking Concept Drift at Feature Selection Stage in SpamHunting: An Anti-spam Instance-Based Reasoning System

Author: Corchado Rodríguez Juan Manuel
Díaz Gómez Fernando
Fernández Riverola Florentino
Iglesias E. L.
Méndez Jose R.
Publication venue: Springer Science + Business Media
Publication date: 01/01/2006
Field of study

In this paper we propose a novel feature selection method able to handle concept drift problems in spam filtering domain. The proposed technique is applied to a previous successful instance-based reasoning e-mail filtering system called SpamHunting. Our achieved information criterion is based on several ideas extracted from the well-known information measure introduced by Shannon. We show how results obtained by our previous system in combination with the improved feature selection method outperforms classical machine learning techniques and other well-known lazy learning approaches. In order to evaluate the performance of all the analysed models, we employ two different corpus and six well-known metrics in various scenarios

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Gestion del Repositorio Documental de la Universidad de Salamanca

A Comparative Performance Study of Feature Selection Methods for the Anti-spam Filtering Domain

Author: Corchado Rodríguez Juan Manuel
Díaz Gómez Fernando
Fernández Riverola Florentino
Iglesias E. L.
Méndez Jose R.
Publication venue: Springer Science + Business Media
Publication date: 01/01/2006
Field of study

In this paper we analyse the strengths and weaknesses of the mainly used feature selection methods in text categorization when they are applied to the spam problem domain. Several experiments with different feature selection methods and content-based filtering techniques are carried out and discussed. Information Gain, χ 2-text, Mutual Information and Document Frequency feature selection methods have been analysed in conjunction with Naïve Bayes, boosting trees, Support Vector Machines and ECUE models in different scenarios. From the experiments carried out the underlying ideas behind feature selection methods are identified and applied for improving the feature selection process of SpamHunting, a novel anti-spam filtering software able to accurate classify suspicious e-mails

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Gestion del Repositorio Documental de la Universidad de Salamanca