Search CORE

87,851 research outputs found

Classification of authors for an automatic recommendation process for criminal responsibility

Author: amelec viloria
Chang Eduardo
Pineda Lezama Omar Bonerge
Publication venue: 'Elsevier BV'
Publication date: 01/01/2020
Field of study

One problem in classifying tasks is the handling of features that characterize classes. When the list of features is long, a noise resistant algorithm of irrelevant features can be used, or these features can be reduced. Authorship attribution is a task that assigns an anonymous text to a subject on a list of possible authors, has been widely addressed as an automatic text classification task. In it, n-grams can produce long lists of features even in small corpora. Despite this, there is a lack of research exposing the effects of using noise-resistant algorithms, reducing traits, or combining both options. This paper responds to this lack by using contributions to discussion forums related to organized crime. The results show that the classifiers evaluated, in general, benefit from feature reduction, and that, thanks to such reduction, even classical algorithms outperform state-of-the-art classifiers considered highly noise resistant

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Repositorio Digital CUC

PhishDef: URL Names Say It All

Author: Faloutsos Michalis
Le Anh
Markopoulou Athina
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 12/09/2010
Field of study

Phishing is an increasingly sophisticated method to steal personal user information using sites that pretend to be legitimate. In this paper, we take the following steps to identify phishing URLs. First, we carefully select lexical features of the URLs that are resistant to obfuscation techniques used by attackers. Second, we evaluate the classification accuracy when using only lexical features, both automatically and hand-selected, vs. when using additional features. We show that lexical features are sufficient for all practical purposes. Third, we thoroughly compare several classification algorithms, and we propose to use an online method (AROW) that is able to overcome noisy training data. Based on the insights gained from our analysis, we propose PhishDef, a phishing detection system that uses only URL names and combines the above three elements. PhishDef is a highly accurate method (when compared to state-of-the-art approaches over real datasets), lightweight (thus appropriate for online and client-side deployment), proactive (based on online classification rather than blacklists), and resilient to training data inaccuracies (thus enabling the use of large noisy training data).Comment: 9 pages, submitted to IEEE INFOCOM 201

arXiv.org e-Print Archive

CiteSeerX

Crossref

Influence of Noise and Data Characteristics on Classification Quality of Dispersed Data Using Neural Networks on the Fusion of Predictions

Author: Marfo Kwabena Frimpong
Przybyła-Kasperek Małgorzata
Publication venue: AIS Electronic Library (AISeL)
Publication date: 19/09/2022
Field of study

In this paper, the issues of classification based on dispersed data are considered. For this purpose, an approach is used in which prediction vectors are generated locally using the k-nearest neighbors classifier. However, in central server, the final fusion of prediction vectors is made with the use of a neural network. The main aim of the study is to check the influence of various data characteristics (the number of conditional attributes, the number of objects, the number of decision classes) and the degree of dispersion and noise intensity on the quality of classification of the considered approach. For this purpose, 270 data sets were generated that differed by the above factors. Experiments were carried out using these data sets and statistical tests were performed. It was found that each of the examined factors has a statistically significant impact on the quality of classification. However, the number of conditional attributes, degree of dispersion, and noise intensity have the greatest impact. Multidimensionality in dispersed data affects the results positively, but the analyzed method is only resistant to a certain degree of noise intensity and dispersion

AIS Electronic Library (AISeL)