25 research outputs found

    Soft set approach for clustering web user transactions

    Get PDF
    Rough set theory provides a methodology for data analysis based on the approximation of information systems. It is revolves around the notion of discernibly i.e. the ability to distinguish between objects based on their attributes value. It allows inferring data dependencies that are useful in the fields of feature selection and decision model construction. Since it is proven that every rough set is a soft set, therefore, within the context of soft sets theory, we present a soft set-based framework for partition attribute selection. The paper unifies existing work in this direction, and introduces the concepts of maximum attribute relative to determine and rank the attribute in the multi-valued information system. Experimental results demonstrate the potentiality of the proposed technique to discover the attribute subsets, leading to partition selection models which better coverage and achieve lower computational time than that the baseline techniques

    A Parameter-Free Hybrid Clustering algorithm used for Malware Categorization

    Get PDF
    Nowadays, numerous attacks made by the malware, such as viruses, backdoors, spyware, trojans and worms, have presented a major security threat to computer users. The most significant line of defense against malware is anti-virus products which detects, removes, and characterizes these threats. The ability of these AV products to successfully characterize these threats greatly depends on the method for categorizing these profiles of malware into groups. Therefore, clustering malware into different families is one of the computer security topics that are of great interest. In this paper, resting on the analysis of the extracted instruction of malware samples, we propose a novel parameter-free hybrid clustering algorithm (PFHC) which combines the merits of hierarchical clustering and K-means algorithms for malware clustering. It can not only generate stable initial division, but also give the best K. PFHC first utilizes agglomerative hierarchical clustering algorithm as the frame, starting with N singleton clusters, each of which exactly includes one sample, then reuses the centroids of upper level in every level and merges the two nearest clusters, finally adopts K-means algorithm for iteration to achieve an approximate global optimal division. PFHC evaluates clustering validity of each iteration procedure and generates the best K by comparing the values. The promising studies on real daily data collection illustrate that, compared with popular existing K-means and hierarchical clustering approaches, our proposed PFHC algorithm always generates much higher quality clusters and it can be well used for malware categorization

    Spanish Corpora of tweets about COVID-19 vaccination for automatic stance detection

    Get PDF
    The paper presents new annotated corpora for performing stance detection on Spanish Twitter data, most notably Health-related tweets. The objectives of this research are threefold: (1) to develop a manually annotated benchmark corpus for emotion recognition taking into account different variants of Spanish in social posts; (2) to evaluate the efficiency of semi-supervised models for extending such corpus with unlabelled posts; and (3) to describe such short text corpora via specialised topic modelling. A corpus of 2,801 tweets about COVID-19 vaccination was annotated by three native speakers to be in favour (904), against (674) or neither (1,223) with a 0.725 Fleiss’ kappa score. Results show that the self-training method with SVM base estimator can alleviate annotation work while ensuring high model performance. The self-training model outperformed the other approaches and produced a corpus of 11,204 tweets with a macro averaged f1 score of 0.94. The combination of sentence-level deep learning embeddings and density-based clustering was applied to explore the contents of both corpora. Topic quality was measured in terms of the trustworthiness and the validation index.Agencia Estatal de Investigación | Ref. PID2020–113673RB-I00Xunta de Galicia | Ref. ED431C2018/55Fundação para a Ciência e a Tecnologia | Ref. UIDB/04469/2020Financiado para publicación en acceso aberto: Universidade de Vigo/CISU

    Text categorization methods for automatic estimation of verbal intelligence

    Get PDF
    In this paper we investigate whether conventional text categorization methods may suffice to infer different verbal intelligence levels. This research goal relies on the hypothesis that the vocabulary that speakers make use of reflects their verbal intelligence levels. Automatic verbal intelligence estimation of users in a spoken language dialog system may be useful when defining an optimal dialog strategy by improving its adaptation capabilities. The work is based on a corpus containing descriptions (i.e. monologs) of a short film by test persons yielding different educational backgrounds and the verbal intelligence scores of the speakers. First, a one-way analysis of variance was performed to compare the monologs with the film transcription and to demonstrate that there are differences in the vocabulary used by the test persons yielding different verbal intelligence levels. Then, for the classification task, the monologs were represented as feature vectors using the classical TF–IDF weighting scheme. The Naive Bayes, k-nearest neighbors and Rocchio classifiers were tested. In this paper we describe and compare these classification approaches, define the optimal classification parameters and discuss the classification results obtained

    A Big Data Architecture for Early Identification and Categorization of Dark Web Sites

    Full text link
    The dark web has become notorious for its association with illicit activities and there is a growing need for systems to automate the monitoring of this space. This paper proposes an end-to-end scalable architecture for the early identification of new Tor sites and the daily analysis of their content. The solution is built using an Open Source Big Data stack for data serving with Kubernetes, Kafka, Kubeflow, and MinIO, continuously discovering onion addresses in different sources (threat intelligence, code repositories, web-Tor gateways, and Tor repositories), downloading the HTML from Tor and deduplicating the content using MinHash LSH, and categorizing with the BERTopic modeling (SBERT embedding, UMAP dimensionality reduction, HDBSCAN document clustering and c-TF-IDF topic keywords). In 93 days, the system identified 80,049 onion services and characterized 90% of them, addressing the challenge of Tor volatility. A disproportionate amount of repeated content is found, with only 6.1% unique sites. From the HTML files of the dark sites, 31 different low-topics are extracted, manually labeled, and grouped into 11 high-level topics. The five most popular included sexual and violent content, repositories, search engines, carding, cryptocurrencies, and marketplaces. During the experiments, we identified 14 sites with 13,946 clones that shared a suspiciously similar mirroring rate per day, suggesting an extensive common phishing network. Among the related works, this study is the most representative characterization of onion services based on topics to date

    Kieliteknologia analytiikan tukena sotilas- ja viranomaistyössä

    Get PDF
    corecore