Search CORE

9 research outputs found

Learning from unequally reliable blind ensembles of classifiers

Author: Giannakis Georgios B.
Pagès Zamora Alba Maria
Traganitis Panagiotis
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2017
Field of study

© 2018 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes,creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.The rising interest in pattern recognition and data analytics has spurred the development of a plethora of machine learning algorithms and tools. However, as each algorithm has its strengths and weaknesses, one is motivated to judiciously fuse multiple algorithms in order to find the “best” performing one, for a given dataset. Ensemble learning aims to create a high- performance meta-algorithm, by combining the outputs from multiple algorithms. The present work introduces a simple blind scheme for learning from ensembles of classifiers, using joint matrix factorization. Blind refers to the combiner who has no knowledge of the ground-truth labels that each classifier has been trained on. Performance is evaluated on synthetic and real datasets.Peer ReviewedPostprint (author's final draft

UPCommons. Portal del coneixement obert de la UPC

Blind Multiclass Ensemble Classification

Author: Giannakis Georgios B.
Pagès-Zamora Alba
Traganitis Panagiotis A.
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2018
Field of study

The rising interest in pattern recognition and data analytics has spurred the development of innovative machine learning algorithms and tools. However, as each algorithm has its strengths and limitations, one is motivated to judiciously fuse multiple algorithms in order to find the "best" performing one, for a given dataset. Ensemble learning aims at such high-performance meta-algorithm, by combining the outputs from multiple algorithms. The present work introduces a blind scheme for learning from ensembles of classifiers, using a moment matching method that leverages joint tensor and matrix factorization. Blind refers to the combiner who has no knowledge of the ground-truth labels that each classifier has been trained on. A rigorous performance analysis is derived and the proposed scheme is evaluated on synthetic and real datasets.Comment: To appear in IEEE Transactions in Signal Processin

arXiv.org e-Print Archive

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

UPCommons. Portal del coneixement obert de la UPC

Getting ahead of the arms race: hothousing the coevolution of VirusTotal with a Packer

Author: Clark D.
Clark D.
Menéndez H.
Menéndez H.
T. Barr E.
T. Barr E.
Publication venue: MDPI
Publication date: 01/01/2021
Field of study

Malware detection is in a coevolutionary arms race where the attackers and defenders are constantly seeking advantage. This arms race is asymmetric: detection is harder and more expensive than evasion. White hats must be conservative to avoid false positives when searching for malicious behaviour. We seek to redress this imbalance. Most of the time, black hats need only make incremental changes to evade them. On occasion, white hats make a disruptive move and find a new technique that forces black hats to work harder. Examples include system calls, signatures and machine learning. We present a method, called Hothouse, that combines simulation and search to accelerate the white hat’s ability to counter the black hat’s incremental moves, thereby forcing black hats to perform disruptive moves more often. To realise Hothouse, we evolve EEE, an entropy-based polymorphic packer for Windows executables. Playing the role of a black hat, EEE uses evolutionary computation to disrupt the creation of malware signatures. We enter EEE into the detection arms race with VirusTotal, the most prominent cloud service for running anti-virus tools on software. During our 6 month study, we continually improved EEE in response to VirusTotal, eventually learning a packer that produces packed malware whose evasiveness goes from an initial 51.8% median to 19.6%. We report both how well VirusTotal learns to detect EEE-packed binaries and how well VirusTotal forgets in order to reduce false positives. VirusTotal’s tools learn and forget fast, actually in about 3 days. We also show where VirusTotal focuses its detection efforts, by analysing EEE’s variants

Middlesex University Research Repository

Detecting malware with information complexity

Author: Alshahwan N.
Alshahwan N.
Barr E.
Barr E.
Clark D.
Clark D.
Danezis G.
Danezis G.
Menéndez H.
Menéndez H.
Publication venue: MDPI
Publication date: 01/01/2020
Field of study

Malware concealment is the predominant strategy for malware propagation. Black hats create variants of malware based on polymorphism and metamorphism. Malware variants, by definition, share some information. Although the concealment strategy alters this information, there are still patterns on the software. Given a zoo of labelled malware and benign-ware, we ask whether a suspect program is more similar to our malware or to our benign-ware. Normalized Compression Distance (NCD) is a generic metric that measures the shared information content of two strings. This measure opens a new front in the malware arms race, one where the countermeasures promise to be more costly for malware writers, who must now obfuscate patterns as strings qua strings, without reference to execution, in their variants. Our approach classifies disk-resident malware with 97.4% accuracy and a false positive rate of 3%. We demonstrate that its accuracy can be improved by combining NCD with the compressibility rates of executables using decision forests, paving the way for future improvements. We demonstrate that malware reported within a narrow time frame of a few days is more homogeneous than malware reported over two years, but that our method still classifies the latter with 95.2% accuracy and a 5% false positive rate. Due to its use of compression, the time and computation cost of our method is nontrivial. We show that simple approximation techniques can improve its running time by up to 63%. We compare our results to the results of applying the 59 anti-malware programs used on the VirusTotal website to our malware. Our approach outperforms each one used alone and matches that of all of them used collectively

Middlesex University Research Repository

Getting ahead of the arms race: hothousing the coevolution of VirusTotal with a Packer

Author: Clark David
Menéndez Héctor D.
T. Barr Earl
Publication venue: 'MDPI AG'
Publication date: 26/03/2021
Field of study

UCL Discovery

Middlesex University Research Repository

Scalable and Ensemble Learning for Big Data

Author: Traganitis Panagiotis
Publication venue
Publication date: 01/05/2019
Field of study

University of Minnesota Ph.D. dissertation. May 2019. Major: Electrical/Computer Engineering. Advisor: Georgios Giannakis. 1 computer file (PDF); xi, 126 pages.The turn of the decade has trademarked society and computing research with a ``data deluge.'' As the number of smart, highly accurate and Internet-capable devices increases, so does the amount of data that is generated and collected. While this sheer amount of data has the potential to enable high quality inference, and mining of information, it introduces numerous challenges in the processing and pattern analysis, since available statistical inference and machine learning approaches do not necessarily scale well with the number of data and their dimensionality. In addition to the challenges related to scalability, data gathered are often noisy, dynamic, contaminated by outliers or corrupted to specifically inhibit the inference task. Moreover, many machine learning approaches have been shown to be susceptible to adversarial attacks. At the same time, the cost of cloud and distributed computing is rapidly declining. Therefore, there is a pressing need for statistical inference and machine learning tools that are robust to attacks and scale with the volume and dimensionality of the data, by harnessing efficiently the available computational resources. This thesis is centered on analytical and algorithmic foundations that aim to enable statistical inference and data analytics from large volumes of high-dimensional data. The vision is to establish a comprehensive framework based on state-of-the-art machine learning, optimization and statistical inference tools to enable truly large-scale inference, which can tap on the available (possibly distributed) computational resources, and be resilient to adversarial attacks. The ultimate goal is to both analytically and numerically demonstrate how valuable insights from signal processing can lead to markedly improved and accelerated learning tools. To this end, the present thesis investigates two main research thrusts: i) Large-scale subspace clustering; and ii) unsupervised ensemble learning. The aforementioned research thrusts introduce novel algorithms that aim to tackle the issues of large-scale learning. The potential of the proposed algorithms is showcased by rigorous theoretical results and extensive numerical tests

University of Minnesota Digital Conservancy

Ensembles in adversarial classification for spam

Author
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2009
Field of study

Crossref