135 research outputs found

    Hashing for large-scale structured data classification

    Full text link
    University of Technology Sydney. Faculty of Engineering and Information Technology.With the rapid development of the information society and the wide applications of networks, almost incredibly large numbers bytes of data are generated every day from the social networks, business transactions and so on. In such cases, hashing technology, if done successfully, would greatly improve the performance of data management. The goal of this thesis is to develop hashing methods for large-scale structured data classification. First of all, this work focuses on categorizing and reviewing the current progress on hashing from a data classification perspective. Secondly, new hashing schemes are proposed by considering different data characteristics and challenges, respectively. Due to the popularity and importance of graph and text data, this research mainly focuses on these two kinds of structured data: 1) The first method is a fast graph stream classification method using Discriminative Clique Hashing (DICH). The main idea is to employ a fast algorithm to decompose a compressed graph into a number of cliques to sequentially extract clique-patterns over the graph stream as features. Two random hashing schemes are employed to compress the original edge set of the graph stream and map the unlimitedly increasing clique-patterns onto a fixed-size feature space, respectively. DICH essentially speeds up the discriminative clique-pattern mining process and solves the unlimited clique-pattern expanding problem in graph stream mining; 2) The second method is an adaptive hashing for real-time graph stream classification (ARC-GS) based on DICH. In order to adapt to the concept drifts of the graph stream, we partition the whole graph stream into consecutive graph chunks. A differential hashing scheme is used to map unlimited increasing features (cliques) onto a fixed-size feature space. At the final stage, a chunk level weighting mechanism is used to form an ensemble classifier for graph stream classification. Experiments demonstrate that our method significantly outperforms existing methods; 3) The last method is a Recursive Min-wise Hashing (RMH) for text structure. In this method, this study aims to quickly compute similarities between texts while also preserving context information. To take into account semantic hierarchy, this study considers a notion of “multi-level exchangeability”, and employs a nested-set to represent a multi-level exchangeable object. To fingerprint nested-sets for fast comparison, Recursive Min-wise Hashing (RMH) algorithm is proposed at the same computational cost of the standard min-wise hashing algorithm. Theoretical study and bound analysis confirm that RMH is a highly-concentrated estimator

    Is It Overkill? Analyzing Feature-Space Concept Drift in Malware Detectors

    Get PDF
    Concept drift is a major challenge faced by machine learning-based malware detectors when deployed in practice. While existing works have investigated methods to detect concept drift, it is not yet well understood regarding the main causes behind the drift. In this paper, we design experiments to empirically analyze the impact of feature-space drift (new features introduced by new samples) and compare it with data-space drift (data distribution shift over existing features). Surprisingly, we find that data-space drift is the dominating contributor to the model degradation over time while feature-space drift has little to no impact. This is consistently observed over both Android and PE malware detectors, with different feature types and feature engineering methods, across different settings. We further validate this observation with recent online learning based malware detectors that incrementally update the feature space. Our result indicates the possibility of handling concept drift without frequent feature updating, and we further discuss the open questions for future research

    Incremental Discovery of Process Maps

    Get PDF
    Protsessikaeve on meetodite kogu, analüüsimaks protsesside teostuse jooksul loodud sündmuste logisid, et saada teavet nende parandamiseks. Protsessikaeve meetodite kogu, mida nimetatakse automatiseeritud protsessi avastuseks, lubab analüütikutel leida informatsiooni äriprotsesside mudelite kohta sündmuste logidest. Automatiseeritud protsessi avastusmeetodeid kasutatakse tavaliselt ühenduseta keskkonnas, mis tähendab, et protsessi mudel avastatakse hetketõmmisena tervest sündmuste logist. Samas on olukordi, kus uued juhtumid tulevad peale sellise suure kiirusega, et ei ole mõtet salvestada tervet sündmuste logi ja pidevalt nullist taasavastada mudelit. Selliste olukordade jaoks oleks vaja võrgus olevaid protsessi avastusmeetmeid. Andes sisendiks protsessi teostuse käigus loodud sündmuste voo, võrgus oleva protsessi avastusmeetodi eesmärk on järjepidevalt uuendada protsessi mudelit, tehes seda piiratud hulga mäluga ja säilitades sama täpsust, mida suudavad meetodid ühenduseta keskkondades. Olemasolevad meetodid vajavad palju mälu, et saavutada tulemusi, mis oleks võrreldavad ühenduseta keskkonnas saadud tulemustega. Käesolev lõputöö pakub välja võrgus oleva protsessi avastusraamistiku, ühtlustades protsessi avastus probleemi vähemälu haldusega ja kasutades vähemälu asenduspoliitikaid lahendamaks antud probleemi. Loodud raamistik on kirjutatud kasutades .NET-i, integreeritud Minit protsessikaeve tööriistaga ja analüüsitud kasutades elulisi ärijuhte.Process mining is a body of methods to analyze event logs produced during the execution of business processes in order to extract insights for their improvement. A family of process mining methods, known as automated process discovery, allows analysts to extract business process models from event logs. Traditional automated process discovery methods are intended to be used in an offline setting, meaning that the process model is extracted from a snapshot of an event log stored in its entirety. In some scenarios however, events keep coming with a high arrival rate to the extent that it is impractical to store the entire event log and to continuously re-discover a process model from scratch. Such scenarios require online automated process discovery approaches. Given an event stream produced by the execution of a business process, the goal of an online automated process discovery method is to maintain a continuously updated model of the process with a bounded amount of memory while at the same time achieving similar accuracy as offline methods. Existing automated discovery approaches require relatively large amounts of memory to achieve levels of accuracy comparable to that of offline methods. This thesis proposes a online process discovery framework that addresses this limitation by mapping the problem of online process discovery to that of cache memory management, and applying well-known cache replacement policies to the problem of online process discovery. The proposed framework has been implemented in .NET, integrated with the Minit process mining tool and comparatively evaluated against an existing baseline, using real-life datasets

    Mining, Modeling, and Analyzing Real-Time Social Trails

    Get PDF
    Real-time social systems are the fastest growing phenomena on the web, enabling millions of users to generate, share, and consume content on a massive scale. These systems are manifestations of a larger trend toward the global sharing of the real-time interests, affiliations, and activities of everyday users and demand new computational approaches for monitoring, analyzing, and distilling information from the prospective web of real-time content. In this dissertation research, we focus on the real-time social trails that reflect the digital footprints of crowds of real-time web users in response to real-world events or online phenomena. These digital footprints correspond to the artifacts strewn across the real-time web like posting of messages to Twitter or Facebook; the creation, sharing, and viewing of videos on websites like YouTube; and so on. While access to social trails could benefit many domains there is a significant research gap toward discovering, modeling, and leveraging these social trails. Hence, this dissertation research makes three contributions: • The first contribution of this dissertation research is a suite of efficient techniques for discovering non-trivial social trails from large-scale real-time social systems. We first develop a communication-based method using temporal graphs for discovering social trails on a stream of conversations from social messaging systems like instant messages, emails, Twitter directed or @ messages, SMS, etc. and then develop a content-based method using locality sensitive hashing for discovering content based social trails on a stream of text messages like Tweet stream, stream of Facebook messages, YouTube comments, etc. • The second contribution of this dissertation research is a framework for modeling and predicting the spatio-temporal dynamics of social trails. In particular, we develop a probabilistic model that synthesizes two conflicting hypotheses about the nature of online information spread: (i) the spatial influence model, which asserts that social trails propagates to locations that are close by; and (ii) the community affinity influence model, which asserts that social trail prop- agates between locations that are culturally connected, even if they are distant. • The third contribution of this dissertation research is a set of methods for social trail analytics and leveraging social trails for prognostic applications like real-time content recommendation, personalized advertising, and so on. We first analyze geo-spatial social trails of hashtags from Twitter, investigate their spatio-temporal dynamics and then use this analysis to develop a framework for recommending hashtags. Finally, we address the challenge of classifying social trails efficiently on real-time social systems

    Error minimising gradients for improving cerebellar model articulation controller performance

    Get PDF
    In motion control applications where the desired trajectory velocity exceeds an actuator’s maximum velocity limitations, large position errors will occur between the desired and actual trajectory responses. In these situations standard control approaches cannot predict the output saturation of the actuator and thus the associated error summation cannot be minimised.An adaptive feedforward control solution such as the Cerebellar Model Articulation Controller (CMAC) is able to provide an inherent level of prediction for these situations, moving the system output in the direction of the excessive desired velocity before actuator saturation occurs. However the pre-empting level of a CMAC is not adaptive, and thus the optimal point in time to start moving the system output in the direction of the excessive desired velocity remains unsolved. While the CMAC can adaptively minimise an actuator’s position error, the minimisation of the summation of error over time created by the divergence of the desired and actual trajectory responses requires an additional adaptive level of control.This thesis presents an improved method of training CMACs to minimise the summation of error over time created when the desired trajectory velocity exceeds the actuator’s maximum velocity limitations. This improved method called the Error Minimising Gradient Controller (EMGC) is able to adaptively modify a CMAC’s training signal so that the CMAC will start to move the output of the system in the direction of the excessive desired velocity with an optimised pre-empting level.The EMGC was originally created to minimise the loss of linguistic information conveyed through an actuated series of concatenated hand sign gestures reproducing deafblind sign language. The EMGC concept however is able to be implemented on any system where the error summation associated with excessive desired velocities needs to be minimised, with the EMGC producing an improved output approximation over using a CMAC alone.In this thesis, the EMGC was tested and benchmarked against a feedforward / feedback combined controller using a CMAC and PID controller. The EMGC was tested on an air-muscle actuator for a variety of situations comprising of a position discontinuity in a continuous desired trajectory. Tested situations included various discontinuity magnitudes together with varying approach and departure gradient profiles.Testing demonstrated that the addition of an EMGC can reduce a situation’s error summation magnitude if the base CMAC controller has not already provided a prior enough pre-empting output in the direction of the situation. The addition of an EMGC to a CMAC produces an improved approximation of reproduced motion trajectories, not only minimising position error for a single sampling instance, but also over time for periodic signals

    Enhancing Computer Network Security through Improved Outlier Detection for Data Streams

    Get PDF
    V několika posledních letech se metody strojového učení (zvláště ty zabývající se detekcí odlehlých hodnot - OD) v oblasti kyberbezpečnosti opíraly o zjišťování anomálií síťového provozu spočívajících v nových schématech útoků. Detekce anomálií v počítačových sítích reálného světa se ale stala stále obtížnější kvůli trvalému nárůstu vysoce objemných, rychlých a dimenzionálních průběžně přicházejících dat (SD), pro která nejsou k dispozici obecně uznané a pravdivé informace o anomalitě. Účinná detekční schémata pro vestavěná síťová zařízení musejí být rychlá a paměťově nenáročná a musejí být schopna se potýkat se změnami konceptu, když se vyskytnou. Cílem této disertace je zlepšit bezpečnost počítačových sítí zesílenou detekcí odlehlých hodnot v datových proudech, obzvláště SD, a dosáhnout kyberodolnosti, která zahrnuje jak detekci a analýzu, tak reakci na bezpečnostní incidenty jako jsou např. nové zlovolné aktivity. Za tímto účelem jsou v práci navrženy čtyři hlavní příspěvky, jež byly publikovány nebo se nacházejí v recenzním řízení časopisů. Zaprvé, mezera ve volbě vlastností (FS) bez učitele pro zlepšování již hotových metod OD v datových tocích byla zaplněna navržením volby vlastností bez učitele pro detekci odlehlých průběžně přicházejících dat označované jako UFSSOD. Následně odvozujeme generický koncept, který ukazuje dva aplikační scénáře UFSSOD ve spojení s online algoritmy OD. Rozsáhlé experimenty ukázaly, že UFSSOD coby algoritmus schopný online zpracování vykazuje srovnatelné výsledky jako konkurenční metoda upravená pro OD. Zadruhé představujeme nový aplikační rámec nazvaný izolovaný les založený na počítání výkonu (PCB-iForest), jenž je obecně schopen využít jakoukoliv online OD metodu založenou na množinách dat tak, aby fungovala na SD. Do tohoto algoritmu integrujeme dvě varianty založené na klasickém izolovaném lese. Rozsáhlé experimenty provedené na 23 multidisciplinárních datových sadách týkajících se bezpečnostní problematiky reálného světa ukázaly, že PCB-iForest jasně překonává už zavedené konkurenční metody v 61 % případů a dokonce dosahuje ještě slibnějších výsledků co do vyváženosti mezi výpočetními náklady na klasifikaci a její úspěšností. Zatřetí zavádíme nový pracovní rámec nazvaný detekce odlehlých hodnot a rozpoznávání schémat útoku proudovým způsobem (SOAAPR), jenž je na rozdíl od současných metod schopen zpracovat výstup z různých online OD metod bez učitele proudovým způsobem, aby získal informace o nových schématech útoku. Ze seshlukované množiny korelovaných poplachů jsou metodou SOAAPR vypočítány tři různé soukromí zachovávající podpisy podobné otiskům prstů, které charakterizují a reprezentují potenciální scénáře útoku s ohledem na jejich komunikační vztahy, projevy ve vlastnostech dat a chování v čase. Evaluace na dvou oblíbených datových sadách odhalila, že SOAAPR může soupeřit s konkurenční offline metodou ve schopnosti korelace poplachů a významně ji překonává z hlediska výpočetního času . Navíc se všechny tři typy podpisů ve většině případů zdají spolehlivě charakterizovat scénáře útoků tím, že podobné seskupují k sobě. Začtvrté představujeme algoritmus nepárového kódu autentizace zpráv (Uncoupled MAC), který propojuje oblasti kryptografického zabezpečení a detekce vniknutí (IDS) pro síťovou bezpečnost. Zabezpečuje síťovou komunikaci (autenticitu a integritu) kryptografickým schématem s podporou druhé vrstvy kódy autentizace zpráv, ale také jako vedlejší efekt poskytuje funkcionalitu IDS tak, že vyvolává poplach na základě porušení hodnot nepárového MACu. Díky novému samoregulačnímu rozšíření algoritmus adaptuje svoje vzorkovací parametry na základě zjištění škodlivých aktivit. Evaluace ve virtuálním prostředí jasně ukazuje, že schopnost detekce se za běhu zvyšuje pro různé scénáře útoku. Ty zahrnují dokonce i situace, kdy se inteligentní útočníci snaží využít slabá místa vzorkování.ObhájenoOver the past couple of years, machine learning methods - especially the Outlier Detection (OD) ones - have become anchored to the cyber security field to detect network-based anomalies rooted in novel attack patterns. Due to the steady increase of high-volume, high-speed and high-dimensional Streaming Data (SD), for which ground truth information is not available, detecting anomalies in real-world computer networks has become a more and more challenging task. Efficient detection schemes applied to networked, embedded devices need to be fast and memory-constrained, and must be capable of dealing with concept drifts when they occur. The aim of this thesis is to enhance computer network security through improved OD for data streams, in particular SD, to achieve cyber resilience, which ranges from the detection, over the analysis of security-relevant incidents, e.g., novel malicious activity, to the reaction to them. Therefore, four major contributions are proposed, which have been published or are submitted journal articles. First, a research gap in unsupervised Feature Selection (FS) for the improvement of off-the-shell OD methods in data streams is filled by proposing Unsupervised Feature Selection for Streaming Outlier Detection, denoted as UFSSOD. A generic concept is retrieved that shows two application scenarios of UFSSOD in conjunction with online OD algorithms. Extensive experiments have shown that UFSSOD, as an online-capable algorithm, achieves comparable results with a competitor trimmed for OD. Second, a novel unsupervised online OD framework called Performance Counter-Based iForest (PCB-iForest) is being introduced, which generalized, is able to incorporate any ensemble-based online OD method to function on SD. Two variants based on classic iForest are integrated. Extensive experiments, performed on 23 different multi-disciplinary and security-related real-world data sets, revealed that PCB-iForest clearly outperformed state-of-the-art competitors in 61 % of cases and even achieved more promising results in terms of the tradeoff between classification and computational costs. Third, a framework called Streaming Outlier Analysis and Attack Pattern Recognition, denoted as SOAAPR is being introduced that, in contrast to the state-of-the-art, is able to process the output of various online unsupervised OD methods in a streaming fashion to extract information about novel attack patterns. Three different privacy-preserving, fingerprint-like signatures are computed from the clustered set of correlated alerts by SOAAPR, which characterize and represent the potential attack scenarios with respect to their communication relations, their manifestation in the data's features and their temporal behavior. The evaluation on two popular data sets shows that SOAAPR can compete with an offline competitor in terms of alert correlation and outperforms it significantly in terms of processing time. Moreover, in most cases all three types of signatures seem to reliably characterize attack scenarios to the effect that similar ones are grouped together. Fourth, an Uncoupled Message Authentication Code algorithm - Uncoupled MAC - is presented which builds a bridge between cryptographic protection and Intrusion Detection Systems (IDSs) for network security. It secures network communication (authenticity and integrity) through a cryptographic scheme with layer-2 support via uncoupled message authentication codes but, as a side effect, also provides IDS-functionality producing alarms based on the violation of Uncoupled MAC values. Through a novel self-regulation extension, the algorithm adapts its sampling parameters based on the detection of malicious actions on SD. The evaluation in a virtualized environment clearly shows that the detection rate increases over runtime for different attack scenarios. Those even cover scenarios in which intelligent attackers try to exploit the downsides of sampling
    corecore