27 research outputs found
Antidefacement
Internet connects around three billions of users worldwide, a number increasing every day. Thanks to this technology, people, companies and devices perform several tasks, such as information broadcasting through websites. Because of the large volumes of sensitive information and the lack of security in the websites, the number of attacks on these applications has been increasing significantly. Attacks on websites have different purposes, one of these is the introduction of unauthorized modifications (defacement). Defacement is an issue which involves impacts on both, system users and company image, thus, the researchers community has been working on solutions to reduce security risks. This paper presents an introduction to the state of the art about techniques, methodologies and solutions proposed by both, the researchers community and the computer security industry
New strategies for efficient and practical genetic programming.
2006/2007In the last decades, engineers and decision makers expressed a growing interest in the development of effective modeling and simulation methods to understand or predict the behavior of many phenomena in science and engineering. Many of these phenomena are translated in mathematical models for convenience and to carry out an easy interpretation. Methods commonly employed for this purpose include, for example, Neural Networks, Simulated Annealing, Genetic Algorithms, Tabu search, and so on. These methods all seek for the optimal or near optimal values of a predefined set of parameters of a model built a priori. But in this case, a suitable model should be known beforehand. When the form of this model cannot be found, the problem can be seen from another level where the goal is to find a program or a mathematical representation which can solve the problem. According to this idea the modeling step is performed automatically thanks to a quality criterion which drives the building process.
In this thesis, we focus on the Genetic Programming (GP) approach as an automatic method for creating computer programs by means of artificial evolution based upon the original contributions of Darwin and Mendel.
While GP has proven to be a powerful means for coping with problems in which finding a solution and its representation is difficult, its practical applicability is still severely limited by several factors. First, the GP approach is inherently a stochastic process. It means there is no guarantee to obtain a satisfactory solution at the end of the evolutionary loop. Second, the performances on a given problem may be strongly dependent on a broad range of parameters, including the number of variables involved, the quantity of data for each variable, the size and composition of the initial population, the number of generations and so on.
On the contrary, when one uses Genetic Programming to solve a problem, he has two expectancies: on the one hand, maximize the probability to obtain an acceptable solution, and on the other hand, minimize the amount of computational resources to get this solution.
Initially we present innovative and challenging applications related to several fields in science (computer science and mechanical science) which participate greatly in the experience gained in the GP field.
Then we propose new strategies for improving the performances of the GP approach in terms of efficiency and accuracy. We probe our approach on a large set of benchmark problems in three different domains. Furthermore we introduce a new approach based on GP dedicated to symbolic regression of multivariate data-sets where the underlying phenomenon is best characterized by a discontinuous function.
These contributions aim to provide a better understanding of the key features and the underlying relationships which make enhancements successful in improving the original algorithm.Negli ultimi anni, ingegneri e progettisti hanno espresso un interesse crescente nello sviluppo di nuovi metodi di simulazione e di modellazione per comprendere e predire il comportamento di diversi fenomeni sia in ambito scientifico che ingegneristico. Molti di questi fenomeni vengono descritti attraverso modelli matematici che ne facilitano l'interpretazione. A questo fine, i metodi più comunemente impiegati sono, le tecniche basate sui Reti Neurali, Simulated Annealing, gli Algoritmi Genetici, la ricerca Tabu, ecc. Questi metodi vanno a determinare i valori ottimali o quasi ottimali dei parametri di un modello costruito a priori. E evidente che in tal caso, si dovrebbe conoscere in anticipo un modello idoneo. Quando ciò non è possibile, il problema deve essere considerato da un altro punto di vista: l'obiettivo è trovare un programma o una rappresentazione matematica che possano risolvere il problema. A questo scopo, la fase di modellazione è svolta automaticamente in funzione di un criterio qualitativo che guida il processo di ricerca.
Il tema di ricerca di questa tesi è la programmazione genetica (“Genetic Programming” che chiameremo GP) e le sue applicazioni. La programmazione genetica si può definire come un metodo automatico per la generazione di programmi attraverso una simulazione artificiale dei principi relativi all'evoluzione naturale basata sui contributi originali di Darwin e di Mendel.
La programmazione genetica ha dimostrato di essere un potente mezzo per affrontare quei problemi in cui trovare una soluzione e la sua rappresentazione è difficile. Però la sua applicabilità rimane severamente limitata da diversi fattori. In primo luogo, il metodo GP è inerentemente un processo stocastico. Ciò significa che non garantisce che una soluzione soddisfacente sarà trovata alla fine del ciclo evolutivo. In secondo luogo, le prestazioni su un dato problema dipendono fortemente da una vasta gamma di parametri, compresi il numero di variabili impiegate, la quantità di dati per ogni variabile, la dimensione e la composizione della popolazione iniziale, il numero di generazioni e così via.
Al contrario, un utente della programmazione genetica ha due aspettative: da una parte, massimizzare la probabilità di ottenere una soluzione accettabile, e dall'altra, minimizzare la quantità di risorse di calcolo per ottenerla.
Nella fase iniziale di questo lavoro sono state considerate delle applicazioni particolarmente innovative relative a diversi campi della scienza (informatica e meccanica) che hanno contributo notevolmente all'esperienza acquisita nel campo della programmazione genetica.
In questa tesi si propone un nuovo procedimento con lo scopo di migliorare le prestazioni della programmazione genetica in termini di efficienza ed accuratezza. Abbiamo testato il nostro approccio su un ampio insieme di benchmarks in tre domini applicativi diversi. Si propone inoltre una tecnica basata sul GP per la regressione simbolica di data-set multivariati dove il fenomeno di fondo è caratterizzato da una funzione discontinua.
Questi contributi cercano di fornire una comprensione migliore degli elementi chiave e dei meccanismi interni che hanno consentito il miglioramento dell'algoritmo originale.XX Ciclo198
Recommended from our members
Identifying and Preventing Large-scale Internet Abuse
The widespread access to the Internet and the ubiquity of web-based services make it easy to communicate and interact globally. Unfortunately, the software and protocols implementing the functionality of these services are often vulnerable to attacks. In turn, an attacker can exploit them to compromise, take over, and abuse the services for her own nefarious purposes. In this dissertation, we aim to better understand such attacks, and we develop methods and algorithms to detect and prevent them, which we evaluate on large-scale datasets.First, we detail Meerkat, a system to detect a visible way in which websites are being compromised, namely website defacements. They can inflict significant harm on the websites’ operators through the loss of sales, the loss in reputation, or because of legal ramifications. Meerkat requires no prior knowledge about the websites’ content or their structure, but only the Uniform Resource Identifier (URI) at which they can be reached. By design, Meerkat mimics how a human analyst decides if a website was defaced when viewing it in a browser, by using computer vision techniques. Thus, it tackles the problem of detecting website defacements through their attention-seeking nature, their goal and purpose, rather than code or data artifacts that they might exhibit. In turn, it is much harder for an attacker to evade our system, as she needs to change her modus operandi. When Meerkat detects a website as defaced, the website can automatically be put into maintenance mode or restored to a known good state.An attacker, however, is not limited to abuse a compromised website in a way that is visible to the website’s visitors. Instead, she can misuse the website to infect its visitors with malicious software (malware). Although malware is well studied, identifying malicious websites remains a major challenge in today’s Internet. Second, we introduce Delta, a novel, purely static analysis approach that extracts change-related features between two versions of the same website, uses machine learning to derive a model of website changes, detects if an introduced change was malicious or benign, identifies the underlying infection vector based on clustering, and generates an identifying signature. Furthermore, due to the way Delta clusters campaigns, it can uncover infection campaigns that leverage specific vulnerable applications as a distribution channel, and it can greatly reduce the human labor necessary to uncover the application responsible for a service’s compromise.Third, we investigate the practicality and impact of domain takeover attacks, which an attacker can similarly abuse to spread misinformation or malware, and we present a defense on how such takeover attacks can be rendered toothless. Specifically, the new elasticity of Internet resources, in particular Internet protocol (IP) addresses in the context of Infrastructure-as-a-Service cloud service providers, combined with previously made protocol assumptions can lead to security issues. In Cloud Strife, we show that this dynamic component paired with recent developments in trust-based ecosystems (e.g., Transport Layer Security (TLS) certificates) creates so far unknown attack vectors. For example, a substantial number of stale domain name system (DNS) records points to readily available IP addresses in clouds, yet, they are still actively attempted to be accessed. Often, these records belong to discontinued services that were previously hosted in the cloud. We demonstrate that it is practical, and time and cost-efficient for attackers to allocate the IP addresses to which stale DNS records point. Further considering the ubiquity of domain validation in trust ecosystems, an attacker can impersonate the service by obtaining and using a valid certificate that is trusted by all major operating systems and browsers, which severely increases the attackers’ capabilities. The attacker can then also exploit residual trust in the domain name for phishing, receiving and sending emails, or possibly distributing code to clients that load remote code from the domain (e.g., loading of native code by mobile apps, or JavaScript libraries by websites). To prevent such attacks, we introduce a new authentication method for trust-based domain validation that mitigates staleness issues without incurring additional certificate requester effort by incorporating existing trust into the validation process.Finally, the analyses of Delta, Meerkat, and Cloud Strife have made use of large-scale measurements to assess our approaches’ impact and viability. Indeed, security research in general has made extensive use of exhaustive Internet-wide scans over the recent years, as they can provide significant insights into the state of security of the Internet (e.g., if classes of devices are behaving maliciously, or if they might be insecure and could turn malicious in an instant). However, the address space of the Internet’s core addressing protocol (Internet Protocol version 4; IPv4) is exhausted, and a migration to its successor (Internet Protocol version 6; IPv6), the only accepted long-term solution, is inevitable. In turn, to better understand the security of devices connected to the Internet, in particular Internet of Things devices, it is imperative to include IPv6 addresses in security evaluations and scans. Unfortunately, it is practically infeasible to iterate through the entire IPv6 address space, as it is 296 times larger than the IPv4 address space. Without enumerating hosts prior to scanning, we will be unable to retain visibility into the overall security of Internet-connected devices in the future, and we will be unable to detect and prevent their abuse or compromise. To mitigate this blind spot, we introduce a novel technique to enumerate part of the IPv6 address space by walking DNSSEC-signed IPv6 reverse zones. We show (i) that enumerating active IPv6 hosts is practical without a preferential network position contrary to common belief, (ii) that the security of active IPv6 hosts is currently still lagging behind the security state of IPv4 hosts, and (iii) that unintended default IPv6 connectivity is a major security issue
Techniques for large-scale automatic detection of web site defacements.
2006/2007Web site defacement, the process of introducing unauthorized modifications to a web site, is a very common form of attack. This thesis describes the design and experimental evaluation of a framework that may constitute the basis for a defacement detection service capable of monitoring thousands of remote web sites sistematically and automatically.
With this framework an organization may join the service by simply providing the URL of the resource to be monitored along with the contact point of an administrator. The monitored organization may thus take advantage of the service with just a few mouse clicks, without installing any software locally nor changing its own daily operational processes.
The main proposed approach is based on anomaly detection and allows monitoring the integrity of many remote web resources automatically while remaining fully decoupled from them, in particular, without requiring any prior knowledge about those resources. During a preliminary learning phase a profile of the monitored resource is built automatically. Then, while monitoring, the remote resource is retrieved periodically and an alert is generated whenever something "unusual" shows up.
The thesis discusses about the effectiveness of the approach in terms of accuracy of detection---i.e., missed detections and false alarms. The thesis also considers the problem of misclassified readings in the learning set. The effectiveness of anomaly detection approach, and hence of the proposed framework, bases on the assumption that the profile is computed starting from a learning set which is not corrupted by attacks; this assumption is often taken for granted. The influence of leaning set corruption on our framework effectiveness is assessed and a procedure aimed at discovering when a given unknown learning set is corrupted by positive readings is proposed and evaluated experimentally.
An approach to automatic defacement detection based on Genetic Programming (GP), an automatic method for creating computer programs by means of artificial evolution, is proposed and evaluated experimentally. Moreover, a set of techniques that have been used in literature for designing several host-based or network-based Intrusion Detection Systems are considered and evaluated experimentally, in comparison with the proposed approach.
Finally, the thesis presents the findings of a large-scale study on reaction time to web site defacement. There exist several statistics that indicate the number of incidents of this sort but there is a crucial piece of information still lacking: the typical duration of a defacement. A two months monitoring activity has been performed over more than 62000 defacements in order to figure out whether and when a reaction to the defacement is taken. It is shown that such time tends to be unacceptably long---in the order of several days---and with a long-tailed distribution.Il web site defacement, che consiste nell'introdurre modifiche non autorizzate ad un sito web, è una forma di attacco molto comune. Questa tesi descrive il progetto, la realizzazione e la valutazione sperimentale di una sistema che può costituire la base per un servizio capace di monitorare migliaia di siti web remoti in maniera sistematica e automatica.
Con questo sistema un'organizzazione può avvalersi del servizio semplicemente fornendo l'URL della risorsa da monitorare e un punto di contatto per l'amministratore. L'organizzazione monitorata può quindi avvantaggiarsi del servizio con pochi click del mouse, senza dover installare nessun software in locale e senza dover cambiare le sue attività quotidiane.
Il principale approccio proposto è basato sull'anomaly detection e permette di monitorare l'integrita di molte risorse web remote automaticamente rimanendo completamente distaccato da queste e, in particolare, non richiedendo nessuna conoscenza a priori delle stesse. Durante una fase preliminare di apprendimento viene generato automaticamente un profilo della risorsa. Successivamente, durante il monitoraggio, la risorsa è controllata periodicamente ed un allarme viene generato quando qualcosa di "unusuale" si manifesta.
La tesi prende in considerazione l'efficacia dell'approccio in termini di accuratezza di rilevamento---cioè, attacchi non rilevati e falsi allarmi generati. La tesi considera anche il problema dei reading mal classificati presenti nel learning set. L'efficiacia dell'approccio anomaly detection, e quindi del sistema proposto, si basa sull'ipotesi che il profilo è generato a partire da un learning set che non è corrotto dalla presenza di attacchi; questa ipotesi viene spesso data per vera. Viene quantificata l'influenza della presenza di reading corrotti sull'efficacia del sistema proposto e viene proposta e valutata sperimentalmente una procedura atta a rilevare quando un learning set ignoto è corrotto dalla presenza di reading positivi.
Viene proposto e valutato sperimentalmente un approccio per la rilevazione automatica dei defacement basato sul Genetic Programming (GP), un metodo automatico per creare programmi in termini di evoluzione artificiale. Inoltre, vengono valutate sperimentalmente, in riferimento all'approccio proposto, un insieme di tecniche che sono state utilizzate per progettare Intrusion Detection Systems, sia host based che network-based.
Infine, la tesi presenta i risultati di uno studio su larga scala sul tempo di reazione ai defacement. Ci sono diverse statistiche che indicano quale sia il numero di questo tipo di attacchi ma manca un'informazione molto importante: la durata tipica di un defacement. Si è effettuato un monitoraggio di oltre 62000 pagine defacciate per circa due mesi per scoprire se e quando viene presa una contromisura in seguito ad un defacement. Lo studio mostra che i tempi sono inaccettabilmente lunghi---dell'ordine di molti giorni---e con una distribuzione a coda lunga.XX Ciclo197
Wide spectrum attribution: Using deception for attribution intelligence in cyber attacks
Modern cyber attacks have evolved considerably. The skill level required to conduct
a cyber attack is low. Computing power is cheap, targets are diverse and plentiful.
Point-and-click crimeware kits are widely circulated in the underground economy, while
source code for sophisticated malware such as Stuxnet is available for all to download
and repurpose. Despite decades of research into defensive techniques, such as firewalls,
intrusion detection systems, anti-virus, code auditing, etc, the quantity of successful
cyber attacks continues to increase, as does the number of vulnerabilities identified.
Measures to identify perpetrators, known as attribution, have existed for as long as there
have been cyber attacks. The most actively researched technical attribution techniques
involve the marking and logging of network packets. These techniques are performed
by network devices along the packet journey, which most often requires modification of
existing router hardware and/or software, or the inclusion of additional devices. These
modifications require wide-scale infrastructure changes that are not only complex and
costly, but invoke legal, ethical and governance issues. The usefulness of these techniques
is also often questioned, as attack actors use multiple stepping stones, often innocent
systems that have been compromised, to mask the true source. As such, this thesis
identifies that no publicly known previous work has been deployed on a wide-scale basis
in the Internet infrastructure.
This research investigates the use of an often overlooked tool for attribution: cyber de-
ception. The main contribution of this work is a significant advancement in the field of
deception and honeypots as technical attribution techniques. Specifically, the design and
implementation of two novel honeypot approaches; i) Deception Inside Credential Engine
(DICE), that uses policy and honeytokens to identify adversaries returning from different
origins and ii) Adaptive Honeynet Framework (AHFW), an introspection and adaptive
honeynet framework that uses actor-dependent triggers to modify the honeynet envi-
ronment, to engage the adversary, increasing the quantity and diversity of interactions.
The two approaches are based on a systematic review of the technical attribution litera-
ture that was used to derive a set of requirements for honeypots as technical attribution
techniques. Both approaches lead the way for further research in this field
Genetic Programming Techniques in Engineering Applications
2012/2013Machine learning is a suite of techniques that allow developing algorithms for performing tasks by generalizing from examples.
Machine learning systems, thus, may automatically synthesize programs from data.
This approach is often feasible and cost-effective where manual programming or manual algorithm design is not.
In the last decade techniques based on machine learning have spread in a broad range of application domains.
In this thesis, we will present several novel applications of a specific machine Learning technique, called Genetic Programming, to a wide set of engineering applications grounded in real world problems.
The problems treated in this work range from the automatic synthesis of regular expressions, to the generation of electricity price forecast, to the synthesis of a model for the tracheal pressure in mechanical ventilation.
The results demonstrate that Genetic Programming is indeed a suitable tool for solving complex problems of practical interest.
Furthermore, several results constitute a significant improvement over the existing state-of-the-art.
The main contribution of this thesis is the design and implementation of a framework for the automatic inference of regular expressions from examples based on Genetic Programming.
First, we will show the ability of such a framework to cope with the generation of regular expressions for solving text-extraction tasks from examples.
We will experimentally assess our proposal comparing our results with previous proposals on a collection of real-world datasets.
The results demonstrate a clear superiority of our approach.
We have implemented the approach in a web application that has gained considerable interest and has reached peaks of more 10000 daily accesses.
Then, we will apply the framework to a popular "regex golf" challenge, a competition for human players that are required to generate the shortest regular expression solving a given set of problems.
Our results rank in the top 10 list of human players worldwide and outperform those generated by the only existing algorithm specialized to this purpose.
Hence, we will perform an extensive experimental evaluation in order to compare our proposal to the state-of-the-art proposal in a very close and long-established research field: the generation of a Deterministic Finite Automata (DFA) from a labelled set of examples.
Our results demonstrate that the existing state-of-the-art in DFA learning is not suitable for text extraction tasks.
We will also show a variant of our framework designed for solving text processing tasks of the search-and-replace form.
A common way to automate search-and-replace is to describe the region to be modified and the desired changes through a regular expression and a replacement expression.
We will propose a solution to automatically produce both those expressions based only on examples provided by user.
We will experimentally assess our proposal on real-word search-and-replace tasks.
The results indicate that our proposal is indeed feasible.
Finally, we will study the applicability of our framework to the generation of schema based on a sample of the eXtensible Markup Language documents.
The eXtensible Markup Language documents are largely used in machine-to-machine interactions and such interactions often require that some constraints are applied to the contents of the documents.
These constraints are usually specified in a separate document which is often unavailable or missing.
In order to generate a missing schema, we will apply and will evaluate experimentally our framework to solve this problem.
In the final part of this thesis we will describe two significant applications from different domains.
We will describe a forecasting system for producing estimates of the next day electricity price.
The system is based on a combination of a predictor based on Genetic Programming and a classifier based on Neural Networks.
Key feature of this system is the ability of handling outliers-i.e., values rarely seen during the learning phase.
We will compare our results with a challenging baseline representative of the state-of-the-art.
We will show that our proposal exhibits smaller prediction error than the baseline.
Finally, we will move to a biomedical problem: estimating tracheal pressure in a patient treated with high-frequency percussive ventilation.
High-frequency percussive ventilation is a new and promising non-conventional mechanical ventilatory strategy.
In order to avoid barotrauma and volutrauma in patience, the pressure of air insufflated must be monitored carefully.
Since measuring the tracheal pressure is difficult, a model for accurately estimating the tracheal pressure is required.
We will propose a synthesis of such model by means of Genetic Programming and we will compare our results with the state-of-the-art.XXVI Ciclo198