42 research outputs found

    TOWARDS A HOLISTIC EFFICIENT STACKING ENSEMBLE INTRUSION DETECTION SYSTEM USING NEWLY GENERATED HETEROGENEOUS DATASETS

    Get PDF
    With the exponential growth of network-based applications globally, there has been a transformation in organizations\u27 business models. Furthermore, cost reduction of both computational devices and the internet have led people to become more technology dependent. Consequently, due to inordinate use of computer networks, new risks have emerged. Therefore, the process of improving the speed and accuracy of security mechanisms has become crucial.Although abundant new security tools have been developed, the rapid-growth of malicious activities continues to be a pressing issue, as their ever-evolving attacks continue to create severe threats to network security. Classical security techniquesfor instance, firewallsare used as a first line of defense against security problems but remain unable to detect internal intrusions or adequately provide security countermeasures. Thus, network administrators tend to rely predominantly on Intrusion Detection Systems to detect such network intrusive activities. Machine Learning is one of the practical approaches to intrusion detection that learns from data to differentiate between normal and malicious traffic. Although Machine Learning approaches are used frequently, an in-depth analysis of Machine Learning algorithms in the context of intrusion detection has received less attention in the literature.Moreover, adequate datasets are necessary to train and evaluate anomaly-based network intrusion detection systems. There exist a number of such datasetsas DARPA, KDDCUP, and NSL-KDDthat have been widely adopted by researchers to train and evaluate the performance of their proposed intrusion detection approaches. Based on several studies, many such datasets are outworn and unreliable to use. Furthermore, some of these datasets suffer from a lack of traffic diversity and volumes, do not cover the variety of attacks, have anonymized packet information and payload that cannot reflect the current trends, or lack feature set and metadata.This thesis provides a comprehensive analysis of some of the existing Machine Learning approaches for identifying network intrusions. Specifically, it analyzes the algorithms along various dimensionsnamely, feature selection, sensitivity to the hyper-parameter selection, and class imbalance problemsthat are inherent to intrusion detection. It also produces a new reliable dataset labeled Game Theory and Cyber Security (GTCS) that matches real-world criteria, contains normal and different classes of attacks, and reflects the current network traffic trends. The GTCS dataset is used to evaluate the performance of the different approaches, and a detailed experimental evaluation to summarize the effectiveness of each approach is presented. Finally, the thesis proposes an ensemble classifier model composed of multiple classifiers with different learning paradigms to address the issue of detection accuracy and false alarm rate in intrusion detection systems

    Cyberspace and Real-World Behavioral Relationships: Towards the Application of Internet Search Queries to Identify Individuals At-risk for Suicide

    Get PDF
    The Internet has become an integral and pervasive aspect of society. Not surprisingly, the growth of ecommerce has led to focused research on identifying relationships between user behavior in cyberspace and the real world - retailers are tracking items customers are viewing and purchasing in order to recommend additional products and to better direct advertising. As the relationship between online search patterns and real-world behavior becomes more understood, the practice is likely to expand to other applications. Indeed, Google Flu Trends has implemented an algorithm that accurately charts the relationship between the number of people searching for flu-related topics on the Internet, and the number of people who actually have flu symptoms in that region. Because the results are real-time, studies show Google Flu Trends estimates are typically two weeks ahead of the Center for Disease Control. The Air Force has devoted considerable resources to suicide awareness and prevention. Despite these efforts, suicide rates have remained largely unaffected. The Air Force Suicide Prevention Program assists family, friends, and co-workers of airmen in recognizing and discussing behavioral changes with at-risk individuals. Based on other successes in correlating behaviors in cyberspace and the real world, is it possible to leverage online activities to help identify individuals that exhibit suicidal or depression-related symptoms? This research explores the notion of using Internet search queries to classify individuals with common search patterns. Text mining was performed on user search histories for a one-month period from nine Air Force installations. The search histories were clustered based on search term probabilities, providing the ability to identify relationships between individuals searching for common terms. Analysis was then performed to identify relationships between individuals searching for key terms associated with suicide, anxiety, and post-traumatic stress

    A Framework for Hybrid Intrusion Detection Systems

    Get PDF
    Web application security is a definite threat to the world’s information technology infrastructure. The Open Web Application Security Project (OWASP), generally defines web application security violations as unauthorized or unintentional exposure, disclosure, or loss of personal information. These breaches occur without the company’s knowledge and it often takes a while before the web application attack is revealed to the public, specifically because the security violations are fixed. Due to the need to protect their reputation, organizations have begun researching solutions to these problems. The most widely accepted solution is the use of an Intrusion Detection System (IDS). Such systems currently rely on either signatures of the attack used for the data breach or changes in the behavior patterns of the system to identify an intruder. These systems, either signature-based or anomaly-based, are readily understood by attackers. Issues arise when attacks are not noticed by an existing IDS because the attack does not fit the pre-defined attack signatures the IDS is implemented to discover. Despite current IDSs capabilities, little research has identified a method to detect all potential attacks on a system. This thesis intends to address this problem. A particular emphasis will be placed on detecting advanced attacks, such as those that take place at the application layer. These types of attacks are able to bypass existing IDSs, increase the potential for a web application security breach to occur and not be detected. In particular, the attacks under study are all web application layer attacks. Those included in this thesis are SQL injection, cross-site scripting, directory traversal and remote file inclusion. This work identifies common and existing data breach detection methods as well as the necessary improvements for IDS models. Ultimately, the proposed approach combines an anomaly detection technique measured by cross entropy and a signature-based attack detection framework utilizing genetic algorithm. The proposed hybrid model for data breach detection benefits organizations by increasing security measures and allowing attacks to be identified in less time and more efficiently

    Anomaly-based network intrusion detection enhancement by prediction threshold adaptation of binary classification models

    Get PDF
    Network traffic exhibits a high level of variability over short periods of time. This variability impacts negatively on the performance (accuracy) of anomaly-based network Intrusion Detection Systems (IDS) that are built using predictive models in a batch-learning setup. This thesis investigates how adapting the discriminating threshold of model predictions, specifically to the evaluated traffic, improves the detection rates of these Intrusion Detection models. Specifically, this thesis studied the adaptability features of three well known Machine Learning algorithms: C5.0, Random Forest, and Support Vector Machine. The ability of these algorithms to adapt their prediction thresholds was assessed and analysed under different scenarios that simulated real world settings using the prospective sampling approach. A new dataset (STA2018) was generated for this thesis and used for the analysis. This thesis has demonstrated empirically the importance of threshold adaptation in improving the accuracy of detection models when training and evaluation (test) traffic have different statistical properties. Further investigation was undertaken to analyse the effects of feature selection and data balancing processes on a model’s accuracy when evaluation traffic with different significant features were used. The effects of threshold adaptation on reducing the accuracy degradation of these models was statistically analysed. The results showed that, of the three compared algorithms, Random Forest was the most adaptable and had the highest detection rates. This thesis then extended the analysis to apply threshold adaptation on sampled traffic subsets, by using different sample sizes, sampling strategies and label error rates. This investigation showed the robustness of the Random Forest algorithm in identifying the best threshold. The Random Forest algorithm only needed a sample that was 0.05% of the original evaluation traffic to identify a discriminating threshold with an overall accuracy rate of nearly 90% of the optimal threshold."This research was supported and funded by the Government of the Sultanate of Oman represented by the Ministry of Higher Education and the Sultan Qaboos University." -- p. i

    Automating the gathering of relevant information from biomedical text

    Get PDF
    More and more, database curators rely on literature-mining techniques to help them gather and make use of the knowledge encoded in text documents. This thesis investigates how an assisted annotation process can help and explores the hypothesis that it is only with respect to full-text publications that a system can tell relevant and irrelevant facts apart by studying their frequency. A semi-automatic annotation process was developed for a particular database - the Nuclear Protein Database (NPD), based on a set of full-text articles newly annotated with regards to subnuclear protein localisation, along with eight lexicons. The annotation process is carried out online, retrieving relevant documents (abstracts and full-text papers) and highlighting sentences of interest in them. The process also offers a summary Table of the facts found clustered by type of information. Each method involved in each step of the tool is evaluated using cross-validation results on the training data as well as test set results. The performance of the final tool, called the “NPD Curator System Interface”, is estimated empirically in an experiment where the NPD curator updates the database with pieces of information found relevant in 31 publications using the interface. A final experiment complements our main methodology by showing its extensibility to retrieving information on protein function rather than localisation. I argue that the general methods, the results they produced and the discussions they engendered are useful for any subsequent attempt to generate semi-automatic database annotation processes. The annotated corpora, gazetteers, methods and tool are fully available on request of the author ([email protected])

    Democratizing Self-Service Data Preparation through Example Guided Program Synthesis,

    Full text link
    The majority of real-world data we can access today have one thing in common: they are not immediately usable in their original state. Trapped in a swamp of data usability issues like non-standard data formats and heterogeneous data sources, most data analysts and machine learning practitioners have to burden themselves with "data janitor" work, writing ad-hoc Python, PERL or SQL scripts, which is tedious and inefficient. It is estimated that data scientists or analysts typically spend 80% of their time in preparing data, a significant amount of human effort that can be redirected to better goals. In this dissertation, we accomplish this task by harnessing knowledge such as examples and other useful hints from the end user. We develop program synthesis techniques guided by heuristics and machine learning, which effectively make data preparation less painful and more efficient to perform by data users, particularly those with little to no programming experience. Data transformation, also called data wrangling or data munging, is an important task in data preparation, seeking to convert data from one format to a different (often more structured) format. Our system Foofah shows that allowing end users to describe their desired transformation, through providing small input-output transformation examples, can significantly reduce the overall user effort. The underlying program synthesizer can often succeed in finding meaningful data transformation programs within a reasonably short amount of time. Our second system, CLX, demonstrates that sometimes the user does not even need to provide complete input-output examples, but only label ones that are desirable if they exist in the original dataset. The system is still capable of suggesting reasonable and explainable transformation operations to fix the non-standard data format issue in a dataset full of heterogeneous data with varied formats. PRISM, our third system, targets a data preparation task of data integration, i.e., combining multiple relations to formulate a desired schema. PRISM allows the user to describe the target schema using not only high-resolution (precise) constraints of complete example data records in the target schema, but also (imprecise) constraints of varied resolutions, such as incomplete data record examples with missing values, value ranges, or multiple possible values in each element (cell), so as to require less familiarity of the database contents from the end user.PHDComputer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/163059/1/markjin_1.pd

    A Data Cleaning Solution by Perl Scripts for the KDD Cup 2003 Task 2

    No full text
    Article dans revue scientifique avec comité de lecture. nationale.National audienceIn this paper, we present our solution for the KDD CUP 2003 task 2 competition. Our approach is based on a data cleaning methodology using Perl scripts. These scripts contain regular expressions for automatically extracting relevant information from the 35472 LaTeX texts. These expressions were optimized by statistical investigations on the texts. Our solution has permitted us to obtain 144,087 associations

    Intrusion Detection from Heterogenous Sensors

    Get PDF
    RÉSUMÉ De nos jours, la protection des systĂšmes et rĂ©seaux informatiques contre diffĂ©rentes attaques avancĂ©es et distribuĂ©es constitue un dĂ©fi vital pour leurs propriĂ©taires. L’une des menaces critiques Ă  la sĂ©curitĂ© de ces infrastructures informatiques sont les attaques rĂ©alisĂ©es par des individus dont les intentions sont malveillantes, qu’ils soient situĂ©s Ă  l’intĂ©rieur et Ă  l’extĂ©rieur de l’environnement du systĂšme, afin d’abuser des services disponibles, ou de rĂ©vĂ©ler des informations confidentielles. Par consĂ©quent, la gestion et la surveillance des systĂšmes informatiques est un dĂ©fi considĂ©rable considĂ©rant que de nouvelles menaces et attaques sont dĂ©couvertes sur une base quotidienne. Les systĂšmes de dĂ©tection d’intrusion, Intrusion Detection Systems (IDS) en anglais, jouent un rĂŽle clĂ© dans la surveillance et le contrĂŽle des infrastructures de rĂ©seau informatique. Ces systĂšmes inspectent les Ă©vĂ©nements qui se produisent dans les systĂšmes et rĂ©seaux informatiques et en cas de dĂ©tection d’activitĂ© malveillante, ces derniers gĂ©nĂšrent des alertes afin de fournir les dĂ©tails des attaques survenues. Cependant, ces systĂšmes prĂ©sentent certaines limitations qui mĂ©ritent d’ĂȘtre adressĂ©es si nous souhaitons les rendre suffisamment fiables pour rĂ©pondre aux besoins rĂ©els. L’un des principaux dĂ©fis qui caractĂ©rise les IDS est le grand nombre d’alertes redondantes et non pertinentes ainsi que le taux de faux-positif gĂ©nĂ©rĂ©s, faisant de leur analyse une tĂąche difficile pour les administrateurs de sĂ©curitĂ© qui tentent de dĂ©terminer et d’identifier les alertes qui sont rĂ©ellement importantes. Une partie du problĂšme rĂ©side dans le fait que la plupart des IDS ne prennent pas compte les informations contextuelles (type de systĂšmes, applications, utilisateurs, rĂ©seaux, etc.) reliĂ©es Ă  l’attaque. Ainsi, une grande partie des alertes gĂ©nĂ©rĂ©es par les IDS sont non pertinentes en ce sens qu’elles ne permettent de comprendre l’attaque dans son contexte et ce, malgrĂ© le fait que le systĂšme ait rĂ©ussi Ă  correctement dĂ©tecter une intrusion. De plus, plusieurs IDS limitent leur dĂ©tection Ă  un seul type de capteur, ce qui les rend inefficaces pour dĂ©tecter de nouvelles attaques complexes. Or, ceci est particuliĂšrement important dans le cas des attaques ciblĂ©es qui tentent d’éviter la dĂ©tection par IDS conventionnels et par d’autres produits de sĂ©curitĂ©. Bien que de nombreux administrateurs systĂšme incorporent avec succĂšs des informations de contexte ainsi que diffĂ©rents types de capteurs et journaux dans leurs analyses, un problĂšme important avec cette approche reste le manque d’automatisation, tant au niveau du stockage que de l’analyse. Afin de rĂ©soudre ces problĂšmes d’applicabilitĂ©, divers types d’IDS ont Ă©tĂ© proposĂ©s dans les derniĂšres annĂ©es, dont les IDS de type composant pris sur Ă©tagĂšre, commercial off-the-shelf (COTS) en anglais, qui sont maintenant largement utilisĂ©s dans les centres d’opĂ©rations de sĂ©curitĂ©, Security Operations Center (SOC) en anglais, de plusieurs grandes organisations. D’un point de vue plus gĂ©nĂ©ral, les diffĂ©rentes approches proposĂ©es peuvent ĂȘtre classĂ©es en diffĂ©rentes catĂ©gories : les mĂ©thodes basĂ©es sur l’apprentissage machine, tel que les rĂ©seaux bayĂ©siens, les mĂ©thodes d’extraction de donnĂ©es, les arbres de dĂ©cision, les rĂ©seaux de neurones, etc., les mĂ©thodes impliquant la corrĂ©lation d’alertes et les approches fondĂ©es sur la fusion d’alertes, les systĂšmes de dĂ©tection d’intrusion sensibles au contexte, les IDS dit distribuĂ©s et les IDS qui reposent sur la notion d’ontologie de base. Étant donnĂ© que ces diffĂ©rentes approches se concentrent uniquement sur un ou quelques-uns des dĂ©fis courants reliĂ©s aux IDS, au meilleure de notre connaissance, le problĂšme dans son ensemble n’a pas Ă©tĂ© rĂ©solu. Par consĂ©quent, il n’existe aucune approche permettant de couvrir tous les dĂ©fis des IDS modernes prĂ©cĂ©demment mentionnĂ©s. Par exemple, les systĂšmes qui reposent sur des mĂ©thodes d’apprentissage machine classent les Ă©vĂ©nements sur la base de certaines caractĂ©ristiques en fonction du comportement observĂ© pour un type d’évĂ©nements, mais ils ne prennent pas en compte les informations reliĂ©es au contexte et les relations pouvant exister entre plusieurs Ă©vĂ©nements. La plupart des techniques de corrĂ©lation d’alerte proposĂ©es ne considĂšrent que la corrĂ©lation entre plusieurs capteurs du mĂȘme type ayant un Ă©vĂ©nement commun et une sĂ©mantique d’alerte similaire (corrĂ©lation homogĂšne), laissant aux administrateurs de sĂ©curitĂ© la tĂąche d’effectuer la corrĂ©lation entre les diffĂ©rents types de capteurs hĂ©tĂ©rogĂšnes. Pour leur part, les approches sensibles au contexte n’emploient que des aspects limitĂ©s du contexte sous-jacent. Une autre limitation majeure des diffĂ©rentes approches proposĂ©es est l’absence d’évaluation prĂ©cise basĂ©e sur des ensembles de donnĂ©es qui contiennent des scĂ©narios d’attaque complexes et modernes. À cet effet, l’objectif de cette thĂšse est de concevoir un systĂšme de corrĂ©lation d’évĂ©nements qui peut prendre en considĂ©ration plusieurs types hĂ©tĂ©rogĂšnes de capteurs ainsi que les journaux de plusieurs applications (par exemple, IDS/IPS, pare-feu, base de donnĂ©es, systĂšme d’exploitation, antivirus, proxy web, routeurs, etc.). Cette mĂ©thode permettra de dĂ©tecter des attaques complexes qui laissent des traces dans les diffĂ©rents systĂšmes, et d’incorporer les informations de contexte dans l’analyse afin de rĂ©duire les faux-positifs. Nos contributions peuvent ĂȘtre divisĂ©es en quatre parties principales : 1) Nous proposons la Pasargadae, une solution complĂšte sensible au contexte et reposant sur une ontologie de corrĂ©lation des Ă©vĂ©nements, laquelle effectue automatiquement la corrĂ©lation des Ă©vĂ©nements par l’analyse des informations recueillies auprĂšs de diverses sources. Pasargadae utilise le concept d’ontologie pour reprĂ©senter et stocker des informations sur les Ă©vĂ©nements, le contexte et les vulnĂ©rabilitĂ©s, les scĂ©narios d’attaques, et utilise des rĂšgles d’ontologie de logique simple Ă©crites en Semantic Query-Enhance Web Rule Language (SQWRL) afin de corrĂ©ler diverse informations et de filtrer les alertes non pertinentes, en double, et les faux-positifs. 2) Nous proposons une approche basĂ©e sur, mĂ©ta-Ă©vĂ©nement , tri topologique et l‘approche corrĂ©lation dâ€˜Ă©vĂ©nement basĂ©e sur sĂ©mantique qui emploie Pasargadae pour effectuer la corrĂ©lation d’évĂ©nements Ă  travers les Ă©vĂ©nements collectĂ©s de plusieurs capteurs rĂ©partis dans un rĂ©seau informatique. 3) Nous proposons une approche alerte de fusion basĂ©e sur sĂ©mantique, contexte sensible, qui s‘appuie sur certains des sous-composantes de Pasargadae pour effectuer une alerte fusion hĂ©tĂ©rogĂšne recueillies auprĂšs IDS hĂ©tĂ©rogĂšnes. 4) Dans le but de montrer le niveau de flexibilitĂ© de Pasargadae, nous l’utilisons pour mettre en oeuvre d’autres approches proposĂ©es d‘alertes et de corrĂ©lation dâ€˜Ă©vĂ©nements. La somme de ces contributions reprĂ©sente une amĂ©lioration significative de l’applicabilitĂ© et la fiabilitĂ© des IDS dans des situations du monde rĂ©el. Afin de tester la performance et la flexibilitĂ© de l’approche de corrĂ©lation d’évĂ©nements proposĂ©s, nous devons aborder le manque d’infrastructures expĂ©rimental adĂ©quat pour la sĂ©curitĂ© du rĂ©seau. Une Ă©tude de littĂ©rature montre que les approches expĂ©rimentales actuelles ne sont pas adaptĂ©es pour gĂ©nĂ©rer des donnĂ©es de rĂ©seau de grande fidĂ©litĂ©. Par consĂ©quent, afin d’accomplir une Ă©valuation complĂšte, d’abord, nous menons nos expĂ©riences sur deux scĂ©narios d’étude d‘analyse de cas distincts, inspirĂ©s des ensembles de donnĂ©es d’évaluation DARPA 2000 et UNB ISCX IDS. Ensuite, comme une Ă©tude dĂ©posĂ©e complĂšte, nous employons Pasargadae dans un vrai rĂ©seau informatique pour une pĂ©riode de deux semaines pour inspecter ses capacitĂ©s de dĂ©tection sur un vrai terrain trafic de rĂ©seau. Les rĂ©sultats obtenus montrent que, par rapport Ă  d’autres amĂ©liorations IDS existants, les contributions proposĂ©es amĂ©liorent considĂ©rablement les performances IDS (taux de dĂ©tection) tout en rĂ©duisant les faux positifs, non pertinents et alertes en double.----------ABSTRACT Nowadays, protecting computer systems and networks against various distributed and multi-steps attack has been a vital challenge for their owners. One of the essential threats to the security of such computer infrastructures is attacks by malicious individuals from inside and outside of the system environment to abuse available services, or reveal their confidential information. Consequently, managing and supervising computer systems is a considerable challenge, as new threats and attacks are discovered on a daily basis. Intrusion Detection Systems (IDSs) play a key role in the surveillance and monitoring of computer network infrastructures. These systems inspect events occurred in computer systems and networks and in case of any malicious behavior they generate appropriate alerts describing the attacks’ details. However, there are a number of shortcomings that need to be addressed to make them reliable enough in the real-world situations. One of the fundamental challenges in real-world IDS is the large number of redundant, non-relevant, and false positive alerts that they generate, making it a difficult task for security administrators to determine and identify real and important alerts. Part of the problem is that most of the IDS do not take into account contextual information (type of systems, applications, users, networks, etc.), and therefore a large portion of the alerts are non-relevant in that even though they correctly recognize an intrusion, the intrusion fails to reach its objectives. Additionally, to detect newer and complicated attacks, relying on only one detection sensor type is not adequate, and as a result many of the current IDS are unable to detect them. This is especially important with respect to targeted attacks that try to avoid detection by conventional IDS and by other security products. While many system administrators are known to successfully incorporate context information and many different types of sensors and logs into their analysis, an important problem with this approach is the lack of automation in both storage and analysis. In order to address these problems in IDS applicability, various IDS types have been proposed in the recent years and commercial off-the-shelf (COTS) IDS products have found their way into Security Operations Centers (SOC) of many large organizations. From a general perspective, these works can be categorized into: machine learning based approaches including Bayesian networks, data mining methods, decision trees, neural networks, etc., alert correlation and alert fusion based approaches, context-aware intrusion detection systems, distributed intrusion detection systems, and ontology based intrusion detection systems. To the best of our knowledge, since these works only focus on one or few of the IDS challenges, the problem as a whole has not been resolved. Hence, there is no comprehensive work addressing all the mentioned challenges of modern intrusion detection systems. For example, works that utilize machine learning approaches only classify events based on some features depending on behavior observed with one type of events, and they do not take into account contextual information and event interrelationships. Most of the proposed alert correlation techniques consider correlation only across multiple sensors of the same type having a common event and alert semantics (homogeneous correlation), leaving it to security administrators to perform correlation across heterogeneous types of sensors. Context-aware approaches only employ limited aspects of the underlying context. The lack of accurate evaluation based on the data sets that encompass modern complex attack scenarios is another major shortcoming of most of the proposed approaches. The goal of this thesis is to design an event correlation system that can correlate across several heterogeneous types of sensors and logs (e.g. IDS/IPS, firewall, database, operating system, anti-virus, web proxy, routers, etc.) in order to hope to detect complex attacks that leave traces in various systems, and incorporate context information into the analysis, in order to reduce false positives. To this end, our contributions can be split into 4 main parts: 1) we propose the Pasargadae comprehensive context-aware and ontology-based event correlation framework that automatically performs event correlation by reasoning on the information collected from various information resources. Pasargadae uses ontologies to represent and store information on events, context and vulnerability information, and attack scenarios, and uses simple ontology logic rules written in Semantic Query-Enhance Web Rule Language (SQWRL) to correlate various information and filter out non-relevant alerts and duplicate alerts, and false positives. 2) We propose a meta-event based, topological sort based and semantic-based event correlation approach that employs Pasargadae to perform event correlation across events collected form several sensors distributed in a computer network. 3) We propose a semantic-based context-aware alert fusion approach that relies on some of the subcomponents of Pasargadae to perform heterogeneous alert fusion collected from heterogeneous IDS. 4) In order to show the level of flexibility of Pasargadae, we use it to implement some other proposed alert and event correlation approaches. The sum of these contributions represent a significant improvement in the applicability and reliability of IDS in real-world situations. In order to test the performance and flexibility of the proposed event correlation approach, we need to address the lack of experimental infrastructure suitable for network security. A study of the literature shows that current experimental approaches are not appropriate to generate high fidelity network data. Consequently, in order to accomplish a comprehensive evaluation, first, we conduct our experiments on two separate analysis case study scenarios, inspired from the DARPA 2000 and UNB ISCX IDS evaluation data sets. Next, as a complete field study, we employ Pasargadae in a real computer network for a two weeks period to inspect its detection capabilities on a ground truth network traffic. The results obtained show that compared to other existing IDS improvements, the proposed contributions significantly improve IDS performance (detection rate) while reducing false positives, non-relevant and duplicate alerts

    Data quality and data cleaning in database applications

    Get PDF
    Today, data plays an important role in people's daily activities. With the help of some database applications such as decision support systems and customer relationship management systems (CRM), useful information or knowledge could be derived from large quantities of data. However, investigations show that many such applications fail to work successfully. There are many reasons to cause the failure, such as poor system infrastructure design or query performance. But nothing is more certain to yield failure than lack of concern for the issue of data quality. High quality of data is a key to today's business success. The quality of any large real world data set depends on a number of factors among which the source of the data is often the crucial factor. It has now been recognized that an inordinate proportion of data in most data sources is dirty. Obviously, a database application with a high proportion of dirty data is not reliable for the purpose of data mining or deriving business intelligence and the quality of decisions made on the basis of such business intelligence is also unreliable. In order to ensure high quality of data, enterprises need to have a process, methodologies and resources to monitor and analyze the quality of data, methodologies for preventing and/or detecting and repairing dirty data. This thesis is focusing on the improvement of data quality in database applications with the help of current data cleaning methods. It provides a systematic and comparative description of the research issues related to the improvement of the quality of data, and has addressed a number of research issues related to data cleaning. In the first part of the thesis, related literature of data cleaning and data quality are reviewed and discussed. Building on this research, a rule-based taxonomy of dirty data is proposed in the second part of the thesis. The proposed taxonomy not only summarizes the most dirty data types but is the basis on which the proposed method for solving the Dirty Data Selection (DDS) problem during the data cleaning process was developed. This helps us to design the DDS process in the proposed data cleaning framework described in the third part of the thesis. This framework retains the most appealing characteristics of existing data cleaning approaches, and improves the efficiency and effectiveness of data cleaning as well as the degree of automation during the data cleaning process. Finally, a set of approximate string matching algorithms are studied and experimental work has been undertaken. Approximate string matching is an important part in many data cleaning approaches which has been well studied for many years. The experimental work in the thesis confirmed the statement that there is no clear best technique. It shows that the characteristics of data such as the size of a dataset, the error rate in a dataset, the type of strings in a dataset and even the type of typo in a string will have significant effect on the performance of the selected techniques. In addition, the characteristics of data also have effect on the selection of suitable threshold values for the selected matching algorithms. The achievements based on these experimental results provide the fundamental improvement in the design of 'algorithm selection mechanism' in the data cleaning framework, which enhances the performance of data cleaning system in database applications.EThOS - Electronic Theses Online ServiceGBUnited Kingdo

    A data cleaning solution by Perl scripts for the KDD Cup 2003 task 2

    No full text
    corecore