190 research outputs found

    Machine Learning And A Workflow Engine For An Agent-Based Structural Health Monitoring System

    Get PDF
    This thesis reports on work in machine learning and high-performance computing for structural heath monitoring. The data used are acoustic emission signals, and we classify these signals according to source mechanisms, those associated with crack growth being particularly significant. The work reported here is part of a larger project to develop an agent-based structural health monitoring system. The agents are proxies for communication- and computation-intensive techniques and respond to the situation at hand by determining an appropriate constellation of techniques. The techniques thus structured are executed by a workflow engine, which is part of the contribution reported here. It is critical that the system have a repertoire of classifiers with different characteristics so that a combination appropriate for the situation at hand can generally be found. The classifiers are trained using machine-learning techniques, and we report on investigations we conducted on three supervised and two unsupervised learning techniques to determine which techniques are the best to use in a particular situation

    New Statistical Algorithms for the Analysis of Mass Spectrometry Time-Of-Flight Mass Data with Applications in Clinical Diagnostics

    Get PDF
    Mass spectrometry (MS) based techniques have emerged as a standard forlarge-scale protein analysis. The ongoing progress in terms of more sensitive machines and improved data analysis algorithms led to a constant expansion of its fields of applications. Recently, MS was introduced into clinical proteomics with the prospect of early disease detection using proteomic pattern matching. Analyzing biological samples (e.g. blood) by mass spectrometry generates mass spectra that represent the components (molecules) contained in a sample as masses and their respective relative concentrations. In this work, we are interested in those components that are constant within a group of individuals but differ much between individuals of two distinct groups. These distinguishing components that dependent on a particular medical condition are generally called biomarkers. Since not all biomarkers found by the algorithms are of equal (discriminating) quality we are only interested in a small biomarker subset that - as a combination - can be used as a fingerprint for a disease. Once a fingerprint for a particular disease (or medical condition) is identified, it can be used in clinical diagnostics to classify unknown spectra. In this thesis we have developed new algorithms for automatic extraction of disease specific fingerprints from mass spectrometry data. Special emphasis has been put on designing highly sensitive methods with respect to signal detection. Thanks to our statistically based approach our methods are able to detect signals even below the noise level inherent in data acquired by common MS machines, such as hormones. To provide access to these new classes of algorithms to collaborating groups we have created a web-based analysis platform that provides all necessary interfaces for data transfer, data analysis and result inspection. To prove the platform's practical relevance it has been utilized in several clinical studies two of which are presented in this thesis. In these studies it could be shown that our platform is superior to commercial systems with respect to fingerprint identification. As an outcome of these studies several fingerprints for different cancer types (bladder, kidney, testicle, pancreas, colon and thyroid) have been detected and validated. The clinical partners in fact emphasize that these results would be impossible with a less sensitive analysis tool (such as the currently available systems). In addition to the issue of reliably finding and handling signals in noise we faced the problem to handle very large amounts of data, since an average dataset of an individual is about 2.5 Gigabytes in size and we have data of hundreds to thousands of persons. To cope with these large datasets, we developed a new framework for a heterogeneous (quasi) ad-hoc Grid - an infrastructure that allows to integrate thousands of computing resources (e.g. Desktop Computers, Computing Clusters or specialized hardware, such as IBM's Cell Processor in a Playstation 3)

    Developing Cyberspace Data Understanding: Using CRISP-DM for Host-based IDS Feature Mining

    Get PDF
    Current intrusion detection systems generate a large number of specific alerts, but do not provide actionable information. Many times, these alerts must be analyzed by a network defender, a time consuming and tedious task which can occur hours or days after an attack occurs. Improved understanding of the cyberspace domain can lead to great advancements in Cyberspace situational awareness research and development. This thesis applies the Cross Industry Standard Process for Data Mining (CRISP-DM) to develop an understanding about a host system under attack. Data is generated by launching scans and exploits at a machine outfitted with a set of host-based data collectors. Through knowledge discovery, features are identified within the data collected which can be used to enhance host-based intrusion detection. By discovering relationships between the data collected and the events, human understanding of the activity is shown. This method of searching for hidden relationships between sensors greatly enhances understanding of new attacks and vulnerabilities, bolstering our ability to defend the cyberspace domain

    Complex Event Processing(CEP) for Intrusion Detection

    Get PDF
    Σε αυτή την εργασία ασχολούμαστε με τη χρήση των τεχνολογιών ανάλυσης δεδομένων για τη μελέτη της συμπεριφοράς των δικτύων IoT [3]. Οι συσκευές IoT βρίσκονται παντού γύρω μας και δεν πρόκειται να ξεπεραστούν σύντομα, οπως είναι τα έξυπνα βραχιόλια υγειας , έξυπνες συσκευές που συνδέονται με οχήματα και έξυπνα ενεργειακοί πάροχοι. Αλλά τι γίνεται με την ασφάλεια; Αυτά τα συστήματα είναι σε θέση να συγκεντρώνουν και να μοιράζονται τεράστιες ποσότητες ευαίσθητων δεδομένων του χρήστη. Οι καταναλωτές είναι συνεχώς εκτεθειμένοι σε επιθέσεις και φυσικές εισβολές επειδή χρησιμοποιουν ένα ευρύ φάσμα των διαθέσιμων συσκευών IoT, όπως κεντρικές συσκευές ελέγχου για αισθητήρες οικιακού αυτοματισμού. Όπως μπορούμε να φανταστούμε αυτές οι συσκευές είναι εγγενώς ανασφαλής (και οι χρήστες τους συχνά αγνοούν τις επικείμενες απειλές), και αποτελούν εύκολη λεία για τους επιτιθέμενους. Παράλληλα, οι συσκευές IoT μπορούν να χαρακτηριστούν ως χαμηλού κόστους, δηλαδή συσκευές με περιορισμένη επεξεργαστική ισχύ, μπαταρία και μνήμη. Αυτό σημαίνει ότι οι λύσεις που αφορούν την ασφάλεια των έξυπνων συσκευών, καθώς και τα προσωπικά δεδομένα των χρηστών αποτελουν πρόκληση. Η προτεινόμενη προσέγγιση προσφέρει μια εφαρμογή που λύνει το πρόβλημα των εισβολών ασφαλείας με τη χρήση δεδομένων που δημιουργούνται από συσκευές IoT που σχετίζονται με τις ιδιότητες του δικτύου τους με σκοπό τον εντοπισμό μη φυσιολογικών συμπεριφορών και ενημερώνει τον χρήστη μέσω ειδοποιήσεων. Στην περίπτωσή μας κάθε συσκευή που συμμετέχει σε ένα δίκτυο IoT αντιμετωπίζεται ως μια συσκευή αισθητήρα που μετράει τα χαρακτηριστικα του δικτύου, χρησιμοποιώντας ένα πρωτόκολλο διαχείρισης δικτύου (SNMP). Οι μετρήσεις αυτές παρέχονται ως είσοδος σε Σύνθετη Επεξεργασία Γεγονότων (CEP) που ονομάζεται Esper [1]. Οι αισθητήρες του CEP εντοπίζουν και να αναλύουν τα δεδομένα του αισθητήρα σε πραγματικό χρόνο με βάση τα κατώτατα όρια που σχετίζονται με τη φυσιολογική συμπεριφορά. Μια τέτοια διαφορετική συμπεριφορά μπορεί να είναι μια σαφής ένδειξη της εμφάνισης συμβάντος (π.χ. επίθεση). Οι μετρήσεις των συσκευών μπορούν να συνδυαστούν ώστε να μπορούμε να ανιχνεύσουμε διαφόρες επιθέσεις ασφάλειας με μεγαλύτερη σιγουριά. Οι εκτιμήσεις του προγράμματος CEP βασίζεται σε στατιστικούς προγνωστικούς παράγοντες, συμπεριλαμβανομένων των μεθόδων μηχανικής μάθησης όπως ο αλγόριθμος ARΤ. Σας παρουσιάζουμε μια σειρά πειραμάτων για τις προτεινόμενες μεθοδολογίες που δείχνουν την απόδοσή τουςIn this thesis we deal with the usage of data analysis technologies to study the behavior of IoT [3] networks. IoT devices are everywhere, and they’re not going away any time soon, including wearable health, connected vehicles and smart grids. But what about security? These systems are able to gather and share huge quantities of sensitive user data. Consumers are constantly exposed to attacks and physical intrusions due to the use of a wide range of available IoT devices, such central control devices for home automation sensors. As we can imagine these devices are inherently insecure (and their users are often unaware of any impending threats), they’re easy prey for hackers. In parallel IoT devices can be characterized as low cost, i.e. devices with limited processing power, battery and memory. This means that device-centric solutions for incorporating security and privacy components will be a challenge as well. The proposed approach offers an application solution to the problem of security intrusions (anomaly-based detection) by using streams generated by IoT devices relevant to their network properties in order to detect abnormal behavior and notify the user via an alert. In our case, each device participating in a IoT network is handled as a sensor device that generates streams of network measurements by using Simple Network Management Protocol (SNMP) [1]. These measurements are provided as input to Complex Event Processing (CEP) [4] framework, i.e. Esper [2]. CEP listeners detect and analyze the sensor streams in real time based on thresholds related to the normal behavior. Such abnormal statistical behavior can be a clear indication of an event occurrence (e.g., intrusion). Typical measurements of the devices can be combined in order to more accurately observe the outbreak of various security incidents. The estimations of CEP engine will be based on statistical predictors including machine learning methods like ART [5]. We present a number of experiments for the proposed methodologies that show their performance

    Development and application of distributed computing tools for virtual screening of large compound libraries

    Get PDF
    Im derzeitigen Drug Discovery Prozess ist die Identifikation eines neuen Targetproteins und dessen potenziellen Liganden langwierig, teuer und zeitintensiv. Die Verwendung von in silico Methoden gewinnt hier zunehmend an Bedeutung und hat sich als wertvolle Strategie zur Erkennung komplexer Zusammenhänge sowohl im Bereich der Struktur von Proteinen wie auch bei Bioaktivitäten erwiesen. Die zunehmende Nachfrage nach Rechenleistung im wissenschaftlichen Bereich sowie eine detaillierte Analyse der generierten Datenmengen benötigen innovative Strategien für die effiziente Verwendung von verteilten Computerressourcen, wie z.B. Computergrids. Diese Grids ergänzen bestehende Technologien um einen neuen Aspekt, indem sie heterogene Ressourcen zur Verfügung stellen und koordinieren. Diese Ressourcen beinhalten verschiedene Organisationen, Personen, Datenverarbeitung, Speicherungs- und Netzwerkeinrichtungen, sowie Daten, Wissen, Software und Arbeitsabläufe. Das Ziel dieser Arbeit war die Entwicklung einer universitätsweit anwendbaren Grid-Infrastruktur - UVieCo (University of Vienna Condor pool) -, welche für die Implementierung von akademisch frei verfügbaren struktur- und ligandenbasierten Drug Discovery Anwendungen verwendet werden kann. Firewall- und Sicherheitsprobleme wurden mittels eines virtuellen privaten Netzwerkes gelöst, wohingegen die Virtualisierung der Computerhardware über das CoLinux Konzept ermöglicht wurde. Dieses ermöglicht, dass unter Linux auszuführende Aufträge auf Windows Maschinen laufen können. Die Effektivität des Grids wurde durch Leistungsmessungen anhand sequenzieller und paralleler Aufgaben ermittelt. Als Anwendungsbeispiel wurde die Assoziation der Expression bzw. der Sensitivitätsprofile von ABC-Transportern mit den Aktivitätsprofilen von Antikrebswirkstoffen durch Data-Mining des NCI (National Cancer Institute) Datensatzes analysiert. Die dabei generierten Datensätze wurden für liganden-basierte Computermethoden wie Shape-Similarity und Klassifikationsalgorithmen mit dem Ziel verwendet, P-glycoprotein (P-gp) Substrate zu identifizieren und sie von Nichtsubstraten zu trennen. Beim Erstellen vorhersagekräftiger Klassifikationsmodelle konnte das Problem der extrem unausgeglichenen Klassenverteilung durch Verwendung der „Cost-Sensitive Bagging“ Methode gelöst werden. Applicability Domain Studien ergaben, dass unser Modell nicht nur die NCI Substanzen gut vorhersagen kann, sondern auch für wirkstoffähnliche Moleküle verwendet werden kann. Die entwickelten Modelle waren relativ einfach, aber doch präzise genug um für virtuelles Screening einer großen chemischen Bibliothek verwendet werden zu können. Dadurch könnten P-gp Substrate schon frühzeitig erkannt werden, was möglicherweise nützlich sein kann zur Entfernung von Substanzen mit schlechten ADMET-Eigenschaften bereits in einer frühen Phase der Arzneistoffentwicklung. Zusätzlich wurden Shape-Similarity und Self-organizing Map Techniken verwendet um neue Substanzen in einer hauseigenen sowie einer großen kommerziellen Datenbank zu identifizieren, die ähnlich zu selektiven Serotonin-Reuptake-Inhibitoren (SSRI) sind und Apoptose induzieren können. Die erhaltenen Treffer besitzen neue chemische Grundkörper und können als Startpunkte für Leitstruktur-Optimierung in Betracht gezogen werden. Die in dieser Arbeit beschriebenen Studien werden nützlich sein um eine verteilte Computerumgebung zu kreieren die vorhandene Ressourcen in einer Organisation nutzt, und die für verschiedene Anwendungen geeignet ist, wie etwa die effiziente Handhabung der Klassifizierung von unausgeglichenen Datensätzen, oder mehrstufiges virtuelles Screening.In the current drug discovery process, the identification of new target proteins and potential ligands is very tedious, expensive and time-consuming. Thus, use of in silico techniques is of utmost importance and proved to be a valuable strategy in detecting complex structural and bioactivity relationships. Increased demands of computational power for tremendous calculations in scientific fields and timely analysis of generated piles of data require innovative strategies for efficient utilization of distributed computing resources in the form of computational grids. Such grids add a new aspect to the emerging information technology paradigm by providing and coordinating the heterogeneous resources such as various organizations, people, computing, storage and networking facilities as well as data, knowledge, software and workflows. The aim of this study was to develop a university-wide applicable grid infrastructure, UVieCo (University of Vienna Condor pool) which can be used for implementation of standard structure- and ligand-based drug discovery applications using freely available academic software. Firewall and security issues were resolved with a virtual private network setup whereas virtualization of computer hardware was done using the CoLinux concept in a way to run Linux-executable jobs inside Windows machines. The effectiveness of the grid was assessed by performance measurement experiments using sequential and parallel tasks. Subsequently, the association of expression/sensitivity profiles of ABC transporters with activity profiles of anticancer compounds was analyzed by mining the data from NCI (National Cancer Institute). The datasets generated in this analysis were utilized with ligand-based computational methods such as shape similarity and classification algorithms to identify and separate P-gp substrates from non-substrates. While developing predictive classification models, the problem of imbalanced class distribution was proficiently addressed using the cost-sensitive bagging approach. Applicability domain experiment revealed that our model not only predicts NCI compounds well, but it can also be applied to drug-like molecules. The developed models were relatively simple but precise enough to be applicable for virtual screening of large chemical libraries for the early identification of P-gp substrates which can potentially be useful to remove compounds of poor ADMET properties in an early phase of drug discovery. Additionally, shape-similarity and self-organizing maps techniques were used to screen in-house as well as a large vendor database for identification of novel selective serotonin reuptake inhibitor (SSRI) like compounds to induce apoptosis. The retrieved hits possess novel chemical scaffolds and can be considered as a starting point for lead optimization studies. The work described in this thesis will be useful to create distributed computing environment using available resources within an organization and can be applied to various applications such as efficient handling of imbalanced data classification problems or multistep virtual screening approach

    Cybersecurity of Digital Service Chains

    Get PDF
    This open access book presents the main scientific results from the H2020 GUARD project. The GUARD project aims at filling the current technological gap between software management paradigms and cybersecurity models, the latter still lacking orchestration and agility to effectively address the dynamicity of the former. This book provides a comprehensive review of the main concepts, architectures, algorithms, and non-technical aspects developed during three years of investigation; the description of the Smart Mobility use case developed at the end of the project gives a practical example of how the GUARD platform and related technologies can be deployed in practical scenarios. We expect the book to be interesting for the broad group of researchers, engineers, and professionals daily experiencing the inadequacy of outdated cybersecurity models for modern computing environments and cyber-physical systems

    Cybersecurity of Digital Service Chains

    Get PDF
    This open access book presents the main scientific results from the H2020 GUARD project. The GUARD project aims at filling the current technological gap between software management paradigms and cybersecurity models, the latter still lacking orchestration and agility to effectively address the dynamicity of the former. This book provides a comprehensive review of the main concepts, architectures, algorithms, and non-technical aspects developed during three years of investigation; the description of the Smart Mobility use case developed at the end of the project gives a practical example of how the GUARD platform and related technologies can be deployed in practical scenarios. We expect the book to be interesting for the broad group of researchers, engineers, and professionals daily experiencing the inadequacy of outdated cybersecurity models for modern computing environments and cyber-physical systems

    Network anomalies detection via event analysis and correlation by a smart system

    Get PDF
    The multidisciplinary of contemporary societies compel us to look at Information Technology (IT) systems as one of the most significant grants that we can remember. However, its increase implies a mandatory security force for users, a force in the form of effective and robust tools to combat cybercrime to which users, individual or collective, are ex-posed almost daily. Monitoring and detection of this kind of problem must be ensured in real-time, allowing companies to intervene fruitfully, quickly and in unison. The proposed framework is based on an organic symbiosis between credible, affordable, and effective open-source tools for data analysis, relying on Security Information and Event Management (SIEM), Big Data and Machine Learning (ML) techniques commonly applied for the development of real-time monitoring systems. Dissecting this framework, it is composed of a system based on SIEM methodology that provides monitoring of data in real-time and simultaneously saves the information, to assist forensic investigation teams. Secondly, the application of the Big Data concept is effective in manipulating and organising the flow of data. Lastly, the use of ML techniques that help create mechanisms to detect possible attacks or anomalies on the network. This framework is intended to provide a real-time analysis application in the institution ISCTE – Instituto Universitário de Lisboa (Iscte), offering a more complete, efficient, and secure monitoring of the data from the different devices comprising the network.A multidisciplinaridade das sociedades contemporâneas obriga-nos a perspetivar os sistemas informáticos como uma das maiores dádivas de que há memória. Todavia o seu incremento implica uma mandatária força de segurança para utilizadores, força essa em forma de ferramentas eficazes e robustas no combate ao cibercrime a que os utilizadores, individuais ou coletivos, são sujeitos quase diariamente. A monitorização e deteção deste tipo de problemas tem de ser assegurada em tempo real, permitindo assim, às empresas intervenções frutuosas, rápidas e em uníssono. A framework proposta é alicerçada numa simbiose orgânica entre ferramentas open source credíveis, acessíveis pecuniariamente e eficazes na monitorização de dados, recorrendo a um sistema baseado em técnicas de Security Information and Event Management (SIEM), Big Data e Machine Learning (ML) comumente aplicadas para a criação de sistemas de monitorização em tempo real. Dissecando esta framework, é composta pela metodologia SIEM que possibilita a monitorização de dados em tempo real e em simultâneo guardar a informação, com o objetivo de auxiliar as equipas de investigação forense. Em segundo lugar, a aplicação do conceito Big Data eficaz na manipulação e organização do fluxo dos dados. Por último, o uso de técnicas de ML que ajudam a criação de mecanismos de deteção de possíveis ataques ou anomalias na rede. Esta framework tem como objetivo uma aplicação de análise em tempo real na instituição ISCTE – Instituto Universitário de Lisboa (Iscte), apresentando uma monitorização mais completa, eficiente e segura dos dados dos diversos dispositivos presentes na mesma

    Framing Apache Spark in life sciences

    Get PDF
    Advances in high-throughput and digital technologies have required the adoption of big data for handling complex tasks in life sciences. However, the drift to big data led researchers to face technical and infrastructural challenges for storing, sharing, and analysing them. In fact, this kind of tasks requires distributed computing systems and algorithms able to ensure efficient processing. Cutting edge distributed programming frameworks allow to implement flexible algorithms able to adapt the computation to the data over on-premise HPC clusters or cloud architectures. In this context, Apache Spark is a very powerful HPC engine for large-scale data processing on clusters. Also thanks to specialised libraries for working with structured and relational data, it allows to support machine learning, graph-based computation, and stream processing. This review article is aimed at helping life sciences researchers to ascertain the features of Apache Spark and to assess whether it can be successfully used in their research activities
    corecore