189 research outputs found

    Geant4 Hadronic Cascade Models and CMS Data Analysis : Computational Challenges in the LHC era

    Get PDF
    This work belongs to the field of computational high-energy physics (HEP). The key methods used in this thesis work to meet the challenges raised by the Large Hadron Collider (LHC) era experiments are object-orientation with software engineering, Monte Carlo simulation, the computer technology of clusters, and artificial neural networks. The first aspect discussed is the development of hadronic cascade models, used for the accurate simulation of medium-energy hadron-nucleus reactions, up to 10 GeV. These models are typically needed in hadronic calorimeter studies and in the estimation of radiation backgrounds. Various applications outside HEP include the medical field (such as hadron treatment simulations), space science (satellite shielding), and nuclear physics (spallation studies). Validation results are presented for several significant improvements released in Geant4 simulation tool, and the significance of the new models for computing in the Large Hadron Collider era is estimated. In particular, we estimate the ability of the Bertini cascade to simulate Compact Muon Solenoid (CMS) hadron calorimeter HCAL. LHC test beam activity has a tightly coupled cycle of simulation-to-data analysis. Typically, a Geant4 computer experiment is used to understand test beam measurements. Thus an another aspect of this thesis is a description of studies related to developing new CMS H2 test beam data analysis tools and performing data analysis on the basis of CMS Monte Carlo events. These events have been simulated in detail using Geant4 physics models, full CMS detector description, and event reconstruction. Using the ROOT data analysis framework we have developed an offline ANN-based approach to tag b-jets associated with heavy neutral Higgs particles, and we show that this kind of NN methodology can be successfully used to separate the Higgs signal from the background in the CMS experiment.Tärkeimmät menetelmät tässä kokeellisen hiukkasfysiikan laskennan alaan kuuluvassa työssä ovat olipohjainen ohjelmistokehitys, Monte Carlo simulointi ja keinotekoiset hermoverkot. Näitä menetelmiä on käytetty vastaamaan haasteisiin joita asettaa CERN:n Large Hadron Collider - kiihdyttimellä tehtävien kokeiden simulointi ja data-analyysi -menetelmien kehittäminen. Työn ensimmäisessä osassa keskitytään atomiytimen sisäisten hadroni-kaskadimallien kehittämiseen. Näiden mallien tyypilliset käyttöalueet ovat hiukkasfysiikassa hadronikalorimetrien simulointi ja taustasäteilyn arviointi, ydifysiikassa spallaatiotutkimus, avaruustutkimuksesa säteilynsuojelu ja lääketieteessä hadronihoitojen mallintaminen. Työssä esitellään avoimen lähdekoodin ohjelmistossa Geant4:ssä kehitettyjen mallien suorituskykyä ja sovellutuksia erityisesti CERN:n Compact Muon Solenoid -kokeen hadronikalorimetrin näkökulmasta ja perustellaan miksi CERN kokeet ovat valinneet työssä kehitetyt mallit standardityökaluksi simulaatioissaan. Työn toisessa osassa kuvataan hiukkassuihkujen analysointiin Helsinki Silicon Beam Telescope -jäljistimellä kehittettyä Geant4-simulointia ja data-analyysi -ohjelmistoa. Lisäksi esitellään uusi menetelmä erottaa CMS kokeeessa Higgs hiukkasen signaali vallitsevasta taustasta käyttäen keinotekoisia hemoverkkoja. Aikaisemmin vain vähän sovelletujen itseoppivien hermoverkkojen analyysillä todistetaan että myös tällainen työkalu soveltuu tukemaan perinteisiä menetelmiä

    Applying Machine Learning to Advance Cyber Security: Network Based Intrusion Detection Systems

    Get PDF
    Many new devices, such as phones and tablets as well as traditional computer systems, rely on wireless connections to the Internet and are susceptible to attacks. Two important types of attacks are the use of malware and exploiting Internet protocol vulnerabilities in devices and network systems. These attacks form a threat on many levels and therefore any approach to dealing with these nefarious attacks will take several methods to counter. In this research, we utilize machine learning to detect and classify malware, visualize, detect and classify worms, as well as detect deauthentication attacks, a form of Denial of Service (DoS). This work also includes two prevention mechanisms for DoS attacks, namely a one- time password (OTP) and through the use of machine learning. Furthermore, we focus on an exploit of the widely used IEEE 802.11 protocol for wireless local area networks (WLANs). The work proposed here presents a threefold approach for intrusion detection to remedy the effects of malware and an Internet protocol exploit employing machine learning as a primary tool. We conclude with a comparison of dimensionality reduction methods to a deep learning classifier to demonstrate the effectiveness of these methods without compromising the accuracy of classification

    Context-dependent fusion with application to landmine detection.

    Get PDF
    Traditional machine learning and pattern recognition systems use a feature descriptor to describe the sensor data and a particular classifier (also called expert or learner ) to determine the true class of a given pattern. However, for complex detection and classification problems, involving data with large intra-class variations and noisy inputs, no single source of information can provide a satisfactory solution. As a result, combination of multiple classifiers is playing an increasing role in solving these complex pattern recognition problems, and has proven to be viable alternative to using a single classifier. In this thesis we introduce a new Context-Dependent Fusion (CDF) approach, We use this method to fuse multiple algorithms which use different types of features and different classification methods on multiple sensor data. The proposed approach is motivated by the observation that there is no single algorithm that can consistently outperform all other algorithms. In fact, the relative performance of different algorithms can vary significantly depending on several factions such as extracted features, and characteristics of the target class. The CDF method is a local approach that adapts the fusion method to different regions of the feature space. The goal is to take advantages of the strengths of few algorithms in different regions of the feature space without being affected by the weaknesses of the other algorithms and also avoiding the loss of potentially valuable information provided by few weak classifiers by considering their output as well. The proposed fusion has three main interacting components. The first component, called Context Extraction, partitions the composite feature space into groups of similar signatures, or contexts. Then, the second component assigns an aggregation weight to each detector\u27s decision in each context based on its relative performance within the context. The third component combines the multiple decisions, using the learned weights, to make a final decision. For Context Extraction component, a novel algorithm that performs clustering and feature discrimination is used to cluster the composite feature space and identify the relevant features for each cluster. For the fusion component, six different methods were proposed and investigated. The proposed approached were applied to the problem of landmine detection. Detection and removal of landmines is a serious problem affecting civilians and soldiers worldwide. Several detection algorithms on landmine have been proposed. Extensive testing of these methods has shown that the relative performance of different detectors can vary significantly depending on the mine type, geographical site, soil and weather conditions, and burial depth, etc. Therefore, multi-algorithm, and multi-sensor fusion is a critical component in land mine detection. Results on large and diverse real data collections show that the proposed method can identify meaningful and coherent clusters and that different expert algorithms can be identified for the different contexts. Our experiments have also indicated that the context-dependent fusion outperforms all individual detectors and several global fusion methods

    Data Mining

    Get PDF
    The availability of big data due to computerization and automation has generated an urgent need for new techniques to analyze and convert big data into useful information and knowledge. Data mining is a promising and leading-edge technology for mining large volumes of data, looking for hidden information, and aiding knowledge discovery. It can be used for characterization, classification, discrimination, anomaly detection, association, clustering, trend or evolution prediction, and much more in fields such as science, medicine, economics, engineering, computers, and even business analytics. This book presents basic concepts, ideas, and research in data mining

    The IPAC Image Subtraction and Discovery Pipeline for the intermediate Palomar Transient Factory

    Get PDF
    We describe the near real-time transient-source discovery engine for the intermediate Palomar Transient Factory (iPTF), currently in operations at the Infrared Processing and Analysis Center (IPAC), Caltech. We coin this system the IPAC/iPTF Discovery Engine (or IDE). We review the algorithms used for PSF-matching, image subtraction, detection, photometry, and machine-learned (ML) vetting of extracted transient candidates. We also review the performance of our ML classifier. For a limiting signal-to-noise ratio of 4 in relatively unconfused regions, "bogus" candidates from processing artifacts and imperfect image subtractions outnumber real transients by ~ 10:1. This can be considerably higher for image data with inaccurate astrometric and/or PSF-matching solutions. Despite this occasionally high contamination rate, the ML classifier is able to identify real transients with an efficiency (or completeness) of ~ 97% for a maximum tolerable false-positive rate of 1% when classifying raw candidates. All subtraction-image metrics, source features, ML probability-based real-bogus scores, contextual metadata from other surveys, and possible associations with known Solar System objects are stored in a relational database for retrieval by the various science working groups. We review our efforts in mitigating false-positives and our experience in optimizing the overall system in response to the multitude of science projects underway with iPTF.Comment: 66 pages, 21 figures, 7 tables, accepted by PAS

    Modelos de clasificación multi-etiqueta para datos heterogéneos: un enfoque basado en ensembles

    Get PDF
    In recent years, the multi-label classification task has gained the attention of the scientific community given its ability to solve real-world problems where each instance of the dataset may be associated with several class labels simultaneously. For example, in medical problems each patient may be affected by several diseases at the same time, and in multimedia categorization problems, each item might be related with different tags or topics. Thus, given the nature of these problems, dealing with them as traditional classification problems where just one class label is assigned to each instance, would lead to a lose of information. However, the fact of having more than one label associated with each instance leads to new classification challenges that should be addressed, such as modeling the compound dependencias among labels, the imbalance of the label space, and the high dimensionality of the output space. A large number of methods for multi-label classification has been proposed in the literature, including several ensemble-based methods. Ensemble learning is a technique which is based on combining the outputs of many diverse base models, in order to outperform each of the separate members. In multi-label classification, ensemble methods are those that combine the predictions of several multi-label classifiers, and these methods have shown to outperform simpler multi-label classifiers. Therefore, given its great performance, we focused our research on the study of ensemble-based methods for multi-label classification. The first objective of this dissertation is to perform an thorough review of the state-of-the-art ensembles of multi-label classifiers. Its aim is twofold: I) study different ensembles of multi-label classifiers proposed in the literature, and categorize them according to their characteristics proposing a novel taxonomy; and II) perform an experimental study to find the method or family of methods that performs better depending on the characteristics of the data, as well as provide then some guidelines to select the best method according to the characteristics of a given problem. Since most of the ensemble methods for multi-label classification are based on creating diverse members by randomly selecting instances, input features, or labels, our second and main objective is to propose novel ensemble methods for multi-label classification where the characteristics of the data are taken into account. For this purpose, we first propose an evolutionary algorithm able to build an ensemble of multi-label classifiers, where each of the individuals of the population is an entire ensemble. This approach is able to model the relationships among the labels with a relative low complexity and imbalance of the output space, also considering these characteristics to guide the learning process. Furthermore, it looks for an optimal structure of the ensemble not only considering its predictive performance, but also the number of times that each label appears in it. In this way, all labels are expected to appear a similar number of times in the ensemble, not neglecting any of them regardless of their frequency. Then, we develop a second evolutionary algorithm able to build ensembles of multi-label classifiers, but in this case each individual of the population is a hypothetical member of the ensemble, and not the entire ensemble. The fact of evolving members of the ensemble separately makes the algorithm less computationally complex and able to determine the quality of each member separately. However, a method to select the ensemble members needs to be defined. This process selects those classifiers that are both accurate but also diverse among them to form the ensemble, also controlling that all labels appear a similar number of times in the final ensemble. In all experimental studies, the methods are compared using rigorous experimental setups and statistical tests over many evaluation metrics and reference datasets in multi-label classification. The experiments confirm that the proposed methods obtain significantly better and more consistent performance than the stateof- the-art methods in multi-label classification. Furthermore, the second proposal is proven to be more efficient than the first one, given the use of separate classifiers as individuals.En los últimos años, el paradigma de clasificación multi-etiqueta ha ganado atención en la comunidad científica, dada su habilidad para resolver problemas reales donde cada instancia del conjunto de datos puede estar asociada con varias etiquetas de clase simultáneamente. Por ejemplo, en problemas médicos cada paciente puede estar afectado por varias enfermedades a la vez, o en problemas de categorización multimedia, cada ítem podría estar relacionado con varias etiquetas o temas. Dada la naturaleza de estos problemas, tratarlos como problemas de clasificación tradicional donde cada instancia puede tener asociada únicamente una etiqueta de clase, conllevaría una pérdida de información. Sin embargo, el hecho de tener más de una etiqueta asociada con cada instancia conlleva la aparición de nuevos retos que deben ser abordados, como modelar las dependencias entre etiquetas, el desbalanceo de etiquetas, y la alta dimensionalidad del espacio de salida. En la literatura se han propuesto un gran número de métodos para clasificación multi-etiqueta, incluyendo varios basados en ensembles. El aprendizaje basado en ensembles combina las salidas de varios modelos más simples y diversos entre sí, de cara a conseguir un mejor rendimiento que cada miembro por separado. En clasificación multi-etiqueta, se consideran ensembles aquellos métodos que combinan las predicciones de varios clasificadores multi-etiqueta, y estos métodos han mostrado conseguir un mejor rendimiento que los clasificadores multi-etiqueta sencillos. Por tanto, dado su buen rendimiento, centramos nuestra investigación en el estudio de métodos basados en ensembles para clasificación multi-etiqueta. El primer objetivo de esta tesis el realizar una revisión a fondo del estado del arte en ensembles de clasificadores multi-etiqueta. El objetivo de este estudio es doble: I) estudiar diferentes ensembles de clasificadores multi-etiqueta propuestos en la literatura, y categorizarlos de acuerdo a sus características proponiendo una nueva taxonomía; y II) realizar un estudio experimental para encontrar el método o familia de métodos que obtiene mejores resultados dependiendo de las características de los datos, así como ofrecer posteriormente algunas guías para seleccionar el mejor método de acuerdo a las características de un problema dado. Dado que la mayoría de ensembles para clasificación multi-etiqueta están basados en la creación de miembros diversos seleccionando aleatoriamente instancias, atributos, o etiquetas; nuestro segundo y principal objetivo es proponer nuevos modelos de ensemble para clasificación multi-etiqueta donde se tengan en cuenta las características de los datos. Para ello, primero proponemos un algoritmo evolutivo capaz de generar un ensemble de clasificadores multi-etiqueta, donde cada uno de los individuos de la población es un ensemble completo. Este enfoque es capaz de modelar las relaciones entre etiquetas con una complejidad y desbalanceo de etiquetas relativamente bajos, considerando también estas características para guiar el proceso de aprendizaje. Además, busca una estructura óptima para el ensemble, no solo considerando su capacidad predictiva, pero también teniendo en cuenta el número de veces que aparece cada etiqueta en él. De este modo, se espera que todas las etiquetas aparezcan un número de veces similar en el ensemble, sin despreciar ninguna de ellas independientemente de su frecuencia. Posteriormente, desarrollamos un segundo algoritmo evolutivo capaz de construir ensembles de clasificadores multi-etiqueta, pero donde cada individuo de la población es un hipotético miembro del ensemble, en lugar del ensemble completo. El hecho de evolucionar los miembros del ensemble por separado hace que el algoritmo sea menos complejo y capaz de determinar la calidad de cada miembro por separado. Sin embargo, también es necesario definir un método para seleccionar los miembros que formarán el ensemble. Este proceso selecciona aquellos clasificadores que sean tanto precisos como diversos entre ellos, también controlando que todas las etiquetas aparezcan un número similar de veces en el ensemble final. En todos los estudios experimentales realizados, los métodos han sido comparados utilizando rigurosas configuraciones experimentales y test estadísticos, involucrando varias métricas de evaluación y conjuntos de datos de referencia en clasificación multi-etiqueta. Los experimentos confirman que los métodos propuestos obtienen un rendimiento significativamente mejor y más consistente que los métodos en el estado del arte. Además, se demuestra que el segundo algoritmo propuesto es más eficiente que el primero, dado el uso de individuos representando clasificadores por separado

    Multi-label classification models for heterogeneous data: an ensemble-based approach.

    Get PDF
    In recent years, the multi-label classification gained attention of the scientific community given its ability to solve real-world problems where each instance of the dataset may be associated with several class labels simultaneously, such as multimedia categorization or medical problems. The first objective of this dissertation is to perform a thorough review of the state-of-the-art ensembles of multi-label classifiers (EMLCs). Its aim is twofold: 1) study state-of-the-art ensembles of multi-label classifiers and categorize them proposing a novel taxonomy; and 2) perform an experimental study to give some tips and guidelines to select the method that perform the best according to the characteristics of a given problem. Since most of the EMLCs are based on creating diverse members by randomly selecting instances, input features, or labels, our main objective is to propose novel ensemble methods while considering the characteristics of the data. In this thesis, we propose two evolutionary algorithms to build EMLCs. The first proposal encodes an entire EMLC in each individual, where each member is focused on a small subset of the labels. On the other hand, the second algorithm encodes separate members in each individual, then combining the individuals of the population to build the ensemble. Finally, both methods are demonstrated to be more consistent and perform significantly better than state-of-the-art methods in multi-label classification

    Dutkat: A Privacy-Preserving System for Automatic Catch Documentation and Illegal Activity Detection in the Fishing Industry

    Get PDF
    United Nations' Sustainable Development Goal 14 aims to conserve and sustainably use the oceans and their resources for the benefit of people and the planet. This includes protecting marine ecosystems, preventing pollution, and overfishing, and increasing scientific understanding of the oceans. Achieving this goal will help ensure the health and well-being of marine life and the millions of people who rely on the oceans for their livelihoods. In order to ensure sustainable fishing practices, it is important to have a system in place for automatic catch documentation. This thesis presents our research on the design and development of Dutkat, a privacy-preserving, edge-based system for catch documentation and detection of illegal activities in the fishing industry. Utilising machine learning techniques, Dutkat can analyse large amounts of data and identify patterns that may indicate illegal activities such as overfishing or illegal discard of catch. Additionally, the system can assist in catch documentation by automating the process of identifying and counting fish species, thus reducing potential human error and increasing efficiency. Specifically, our research has consisted of the development of various components of the Dutkat system, evaluation through experimentation, exploration of existing data, and organization of machine learning competitions. We have also implemented it from a compliance-by-design perspective to ensure that the system is in compliance with data protection laws and regulations such as GDPR. Our goal with Dutkat is to promote sustainable fishing practices, which aligns with the Sustainable Development Goal 14, while simultaneously protecting the privacy and rights of fishing crews

    Essentials of Business Analytics

    Get PDF
    corecore