972 research outputs found

    Anomaly detection for machine learning redshifts applied to SDSS galaxies

    Full text link
    We present an analysis of anomaly detection for machine learning redshift estimation. Anomaly detection allows the removal of poor training examples, which can adversely influence redshift estimates. Anomalous training examples may be photometric galaxies with incorrect spectroscopic redshifts, or galaxies with one or more poorly measured photometric quantity. We select 2.5 million 'clean' SDSS DR12 galaxies with reliable spectroscopic redshifts, and 6730 'anomalous' galaxies with spectroscopic redshift measurements which are flagged as unreliable. We contaminate the clean base galaxy sample with galaxies with unreliable redshifts and attempt to recover the contaminating galaxies using the Elliptical Envelope technique. We then train four machine learning architectures for redshift analysis on both the contaminated sample and on the preprocessed 'anomaly-removed' sample and measure redshift statistics on a clean validation sample generated without any preprocessing. We find an improvement on all measured statistics of up to 80% when training on the anomaly removed sample as compared with training on the contaminated sample for each of the machine learning routines explored. We further describe a method to estimate the contamination fraction of a base data sample.Comment: 13 pages, 8 figures, 1 table, minor text updates to macth MNRAS accepted versio

    Methods for fast and reliable clustering

    Get PDF

    A case study: Failure prediction in a real LTE network

    Get PDF
    Mobile traffic and number of connected devices have been increasing exponentially nowadays, with customer expectation from mobile operators in term of quality and reliability is higher and higher. This places pressure on operators to invest as well as to operate their growing infrastructures. As such, telecom network management becomes an essential problem. To reduce cost and maintain network performance, operators need to bring more automation and intelligence into their management system. Self-Organizing Networks function (SON) is an automation technology aiming to maximize performance in mobility networks by bringing autonomous adaptability and reducing human intervention in network management and operations. Three main areas of SON include self-configuration (auto-configuration when new element enter the network), self-optimization (optimization of the network parameters during operation) and self-healing (maintenance). The main purpose of the thesis is to illustrate how anomaly detection methods can be applied to SON functions, in particularly self-healing functions such as fault detection and cell outage management. The thesis is illustrated by a case study, in which the anomalies - in this case, the failure alarms, are predicted in advance using performance measurement data (PM data) collected from a real LTE network within a certain timeframe. Failures prediction or anomalies detection can help reduce cost and maintenance time in mobile network base stations. The author aims to answer the research questions: what anomaly detection models could detect the anomalies in advance, and what type of anomalies can be well-detected using those models. Using cross-validation, the thesis shows that random forest method is the best performing model out of the chosen ones, with F1-score of 0.58, 0.96 and 0.52 for the anomalies: Failure in Optical Interface, Temperature alarm, and VSWR minor alarm respectively. Those are also the anomalies can be well-detected by the model

    Ant colony optimization approach for stacking configurations

    Full text link
    In data mining, classifiers are generated to predict the class labels of the instances. An ensemble is a decision making system which applies certain strategies to combine the predictions of different classifiers and generate a collective decision. Previous research has empirically and theoretically demonstrated that an ensemble classifier can be more accurate and stable than its component classifiers in most cases. Stacking is a well-known ensemble which adopts a two-level structure: the base-level classifiers to generate predictions and the meta-level classifier to make collective decisions. A consequential problem is: what learning algorithms should be used to generate the base-level and meta-level classifier in the Stacking configuration? It is not easy to find a suitable configuration for a specific dataset. In some early works, the selection of a meta classifier and its training data are the major concern. Recently, researchers have tried to apply metaheuristic methods to optimize the configuration of the base classifiers and the meta classifier. Ant Colony Optimization (ACO), which is inspired by the foraging behaviors of real ant colonies, is one of the most popular approaches among the metaheuristics. In this work, we propose a novel ACO-Stacking approach that uses ACO to tackle the Stacking configuration problem. This work is the first to apply ACO to the Stacking configuration problem. Different implementations of the ACO-Stacking approach are developed. The first version identifies the appropriate learning algorithms in generating the base-level classifiers while using a specific algorithm to create the meta-level classifier. The second version simultaneously finds the suitable learning algorithms to create the base-level classifiers and the meta-level classifier. Moreover, we study how different kinds on local information of classifiers will affect the classification results. Several pieces of local information collected from the initial phase of ACO-Stacking are considered, such as the precision, f-measure of each classifier and correlative differences of paired classifiers. A series of experiments are performed to compare the ACO-Stacking approach with other ensembles on a number of datasets of different domains and sizes. The experiments show that the new approach can achieve promising results and gain advantages over other ensembles. The correlative differences of the classifiers could be the best local information in this approach. Under the agile ACO-Stacking framework, an application to deal with a direct marketing problem is explored. A real world database from a US-based catalog company, containing more than 100,000 customer marketing records, is used in the experiments. The results indicate that our approach can gain more cumulative response lifts and cumulative profit lifts in the top deciles. In conclusion, it is competitive with some well-known conventional and ensemble data mining methods

    Development of a modular Knowledge-Discovery Framework based on Machine Learning for the interdisciplinary analysis of complex phenomena in the context of GDI combustion processes

    Get PDF
    Die physikalischen und chemischen Phänomene vor, während und nach der Verbrennung in Motoren mit Benzindirekteinspritzung (BDE) sind komplex und umfassen unterschiedliche Wechselwirkungen zwischen Flüssigkeiten, Gasen und der umgebenden Brennraumwand. In den letzten Jahren wurden verschiedene Simulationstools und Messtechniken entwickelt, um die an den Verbrennungsprozessen beteiligten Komponenten zu bewerten und zu optimieren. Die Möglichkeit, den gesamten Gestaltungsraum zu erkunden, ist jedoch durch den hohen Aufwand zur Generierung und zur Analyse der nichtlinearen und multidimensionalen Ergebnisse begrenzt. Das Ziel dieser Arbeit ist die Entwicklung und Validierung eines Datenanalysewerkzeugs zur Erkenntnisgewinnung. Im Rahmen dieser Arbeit wird der gesamte Prozess als auch das Werkzeug als "Knowledge-Discovery Framework" bezeichnet. Dieses Werkzeug soll in der Lage sein, die im BDE-Kontext erzeugten Daten durch Methoden des maschinellen Lernens zu analysieren. Anhand einer begrenzten Anzahl von Beobachtungen wird damit ermöglicht, die untersuchten Gestaltungsräume zu erkunden sowie Zusammenhänge in den Beobachtungen der komplexen Phänomene schneller zu entdecken. Damit können teure und zeitaufwendige Auswertungen durch schnelle und genaue Vorhersagen ersetzt werden. Nach der Einführung der wichtigsten Datenmerkmale im Bereich der BDE Anwendungen wird das Framework vorgestellt und seine modularen und interdisziplinären Eigenschaften dargestellt. Kern des Frameworks ist eine parameterfreie, schnelle und dynamische datenbasierte Modellauswahl für die BDE-typischen, heterogenen Datensätze. Das Potenzial dieses Ansatzes wird in der Analyse numerischer und experimenteller Untersuchungen an Düsen und Motoren gezeigt. Insbesondere werden die nichtlinearen Einflüsse der Auslegungsparameter auf Einström- und Sprayverhalten sowie auf Emissionen aus den Daten extrahiert. Darüber hinaus werden neue Designs, basierend auf Vorhersagen des maschinellen Lernens identifiziert, welche vordefinierte Ziele und Leistungen erfüllen können. Das extrahierte Wissen wird schließlich mit der Domänenexpertise validiert, wodurch das Potenzial und die Grenzen dieses neuartigen Ansatzes aufgezeigt werden

    Stratiform and convective rain classification using machine learning models and micro rain radar

    Get PDF
    Rain type classification into convective and stratiform is an essential step required to improve quantitative precipitation estimations by remote sensing instruments. Previous studies with Micro Rain Radar (MRR) measurements and subjective rules have been performed to classify rain events. However, automating this process by using machine learning (ML) models provides the advantages of fast and reliable classification with the possibility to classify rain minute by minute. A total of 20,979 min of rain data measured by an MRR at Das in northeast Spain were used to build seven types of ML models for stratiform and convective rain type classification. The proposed classification models use a set of 22 parameters that summarize the reflectivity, the Doppler velocity, and the spectral width (SW) above and below the so-called separation level (SL). This level is defined as the level with the highest increase in Doppler velocity and corresponds with the bright band in stratiform rain. A pre-classification of the rain type for each minute based on the rain microstructure provided by the collocated disdrometer was performed. Our results indicate that complex ML models, particularly tree-based ensembles such as xgboost and random forest which capture the interactions of different features, perform better than simpler models. Applying methods from the field of interpretable ML, we identified reflectivity at the lowest layer and the average spectral width in the layers below SL as the most important features. High reflectivity and low SW values indicate a higher probability of convective rainPostprint (published version

    A Gas Giant Circumbinary Planet Transiting the F Star Primary of the Eclipsing Binary Star KIC 4862625 and the Independent Discovery and Characterization of the two transiting planets in the Kepler-47 System

    Full text link
    We report the discovery of a transiting, gas giant circumbinary planet orbiting the eclipsing binary KIC 4862625 and describe our independent discovery of the two transiting planets orbiting Kepler-47 (Orosz et al. 2012). We describe a simple and semi-automated procedure for identifying individual transits in light curves and present our follow-up measurements of the two circumbinary systems. For the KIC 4862625 system, the 0.52+/-0.018 RJup radius planet revolves every ~138 days and occults the 1.47+/-0.08 MSun, 1.7 +/-0.06 RSun F8 IV primary star producing aperiodic transits of variable durations commensurate with the configuration of the eclipsing binary star. Our best-fit model indicates the orbit has a semi-major axis of 0.64 AU and is slightly eccentric, e=0.1. For the Kepler-47 system, we confirm the results of Orosz et al. (2012). Modulations in the radial velocity of KIC 4862625A are measured both spectroscopically and photometrically, i.e. via Doppler boosting, and produce similar results.Comment: 40 pages, 17 figure
    • …
    corecore