95 research outputs found

    Comprehensible and Robust Knowledge Discovery from Small Datasets

    Get PDF
    Die Wissensentdeckung in Datenbanken (“Knowledge Discovery in Databases”, KDD) zielt darauf ab, nützliches Wissen aus Daten zu extrahieren. Daten können eine Reihe von Messungen aus einem realen Prozess repräsentieren oder eine Reihe von Eingabe- Ausgabe-Werten eines Simulationsmodells. Zwei häufig widersprüchliche Anforderungen an das erworbene Wissen sind, dass es (1) die Daten möglichst exakt zusammenfasst und (2) in einer gut verständlichen Form vorliegt. Entscheidungsbäume (“Decision Trees”) und Methoden zur Entdeckung von Untergruppen (“Subgroup Discovery”) liefern Wissenszusammenfassungen in Form von Hyperrechtecken; diese gelten als gut verständlich. Um die Bedeutung einer verständlichen Datenzusammenfassung zu demonstrieren, erforschen wir Dezentrale intelligente Netzsteuerung — ein neues System, das die Bedarfsreaktion in Stromnetzen ohne wesentliche Änderungen in der Infrastruktur implementiert. Die bisher durchgeführte konventionelle Analyse dieses Systems beschränkte sich auf die Berücksichtigung identischer Teilnehmer und spiegelte daher die Realität nicht ausreichend gut wider. Wir führen viele Simulationen mit unterschiedlichen Eingabewerten durch und wenden Entscheidungsbäume auf die resultierenden Daten an. Mit den daraus resultierenden verständlichen Datenzusammenfassung konnten wir neue Erkenntnisse zum Verhalten der Dezentrale intelligente Netzsteuerung gewinnen. Entscheidungsbäume ermöglichen die Beschreibung des Systemverhaltens für alle Eingabekombinationen. Manchmal ist man aber nicht daran interessiert, den gesamten Eingaberaum zu partitionieren, sondern Bereiche zu finden, die zu bestimmten Ausgabe führen (sog. Untergruppen). Die vorhandenen Algorithmen zum Erkennen von Untergruppen erfordern normalerweise große Datenmengen, um eine stabile und genaue Ausgabe zu erzielen. Der Datenerfassungsprozess ist jedoch häufig kostspielig. Unser Hauptbeitrag ist die Verbesserung der Untergruppenerkennung aus Datensätzen mit wenigen Beobachtungen. Die Entdeckung von Untergruppen in simulierten Daten wird als Szenarioerkennung bezeichnet. Ein häufig verwendeter Algorithmus für die Szenarioerkennung ist PRIM (Patient Rule Induction Method). Wir schlagen REDS (Rule Extraction for Discovering Scenarios) vor, ein neues Verfahren für die Szenarioerkennung. Für REDS, trainieren wir zuerst ein statistisches Zwischenmodell und verwenden dieses, um eine große Menge neuer Daten für PRIM zu erstellen. Die grundlegende statistische Intuition beschrieben wir ebenfalls. Experimente zeigen, dass REDS viel besser funktioniert als PRIM für sich alleine: Es reduziert die Anzahl der erforderlichen Simulationsläufe um 75% im Durchschnitt. Mit simulierten Daten hat man perfekte Kenntnisse über die Eingangsverteilung — eine Voraussetzung von REDS. Um REDS auf realen Messdaten anwendbar zu machen, haben wir es mit Stichproben aus einer geschätzten multivariate Verteilung der Daten kombiniert. Wir haben die resultierende Methode in Kombination mit verschiedenen Methoden zur Generierung von Daten experimentell evaluiert. Wir haben dies für PRIM und BestInterval — eine weitere repräsentative Methode zur Erkennung von Untergruppen — gemacht. In den meisten Fällen hat unsere Methodik die Qualität der entdeckten Untergruppen erhöht

    Scenario Discovery via Rule Extraction

    Full text link
    Scenario discovery is the process of finding areas of interest, commonly referred to as scenarios, in data spaces resulting from simulations. For instance, one might search for conditions - which are inputs of the simulation model - where the system under investigation is unstable. A commonly used algorithm for scenario discovery is PRIM. It yields scenarios in the form of hyper-rectangles which are human-comprehensible. When the simulation model has many inputs, and the simulations are computationally expensive, PRIM may not produce good results, given the affordable volume of data. So we propose a new procedure for scenario discovery - we train an intermediate statistical model which generalizes fast, and use it to label (a lot of) data for PRIM. We provide the statistical intuition behind our idea. Our experimental study shows that this method is much better than PRIM itself. Specifically, our method reduces the number of simulations runs necessary by 75% on average

    A benchmark of categorical encoders for binary classification

    Full text link
    Categorical encoders transform categorical features into numerical representations that are indispensable for a wide range of machine learning models. Existing encoder benchmark studies lack generalizability because of their limited choice of (1) encoders, (2) experimental factors, and (3) datasets. Additionally, inconsistencies arise from the adoption of varying aggregation strategies. This paper is the most comprehensive benchmark of categorical encoders to date, including an extensive evaluation of 32 configurations of encoders from diverse families, with 36 combinations of experimental factors, and on 50 datasets. The study shows the profound influence of dataset selection, experimental factors, and aggregation strategies on the benchmark's conclusions -- aspects disregarded in previous encoder benchmarks.Comment: To be published in the 37th Conference on Neural Information Processing Systems (NeurIPS 2023) Track on Datasets and Benchmark

    Budgeted Multi-Armed Bandits with Asymmetric Confidence Intervals

    Full text link
    We study the stochastic Budgeted Multi-Armed Bandit (MAB) problem, where a player chooses from KK arms with unknown expected rewards and costs. The goal is to maximize the total reward under a budget constraint. A player thus seeks to choose the arm with the highest reward-cost ratio as often as possible. Current state-of-the-art policies for this problem have several issues, which we illustrate. To overcome them, we propose a new upper confidence bound (UCB) sampling policy, ω\omega-UCB, that uses asymmetric confidence intervals. These intervals scale with the distance between the sample mean and the bounds of a random variable, yielding a more accurate and tight estimation of the reward-cost ratio compared to our competitors. We show that our approach has logarithmic regret and consistently outperforms existing policies in synthetic and real settings

    Adaptive Bernstein Change Detector for High-Dimensional Data Streams

    Full text link
    Change detection is of fundamental importance when analyzing data streams. Detecting changes both quickly and accurately enables monitoring and prediction systems to react, e.g., by issuing an alarm or by updating a learning algorithm. However, detecting changes is challenging when observations are high-dimensional. In high-dimensional data, change detectors should not only be able to identify when changes happen, but also in which subspace they occur. Ideally, one should also quantify how severe they are. Our approach, ABCD, has these properties. ABCD learns an encoder-decoder model and monitors its accuracy over a window of adaptive size. ABCD derives a change score based on Bernstein's inequality to detect deviations in terms of accuracy, which indicate changes. Our experiments demonstrate that ABCD outperforms its best competitor by at least 8% and up to 23% in F1-score on average. It can also accurately estimate changes' subspace, together with a severity measure that correlates with the ground truth

    Comparative genomics reveals the regulatory complexity of Bifidobacterial arabinose and Arabino-oligosaccharide utilization

    Get PDF
    Members of the genus Bifidobacterium are common inhabitants of the human gastrointestinal tract. Previously it was shown that arabino-oligosaccharides (AOS) might act as prebiotics and stimulate the bifidobacterial growth in the gut. However, despite the rapid accumulation of genomic data, the precise mechanisms by which these sugars are utilized and associated transcription control still remain unclear. In the current study, we used a comparative genomic approach to reconstruct arabinose and AOS utilization pathways in over 40 bacterial species belonging to the Bifidobacteriaceae family. The results indicate that the gene repertoire involved in the catabolism of these sugars is highly diverse, and even phylogenetically close species may differ in their utilization capabilities. Using bioinformatics analysis we identified potential DNA-binding motifs and reconstructed putative regulons for the arabinose and AOS utilization genes in the Bifidobacteriaceae genomes. Six LacI-family transcriptional factors (named AbfR, AauR, AauU1, AauU2, BauR1 and BauR2) and a TetR-family regulator (XsaR) presumably act as local repressors for AOS utilization genes encoding various alpha- or beta-L-arabinofuranosidases and predicted AOS transporters. The ROK-family regulator AraU and the LacI-family regulator AraQ control adjacent operons encoding putative arabinose transporters and catabolic enzymes, respectively. However, the AraQ regulator is universally present in all Bifidobacterium species including those lacking the arabinose catabolic genes araBDA, suggesting its control of other genes. Comparative genomic analyses of prospective AraQ-binding sites allowed the reconstruction of AraQ regulons and a proposed binary repression/activation mechanism. The conserved core of reconstructed AraQ regulons in bifidobacteria includes araBDA, as well as genes from the central glycolytic and fermentation pathways (pyk, eno, gap, tkt, tal, galM, ldh). The current study expands the range of genes involved in bifidobacterial arabinose/AOS utilization and demonstrates considerable variations in associated metabolic pathways and regulons. Detailed comparative and phylogenetic analyses allowed us to hypothesize how the identified reconstructed regulons evolved in bifidobacteria. Our findings may help to improve carbohydrate catabolic phenotype prediction and metabolic modeling, while it may also facilitate rational development of novel prebiotics

    Minimizing Bias in Estimation of Mutual Information from Data Streams

    Get PDF
    Mutual information is a measure for both linear and non-linear associations between variables. There exist several estimators of mutual information for static data. In the dynamic case, one needs to apply these estimators to samples of points from data streams. The sampling should be such that more detailed information on the recent past is available. We formulate a list of natural requirements an estimator of mutual information on data streams should fulfill, and we propose two approaches which do meet all of them. Finally, we compare our algorithms to an existing method both theoretically and experimentally. Our findings include that our approaches are faster and have lower bias and better memory complexity

    The influence of nickel layer thickness on microhardness and hydrogen sorption rate of commercially pure titanium alloy

    Get PDF
    The influence of nickel coating thickness on microhardness and hydrogen sorption rate by commercially pure titanium alloy was established in this work. Coating deposition was carried out by magnetron sputtering method with prior ion cleaning of surface. It was shown that increase of sputtering time from 10 to 50 minutes leads to increase coating thickness from 0.56 to 3.78 ?m. It was established that increase of nickel coating thickness leads to increase of microhardness at loads less than 0.5 kg. Microhardness values for all samples are not significantly different at loads 1 kg. Hydrogen content in titanium alloy with nickel layer deposited at 10 and 20 minutes exceeds concentration in initial samples on one order of magnitude. Further increasing of deposition time of nickel coating leads to decreasing of hydrogen concentration in samples due to coating delamination in process of hydrogenation

    The influence of nickel layer thickness on microhardness and hydrogen sorption rate of commercially pure titanium alloy

    Get PDF
    The influence of nickel coating thickness on microhardness and hydrogen sorption rate by commercially pure titanium alloy was established in this work. Coating deposition was carried out by magnetron sputtering method with prior ion cleaning of surface. It was shown that increase of sputtering time from 10 to 50 minutes leads to increase coating thickness from 0.56 to 3.78 ?m. It was established that increase of nickel coating thickness leads to increase of microhardness at loads less than 0.5 kg. Microhardness values for all samples are not significantly different at loads 1 kg. Hydrogen content in titanium alloy with nickel layer deposited at 10 and 20 minutes exceeds concentration in initial samples on one order of magnitude. Further increasing of deposition time of nickel coating leads to decreasing of hydrogen concentration in samples due to coating delamination in process of hydrogenation
    corecore