95 research outputs found
Comprehensible and Robust Knowledge Discovery from Small Datasets
Die Wissensentdeckung in Datenbanken (“Knowledge Discovery in Databases”, KDD) zielt darauf ab, nützliches Wissen aus Daten zu extrahieren. Daten können eine Reihe
von Messungen aus einem realen Prozess repräsentieren oder eine Reihe von Eingabe-
Ausgabe-Werten eines Simulationsmodells. Zwei häufig widersprüchliche Anforderungen
an das erworbene Wissen sind, dass es (1) die Daten möglichst exakt zusammenfasst und
(2) in einer gut verständlichen Form vorliegt. Entscheidungsbäume (“Decision Trees”) und
Methoden zur Entdeckung von Untergruppen (“Subgroup Discovery”) liefern Wissenszusammenfassungen in Form von Hyperrechtecken; diese gelten als gut verständlich.
Um die Bedeutung einer verständlichen Datenzusammenfassung zu demonstrieren,
erforschen wir Dezentrale intelligente Netzsteuerung — ein neues System, das die Bedarfsreaktion in Stromnetzen ohne wesentliche Änderungen in der Infrastruktur implementiert.
Die bisher durchgeführte konventionelle Analyse dieses Systems beschränkte sich auf
die Berücksichtigung identischer Teilnehmer und spiegelte daher die Realität nicht ausreichend gut wider. Wir führen viele Simulationen mit unterschiedlichen Eingabewerten durch und wenden Entscheidungsbäume auf die resultierenden Daten an. Mit den daraus resultierenden verständlichen Datenzusammenfassung konnten wir neue Erkenntnisse zum Verhalten der Dezentrale intelligente Netzsteuerung gewinnen.
Entscheidungsbäume ermöglichen die Beschreibung des Systemverhaltens für alle Eingabekombinationen.
Manchmal ist man aber nicht daran interessiert, den gesamten Eingaberaum
zu partitionieren, sondern Bereiche zu finden, die zu bestimmten Ausgabe führen
(sog. Untergruppen). Die vorhandenen Algorithmen zum Erkennen von Untergruppen
erfordern normalerweise große Datenmengen, um eine stabile und genaue Ausgabe zu erzielen.
Der Datenerfassungsprozess ist jedoch häufig kostspielig. Unser Hauptbeitrag ist die
Verbesserung der Untergruppenerkennung aus Datensätzen mit wenigen Beobachtungen.
Die Entdeckung von Untergruppen in simulierten Daten wird als Szenarioerkennung
bezeichnet. Ein häufig verwendeter Algorithmus für die Szenarioerkennung ist PRIM
(Patient Rule Induction Method). Wir schlagen REDS (Rule Extraction for Discovering
Scenarios) vor, ein neues Verfahren für die Szenarioerkennung. Für REDS, trainieren wir
zuerst ein statistisches Zwischenmodell und verwenden dieses, um eine große Menge
neuer Daten für PRIM zu erstellen. Die grundlegende statistische Intuition beschrieben wir
ebenfalls. Experimente zeigen, dass REDS viel besser funktioniert als PRIM für sich alleine:
Es reduziert die Anzahl der erforderlichen Simulationsläufe um 75% im Durchschnitt.
Mit simulierten Daten hat man perfekte Kenntnisse über die Eingangsverteilung — eine
Voraussetzung von REDS. Um REDS auf realen Messdaten anwendbar zu machen, haben
wir es mit Stichproben aus einer geschätzten multivariate Verteilung der Daten kombiniert.
Wir haben die resultierende Methode in Kombination mit verschiedenen Methoden zur Generierung von Daten experimentell evaluiert. Wir haben dies für PRIM und BestInterval — eine weitere repräsentative Methode zur Erkennung von Untergruppen — gemacht. In den meisten Fällen hat unsere Methodik die Qualität der entdeckten Untergruppen erhöht
Scenario Discovery via Rule Extraction
Scenario discovery is the process of finding areas of interest, commonly
referred to as scenarios, in data spaces resulting from simulations. For
instance, one might search for conditions - which are inputs of the simulation
model - where the system under investigation is unstable. A commonly used
algorithm for scenario discovery is PRIM. It yields scenarios in the form of
hyper-rectangles which are human-comprehensible. When the simulation model has
many inputs, and the simulations are computationally expensive, PRIM may not
produce good results, given the affordable volume of data. So we propose a new
procedure for scenario discovery - we train an intermediate statistical model
which generalizes fast, and use it to label (a lot of) data for PRIM. We
provide the statistical intuition behind our idea. Our experimental study shows
that this method is much better than PRIM itself. Specifically, our method
reduces the number of simulations runs necessary by 75% on average
A benchmark of categorical encoders for binary classification
Categorical encoders transform categorical features into numerical
representations that are indispensable for a wide range of machine learning
models. Existing encoder benchmark studies lack generalizability because of
their limited choice of (1) encoders, (2) experimental factors, and (3)
datasets. Additionally, inconsistencies arise from the adoption of varying
aggregation strategies. This paper is the most comprehensive benchmark of
categorical encoders to date, including an extensive evaluation of 32
configurations of encoders from diverse families, with 36 combinations of
experimental factors, and on 50 datasets. The study shows the profound
influence of dataset selection, experimental factors, and aggregation
strategies on the benchmark's conclusions -- aspects disregarded in previous
encoder benchmarks.Comment: To be published in the 37th Conference on Neural Information
Processing Systems (NeurIPS 2023) Track on Datasets and Benchmark
Budgeted Multi-Armed Bandits with Asymmetric Confidence Intervals
We study the stochastic Budgeted Multi-Armed Bandit (MAB) problem, where a
player chooses from arms with unknown expected rewards and costs. The goal
is to maximize the total reward under a budget constraint. A player thus seeks
to choose the arm with the highest reward-cost ratio as often as possible.
Current state-of-the-art policies for this problem have several issues, which
we illustrate. To overcome them, we propose a new upper confidence bound (UCB)
sampling policy, -UCB, that uses asymmetric confidence intervals. These
intervals scale with the distance between the sample mean and the bounds of a
random variable, yielding a more accurate and tight estimation of the
reward-cost ratio compared to our competitors. We show that our approach has
logarithmic regret and consistently outperforms existing policies in synthetic
and real settings
Adaptive Bernstein Change Detector for High-Dimensional Data Streams
Change detection is of fundamental importance when analyzing data streams.
Detecting changes both quickly and accurately enables monitoring and prediction
systems to react, e.g., by issuing an alarm or by updating a learning
algorithm. However, detecting changes is challenging when observations are
high-dimensional. In high-dimensional data, change detectors should not only be
able to identify when changes happen, but also in which subspace they occur.
Ideally, one should also quantify how severe they are. Our approach, ABCD, has
these properties. ABCD learns an encoder-decoder model and monitors its
accuracy over a window of adaptive size. ABCD derives a change score based on
Bernstein's inequality to detect deviations in terms of accuracy, which
indicate changes. Our experiments demonstrate that ABCD outperforms its best
competitor by at least 8% and up to 23% in F1-score on average. It can also
accurately estimate changes' subspace, together with a severity measure that
correlates with the ground truth
Comparative genomics reveals the regulatory complexity of Bifidobacterial arabinose and Arabino-oligosaccharide utilization
Members of the genus Bifidobacterium are common inhabitants of the human gastrointestinal tract. Previously it was shown that arabino-oligosaccharides (AOS) might act as prebiotics and stimulate the bifidobacterial growth in the gut. However, despite the rapid accumulation of genomic data, the precise mechanisms by which these sugars are utilized and associated transcription control still remain unclear. In the current study, we used a comparative genomic approach to reconstruct arabinose and AOS utilization pathways in over 40 bacterial species belonging to the Bifidobacteriaceae family. The results indicate that the gene repertoire involved in the catabolism of these sugars is highly diverse, and even phylogenetically close species may differ in their utilization capabilities. Using bioinformatics analysis we identified potential DNA-binding motifs and reconstructed putative regulons for the arabinose and AOS utilization genes in the Bifidobacteriaceae genomes. Six LacI-family transcriptional factors (named AbfR, AauR, AauU1, AauU2, BauR1 and BauR2) and a TetR-family regulator (XsaR) presumably act as local repressors for AOS utilization genes encoding various alpha- or beta-L-arabinofuranosidases and predicted AOS transporters. The ROK-family regulator AraU and the LacI-family regulator AraQ control adjacent operons encoding putative arabinose transporters and catabolic enzymes, respectively. However, the AraQ regulator is universally present in all Bifidobacterium species including those lacking the arabinose catabolic genes araBDA, suggesting its control of other genes. Comparative genomic analyses of prospective AraQ-binding sites allowed the reconstruction of AraQ regulons and a proposed binary repression/activation mechanism. The conserved core of reconstructed AraQ regulons in bifidobacteria includes araBDA, as well as genes from the central glycolytic and fermentation pathways (pyk, eno, gap, tkt, tal, galM, ldh). The current study expands the range of genes involved in bifidobacterial arabinose/AOS utilization and demonstrates considerable variations in associated metabolic pathways and regulons. Detailed comparative and phylogenetic analyses allowed us to hypothesize how the identified reconstructed regulons evolved in bifidobacteria. Our findings may help to improve carbohydrate catabolic phenotype prediction and metabolic modeling, while it may also facilitate rational development of novel prebiotics
Minimizing Bias in Estimation of Mutual Information from Data Streams
Mutual information is a measure for both linear and non-linear associations between variables. There exist several estimators of mutual information for static data. In the dynamic case, one needs to apply these estimators to samples of points from data streams. The sampling should be such that more detailed information on the recent past is available. We formulate a list of natural requirements an estimator of mutual information on data streams should fulfill, and we propose two approaches which do meet all of them. Finally, we compare our algorithms to an existing method both theoretically and experimentally. Our findings include that our approaches are faster and have lower bias and better memory complexity
The influence of nickel layer thickness on microhardness and hydrogen sorption rate of commercially pure titanium alloy
The influence of nickel coating thickness on microhardness and hydrogen sorption rate by commercially pure titanium alloy was established in this work. Coating deposition was carried out by magnetron sputtering method with prior ion cleaning of surface. It was shown that increase of sputtering time from 10 to 50 minutes leads to increase coating thickness from 0.56 to 3.78 ?m. It was established that increase of nickel coating thickness leads to increase of microhardness at loads less than 0.5 kg. Microhardness values for all samples are not significantly different at loads 1 kg. Hydrogen content in titanium alloy with nickel layer deposited at 10 and 20 minutes exceeds concentration in initial samples on one order of magnitude. Further increasing of deposition time of nickel coating leads to decreasing of hydrogen concentration in samples due to coating delamination in process of hydrogenation
The influence of nickel layer thickness on microhardness and hydrogen sorption rate of commercially pure titanium alloy
The influence of nickel coating thickness on microhardness and hydrogen sorption rate by commercially pure titanium alloy was established in this work. Coating deposition was carried out by magnetron sputtering method with prior ion cleaning of surface. It was shown that increase of sputtering time from 10 to 50 minutes leads to increase coating thickness from 0.56 to 3.78 ?m. It was established that increase of nickel coating thickness leads to increase of microhardness at loads less than 0.5 kg. Microhardness values for all samples are not significantly different at loads 1 kg. Hydrogen content in titanium alloy with nickel layer deposited at 10 and 20 minutes exceeds concentration in initial samples on one order of magnitude. Further increasing of deposition time of nickel coating leads to decreasing of hydrogen concentration in samples due to coating delamination in process of hydrogenation
- …