172 research outputs found
Numerical Implementation of lepton-nucleus interactions and its effect on neutrino oscillation analysis
We discuss the implementation of the nuclear model based on realistic nuclear
spectral functions in the GENIE neutrino interaction generator. Besides
improving on the Fermi gas description of the nuclear ground state, our scheme
involves a new prescription for selection, meant to efficiently enforce
energy momentum conservation. The results of our simulations, validated through
comparison to electron scattering data, have been obtained for a variety of
target nuclei, ranging from carbon to argon, and cover the kinematical region
in which quasi elastic scattering is the dominant reaction mechanism. We also
analyse the influence of the adopted nuclear model on the determination of
neutrino oscillation parameters.Comment: 19 pages, 35 figures, version accepted by Phys. Rev.
Automated data pre-processing via meta-learning
The final publication is available at link.springer.comA data mining algorithm may perform differently on datasets with different characteristics, e.g., it might perform better on a dataset with continuous attributes rather than with categorical attributes, or the other way around.
As a matter of fact, a dataset usually needs to be pre-processed. Taking into account all the possible pre-processing operators, there exists a staggeringly large number of alternatives and nonexperienced users become overwhelmed.
We show that this problem can be addressed by an automated approach, leveraging ideas from metalearning.
Specifically, we consider a wide range of data pre-processing techniques and a set of data mining algorithms. For each data mining algorithm and selected dataset, we are able to predict the transformations that improve the result
of the algorithm on the respective dataset. Our approach will help non-expert users to more effectively identify the transformations appropriate to their applications, and hence to achieve improved results.Peer ReviewedPostprint (published version
Conditional Neural Relational Inference for Interacting Systems
In this work, we want to learn to model the dynamics of similar yet distinct
groups of interacting objects. These groups follow some common physical laws
that exhibit specificities that are captured through some vectorial
description. We develop a model that allows us to do conditional generation
from any such group given its vectorial description. Unlike previous work on
learning dynamical systems that can only do trajectory completion and require a
part of the trajectory dynamics to be provided as input in generation time, we
do generation using only the conditioning vector with no access to generation
time's trajectories. We evaluate our model in the setting of modeling human
gait and, in particular pathological human gait
Determining appropriate approaches for using data in feature selection
Feature selection is increasingly important in data analysis and machine learning in big data era. However, how to use the data in feature selection, i.e. using either ALL or PART of a dataset, has become a serious and tricky issue. Whilst the conventional practice of using all the data in feature selection may lead to selection bias, using part of the data may, on the other hand, lead to underestimating the relevant features under some conditions. This paper investigates these two strategies systematically in terms of reliability and effectiveness, and then determines their suitability for datasets with different characteristics. The reliability is measured by the Average Tanimoto Index and the Inter-method Average Tanimoto Index, and the effectiveness is measured by the mean generalisation accuracy of classification. The computational experiments are carried out on ten real-world benchmark datasets and fourteen synthetic datasets. The synthetic datasets are generated with a pre-set number of relevant features and varied numbers of irrelevant features and instances, and added with different levels of noise. The results indicate that the PART approach is more effective in reducing the bias when the size of a dataset is small but starts to lose its advantage as the dataset size increases
Adjusted Measures for Feature Selection Stability for Data Sets with Similar Features
For data sets with similar features, for example highly correlated features,
most existing stability measures behave in an undesired way: They consider
features that are almost identical but have different identifiers as different
features. Existing adjusted stability measures, that is, stability measures
that take into account the similarities between features, have major
theoretical drawbacks. We introduce new adjusted stability measures that
overcome these drawbacks. We compare them to each other and to existing
stability measures based on both artificial and real sets of selected features.
Based on the results, we suggest using one new stability measure that considers
highly similar features as exchangeable
Addressing the Challenge of Defining Valid Proteomic Biomarkers and Classifiers
Background: The purpose of this manuscript is to provide, based on an extensive analysis of a proteomic data set, suggestions for proper statistical analysis for the discovery of sets of clinically relevant biomarkers. As tractable example we define the measurable proteomic differences between apparently healthy adult males and females. We choose urine as body-fluid of interest and CE-MS, a thoroughly validated platform technology, allowing for routine analysis of a large number of samples. The second urine of the morning was collected from apparently healthy male and female volunteers (aged 21-40) in the course of the routine medical check-up before recruitment at the Hannover Medical School.Results: We found that the Wilcoxon-test is best suited for the definition of potential biomarkers. Adjustment for multiple testing is necessary. Sample size estimation can be performed based on a small number of observations via resampling from pilot data. Machine learning algorithms appear ideally suited to generate classifiers. Assessment of any results in an independent test set is essential.Conclusions: Valid proteomic biomarkers for diagnosis and prognosis only can be defined by applying proper statistical data mining procedures. In particular, a justification of the sample size should be part of the study design
Algebraic Comparison of Partial Lists in Bioinformatics
The outcome of a functional genomics pipeline is usually a partial list of
genomic features, ranked by their relevance in modelling biological phenotype
in terms of a classification or regression model. Due to resampling protocols
or just within a meta-analysis comparison, instead of one list it is often the
case that sets of alternative feature lists (possibly of different lengths) are
obtained. Here we introduce a method, based on the algebraic theory of
symmetric groups, for studying the variability between lists ("list stability")
in the case of lists of unequal length. We provide algorithms evaluating
stability for lists embedded in the full feature set or just limited to the
features occurring in the partial lists. The method is demonstrated first on
synthetic data in a gene filtering task and then for finding gene profiles on a
recent prostate cancer dataset
Measurement of cosmic-ray reconstruction efficiencies in the MicroBooNE LArTPC using a small external cosmic-ray counter
The MicroBooNE detector is a liquid argon time projection chamber at Fermilab
designed to study short-baseline neutrino oscillations and neutrino-argon
interaction cross-section. Due to its location near the surface, a good
understanding of cosmic muons as a source of backgrounds is of fundamental
importance for the experiment. We present a method of using an external 0.5 m
(L) x 0.5 m (W) muon counter stack, installed above the main detector, to
determine the cosmic-ray reconstruction efficiency in MicroBooNE. Data are
acquired with this external muon counter stack placed in three different
positions, corresponding to cosmic rays intersecting different parts of the
detector. The data reconstruction efficiency of tracks in the detector is found
to be , in good agreement with the Monte Carlo reconstruction
efficiency . This analysis represents
a small-scale demonstration of the method that can be used with future data
coming from a recently installed cosmic-ray tagger system, which will be able
to tag of the cosmic rays passing through the MicroBooNE
detector.Comment: 19 pages, 12 figure
- …