9,672,968 research outputs found
MissForest - nonparametric missing value imputation for mixed-type data
Modern data acquisition based on high-throughput technology is often facing
the problem of missing data. Algorithms commonly used in the analysis of such
large-scale data often depend on a complete set. Missing value imputation
offers a solution to this problem. However, the majority of available
imputation methods are restricted to one type of variable only: continuous or
categorical. For mixed-type data the different types are usually handled
separately. Therefore, these methods ignore possible relations between variable
types. We propose a nonparametric method which can cope with different types of
variables simultaneously. We compare several state of the art methods for the
imputation of missing values. We propose and evaluate an iterative imputation
method (missForest) based on a random forest. By averaging over many unpruned
classification or regression trees random forest intrinsically constitutes a
multiple imputation scheme. Using the built-in out-of-bag error estimates of
random forest we are able to estimate the imputation error without the need of
a test set. Evaluation is performed on multiple data sets coming from a diverse
selection of biological fields with artificially introduced missing values
ranging from 10% to 30%. We show that missForest can successfully handle
missing values, particularly in data sets including different types of
variables. In our comparative study missForest outperforms other methods of
imputation especially in data settings where complex interactions and nonlinear
relations are suspected. The out-of-bag imputation error estimates of
missForest prove to be adequate in all settings. Additionally, missForest
exhibits attractive computational efficiency and can cope with high-dimensional
data.Comment: Submitted to Oxford Journal's Bioinformatics on 3rd of May 201
Improving Automatic Content Type Identification from a Data Set
Data file layout inference refers to building the structure and determining the metadata of a text file. The text files dealt within this research are personal information records that have a consistent structure. Traditionally, if the layout structure of a text file is unknown, the human user must undergo manual labor of identifying the metadata. This is inefficient and prone to error. Content-based oracles are the current state-of-the-art automation technology that attempts to solve the layout inference problem by using databases of known metadata. This paper builds upon the information and documentation of the content-based oracles, and improves the databases of the oracles through experimentation
SMS design review plan, type 1 data
The SMS PDR plan is submitted incrementally 15 days prior to the associated PDR's. The initial release contains a complete schedule for all PDR's plus the agendas for the first PDR's scheduled
An inverse problem of Calderon type with partial data
A generalized variant of the Calder\'on problem from electrical impedance
tomography with partial data for anisotropic Lipschitz conductivities is
considered in an arbitrary space dimension . The following two
results are shown: (i) The selfadjoint Dirichlet operator associated with an
elliptic differential expression on a bounded Lipschitz domain is determined
uniquely up to unitary equivalence by the knowledge of the Dirichlet-to-Neumann
map on an open subset of the boundary, and (ii) the Dirichlet operator can be
reconstructed from the residuals of the Dirichlet-to-Neumann map on this
subset.Comment: to appear in Comm. Partial Differential Equation
On the Use of Underspecified Data-Type Semantics for Type Safety in Low-Level Code
In recent projects on operating-system verification, C and C++ data types are
often formalized using a semantics that does not fully specify the precise byte
encoding of objects. It is well-known that such an underspecified data-type
semantics can be used to detect certain kinds of type errors. In general,
however, underspecified data-type semantics are unsound: they assign
well-defined meaning to programs that have undefined behavior according to the
C and C++ language standards.
A precise characterization of the type-correctness properties that can be
enforced with underspecified data-type semantics is still missing. In this
paper, we identify strengths and weaknesses of underspecified data-type
semantics for ensuring type safety of low-level systems code. We prove
sufficient conditions to detect certain classes of type errors and, finally,
identify a trade-off between the complexity of underspecified data-type
semantics and their type-checking capabilities.Comment: In Proceedings SSV 2012, arXiv:1211.587
- …
