574 research outputs found
Classifying LEP Data with Support Vector Algorithms
We have studied the application of different classification algorithms in the
analysis of simulated high energy physics data. Whereas Neural Network
algorithms have become a standard tool for data analysis, the performance of
other classifiers such as Support Vector Machines has not yet been tested in
this environment. We chose two different problems to compare the performance of
a Support Vector Machine and a Neural Net trained with back-propagation:
tagging events of the type e+e- -> ccbar and the identification of muons
produced in multihadronic e+e- annihilation events.Comment: 7 pages, 4 figures, submitted to proceedings of AIHENP99, Crete,
April 199
A Kernel Method for the Two-sample Problem
We propose a framework for analyzing and comparing distributions, allowing us to design statistical tests to determine if two samples are drawn from different distributions. Our test statistic is the largest difference in expectations over functions in the unit ball of a reproducing kernel Hilbert space (RKHS). We present two tests based on large deviation bounds for the test statistic, while a third is based on the asymptotic distribution of this statistic. The test statistic can be computed in quadratic time, although efficient linear time approximations are available. Several classical metrics on distributions are recovered when the function space used to compute the difference in expectations is allowed to be more general (eg.~a Banach space). We apply our two-sample tests to a variety of problems, including attribute matching for databases using the Hungarian marriage method, where they perform strongly. Excellent performance is also obtained when comparing distributions over graphs, for which these are the first such tests
A framework for space-efficient string kernels
String kernels are typically used to compare genome-scale sequences whose
length makes alignment impractical, yet their computation is based on data
structures that are either space-inefficient, or incur large slowdowns. We show
that a number of exact string kernels, like the -mer kernel, the substrings
kernels, a number of length-weighted kernels, the minimal absent words kernel,
and kernels with Markovian corrections, can all be computed in time and
in bits of space in addition to the input, using just a
data structure on the Burrows-Wheeler transform of the
input strings, which takes time per element in its output. The same
bounds hold for a number of measures of compositional complexity based on
multiple value of , like the -mer profile and the -th order empirical
entropy, and for calibrating the value of using the data
Hilbert Space Representations of Probability Distributions
Many problems in unsupervised learning require the analysis of features of probability distributions. At the most fundamental level, we might wish to determine whether two distributions are the same, based on samples from each - this is known as the two-sample or homogeneity problem. We use kernel methods to address this problem, by mapping probability distributions to elements in a reproducing kernel Hilbert space (RKHS). Given a sufficiently rich RKHS, these representations are unique: thus comparing feature space representations allows us to compare distributions without ambiguity. Applications include testing whether cancer subtypes are distinguishable on the basis of DNA microarray data, and whether low frequency oscillations measured at an electrode in the cortex have a different distribution during a neural spike. A more difficult problem is to discover whether two random variables drawn from a joint distribution are independent. It turns out that any dependence between pairs of random variables can be encoded in a cross-covariance operator between appropriate RKHS representations of the variables, and we may test independence by looking at a norm of the operator. We demonstrate this independence test by establishing dependence between an English text and its French translation, as opposed to French text on the same topic but otherwise unrelated. Finally, we show that this operator norm is itself a difference in feature means
The devices, experimental scaffolds, and biomaterials ontology (DEB): a tool for mapping, annotation, and analysis of biomaterials' data
The size and complexity of the biomaterials literature makes systematic data analysis an excruciating manual task. A practical solution is creating databases and information resources. Implant design and biomaterials research can greatly benefit from an open database for systematic data retrieval. Ontologies are pivotal to knowledge base creation, serving to represent and organize domain knowledge. To name but two examples, GO, the gene ontology, and CheBI, Chemical Entities of Biological Interest ontology and their associated databases are central resources to their respective research communities. The creation of the devices, experimental scaffolds, and biomaterials ontology (DEB), an open resource for organizing information about biomaterials, their design, manufacture, and biological testing, is described. It is developed using text analysis for identifying ontology terms from a biomaterials gold standard corpus, systematically curated to represent the domain's lexicon. Topics covered are validated by members of the biomaterials research community. The ontology may be used for searching terms, performing annotations for machine learning applications, standardized meta-data indexing, and other cross-disciplinary data exploitation. The input of the biomaterials community to this effort to create data-driven open-access research tools is encouraged and welcomed.Preprin
Uncertainty in context-aware systems: A case study for intelligent environments
Data used be context-aware systems is naturally incomplete and not always reflect real situations. The dynamic nature of intelligent environments leads to the need of analysing and handling uncertain information. Users can change their acting patterns within a short space of time. This paper presents a case study for a better understanding of concepts related to context awareness and the problem of dealing with inaccurate data. Through the analysis of identification of elements that results in the construction of unreliable contexts, it is aimed to identify patterns to minimize incompleteness. Thus, it will be possible to deal with flaws caused by undesired execution of applications.Programa Operacional TemĂĄtico Factores de Competitividade (POCI-01-0145-
Reproducing Kernels of Generalized Sobolev Spaces via a Green Function Approach with Distributional Operators
In this paper we introduce a generalized Sobolev space by defining a
semi-inner product formulated in terms of a vector distributional operator
consisting of finitely or countably many distributional operators
, which are defined on the dual space of the Schwartz space. The types of
operators we consider include not only differential operators, but also more
general distributional operators such as pseudo-differential operators. We
deduce that a certain appropriate full-space Green function with respect to
now becomes a conditionally positive
definite function. In order to support this claim we ensure that the
distributional adjoint operator of is
well-defined in the distributional sense. Under sufficient conditions, the
native space (reproducing-kernel Hilbert space) associated with the Green
function can be isometrically embedded into or even be isometrically
equivalent to a generalized Sobolev space. As an application, we take linear
combinations of translates of the Green function with possibly added polynomial
terms and construct a multivariate minimum-norm interpolant to data
values sampled from an unknown generalized Sobolev function at data sites
located in some set . We provide several examples, such
as Mat\'ern kernels or Gaussian kernels, that illustrate how many
reproducing-kernel Hilbert spaces of well-known reproducing kernels are
isometrically equivalent to a generalized Sobolev space. These examples further
illustrate how we can rescale the Sobolev spaces by the vector distributional
operator . Introducing the notion of scale as part of the
definition of a generalized Sobolev space may help us to choose the "best"
kernel function for kernel-based approximation methods.Comment: Update version of the publish at Num. Math. closed to Qi Ye's Ph.D.
thesis (\url{http://mypages.iit.edu/~qye3/PhdThesis-2012-AMS-QiYe-IIT.pdf}
An incremental dual nu-support vector regression algorithm
© 2018, Springer International Publishing AG, part of Springer Nature. Support vector regression (SVR) has been a hot research topic for several years as it is an effective regression learning algorithm. Early studies on SVR mostly focus on solving large-scale problems. Nowadays, an increasing number of researchers are focusing on incremental SVR algorithms. However, these incremental SVR algorithms cannot handle uncertain data, which are very common in real life because the data in the training example must be precise. Therefore, to handle the incremental regression problem with uncertain data, an incremental dual nu-support vector regression algorithm (dual-v-SVR) is proposed. In the algorithm, a dual-v-SVR formulation is designed to handle the uncertain data at first, then we design two special adjustments to enable the dual-v-SVR model to learn incrementally: incremental adjustment and decremental adjustment. Finally, the experiment results demonstrate that the incremental dual-v-SVR algorithm is an efficient incremental algorithm which is not only capable of solving the incremental regression problem with uncertain data, it is also faster than batch or other incremental SVR algorithms
- âŠ