445 research outputs found
Graph-Embedding Empowered Entity Retrieval
In this research, we improve upon the current state of the art in entity
retrieval by re-ranking the result list using graph embeddings. The paper shows
that graph embeddings are useful for entity-oriented search tasks. We
demonstrate empirically that encoding information from the knowledge graph into
(graph) embeddings contributes to a higher increase in effectiveness of entity
retrieval results than using plain word embeddings. We analyze the impact of
the accuracy of the entity linker on the overall retrieval effectiveness. Our
analysis further deploys the cluster hypothesis to explain the observed
advantages of graph embeddings over the more widely used word embeddings, for
user tasks involving ranking entities
Yet another breakdown point notion: EFSBP - illustrated at scale-shape models
The breakdown point in its different variants is one of the central notions
to quantify the global robustness of a procedure. We propose a simple
supplementary variant which is useful in situations where we have no obvious or
only partial equivariance: Extending the Donoho and Huber(1983) Finite Sample
Breakdown Point, we propose the Expected Finite Sample Breakdown Point to
produce less configuration-dependent values while still preserving the finite
sample aspect of the former definition. We apply this notion for joint
estimation of scale and shape (with only scale-equivariance available),
exemplified for generalized Pareto, generalized extreme value, Weibull, and
Gamma distributions. In these settings, we are interested in highly-robust,
easy-to-compute initial estimators; to this end we study Pickands-type and
Location-Dispersion-type estimators and compute their respective breakdown
points.Comment: 21 pages, 4 figure
Robustness and Generalization
We derive generalization bounds for learning algorithms based on their
robustness: the property that if a testing sample is "similar" to a training
sample, then the testing error is close to the training error. This provides a
novel approach, different from the complexity or stability arguments, to study
generalization of learning algorithms. We further show that a weak notion of
robustness is both sufficient and necessary for generalizability, which implies
that robustness is a fundamental property for learning algorithms to work
Comparison of Network Intrusion Detection Performance Using Feature Representation
P. 463-475Intrusion detection is essential for the security of the components
of any network. For that reason, several strategies can be used in
Intrusion Detection Systems (IDS) to identify the increasing attempts to
gain unauthorized access with malicious purposes including those base
on machine learning. Anomaly detection has been applied successfully to
numerous domains and might help to identify unknown attacks. However,
there are existing issues such as high error rates or large dimensionality
of data that make its deployment di cult in real-life scenarios. Representation
learning allows to estimate new latent features of data in a
low-dimensionality space. In this work, anomaly detection is performed
using a previous feature learning stage in order to compare these methods
for the detection of intrusions in network tra c. For that purpose,
four di erent anomaly detection algorithms are applied to recent network
datasets using two di erent feature learning methods such as principal
component analysis and autoencoders. Several evaluation metrics such
as accuracy, F1 score or ROC curves are used for comparing their performance.
The experimental results show an improvement for two of the
anomaly detection methods using autoencoder and no signi cant variations
for the linear feature transformationS
A robust measure of correlation between two genes on a microarray
<p>Abstract</p> <p>Background</p> <p>The underlying goal of microarray experiments is to identify gene expression patterns across different experimental conditions. Genes that are contained in a particular pathway or that respond similarly to experimental conditions could be co-expressed and show similar patterns of expression on a microarray. Using any of a variety of clustering methods or gene network analyses we can partition genes of interest into groups, clusters, or modules based on measures of similarity. Typically, Pearson correlation is used to measure distance (or similarity) before implementing a clustering algorithm. Pearson correlation is quite susceptible to outliers, however, an unfortunate characteristic when dealing with microarray data (well known to be typically quite noisy.)</p> <p>Results</p> <p>We propose a resistant similarity metric based on Tukey's biweight estimate of multivariate scale and location. The resistant metric is simply the correlation obtained from a resistant covariance matrix of scale. We give results which demonstrate that our correlation metric is much more resistant than the Pearson correlation while being more efficient than other nonparametric measures of correlation (e.g., Spearman correlation.) Additionally, our method gives a systematic gene flagging procedure which is useful when dealing with large amounts of noisy data.</p> <p>Conclusion</p> <p>When dealing with microarray data, which are known to be quite noisy, robust methods should be used. Specifically, robust distances, including the biweight correlation, should be used in clustering and gene network analysis.</p
Defining eye-fixation sequences across individuals and tasks: the Binocular-Individual Threshold (BIT) algorithm
We propose a new fully automated velocity-based algorithm to identify fixations from eye-movement records of both eyes, with individual-specific thresholds. The algorithm is based on robust minimum determinant covariance estimators (MDC) and control chart procedures, and is conceptually simple and computationally attractive. To determine fixations, it uses velocity thresholds based on the natural within-fixation variability of both eyes. It improves over existing approaches by automatically identifying fixation thresholds that are specific to (a) both eyes, (b) x- and y- directions, (c) tasks, and (d) individuals. We applied the proposed Binocular-Individual Threshold (BIT) algorithm to two large datasets collected on eye-trackers with different sampling frequencies, and compute descriptive statistics of fixations for larger samples of individuals across a variety of tasks, including reading, scene viewing, and search on supermarket shelves. Our analysis shows that there are considerable differences in the characteristics of fixations not only between these tasks, but also between individuals
Kernel Spectral Clustering and applications
In this chapter we review the main literature related to kernel spectral
clustering (KSC), an approach to clustering cast within a kernel-based
optimization setting. KSC represents a least-squares support vector machine
based formulation of spectral clustering described by a weighted kernel PCA
objective. Just as in the classifier case, the binary clustering model is
expressed by a hyperplane in a high dimensional space induced by a kernel. In
addition, the multi-way clustering can be obtained by combining a set of binary
decision functions via an Error Correcting Output Codes (ECOC) encoding scheme.
Because of its model-based nature, the KSC method encompasses three main steps:
training, validation, testing. In the validation stage model selection is
performed to obtain tuning parameters, like the number of clusters present in
the data. This is a major advantage compared to classical spectral clustering
where the determination of the clustering parameters is unclear and relies on
heuristics. Once a KSC model is trained on a small subset of the entire data,
it is able to generalize well to unseen test points. Beyond the basic
formulation, sparse KSC algorithms based on the Incomplete Cholesky
Decomposition (ICD) and , , Group Lasso regularization are
reviewed. In that respect, we show how it is possible to handle large scale
data. Also, two possible ways to perform hierarchical clustering and a soft
clustering method are presented. Finally, real-world applications such as image
segmentation, power load time-series clustering, document clustering and big
data learning are considered.Comment: chapter contribution to the book "Unsupervised Learning Algorithms
ANALYTICAL QUALITY ASSESSMENT OF ITERATIVELY REWEIGHTED LEAST-SQUARES (IRLS) METHOD
The iteratively reweighted least-squares (IRLS) technique has been widelyemployed in geodetic and geophysical literature. The reliability measures areimportant diagnostic tools for inferring the strength of the model validation. Anexact analytical method is adopted to obtain insights on how much iterativereweighting can affect the quality indicators. Theoretical analyses and numericalresults show that, when the downweighting procedure is performed, (1) theprecision, all kinds of dilution of precision (DOP) metrics and the minimaldetectable bias (MDB) will become larger; (2) the variations of the bias-to-noiseratio (BNR) are involved, and (3) all these results coincide with those obtained bythe first-order approximation method
Multi-Class Clustering of Cancer Subtypes through SVM Based Ensemble of Pareto-Optimal Solutions for Gene Marker Identification
With the advancement of microarray technology, it is now possible to study the expression profiles of thousands of genes across different experimental conditions or tissue samples simultaneously. Microarray cancer datasets, organized as samples versus genes fashion, are being used for classification of tissue samples into benign and malignant or their subtypes. They are also useful for identifying potential gene markers for each cancer subtype, which helps in successful diagnosis of particular cancer types. In this article, we have presented an unsupervised cancer classification technique based on multiobjective genetic clustering of the tissue samples. In this regard, a real-coded encoding of the cluster centers is used and cluster compactness and separation are simultaneously optimized. The resultant set of near-Pareto-optimal solutions contains a number of non-dominated solutions. A novel approach to combine the clustering information possessed by the non-dominated solutions through Support Vector Machine (SVM) classifier has been proposed. Final clustering is obtained by consensus among the clusterings yielded by different kernel functions. The performance of the proposed multiobjective clustering method has been compared with that of several other microarray clustering algorithms for three publicly available benchmark cancer datasets. Moreover, statistical significance tests have been conducted to establish the statistical superiority of the proposed clustering method. Furthermore, relevant gene markers have been identified using the clustering result produced by the proposed clustering method and demonstrated visually. Biological relationships among the gene markers are also studied based on gene ontology. The results obtained are found to be promising and can possibly have important impact in the area of unsupervised cancer classification as well as gene marker identification for multiple cancer subtypes
- …