445 research outputs found

    Graph-Embedding Empowered Entity Retrieval

    Full text link
    In this research, we improve upon the current state of the art in entity retrieval by re-ranking the result list using graph embeddings. The paper shows that graph embeddings are useful for entity-oriented search tasks. We demonstrate empirically that encoding information from the knowledge graph into (graph) embeddings contributes to a higher increase in effectiveness of entity retrieval results than using plain word embeddings. We analyze the impact of the accuracy of the entity linker on the overall retrieval effectiveness. Our analysis further deploys the cluster hypothesis to explain the observed advantages of graph embeddings over the more widely used word embeddings, for user tasks involving ranking entities

    Yet another breakdown point notion: EFSBP - illustrated at scale-shape models

    Full text link
    The breakdown point in its different variants is one of the central notions to quantify the global robustness of a procedure. We propose a simple supplementary variant which is useful in situations where we have no obvious or only partial equivariance: Extending the Donoho and Huber(1983) Finite Sample Breakdown Point, we propose the Expected Finite Sample Breakdown Point to produce less configuration-dependent values while still preserving the finite sample aspect of the former definition. We apply this notion for joint estimation of scale and shape (with only scale-equivariance available), exemplified for generalized Pareto, generalized extreme value, Weibull, and Gamma distributions. In these settings, we are interested in highly-robust, easy-to-compute initial estimators; to this end we study Pickands-type and Location-Dispersion-type estimators and compute their respective breakdown points.Comment: 21 pages, 4 figure

    Robustness and Generalization

    Full text link
    We derive generalization bounds for learning algorithms based on their robustness: the property that if a testing sample is "similar" to a training sample, then the testing error is close to the training error. This provides a novel approach, different from the complexity or stability arguments, to study generalization of learning algorithms. We further show that a weak notion of robustness is both sufficient and necessary for generalizability, which implies that robustness is a fundamental property for learning algorithms to work

    Comparison of Network Intrusion Detection Performance Using Feature Representation

    Get PDF
    P. 463-475Intrusion detection is essential for the security of the components of any network. For that reason, several strategies can be used in Intrusion Detection Systems (IDS) to identify the increasing attempts to gain unauthorized access with malicious purposes including those base on machine learning. Anomaly detection has been applied successfully to numerous domains and might help to identify unknown attacks. However, there are existing issues such as high error rates or large dimensionality of data that make its deployment di cult in real-life scenarios. Representation learning allows to estimate new latent features of data in a low-dimensionality space. In this work, anomaly detection is performed using a previous feature learning stage in order to compare these methods for the detection of intrusions in network tra c. For that purpose, four di erent anomaly detection algorithms are applied to recent network datasets using two di erent feature learning methods such as principal component analysis and autoencoders. Several evaluation metrics such as accuracy, F1 score or ROC curves are used for comparing their performance. The experimental results show an improvement for two of the anomaly detection methods using autoencoder and no signi cant variations for the linear feature transformationS

    A robust measure of correlation between two genes on a microarray

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>The underlying goal of microarray experiments is to identify gene expression patterns across different experimental conditions. Genes that are contained in a particular pathway or that respond similarly to experimental conditions could be co-expressed and show similar patterns of expression on a microarray. Using any of a variety of clustering methods or gene network analyses we can partition genes of interest into groups, clusters, or modules based on measures of similarity. Typically, Pearson correlation is used to measure distance (or similarity) before implementing a clustering algorithm. Pearson correlation is quite susceptible to outliers, however, an unfortunate characteristic when dealing with microarray data (well known to be typically quite noisy.)</p> <p>Results</p> <p>We propose a resistant similarity metric based on Tukey's biweight estimate of multivariate scale and location. The resistant metric is simply the correlation obtained from a resistant covariance matrix of scale. We give results which demonstrate that our correlation metric is much more resistant than the Pearson correlation while being more efficient than other nonparametric measures of correlation (e.g., Spearman correlation.) Additionally, our method gives a systematic gene flagging procedure which is useful when dealing with large amounts of noisy data.</p> <p>Conclusion</p> <p>When dealing with microarray data, which are known to be quite noisy, robust methods should be used. Specifically, robust distances, including the biweight correlation, should be used in clustering and gene network analysis.</p

    Defining eye-fixation sequences across individuals and tasks: the Binocular-Individual Threshold (BIT) algorithm

    Get PDF
    We propose a new fully automated velocity-based algorithm to identify fixations from eye-movement records of both eyes, with individual-specific thresholds. The algorithm is based on robust minimum determinant covariance estimators (MDC) and control chart procedures, and is conceptually simple and computationally attractive. To determine fixations, it uses velocity thresholds based on the natural within-fixation variability of both eyes. It improves over existing approaches by automatically identifying fixation thresholds that are specific to (a) both eyes, (b) x- and y- directions, (c) tasks, and (d) individuals. We applied the proposed Binocular-Individual Threshold (BIT) algorithm to two large datasets collected on eye-trackers with different sampling frequencies, and compute descriptive statistics of fixations for larger samples of individuals across a variety of tasks, including reading, scene viewing, and search on supermarket shelves. Our analysis shows that there are considerable differences in the characteristics of fixations not only between these tasks, but also between individuals

    Kernel Spectral Clustering and applications

    Full text link
    In this chapter we review the main literature related to kernel spectral clustering (KSC), an approach to clustering cast within a kernel-based optimization setting. KSC represents a least-squares support vector machine based formulation of spectral clustering described by a weighted kernel PCA objective. Just as in the classifier case, the binary clustering model is expressed by a hyperplane in a high dimensional space induced by a kernel. In addition, the multi-way clustering can be obtained by combining a set of binary decision functions via an Error Correcting Output Codes (ECOC) encoding scheme. Because of its model-based nature, the KSC method encompasses three main steps: training, validation, testing. In the validation stage model selection is performed to obtain tuning parameters, like the number of clusters present in the data. This is a major advantage compared to classical spectral clustering where the determination of the clustering parameters is unclear and relies on heuristics. Once a KSC model is trained on a small subset of the entire data, it is able to generalize well to unseen test points. Beyond the basic formulation, sparse KSC algorithms based on the Incomplete Cholesky Decomposition (ICD) and L0L_0, L1,L0+L1L_1, L_0 + L_1, Group Lasso regularization are reviewed. In that respect, we show how it is possible to handle large scale data. Also, two possible ways to perform hierarchical clustering and a soft clustering method are presented. Finally, real-world applications such as image segmentation, power load time-series clustering, document clustering and big data learning are considered.Comment: chapter contribution to the book "Unsupervised Learning Algorithms

    ANALYTICAL QUALITY ASSESSMENT OF ITERATIVELY REWEIGHTED LEAST-SQUARES (IRLS) METHOD

    Get PDF
    The iteratively reweighted least-squares (IRLS) technique has been widelyemployed in geodetic and geophysical literature. The reliability measures areimportant diagnostic tools for inferring the strength of the model validation. Anexact analytical method is adopted to obtain insights on how much iterativereweighting can affect the quality indicators. Theoretical analyses and numericalresults show that, when the downweighting procedure is performed, (1) theprecision, all kinds of dilution of precision (DOP) metrics and the minimaldetectable bias (MDB) will become larger; (2) the variations of the bias-to-noiseratio (BNR) are involved, and (3) all these results coincide with those obtained bythe first-order approximation method

    Multi-Class Clustering of Cancer Subtypes through SVM Based Ensemble of Pareto-Optimal Solutions for Gene Marker Identification

    Get PDF
    With the advancement of microarray technology, it is now possible to study the expression profiles of thousands of genes across different experimental conditions or tissue samples simultaneously. Microarray cancer datasets, organized as samples versus genes fashion, are being used for classification of tissue samples into benign and malignant or their subtypes. They are also useful for identifying potential gene markers for each cancer subtype, which helps in successful diagnosis of particular cancer types. In this article, we have presented an unsupervised cancer classification technique based on multiobjective genetic clustering of the tissue samples. In this regard, a real-coded encoding of the cluster centers is used and cluster compactness and separation are simultaneously optimized. The resultant set of near-Pareto-optimal solutions contains a number of non-dominated solutions. A novel approach to combine the clustering information possessed by the non-dominated solutions through Support Vector Machine (SVM) classifier has been proposed. Final clustering is obtained by consensus among the clusterings yielded by different kernel functions. The performance of the proposed multiobjective clustering method has been compared with that of several other microarray clustering algorithms for three publicly available benchmark cancer datasets. Moreover, statistical significance tests have been conducted to establish the statistical superiority of the proposed clustering method. Furthermore, relevant gene markers have been identified using the clustering result produced by the proposed clustering method and demonstrated visually. Biological relationships among the gene markers are also studied based on gene ontology. The results obtained are found to be promising and can possibly have important impact in the area of unsupervised cancer classification as well as gene marker identification for multiple cancer subtypes
    corecore