Search CORE

445 research outputs found

Graph-Embedding Empowered Entity Retrieval

Author: D Metzler
DL Davies
K Balog
L McInnes
N Jardine
N Noy
PJ Rousseeuw
S Robertson
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 06/05/2020
Field of study

In this research, we improve upon the current state of the art in entity retrieval by re-ranking the result list using graph embeddings. The paper shows that graph embeddings are useful for entity-oriented search tasks. We demonstrate empirically that encoding information from the knowledge graph into (graph) embeddings contributes to a higher increase in effectiveness of entity retrieval results than using plain word embeddings. We analyze the impact of the accuracy of the entity linker on the overall retrieval effectiveness. Our analysis further deploys the cluster hypothesis to explain the observed advantages of graph embeddings over the more widely used word embeddings, for user tasks involving ranking entities

arXiv.org e-Print Archive

Crossref

Yet another breakdown point notion: EFSBP - illustrated at scale-shape models

Author: A Balkema
A Marazzi
DL Donoho
DL Donoho
E Castillo
FR Hampel
J Pickands
K Boudt
L Peng
MG Genton
ML Eaton
Nataliya Horbenko
P Ruckdeschel
Peter Ruckdeschel
PJ Rousseeuw
PL Davies
PL Davies
V Brazauskas
W Hoeffding
X He
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 10/06/2011
Field of study

The breakdown point in its different variants is one of the central notions to quantify the global robustness of a procedure. We propose a simple supplementary variant which is useful in situations where we have no obvious or only partial equivariance: Extending the Donoho and Huber(1983) Finite Sample Breakdown Point, we propose the Expected Finite Sample Breakdown Point to produce less configuration-dependent values while still preserving the finite sample aspect of the former definition. We apply this notion for joint estimation of scale and shape (with only scale-equivariance available), exemplified for generalized Pareto, generalized extreme value, Weibull, and Gamma distributions. In these settings, we are interested in highly-robust, easy-to-compute initial estimators; to this end we study Pickands-type and Location-Dispersion-type estimators and compute their respective breakdown points.Comment: 21 pages, 4 figure

arXiv.org e-Print Archive

Crossref

Fraunhofer-ePrints

SME Credit Risk Analysis Using Bank Lending Data: An Analysis of Thai SMEs

Author: Baburam Niraula
Farhad Taghizadeh Hesary
I T Jolliffe
L K Rousseeuw
N S Yoshino
N Yoshino
N Yoshino
N Yoshino
N Yoshino
N Yoshino
Naoyuki Yoshino
Oecd Adb
Phadet Charoensivakorn
T Feger
W L Martinez
Publication venue: 'Elsevier BV'
Publication date: 01/01/2015
Field of study

Crossref

Robustness and Generalization

We derive generalization bounds for learning algorithms based on their robustness: the property that if a testing sample is "similar" to a training sample, then the testing error is close to the training error. This provides a novel approach, different from the complexity or stability arguments, to study generalization of learning algorithms. We further show that a weak notion of robustness is both sufficient and necessary for generalizability, which implies that robustness is a fundamental property for learning algorithms to work

arXiv.org e-Print Archive

CiteSeerX

Crossref

ScholarBank@NUS

Comparison of Network Intrusion Detection Performance Using Feature Representation

Author: AK Jain
AL Buczak
B Schölkopf
GE Hinton
J McHugh
JA Lee
K Wang
L Khan
M Ahmed
MH Bhuyan
MM Breunig
MV Mahoney
PJ Rousseeuw
SM Erfani
V Chandola
Y Bengio
Y Chen
Z Muda
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 04/11/2019
Field of study

P. 463-475Intrusion detection is essential for the security of the components of any network. For that reason, several strategies can be used in Intrusion Detection Systems (IDS) to identify the increasing attempts to gain unauthorized access with malicious purposes including those base on machine learning. Anomaly detection has been applied successfully to numerous domains and might help to identify unknown attacks. However, there are existing issues such as high error rates or large dimensionality of data that make its deployment di cult in real-life scenarios. Representation learning allows to estimate new latent features of data in a low-dimensionality space. In this work, anomaly detection is performed using a previous feature learning stage in order to compare these methods for the detection of intrusions in network tra c. For that purpose, four di erent anomaly detection algorithms are applied to recent network datasets using two di erent feature learning methods such as principal component analysis and autoencoders. Several evaluation metrics such as accuracy, F1 score or ROC curves are used for comparing their performance. The experimental results show an improvement for two of the anomaly detection methods using autoencoder and no signi cant variations for the linear feature transformationS

Crossref

Leon University (Spain)

A robust measure of correlation between two genes on a microarray

Author: A Beaton
Aya Mitani
B Zhang
Brian VanKoten
C Brown
C Glasbey
D Jiang
D Rocke
DM Rocke
E Hubbel
E Marshall
E Schadt
F Mosteller
G Davidson
H Lopuhaä
HP Lopuhaa
I Gat-Vilks
J Ioannidis
J Qin
J Tukey
Johanna Hardin
K Kafadar
K Kafadar
K Yeung
L Dodd
L Heyer
Leanne Hicks
M Eisen
P Huber
P Rousseeuw
P Rousseeuw
P Spellman
R Wilcox
S Bergmann
S Carter
S Chu
S Datta
S Dudoit
T Golub
Toxicogenomics Research Consortium
X Wang
X Wang
Y Yang
Y Yang
Z Bar-Joseph
Publication venue: BioMed Central
Publication date: 01/01/2007
Field of study

Abstract Background The underlying goal of microarray experiments is to identify gene expression patterns across different experimental conditions. Genes that are contained in a particular pathway or that respond similarly to experimental conditions could be co-expressed and show similar patterns of expression on a microarray. Using any of a variety of clustering methods or gene network analyses we can partition genes of interest into groups, clusters, or modules based on measures of similarity. Typically, Pearson correlation is used to measure distance (or similarity) before implementing a clustering algorithm. Pearson correlation is quite susceptible to outliers, however, an unfortunate characteristic when dealing with microarray data (well known to be typically quite noisy.) Results We propose a resistant similarity metric based on Tukey's biweight estimate of multivariate scale and location. The resistant metric is simply the correlation obtained from a resistant covariance matrix of scale. We give results which demonstrate that our correlation metric is much more resistant than the Pearson correlation while being more efficient than other nonparametric measures of correlation (e.g., Spearman correlation.) Additionally, our method gives a systematic gene flagging procedure which is useful when dealing with large amounts of noisy data. Conclusion When dealing with microarray data, which are known to be quite noisy, robust methods should be used. Specifically, robust distances, including the biweight correlation, should be used in clustering and gene network analysis.</p

Scholarship@Claremont

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Defining eye-fixation sequences across individuals and tasks: the Binocular-Individual Threshold (BIT) algorithm

Author: A Duchowski
A Gelman
A Nuthmann
A Poole
AT Duchowski
AW Inhoff
CM Harris
DA Stampe
DC Montgomery
DJ Lunn
DR Gitelman
EAB Over
FW Cornelissen
H Widdel
I Linde van der
J Steen van der
JC Benneyan
JM Henderson
JM Henderson
K Rayner
K Rayner
K Rayner
K Rayner
K Rayner
K Rayner
L Friedman
M Nyström
M Vernet
M Wedel
M Wedel
Michel Wedel
MS Castelhano
PJ Rousseeuw
PJ Rousseeuw
R Engbert
R Engbert
R Karsh
R Lans van der
R Maronna
Ralf van der Lans
Rik Pieters
SP Liversedge
T Foulsham
TJ Andrews
WA Shewhart
WB Kloke
Publication venue: Springer-Verlag
Publication date: 01/01/2011
Field of study

We propose a new fully automated velocity-based algorithm to identify fixations from eye-movement records of both eyes, with individual-specific thresholds. The algorithm is based on robust minimum determinant covariance estimators (MDC) and control chart procedures, and is conceptually simple and computationally attractive. To determine fixations, it uses velocity thresholds based on the natural within-fixation variability of both eyes. It improves over existing approaches by automatically identifying fixation thresholds that are specific to (a) both eyes, (b) x- and y- directions, (c) tasks, and (d) individuals. We applied the proposed Binocular-Individual Threshold (BIT) algorithm to two large datasets collected on eye-trackers with different sampling frequencies, and compute descriptive statistics of fixations for larger samples of individuals across a variety of tasks, including reading, scene viewing, and search on supermarket shelves. Our analysis shows that there are considerable differences in the characteristics of fixations not only between these tasks, but also between individuals

Crossref

PubMed Central

Hong Kong University of Science and Technology Institutional Repository

Kernel Spectral Clustering and applications

In this chapter we review the main literature related to kernel spectral clustering (KSC), an approach to clustering cast within a kernel-based optimization setting. KSC represents a least-squares support vector machine based formulation of spectral clustering described by a weighted kernel PCA objective. Just as in the classifier case, the binary clustering model is expressed by a hyperplane in a high dimensional space induced by a kernel. In addition, the multi-way clustering can be obtained by combining a set of binary decision functions via an Error Correcting Output Codes (ECOC) encoding scheme. Because of its model-based nature, the KSC method encompasses three main steps: training, validation, testing. In the validation stage model selection is performed to obtain tuning parameters, like the number of clusters present in the data. This is a major advantage compared to classical spectral clustering where the determination of the clustering parameters is unclear and relies on heuristics. Once a KSC model is trained on a small subset of the entire data, it is able to generalize well to unseen test points. Beyond the basic formulation, sparse KSC algorithms based on the Incomplete Cholesky Decomposition (ICD) and

L_0

L_1, L_0 + L_1

, Group Lasso regularization are reviewed. In that respect, we show how it is possible to handle large scale data. Also, two possible ways to perform hierarchical clustering and a soft clustering method are presented. Finally, real-world applications such as image segmentation, power load time-series clustering, document clustering and big data learning are considered.Comment: chapter contribution to the book "Unsupervised Learning Algorithms

arXiv.org e-Print Archive

Crossref

ANALYTICAL QUALITY ASSESSMENT OF ITERATIVELY REWEIGHTED LEAST-SQUARES (IRLS) METHOD

Author: BAARDA W
CHANG X.
COLLILIEUX X
GUO J
GUO J
GUO J.
GUO J.
HORN R. A.
HUBER P. J.
HUBER P. J.
KNIGHT N. L.
KOCH K. R.
LEICK A.
OU J
RANGELOVA E.
ROUSSEEUW P. J.
SCHAFFRIN B
STRANG G.
TEUNISSEN P. J. G
VERHAGEN S
WANG J.
Publication venue: Bulletin of Geodetic Sciences
Publication date: 01/03/2014
Field of study

The iteratively reweighted least-squares (IRLS) technique has been widelyemployed in geodetic and geophysical literature. The reliability measures areimportant diagnostic tools for inferring the strength of the model validation. Anexact analytical method is adopted to obtain insights on how much iterativereweighting can affect the quality indicators. Theoretical analyses and numericalresults show that, when the downweighting procedure is performed, (1) theprecision, all kinds of dilution of precision (DOP) metrics and the minimaldetectable bias (MDB) will become larger; (2) the variations of the bias-to-noiseratio (BNR) are involved, and (3) all these results coincide with those obtained bythe first-order approximation method

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Crossref

Biblioteca Digital de Periódicos da UFPR (Universidade Federal do Paraná)

Multi-Class Clustering of Cancer Subtypes through SVM Based Ensemble of Pareto-Optimal Solutions for Gene Marker Identification

Author: A Chlenski
A Strehl
AA Alizadeh
AK Jain
AK Jain
Alfons Navarro
Anirban Mukhopadhyay
C Coello Coello
DE Goldberg
E Zitzler
E Zitzler
F Hedborg
G Melino
J Han
J Handl
J Khan
K Crammer
K Deb
K Deb
K Deb
KL Schaefer
KP Kumar
KY Yeung
KY Yeung
L Fei
MCP de Souto
P Rousseeuw
P Tamayo
S Bandyopadhyay
S Bandyopadhyay
S Kilpinen
S Kwon
Sanghamitra Bandyopadhyay
T Ward
TR Golub
U Alon
U Maulik
U Maulik
U Maulik
Ujjwal Maulik
V Vapnik
Publication venue: Public Library of Science
Publication date: 12/11/2010
Field of study

With the advancement of microarray technology, it is now possible to study the expression profiles of thousands of genes across different experimental conditions or tissue samples simultaneously. Microarray cancer datasets, organized as samples versus genes fashion, are being used for classification of tissue samples into benign and malignant or their subtypes. They are also useful for identifying potential gene markers for each cancer subtype, which helps in successful diagnosis of particular cancer types. In this article, we have presented an unsupervised cancer classification technique based on multiobjective genetic clustering of the tissue samples. In this regard, a real-coded encoding of the cluster centers is used and cluster compactness and separation are simultaneously optimized. The resultant set of near-Pareto-optimal solutions contains a number of non-dominated solutions. A novel approach to combine the clustering information possessed by the non-dominated solutions through Support Vector Machine (SVM) classifier has been proposed. Final clustering is obtained by consensus among the clusterings yielded by different kernel functions. The performance of the proposed multiobjective clustering method has been compared with that of several other microarray clustering algorithms for three publicly available benchmark cancer datasets. Moreover, statistical significance tests have been conducted to establish the statistical superiority of the proposed clustering method. Furthermore, relevant gene markers have been identified using the clustering result produced by the proposed clustering method and demonstrated visually. Biological relationships among the gene markers are also studied based on gene ontology. The results obtained are found to be promising and can possibly have important impact in the area of unsupervised cancer classification as well as gene marker identification for multiple cancer subtypes

Public Library of Science (PLOS)

Crossref

PubMed Central