55,213 research outputs found
Respect My Authority! HITS Without Hyperlinks, Utilizing Cluster-Based Language Models
We present an approach to improving the precision of an initial document
ranking wherein we utilize cluster information within a graph-based framework.
The main idea is to perform re-ranking based on centrality within bipartite
graphs of documents (on one side) and clusters (on the other side), on the
premise that these are mutually reinforcing entities. Links between entities
are created via consideration of language models induced from them.
We find that our cluster-document graphs give rise to much better retrieval
performance than previously proposed document-only graphs do. For example,
authority-based re-ranking of documents via a HITS-style cluster-based approach
outperforms a previously-proposed PageRank-inspired algorithm applied to
solely-document graphs. Moreover, we also show that computing authority scores
for clusters constitutes an effective method for identifying clusters
containing a large percentage of relevant documents
How Many Workers to Ask? Adaptive Exploration for Collecting High Quality Labels
Crowdsourcing has been part of the IR toolbox as a cheap and fast mechanism
to obtain labels for system development and evaluation. Successful deployment
of crowdsourcing at scale involves adjusting many variables, a very important
one being the number of workers needed per human intelligence task (HIT). We
consider the crowdsourcing task of learning the answer to simple
multiple-choice HITs, which are representative of many relevance experiments.
In order to provide statistically significant results, one often needs to ask
multiple workers to answer the same HIT. A stopping rule is an algorithm that,
given a HIT, decides for any given set of worker answers if the system should
stop and output an answer or iterate and ask one more worker. Knowing the
historic performance of a worker in the form of a quality score can be
beneficial in such a scenario. In this paper we investigate how to devise
better stopping rules given such quality scores. We also suggest adaptive
exploration as a promising approach for scalable and automatic creation of
ground truth. We conduct a data analysis on an industrial crowdsourcing
platform, and use the observations from this analysis to design new stopping
rules that use the workers' quality scores in a non-trivial manner. We then
perform a simulation based on a real-world workload, showing that our algorithm
performs better than the more naive approaches.Comment: SIGIR 201
Benchmarking Relief-Based Feature Selection Methods for Bioinformatics Data Mining
Modern biomedical data mining requires feature selection methods that can (1)
be applied to large scale feature spaces (e.g. `omics' data), (2) function in
noisy problems, (3) detect complex patterns of association (e.g. gene-gene
interactions), (4) be flexibly adapted to various problem domains and data
types (e.g. genetic variants, gene expression, and clinical data) and (5) are
computationally tractable. To that end, this work examines a set of
filter-style feature selection algorithms inspired by the `Relief' algorithm,
i.e. Relief-Based algorithms (RBAs). We implement and expand these RBAs in an
open source framework called ReBATE (Relief-Based Algorithm Training
Environment). We apply a comprehensive genetic simulation study comparing
existing RBAs, a proposed RBA called MultiSURF, and other established feature
selection methods, over a variety of problems. The results of this study (1)
support the assertion that RBAs are particularly flexible, efficient, and
powerful feature selection methods that differentiate relevant features having
univariate, multivariate, epistatic, or heterogeneous associations, (2) confirm
the efficacy of expansions for classification vs. regression, discrete vs.
continuous features, missing data, multiple classes, or class imbalance, (3)
identify previously unknown limitations of specific RBAs, and (4) suggest that
while MultiSURF* performs best for explicitly identifying pure 2-way
interactions, MultiSURF yields the most reliable feature selection performance
across a wide range of problem types.Comment: Revised submission to JB
Evaluation of E-Learners Behaviour using Different Fuzzy Clustering Models: A Comparative Study
This paper introduces an evaluation methodologies for the e-learners'
behaviour that will be a feedback to the decision makers in e-learning system.
Learner's profile plays a crucial role in the evaluation process to improve the
e-learning process performance. The work focuses on the clustering of the
e-learners based on their behaviour into specific categories that represent the
learner's profiles. The learners' classes named as regular, workers, casual,
bad, and absent. The work may answer the question of how to return bad students
to be regular ones. The work presented the use of different fuzzy clustering
techniques as fuzzy c-means and kernelized fuzzy c-means to find the learners'
categories and predict their profiles. The paper presents the main phases as
data description, preparation, features selection, and the experiments design
using different fuzzy clustering models. Analysis of the obtained results and
comparison with the real world behavior of those learners proved that there is
a match with percentage of 78%. Fuzzy clustering reflects the learners'
behavior more than crisp clustering. Comparison between FCM and KFCM proved
that the KFCM is much better than FCM in predicting the learners' behaviour.Comment: Pages IEEE format, International Journal of Computer Science and
Information Security, IJCSIS, Vol. 7 No. 2, February 2010, USA. ISSN 1947
5500, http://sites.google.com/site/ijcsis
Satyam: Democratizing Groundtruth for Machine Vision
The democratization of machine learning (ML) has led to ML-based machine
vision systems for autonomous driving, traffic monitoring, and video
surveillance. However, true democratization cannot be achieved without greatly
simplifying the process of collecting groundtruth for training and testing
these systems. This groundtruth collection is necessary to ensure good
performance under varying conditions. In this paper, we present the design and
evaluation of Satyam, a first-of-its-kind system that enables a layperson to
launch groundtruth collection tasks for machine vision with minimal effort.
Satyam leverages a crowdtasking platform, Amazon Mechanical Turk, and automates
several challenging aspects of groundtruth collection: creating and launching
of custom web-UI tasks for obtaining the desired groundtruth, controlling
result quality in the face of spammers and untrained workers, adapting prices
to match task complexity, filtering spammers and workers with poor performance,
and processing worker payments. We validate Satyam using several popular
benchmark vision datasets, and demonstrate that groundtruth obtained by Satyam
is comparable to that obtained from trained experts and provides matching ML
performance when used for training
An effective algorithm for hyperparameter optimization of neural networks
A major challenge in designing neural network (NN) systems is to determine
the best structure and parameters for the network given the data for the
machine learning problem at hand. Examples of parameters are the number of
layers and nodes, the learning rates, and the dropout rates. Typically, these
parameters are chosen based on heuristic rules and manually fine-tuned, which
may be very time-consuming, because evaluating the performance of a single
parametrization of the NN may require several hours. This paper addresses the
problem of choosing appropriate parameters for the NN by formulating it as a
box-constrained mathematical optimization problem, and applying a
derivative-free optimization tool that automatically and effectively searches
the parameter space. The optimization tool employs a radial basis function
model of the objective function (the prediction accuracy of the NN) to
accelerate the discovery of configurations yielding high accuracy. Candidate
configurations explored by the algorithm are trained to a small number of
epochs, and only the most promising candidates receive full training. The
performance of the proposed methodology is assessed on benchmark sets and in
the context of predicting drug-drug interactions, showing promising results.
The optimization tool used in this paper is open-source
The GRIFFIN Data Acquisition System
Gamma-Ray Infrastructure For Fundamental Investigations of Nuclei, GRIFFIN,
is a new experimental facility for radioactive decay studies at the TRIUMF-ISAC
laboratory. This article describes the details of the custom designed GRIFFIN
digital data acquisition system. The features of the system that will enable
high-precision half-life and branching ratio measurements with levels of
uncertainty better than 0.05% are described. The system has demonstrated the
ability to effectively collect signals from High-purity germanium crystals at
counting rates up to 50kHz while maintaining good energy resolution, detection
efficiency and spectral quality
BiRank: Towards Ranking on Bipartite Graphs
The bipartite graph is a ubiquitous data structure that can model the
relationship between two entity types: for instance, users and items, queries
and webpages. In this paper, we study the problem of ranking vertices of a
bipartite graph, based on the graph's link structure as well as prior
information about vertices (which we term a query vector). We present a new
solution, BiRank, which iteratively assigns scores to vertices and finally
converges to a unique stationary ranking. In contrast to the traditional random
walk-based methods, BiRank iterates towards optimizing a regularization
function, which smooths the graph under the guidance of the query vector.
Importantly, we establish how BiRank relates to the Bayesian methodology,
enabling the future extension in a probabilistic way. To show the rationale and
extendability of the ranking methodology, we further extend it to rank for the
more generic n-partite graphs. BiRank's generic modeling of both the graph
structure and vertex features enables it to model various ranking hypotheses
flexibly. To illustrate its functionality, we apply the BiRank and TriRank
(ranking for tripartite graphs) algorithms to two real-world applications: a
general ranking scenario that predicts the future popularity of items, and a
personalized ranking scenario that recommends items of interest to users.
Extensive experiments on both synthetic and real-world datasets demonstrate
BiRank's soundness (fast convergence), efficiency (linear in the number of
graph edges) and effectiveness (achieving state-of-the-art in the two
real-world tasks).Comment: 15 pages, 8 figure
Analysis of the flooding search algorithm with OPNET
In this work, we consider the popular OPNET simulator as a tool for
performance evaluation of algorithms operating in peer-to-peer (P2P) networks.
We created simple framework and used it to analyse the flooding search
algorithm which is a popular technique for searching files in an unstructured
P2P network. We investigated the influence of the number of replicas and time
to live (TTL) of search queries on the algorithm performance. Preparing the
simulation we did not reported the problems which are commonly encountered in
P2P dedicated simulators although the size of simulated network was limited
PageRank without hyperlinks: Structural re-ranking using links induced by language models
Inspired by the PageRank and HITS (hubs and authorities) algorithms for Web
search, we propose a structural re-ranking approach to ad hoc information
retrieval: we reorder the documents in an initially retrieved set by exploiting
asymmetric relationships between them. Specifically, we consider generation
links, which indicate that the language model induced from one document assigns
high probability to the text of another; in doing so, we take care to prevent
bias against long documents. We study a number of re-ranking criteria based on
measures of centrality in the graphs formed by generation links, and show that
integrating centrality into standard language-model-based retrieval is quite
effective at improving precision at top ranks
- …