55,213 research outputs found

    Respect My Authority! HITS Without Hyperlinks, Utilizing Cluster-Based Language Models

    Full text link
    We present an approach to improving the precision of an initial document ranking wherein we utilize cluster information within a graph-based framework. The main idea is to perform re-ranking based on centrality within bipartite graphs of documents (on one side) and clusters (on the other side), on the premise that these are mutually reinforcing entities. Links between entities are created via consideration of language models induced from them. We find that our cluster-document graphs give rise to much better retrieval performance than previously proposed document-only graphs do. For example, authority-based re-ranking of documents via a HITS-style cluster-based approach outperforms a previously-proposed PageRank-inspired algorithm applied to solely-document graphs. Moreover, we also show that computing authority scores for clusters constitutes an effective method for identifying clusters containing a large percentage of relevant documents

    How Many Workers to Ask? Adaptive Exploration for Collecting High Quality Labels

    Full text link
    Crowdsourcing has been part of the IR toolbox as a cheap and fast mechanism to obtain labels for system development and evaluation. Successful deployment of crowdsourcing at scale involves adjusting many variables, a very important one being the number of workers needed per human intelligence task (HIT). We consider the crowdsourcing task of learning the answer to simple multiple-choice HITs, which are representative of many relevance experiments. In order to provide statistically significant results, one often needs to ask multiple workers to answer the same HIT. A stopping rule is an algorithm that, given a HIT, decides for any given set of worker answers if the system should stop and output an answer or iterate and ask one more worker. Knowing the historic performance of a worker in the form of a quality score can be beneficial in such a scenario. In this paper we investigate how to devise better stopping rules given such quality scores. We also suggest adaptive exploration as a promising approach for scalable and automatic creation of ground truth. We conduct a data analysis on an industrial crowdsourcing platform, and use the observations from this analysis to design new stopping rules that use the workers' quality scores in a non-trivial manner. We then perform a simulation based on a real-world workload, showing that our algorithm performs better than the more naive approaches.Comment: SIGIR 201

    Benchmarking Relief-Based Feature Selection Methods for Bioinformatics Data Mining

    Full text link
    Modern biomedical data mining requires feature selection methods that can (1) be applied to large scale feature spaces (e.g. `omics' data), (2) function in noisy problems, (3) detect complex patterns of association (e.g. gene-gene interactions), (4) be flexibly adapted to various problem domains and data types (e.g. genetic variants, gene expression, and clinical data) and (5) are computationally tractable. To that end, this work examines a set of filter-style feature selection algorithms inspired by the `Relief' algorithm, i.e. Relief-Based algorithms (RBAs). We implement and expand these RBAs in an open source framework called ReBATE (Relief-Based Algorithm Training Environment). We apply a comprehensive genetic simulation study comparing existing RBAs, a proposed RBA called MultiSURF, and other established feature selection methods, over a variety of problems. The results of this study (1) support the assertion that RBAs are particularly flexible, efficient, and powerful feature selection methods that differentiate relevant features having univariate, multivariate, epistatic, or heterogeneous associations, (2) confirm the efficacy of expansions for classification vs. regression, discrete vs. continuous features, missing data, multiple classes, or class imbalance, (3) identify previously unknown limitations of specific RBAs, and (4) suggest that while MultiSURF* performs best for explicitly identifying pure 2-way interactions, MultiSURF yields the most reliable feature selection performance across a wide range of problem types.Comment: Revised submission to JB

    Evaluation of E-Learners Behaviour using Different Fuzzy Clustering Models: A Comparative Study

    Full text link
    This paper introduces an evaluation methodologies for the e-learners' behaviour that will be a feedback to the decision makers in e-learning system. Learner's profile plays a crucial role in the evaluation process to improve the e-learning process performance. The work focuses on the clustering of the e-learners based on their behaviour into specific categories that represent the learner's profiles. The learners' classes named as regular, workers, casual, bad, and absent. The work may answer the question of how to return bad students to be regular ones. The work presented the use of different fuzzy clustering techniques as fuzzy c-means and kernelized fuzzy c-means to find the learners' categories and predict their profiles. The paper presents the main phases as data description, preparation, features selection, and the experiments design using different fuzzy clustering models. Analysis of the obtained results and comparison with the real world behavior of those learners proved that there is a match with percentage of 78%. Fuzzy clustering reflects the learners' behavior more than crisp clustering. Comparison between FCM and KFCM proved that the KFCM is much better than FCM in predicting the learners' behaviour.Comment: Pages IEEE format, International Journal of Computer Science and Information Security, IJCSIS, Vol. 7 No. 2, February 2010, USA. ISSN 1947 5500, http://sites.google.com/site/ijcsis

    Satyam: Democratizing Groundtruth for Machine Vision

    Full text link
    The democratization of machine learning (ML) has led to ML-based machine vision systems for autonomous driving, traffic monitoring, and video surveillance. However, true democratization cannot be achieved without greatly simplifying the process of collecting groundtruth for training and testing these systems. This groundtruth collection is necessary to ensure good performance under varying conditions. In this paper, we present the design and evaluation of Satyam, a first-of-its-kind system that enables a layperson to launch groundtruth collection tasks for machine vision with minimal effort. Satyam leverages a crowdtasking platform, Amazon Mechanical Turk, and automates several challenging aspects of groundtruth collection: creating and launching of custom web-UI tasks for obtaining the desired groundtruth, controlling result quality in the face of spammers and untrained workers, adapting prices to match task complexity, filtering spammers and workers with poor performance, and processing worker payments. We validate Satyam using several popular benchmark vision datasets, and demonstrate that groundtruth obtained by Satyam is comparable to that obtained from trained experts and provides matching ML performance when used for training

    An effective algorithm for hyperparameter optimization of neural networks

    Full text link
    A major challenge in designing neural network (NN) systems is to determine the best structure and parameters for the network given the data for the machine learning problem at hand. Examples of parameters are the number of layers and nodes, the learning rates, and the dropout rates. Typically, these parameters are chosen based on heuristic rules and manually fine-tuned, which may be very time-consuming, because evaluating the performance of a single parametrization of the NN may require several hours. This paper addresses the problem of choosing appropriate parameters for the NN by formulating it as a box-constrained mathematical optimization problem, and applying a derivative-free optimization tool that automatically and effectively searches the parameter space. The optimization tool employs a radial basis function model of the objective function (the prediction accuracy of the NN) to accelerate the discovery of configurations yielding high accuracy. Candidate configurations explored by the algorithm are trained to a small number of epochs, and only the most promising candidates receive full training. The performance of the proposed methodology is assessed on benchmark sets and in the context of predicting drug-drug interactions, showing promising results. The optimization tool used in this paper is open-source

    The GRIFFIN Data Acquisition System

    Full text link
    Gamma-Ray Infrastructure For Fundamental Investigations of Nuclei, GRIFFIN, is a new experimental facility for radioactive decay studies at the TRIUMF-ISAC laboratory. This article describes the details of the custom designed GRIFFIN digital data acquisition system. The features of the system that will enable high-precision half-life and branching ratio measurements with levels of uncertainty better than 0.05% are described. The system has demonstrated the ability to effectively collect signals from High-purity germanium crystals at counting rates up to 50kHz while maintaining good energy resolution, detection efficiency and spectral quality

    BiRank: Towards Ranking on Bipartite Graphs

    Full text link
    The bipartite graph is a ubiquitous data structure that can model the relationship between two entity types: for instance, users and items, queries and webpages. In this paper, we study the problem of ranking vertices of a bipartite graph, based on the graph's link structure as well as prior information about vertices (which we term a query vector). We present a new solution, BiRank, which iteratively assigns scores to vertices and finally converges to a unique stationary ranking. In contrast to the traditional random walk-based methods, BiRank iterates towards optimizing a regularization function, which smooths the graph under the guidance of the query vector. Importantly, we establish how BiRank relates to the Bayesian methodology, enabling the future extension in a probabilistic way. To show the rationale and extendability of the ranking methodology, we further extend it to rank for the more generic n-partite graphs. BiRank's generic modeling of both the graph structure and vertex features enables it to model various ranking hypotheses flexibly. To illustrate its functionality, we apply the BiRank and TriRank (ranking for tripartite graphs) algorithms to two real-world applications: a general ranking scenario that predicts the future popularity of items, and a personalized ranking scenario that recommends items of interest to users. Extensive experiments on both synthetic and real-world datasets demonstrate BiRank's soundness (fast convergence), efficiency (linear in the number of graph edges) and effectiveness (achieving state-of-the-art in the two real-world tasks).Comment: 15 pages, 8 figure

    Analysis of the flooding search algorithm with OPNET

    Full text link
    In this work, we consider the popular OPNET simulator as a tool for performance evaluation of algorithms operating in peer-to-peer (P2P) networks. We created simple framework and used it to analyse the flooding search algorithm which is a popular technique for searching files in an unstructured P2P network. We investigated the influence of the number of replicas and time to live (TTL) of search queries on the algorithm performance. Preparing the simulation we did not reported the problems which are commonly encountered in P2P dedicated simulators although the size of simulated network was limited

    PageRank without hyperlinks: Structural re-ranking using links induced by language models

    Full text link
    Inspired by the PageRank and HITS (hubs and authorities) algorithms for Web search, we propose a structural re-ranking approach to ad hoc information retrieval: we reorder the documents in an initially retrieved set by exploiting asymmetric relationships between them. Specifically, we consider generation links, which indicate that the language model induced from one document assigns high probability to the text of another; in doing so, we take care to prevent bias against long documents. We study a number of re-ranking criteria based on measures of centrality in the graphs formed by generation links, and show that integrating centrality into standard language-model-based retrieval is quite effective at improving precision at top ranks
    • …