468,431 research outputs found

    A framework for list representation, enabling list stabilization through incorporation of gene exchangeabilities

    Full text link
    Analysis of multivariate data sets from e.g. microarray studies frequently results in lists of genes which are associated with some response of interest. The biological interpretation is often complicated by the statistical instability of the obtained gene lists with respect to sampling variations, which may partly be due to the functional redundancy among genes, implying that multiple genes can play exchangeable roles in the cell. In this paper we use the concept of exchangeability of random variables to model this functional redundancy and thereby account for the instability attributable to sampling variations. We present a flexible framework to incorporate the exchangeability into the representation of lists. The proposed framework supports straightforward robust comparison between any two lists. It can also be used to generate new, more stable gene rankings incorporating more information from the experimental data. Using a microarray data set from lung cancer patients we show that the proposed method provides more robust gene rankings than existing methods with respect to sampling variations, without compromising the biological significance

    How Many Topics? Stability Analysis for Topic Models

    Full text link
    Topic modeling refers to the task of discovering the underlying thematic structure in a text corpus, where the output is commonly presented as a report of the top terms appearing in each topic. Despite the diversity of topic modeling algorithms that have been proposed, a common challenge in successfully applying these techniques is the selection of an appropriate number of topics for a given corpus. Choosing too few topics will produce results that are overly broad, while choosing too many will result in the "over-clustering" of a corpus into many small, highly-similar topics. In this paper, we propose a term-centric stability analysis strategy to address this issue, the idea being that a model with an appropriate number of topics will be more robust to perturbations in the data. Using a topic modeling approach based on matrix factorization, evaluations performed on a range of corpora show that this strategy can successfully guide the model selection process.Comment: Improve readability of plots. Add minor clarification

    Stability and aggregation of ranked gene lists

    Get PDF
    Ranked gene lists are highly instable in the sense that similar measures of differential gene expression may yield very different rankings, and that a small change of the data set usually affects the obtained gene list considerably. Stability issues have long been under-considered in the literature, but they have grown to a hot topic in the last few years, perhaps as a consequence of the increasing skepticism on the reproducibility and clinical applicability of molecular research findings. In this article, we review existing approaches for the assessment of stability of ranked gene lists and the related problem of aggregation, give some practical recommendations, and warn against potential misuse of these methods. This overview is illustrated through an application to a recent leukemia data set using the freely available Bioconductor package GeneSelector

    Comparison of group recommendation algorithms

    Get PDF
    In recent years recommender systems have become the common tool to handle the information overload problem of educational and informative web sites, content delivery systems, and online shops. Although most recommender systems make suggestions for individual users, in many circumstances the selected items (e.g., movies) are not intended for personal usage but rather for consumption in groups. This paper investigates how effective group recommendations for movies can be generated by combining the group members' preferences (as expressed by ratings) or by combining the group members' recommendations. These two grouping strategies, which convert traditional recommendation algorithms into group recommendation algorithms, are combined with five commonly used recommendation algorithms to calculate group recommendations for different group compositions. The group recommendations are not only assessed in terms of accuracy, but also in terms of other qualitative aspects that are important for users such as diversity, coverage, and serendipity. In addition, the paper discusses the influence of the size and composition of the group on the quality of the recommendations. The results show that the grouping strategy which produces the most accurate results depends on the algorithm that is used for generating individual recommendations. Therefore, the paper proposes a combination of grouping strategies which outperforms each individual strategy in terms of accuracy. Besides, the results show that the accuracy of the group recommendations increases as the similarity between members of the group increases. Also the diversity, coverage, and serendipity of the group recommendations are to a large extent dependent on the used grouping strategy and recommendation algorithm. Consequently for (commercial) group recommender systems, the grouping strategy and algorithm have to be chosen carefully in order to optimize the desired quality metrics of the group recommendations. The conclusions of this paper can be used as guidelines for this selection process

    An LSH Index for Computing Kendall's Tau over Top-k Lists

    Full text link
    We consider the problem of similarity search within a set of top-k lists under the Kendall's Tau distance function. This distance describes how related two rankings are in terms of concordantly and discordantly ordered items. As top-k lists are usually very short compared to the global domain of possible items to be ranked, creating an inverted index to look up overlapping lists is possible but does not capture tight enough the similarity measure. In this work, we investigate locality sensitive hashing schemes for the Kendall's Tau distance and evaluate the proposed methods using two real-world datasets.Comment: 6 pages, 8 subfigures, presented in Seventeenth International Workshop on the Web and Databases (WebDB 2014) co-located with ACM SIGMOD201
    corecore