6,342 research outputs found

    Named entities as privileged information for hierarchical text clustering

    Get PDF
    Text clustering is a text mining task which is often used to aid the organization, knowledge extraction, and exploratory search of text collections. Nowadays, the automatic text clustering becomes essential as the volume and variety of digital text documents increase, either in social networks and the Web or inside organizations. This paper explores the use of named entities as privileged information in a hierarchical clustering process, so as to improve clusters quality and interpretation. We carried out an experimental evaluation on three text collections (one written in Portuguese and two written in English) and the results show that named entities can be applied as privileged information to power clustering solution in dynamic text collection scenarios.FAPESP (grant #2010/20564-8, #2012/13830-9, #2013/14757-6 and #2013/16039-3

    Privileged information for hierarchical document clustering: a metric learning approach

    Get PDF
    Traditional hierarchical text clustering methods assume that the documents are represented only by “technical information”, i.e., keywords, phrases, expressions and named entities that can be directly extracted from the texts. However, in many scenarios there is an additional and valuable information about the documents which is usually disregarded during the clustering task, such as user-validated tags, annotations and comments from experts, dictionaries and domain ontologies. Recently, Vapnik introduced a new learning paradigm, called LUPI - Learning Using Privileged Information, which allows the incorporation of this additional (privileged) information in a supervised learning setting. We investigated the incorporation of privileged information in unsupervised setting. The key idea in our proposed approach is to extract important relationships among documents represented in the privileged information dimensional space to learn a more accurate metric for text clustering in the technical information space. A thorough experimental evaluation indicates that the incorporation of privileged information through metric learning significantly improves the hierarchical clustering accuracy.São Paulo Research Foundation (FAPESP) (grants 2010/20564-8, 2011/17366-2, 2011/19850-9, 2012/13830-9, 2013/16039-3, 2013/22547-1)PROPP/UFMSCAPESCNP

    Combining privileged information to improve context-aware recommender systems

    Get PDF
    A recommender system is an information filtering technology which can be used to predict preference ratings of items (products, services, movies, etc) and/or to output a ranking of items that are likely to be of interest to the user. Context-aware recommender systems (CARS) learn and predict the tastes and preferences of users by incorporating available contextual information in the recommendation process. One of the major challenges in context-aware recommender systems research is the lack of automatic methods to obtain contextual information for these systems. Considering this scenario, in this paper, we propose to use contextual information from topic hierarchies of the items (web pages) to improve the performance of context-aware recommender systems. The topic hierarchies are constructed by an extension of the LUPI-based Incremental Hierarchical Clustering method that considers three types of information: traditional bag-of-words (technical information), and the combination of named entities (privileged information I) with domain terms (privileged information II). We evaluated the contextual information in four context-aware recommender systems. Different weights were assigned to each type of information. The empirical results demonstrated that topic hierarchies with the combination of the two kinds of privileged information can provide better recommendations.FAPESP (grant #2010/20564-8, #2012/13830-9, and #2013/16039-3, São Paulo Research Foundation (FAPESP))CAPE

    Applying multi-view based metadata in personalized ranking for recommender systems

    Get PDF
    In this paper, we propose a multi-view based metadata extraction technique from unstructured textual content in order to be applied in recommendation algorithms based on latent factors. The solution aims at reducing the problem of intense and time-consuming human effort to identify, collect and label descriptions about the items. Our proposal uses a unsupervised learning method to construct topic hierarchies with named entity recognition as privileged information. We evaluate the technique using different recommendation algorithms, and show that better accuracy is obtained when additional information about items is considered.São Paulo Research Foundation (FAPESP) (Grants 2012/13830-9, 2013/16039-3, 2013/22547-1)CAPE

    Cross-lingual Word Clusters for Direct Transfer of Linguistic Structure

    Get PDF
    It has been established that incorporating word cluster features derived from large unlabeled corpora can significantly improve prediction of linguistic structure. While previous work has focused primarily on English, we extend these results to other languages along two dimensions. First, we show that these results hold true for a number of languages across families. Second, and more interestingly, we provide an algorithm for inducing cross-lingual clusters and we show that features derived from these clusters significantly improve the accuracy of cross-lingual structure prediction. Specifically, we show that by augmenting direct-transfer systems with cross-lingual cluster features, the relative error of delexicalized dependency parsers, trained on English treebanks and transferred to foreign languages, can be reduced by up to 13%. When applying the same method to direct transfer of named-entity recognizers, we observe relative improvements of up to 26%

    mARC: Memory by Association and Reinforcement of Contexts

    Full text link
    This paper introduces the memory by Association and Reinforcement of Contexts (mARC). mARC is a novel data modeling technology rooted in the second quantization formulation of quantum mechanics. It is an all-purpose incremental and unsupervised data storage and retrieval system which can be applied to all types of signal or data, structured or unstructured, textual or not. mARC can be applied to a wide range of information clas-sification and retrieval problems like e-Discovery or contextual navigation. It can also for-mulated in the artificial life framework a.k.a Conway "Game Of Life" Theory. In contrast to Conway approach, the objects evolve in a massively multidimensional space. In order to start evaluating the potential of mARC we have built a mARC-based Internet search en-gine demonstrator with contextual functionality. We compare the behavior of the mARC demonstrator with Google search both in terms of performance and relevance. In the study we find that the mARC search engine demonstrator outperforms Google search by an order of magnitude in response time while providing more relevant results for some classes of queries

    Knowledge-enhanced document embeddings for text classification

    Get PDF
    Accurate semantic representation models are essential in text mining applications. For a successful application of the text mining process, the text representation adopted must keep the interesting patterns to be discovered. Although competitive results for automatic text classification may be achieved with traditional bag of words, such representation model cannot provide satisfactory classification performances on hard settings where richer text representations are required. In this paper, we present an approach to represent document collections based on embedded representations of words and word senses. We bring together the power of word sense disambiguation and the semantic richness of word- and word-sense embedded vectors to construct embedded representations of document collections. Our approach results in semantically enhanced and low-dimensional representations. We overcome the lack of interpretability of embedded vectors, which is a drawback of this kind of representation, with the use of word sense embedded vectors. Moreover, the experimental evaluation indicates that the use of the proposed representations provides stable classifiers with strong quantitative results, especially in semantically-complex classification scenarios

    Learning Object Repositories with Dynamically Reconfigurable Metadata Schemata

    Get PDF
    [ES] In this paper we describe a model of learning object repository in which users have full control on the metadata schemata. Thus, they can define new schemata and they can reconfigure existing ones in a collaborative fashion. As consequence, the repository must react to changes in schemata in a dynamic and responsive way. Since schemata enable operations like navigation and search, dynamic reconfigurability requires clever indexing strategies, resilient to changes in these schemata. For this purpose, we have used conventional inverted indexing approaches and we have also devised a hierarchical clusteringbased indexing model. By using Clavy, a system for managing learning object repositories in the field of the Humanities, we provide some experimental results that show how the hierarchical clustering-based model can outperform the more conventional inverted indexes-based solutions
    corecore