67 research outputs found

    Identificando o assunto dos documentos em coleções textuais utilizando termos compostos

    Get PDF
    Diferentemente dos problemas de recuperação de informação, nos quais o usuário conhece o que ele está procurando, às vezes o usuário precisa compreender de forma mais geral os assuntos abordados na coleção para explorar os documentos de interesse. Para cada grupo ou tópico obtido, um conjunto de descritores é selecionado entre os termos da coleção e cabe ao usuário identificar o assunto de cada grupo a partir da lista de descritores apresentada. Normalmente, o conjunto de descritores é composto por termos simples. Entretanto, muitos termos possuem significado próprio quando combinados entre si. Produzir uma lista de termos que já considere na sua construção o uso de termos compostos pode diminuir o esforço necessário para a compreensão dos assuntos identificados. Neste artigo é proposta uma abordagem para identificação de assuntos em coleções de documentos que combina técnicas de regras de associação e de agrupamento de dados. As regras de associação são aplicadas para extrair termos compostos formando o contexto local da relação entre os termos. Essas regras são representadas em uma estrutura bag-of-words cujas dimensões são as mesmas da bag-of-words produzida pela coleção de documentos e são\ud agrupadas, formando o contexto geral das relações. A ideia é que a informação da vizinhança dos termos compostos extraídos ajudam a identificar (a) termos diferentes utilizados em um mesmo contexto ou com mesmo sentido e (b) termos idênticos mas que são usados em contextos diferentes ou com significados diferentes. Os resultados da avaliação indicam que o uso de termos compostos com a abordagem proposta melhora a identificação de assuntos nas coleções de documentos avaliadas.CAPES (processo DS-6345378/D)FAPESP (processo número 2014/08996-0

    Combining privileged information to improve context-aware recommender systems

    Get PDF
    A recommender system is an information filtering technology which can be used to predict preference ratings of items (products, services, movies, etc) and/or to output a ranking of items that are likely to be of interest to the user. Context-aware recommender systems (CARS) learn and predict the tastes and preferences of users by incorporating available contextual information in the recommendation process. One of the major challenges in context-aware recommender systems research is the lack of automatic methods to obtain contextual information for these systems. Considering this scenario, in this paper, we propose to use contextual information from topic hierarchies of the items (web pages) to improve the performance of context-aware recommender systems. The topic hierarchies are constructed by an extension of the LUPI-based Incremental Hierarchical Clustering method that considers three types of information: traditional bag-of-words (technical information), and the combination of named entities (privileged information I) with domain terms (privileged information II). We evaluated the contextual information in four context-aware recommender systems. Different weights were assigned to each type of information. The empirical results demonstrated that topic hierarchies with the combination of the two kinds of privileged information can provide better recommendations.FAPESP (grant #2010/20564-8, #2012/13830-9, and #2013/16039-3, São Paulo Research Foundation (FAPESP))CAPE

    Unsupervised instance selection from text streams

    Get PDF
    Instance selection techniques have received great attention in the literature, since they are very useful to identify a subset of instances (textual documents) that adequately represents the knowledge embedded in the entire text database. Most of the instance selection techniques are supervised, i.e., requires a labeled data set to define, with the help of classifiers, the separation boundaries of the data. However, manual labeling of the instances requires an intense human effort that is impractical when dealing with text streams. In this article, we present an approach for unsupervised instance selection from text streams. In our approach, text clustering methods are used to define the separation boundaries, thereby separating regions of high data density. The most representative instances of each cluster, which are the centers of high-density regions, are selected to represent a portion of the data. A well-known algorithm for data sampling from streams, known as Reservoir Sampling, has been adapted to incorporate the unsupervised instance selection. We carried out an experimental evaluations using three benchmarking text collections and the reported experimental results show that the proposed approach significantly increases the quality of a knowledge extraction task by using more representative instances.FAPESP - São Paulo Research Foundation (grant 2010/20564-8)CAPESCNPq1st. Symposium on Knowledge Discovery, Mining and Learning (KDMiLe).\ud São Carlos, Brazil. 17-19 July 2013

    Flexible document organization: comparing fuzzy and possibilistic approaches

    Get PDF
    System flexibility means the ability of a system to manage imprecise and/or uncertain information. A lot of commercially available Information Retrieval Systems (IRS) address this issue at the level of query formulation. Another way to make the flexibility of an IRS possible is by means of the flexible organization of documents. Such organization can be carried out using clustering algorithms by which documents can be automatically organized in multiple clusters simultaneously. Fuzzy and possibilistic clustering algorithms are examples of methods by which documents can belong to more than one cluster simultaneously with different membership degrees. The interpretation of these membership degrees can be used to quantify the compatibility of a document with a particular topic. The topics are represented by clusters and the clusters are identified by one or more descriptors extracted by a proposed method. We aim to investigate if the performance of each clustering algorithm can affect the extraction of meaningful overlapping cluster descriptors. Experiments were carried using well-known collections of documents and the predictive power of the descriptors extracted from both fuzzy and possibilistic document clustering was evaluated. The results prove that descriptors extracted after both fuzzy and possibilistic clustering are effective and can improve the flexible organization of documents.CAPES (Coordination for the Improvement of Higher Level Personnel) (PDSE grant 5983-11-8)FAPESP (Sao Paulo Research Foundation) (grant 2011/19850-9

    Music classification by transductive learning using bipartite heterogeneous networks

    Get PDF
    The popularization of music distribution in electronic format has increased the amount of music with incomplete metadata. The incompleteness of data can hamper some important tasks, such as music and artist recommendation. In this scenario, transductive classification can be used to classify the whole dataset considering just few labeled instances. Usually transductive classification is performed through label propagation, in which data are represented as networks and the examples propagate their labels through\ud their connections. Similarity-based networks are usually applied to model data as network. However, this kind of representation requires the definition of parameters, which significantly affect the classification accuracy, and presentes a high cost due to the computation of similarities among all dataset instances. In contrast, bipartite heterogeneous networks have appeared as an alternative to similarity-based networks in text mining applications. In these networks, the words are connected to the documents which they occur. Thus, there is no parameter or additional costs to generate such networks. In this paper, we propose the use of the bipartite network representation to perform transductive classification of music, using a bag-of-frames approach to describe music signals. We demonstrate that the proposed approach outperforms other music classification approaches when few labeled instances are available.Sao Paulo Research Foundation (FAPESP) (grants 2011/12823-6, 2012/50714-7, 2013/26151-5, and 2014/08996-0

    Optimizing personalized ranking in recommender systems with metadata awareness

    Get PDF
    In this paper, we propose an item recommendation algorithm based on latent factors which uses implicit feedback from users to optimize the ranking of items according to individual preferences. The novelty of the algorithm is the integration of content metadata to improve the quality of recommendations. Such descriptions are an important source to construct a personalized set of items which are meaningfully related to the user’s main interests. The method is evaluated on two diferente datasets, being compared against another approach reported in the literature. The results demonstrate the effectiveness of supporting personalized ranking with metadata awareness.CAPESCNPqFAPESP (grant #2013/22547-1 and #2012/13830-9

    A data warehouse to support web site automation

    Get PDF
    Background: \ud Due to the constant demand for new information and timely updates of services and content in order to satisfy the user’s needs, web site automation has emerged as a solution to automate several personalization and management activities of a web site. One goal of automation is the reduction of the editor’s effort and consequently of the costs for the owner. The other goal is that the site can more timely adapt to the behavior of the user, improving the browsing experience and helping the user in achieving his/her own goals. \ud \ud Methods: \ud A database to store rich web data is an essential component for web site automation. In this paper, we propose a data warehouse that is developed to be a repository of information to support different web site automation and monitoring activities. We implemented our data warehouse and used it as a repository of information in three different case studies related to the areas of e-commerce, e-learning, and e-news. \ud \ud Result: \ud The case studies showed that our data warehouse is appropriate for web site automation in different contexts. \ud \ud Conclusion: \ud In all cases, the use of the data warehouse was quite simple and with a good response time, mainly because of the simplicity of its structure.FCT - Science and Technology Foundation (SFRH/BD/22516/2005)project Site-O-Matic (POSC/EIA/58367/2004)São Paulo Research Foundation (FAPESP) (grants 2011/19850-9, 2012/13830-9
    • …