4 research outputs found

    A novel multidimensional model for the OLAP on documents : modeling, generation and implementation

    Get PDF
    International audienceAs the amount of textual information grows explosively in various kinds of business systems, it becomes more and more essential to analyze both structured data and unstructured textual data simultaneously. However information contained in non structured data (documents and so on) is only partially used in business intelligence (BI). Indeed On-Line Analytical Processing (OLAP) cubes which are the main support of BI analysis in decision support systems have focused on structured data. This is the reason why OLAP is being extended to unstructured textual data. In this paper we introduce the innovative “Diamond” multidimensional model that will serve as a basis for semantic OLAP on XML documents and then we describe the meta modeling, generation and implementation of a the Diamond multidimensional model

    Dynamic topic herarchies and segmented rankings in textual OLAP technology.

    Get PDF
    Programa de P?s-Gradua??o em Ci?ncia da Computa??o. Departamento de Ci?ncia da Computa??o, Instituto de Ci?ncias Exatas e Biol?gicas, Universidade Federal de Ouro Preto.A tecnologia OLAP tem se consolidado h? 20 anos e recentemente foi redesenhada para que suas dimens?es, hierarquias e medidas possam suportar as particularidades dos dados textuais. A tarefa de organizar dados textuais de forma hier?rquica pode ser resolvida com a constru??o de hierarquias de t?picos. Atualmente, a hierarquia de t?picos ? definida apenas uma vez no cubo de dados, ou seja, para todo o \textit{lattice} de cuboides. No entanto, tal hierarquia ? sens?vel ao conte?do da cole??o de documentos, portanto em um mesmo cubo de dados podem existir c?lulas com conte?dos completamente diferentes, agregando cole??es de documentos distintas, provocando potenciais altera??es na hierarquia de t?picos. Al?m disso, o segmento de texto utilizado na an?lise OLAP tamb?m influencia diretamente nos t?picos elencados por tal hierarquia. Neste trabalho, apresentamos um cubo de dados textual com m?ltiplas e din?micas hierarquias de t?picos. M?ltiplas por serem constru?das a partir de diferentes segmentos de texto e din?micas por serem constru?das para cada c?lula do cubo. Outra contribui??o deste trabalho refere-se ? resposta das consultas multidimensionais. O estado da arte normalmente retorna os top-k documentos mais relevantes para um determinado t?pico. Vamos al?m disso, retornando outros segmentos de texto, como os t?tulos mais significativos, resumos e par?grafos. A abordagem ? projetada em quatro etapas adicionais, onde cada passo atenua um pouco mais o impacto da constru??o de v?rias hierarquias de t?picos e rankings de segmentos por c?lula de cubo. Experimentos que utilizam parte dos documentos da DBLP como uma cole??o de documentos refor?am nossas hip?teses.The OLAP technology emerged 20 years ago and recently has been redesigned so that its dimensions, hierarchies and measures can support the particularities of textual data. Organizing textual data hierarchically can be solved with topic hierarchies. Currently, the topic hierarchy is de ned only once in the data cube, e.g., forthe entire lattice of cubo ids. However, such hierarchy is sensitive to the document collection content. Thus, a data cube cell can contain a collection of documents distinct fromothers in the same cube, causing potential changes in the topic hierarchy. Further more, the text segment used in OLAP analysis also changes this hierarchy. In this work, we present a textual data cube with multiple dynamic topic hierarchies for each cube cell. Multiple hierarchies, since the presented approach builds a topic hierarchy per text segment. Another contribution of this work refers to query response. The state-of-the-art normally returns the top-k documents to the topic selected in the query. We go beyond by returning other text segments, such as the most signi cant titles, abstracts and paragraphs. The approach is designed in four complementary steps and each step attenuates a bit more the impact of building multiple to pic hierarchies and segmented rankings per cube cell. Experiments using part of the DBLP papers as a document collection reinforce our hypotheses

    An Online Analytical System for Multi-Tagged Document Collections

    Get PDF
    The New York Times Annotated Corpus and the ACM Digital Library are two prototypical examples of document collections in which each document is tagged with keywords and significant phrases. Such collections can be viewed as high-dimensional document cubes against which browsers and search systems can be applied in a manner similar to online analytical processing against data cubes. The tagging patterns in these collections are examined and a generative tagging model is developed that can mimic the tag assignments observed in those collections. When a user browses the collection by means of a Boolean query over tags, the result is a subset of documents that can be summarized by a centroid derived from their document term vectors. A partial materialization strategy is developed to provide efficient storage and access to centroids for such document subsets. A customized local term vocabulary storage approach is incorporated into the partial materialization to ensure that rich and relevant term vocabulary is available for representing centroids while maintaining a low storage footprint. By adopting this strategy, summary measures dependent on centroids (including bursty terms, or larger sets of indicative documents) can be efficiently and accurately computed for important subsets of documents. The proposed design is evaluated on the two collections along with PubMed (a held-back document collection) and several synthetic collections to validate that it outperforms alternative storage strategies. Finally, an enhanced faceted browsing system is developed to support users' exploration of large multi-tagged document collections. It provides summary measures of document result sets at each step of navigation through a set of indicative terms and diverse set of documents, as well as information scent that helps to guide users' exploration. These summaries are derived from pre-materialized views that allow for quick calculation of centroids for various result sets. The utility and efficiency of the system is demonstrated on the New York Times Annotated Corpus

    Integrative text mining and management in multidimensional text databases

    Get PDF
    As the text information grows explosively in today's multidimensional text databases, managing and mining this kind of databases is now playing an extremely important role in every domain. Different from traditional text mining tasks that target at single data sets, a text management system for a multidimensional database requires its text mining functions performed in different contexts specified by the structured dimensions, and the system should well support OLAP (online analytical processing) of the text information. This is a big challenge for most existing text mining techniques because of the efficiency and the scalability issues. On the other hand, the huge amount of text information in such databases also provides us an opportunity of acquiring new knowledge out of it, which could be super beneficial. In this thesis, I identified three major types of functions that a text management system should support in order to analyze multidimensional text databases: (1) effective and efficient digestion: the system should support users to digest the text information in an OLAP environment based on domain knowledge; (2) flexible exploration: the system should allow users to flexibly explore the text information based on ad hoc information needs; (3) discovery analysis: the system should effectively analyze the text data with consideration of the associated non-textual data and mine knowledge underlying the text information. All of these functions are integrative analysis of the structured data and the unstructured text data within a multidimensional text database. I proposed and studied different novel models and infrastructures to support all the above functions. First, I proposed a novel model called Topic Cube which combines the OLAP technology for traditional data warehouses with probabilistic topic modeling approaches for text mining. Given a topic hierarchy based on domain knowledge, a topic cube mines semantic topics accordingly and organizes the text information along with the topic hierarchy so that domain experts can quickly digest the text information in different granularity of topics and within different context. Second, a novel infrastructure MiTexCube is proposed to flexibly support various kinds of online explorations, such as summarizing the content of text cells or comparing the content of documents across multiple text cells. The text content in a MiTexCube is stored as a compact representation called micro-clusters which make the online processing very efficient. Third, aiming at a special type of discovery analysis, comparative analysis on different text fields, I proposed a probabilistic topic mapping (PTM) model for mining two parallel text fields to discover latent topics and their associations. The model can be directly applied on multidimensional text databases with two parallel text fields. For multidimensional text databases with only one text field, the structured data can align two subsets of the data and form a parallel document collection so that meaningful knowledge can be mined by the proposed model. Extensive experiments on multiple real world multidimensional text databases show that the proposed Topic Cube, MiTexCube, and PTM are all effective and efficient for digesting, exploring and analyzing multidimensional text databases. Since these techniques are all general, they can be applied to any multidimensional text databases in different application domains
    corecore