Integrative text mining and management in multidimensional text databases


As the text information grows explosively in today's multidimensional text databases, managing and mining this kind of databases is now playing an extremely important role in every domain. Different from traditional text mining tasks that target at single data sets, a text management system for a multidimensional database requires its text mining functions performed in different contexts specified by the structured dimensions, and the system should well support OLAP (online analytical processing) of the text information. This is a big challenge for most existing text mining techniques because of the efficiency and the scalability issues. On the other hand, the huge amount of text information in such databases also provides us an opportunity of acquiring new knowledge out of it, which could be super beneficial. In this thesis, I identified three major types of functions that a text management system should support in order to analyze multidimensional text databases: (1) effective and efficient digestion: the system should support users to digest the text information in an OLAP environment based on domain knowledge; (2) flexible exploration: the system should allow users to flexibly explore the text information based on ad hoc information needs; (3) discovery analysis: the system should effectively analyze the text data with consideration of the associated non-textual data and mine knowledge underlying the text information. All of these functions are integrative analysis of the structured data and the unstructured text data within a multidimensional text database. I proposed and studied different novel models and infrastructures to support all the above functions. First, I proposed a novel model called Topic Cube which combines the OLAP technology for traditional data warehouses with probabilistic topic modeling approaches for text mining. Given a topic hierarchy based on domain knowledge, a topic cube mines semantic topics accordingly and organizes the text information along with the topic hierarchy so that domain experts can quickly digest the text information in different granularity of topics and within different context. Second, a novel infrastructure MiTexCube is proposed to flexibly support various kinds of online explorations, such as summarizing the content of text cells or comparing the content of documents across multiple text cells. The text content in a MiTexCube is stored as a compact representation called micro-clusters which make the online processing very efficient. Third, aiming at a special type of discovery analysis, comparative analysis on different text fields, I proposed a probabilistic topic mapping (PTM) model for mining two parallel text fields to discover latent topics and their associations. The model can be directly applied on multidimensional text databases with two parallel text fields. For multidimensional text databases with only one text field, the structured data can align two subsets of the data and form a parallel document collection so that meaningful knowledge can be mined by the proposed model. Extensive experiments on multiple real world multidimensional text databases show that the proposed Topic Cube, MiTexCube, and PTM are all effective and efficient for digesting, exploring and analyzing multidimensional text databases. Since these techniques are all general, they can be applied to any multidimensional text databases in different application domains

Similar works

Having an issue?

Is data on this page outdated, violates copyrights or anything else? Report the problem now and we will take corresponding actions after reviewing your request.