64 research outputs found
Text mining with exploitation of user\u27s background knowledge : discovering novel association rules from text
The goal of text mining is to find interesting and non-trivial patterns or knowledge from unstructured documents. Both objective and subjective measures have been proposed in the literature to evaluate the interestingness of discovered patterns. However, objective measures alone are insufficient because such measures do not consider knowledge and interests of the users. Subjective measures require explicit input of user expectations which is difficult or even impossible to obtain in text mining environments.
This study proposes a user-oriented text-mining framework and applies it to the problem of discovering novel association rules from documents. The developed system, uMining, consists of two major components: a background knowledge developer and a novel association rules miner. The background knowledge developer learns a user\u27s background knowledge by extracting keywords from documents already known to the user (background documents) and developing a concept hierarchy to organize popular keywords. The novel association rule miner discovers association rules among noun phrases extracted from relevant documents (target documents) and compares the rules with the background knowledge to predict the rule novelty to the particular user (useroriented novelty).
The user-oriented novelty measure is defined as the semantic distance between the antecedent and the consequent of a rule in the background knowledge. It consists of two components: occurrence distance and connection distance. The former considers the co-occurrences of two keywords in the background documents: the more the shorter the distance. The latter considers the common connections of with others in the concept hierarchy. It is defined as the length of the connecting the two keywords in the concept hierarchy: the longer the path, distance.
The user-oriented novelty measure is evaluated from two perspectives: novelty prediction accuracy and usefulness indication power. The results show that the useroriented novelty measure outperforms the WordNet novelty measure and the compared objective measures in term of predicting novel rules and identifying useful rules
Interactive Search of Rules in Medical Data Using Multiobjective Evolutionary Algorithms
ABSTRACT In this work, we propose an approach for evolving rules from medical data based on an interactive multi-criteria evolutionary search: besides selecting the set of criteria and the sets of potential antecedent and consequent attributes, the user can also intervene in the searching process by marking the uninteresting rules. The marked rules are further used in estimating a supplementary optimization criterion which expresses the user's opinion on the rule quality and is taken into account in the evolutionary process
Data mining using neural networks
Data mining is about the search for relationships and global patterns in large databases that are increasing in size. Data mining is beneficial for anyone who has a huge amount of data, for example, customer and business data, transaction, marketing, financial, manufacturing and web data etc. The results of data mining are also referred to as knowledge in the form of rules, regularities and constraints. Rule mining is one of the popular data mining methods since rules provide concise statements of potentially important information that is easily understood by end users and also actionable patterns. At present rule mining has received a good deal of attention and enthusiasm from data mining researchers since rule mining is capable of solving many data mining problems such as classification, association, customer profiling, summarization, segmentation and many others. This thesis makes several contributions by proposing rule mining methods using genetic algorithms and neural networks. The thesis first proposes rule mining methods using a genetic algorithm. These methods are based on an integrated framework but capable of mining three major classes of rules. Moreover, the rule mining processes in these methods are controlled by tuning of two data mining measures such as support and confidence. The thesis shows how to build data mining predictive models using the resultant rules of the proposed methods. Another key contribution of the thesis is the proposal of rule mining methods using supervised neural networks. The thesis mathematically analyses the Widrow-Hoff learning algorithm of a single-layered neural network, which results in a foundation for rule mining algorithms using single-layered neural networks. Three rule mining algorithms using single-layered neural networks are proposed for the three major classes of rules on the basis of the proposed theorems. The thesis also looks at the problem of rule mining where user guidance is absent. The thesis proposes a guided rule mining system to overcome this problem. The thesis extends this work further by comparing the performance of the algorithm used in the proposed guided rule mining system with Apriori data mining algorithm. Finally, the thesis studies the Kohonen self-organization map as an unsupervised neural network for rule mining algorithms. Two approaches are adopted based on the way of self-organization maps applied in rule mining models. In the first approach, self-organization map is used for clustering, which provides class information to the rule mining process. In the second approach, automated rule mining takes the place of trained neurons as it grows in a hierarchical structure
GENERIC FRAMEWORKS FOR INTERACTIVE PERSONALIZED INTERESTING PATTERN DISCOVERY
The traditional frequent pattern mining algorithms generate an exponentially large number of patterns of which a substantial portion are not much significant for many data analysis endeavours. Due to this, the discovery of a small number of interesting patterns from the exponentially large number of frequent patterns according to a particular user\u27s interest is an important task. Existing works on patter
New Fundamental Technologies in Data Mining
The progress of data mining technology and large public popularity establish a need for a comprehensive text on the subject. The series of books entitled by "Data Mining" address the need by presenting in-depth description of novel mining algorithms and many useful applications. In addition to understanding each section deeply, the two books present useful hints and strategies to solving problems in the following chapters. The contributing authors have highlighted many future research directions that will foster multi-disciplinary collaborations and hence will lead to significant development in the field of data mining
Multivariate discretization of continuous valued attributes.
The area of Knowledge discovery and data mining is growing rapidly. Feature Discretization is a crucial issue in Knowledge Discovery in Databases (KDD), or Data Mining because most data sets used in real world applications have features with continuously values. Discretization is performed as a preprocessing step of the data mining to make data mining techniques useful for these data sets. This thesis addresses discretization issue by proposing a multivariate discretization (MVD) algorithm. It begins withal number of common discretization algorithms like Equal width discretization, Equal frequency discretization, Naïve; Entropy based discretization, Chi square discretization, and orthogonal hyper planes. After that comparing the results achieved by the multivariate discretization (MVD) algorithm with the accuracy results of other algorithms. This thesis is divided into six chapters, covering a few common discretization algorithms and tests these algorithms on a real world datasets which varying in size and complexity, and shows how data visualization techniques will be effective in determining the degree of complexity of the given data set. We have examined the multivariate discretization (MVD) algorithm with the same data sets. After that we have classified discrete data using artificial neural network single layer perceptron and multilayer perceptron with back propagation algorithm. We have trained the Classifier using the training data set, and tested its accuracy using the testing data set. Our experiments lead to better accuracy results with some data sets and low accuracy results with other data sets, and this is subject ot the degree of data complexity then we have compared the accuracy results of multivariate discretization (MVD) algorithm with the results achieved by other discretization algorithms. We have found that multivariate discretization (MVD) algorithm produces good accuracy results in comparing with the other discretization algorithm
NEW ARTIFACTS FOR THE KNOWLEDGE DISCOVERY VIA DATA ANALYTICS (KDDA) PROCESS
Recently, the interest in the business application of analytics and data science has increased significantly. The popularity of data analytics and data science comes from the clear articulation of business problem solving as an end goal. To address limitations in existing literature, this dissertation provides four novel design artifacts for Knowledge Discovery via Data Analytics (KDDA). The first artifact is a Snail Shell KDDA process model that extends existing knowledge discovery process models, but addresses many existing limitations. At the top level, the KDDA Process model highlights the iterative nature of KDDA projects and adds two new phases, namely Problem Formulation and Maintenance. At the second level, generic tasks of the KDDA process model are presented in a comparative manner, highlighting the differences between the new KDDA process model and the traditional knowledge discovery process models. Two case studies are used to demonstrate how to use KDDA process model to guide real world KDDA projects. The second artifact, a methodology for theory building based on quantitative data is a novel application of KDDA process model. The methodology is evaluated using a theory building case from the public health domain. It is not only an instantiation of the Snail Shell KDDA process model, but also makes theoretical contributions to theory building. It demonstrates how analytical techniques can be used as quantitative gauges to assess important construct relationships during the formative phase of theory building. The third artifact is a data mining ontology, the DM3 ontology, to bridge the semantic gap between business users and KDDA expert and facilitate analytical model maintenance and reuse. The DM3 ontology is evaluated using both criteria-based approach and task-based approach. The fourth artifact is a decision support framework for MCDA software selection. The framework enables users choose relevant MCDA software based on a specific decision making situation (DMS). A DMS modeling framework is developed to structure the DMS based on the decision problem and the users\u27 decision preferences and. The framework is implemented into a decision support system and evaluated using application examples from the real-estate domain
Exploratory search in time-oriented primary data
In a variety of research fields, primary data that describes scientific phenomena in an original condition is obtained.
Time-oriented primary data, in particular, is an indispensable data type, derived from complex measurements depending
on time. Today, time-oriented primary data is collected at rates that exceed the domain experts’ abilities to seek
valuable information undiscovered in the data. It is widely accepted that the magnitudes of uninvestigated data will
disclose tremendous knowledge in data-driven research, provided that domain experts are able to gain insight into the
data. Domain experts involved in data-driven research urgently require analytical capabilities. In scientific practice,
predominant activities are the generation and validation of hypotheses. In analytical terms, these activities are often
expressed in confirmatory and exploratory data analysis. Ideally, analytical support would combine the strengths of
both types of activities.
Exploratory search (ES) is a concept that seamlessly includes information-seeking behaviors ranging from search
to exploration. ES supports domain experts in both gaining an understanding of huge and potentially unknown data
collections and the drill-down to relevant subsets, e.g., to validate hypotheses. As such, ES combines predominant tasks
of domain experts applied to data-driven research. For the design of useful and usable ES systems (ESS), data scientists
have to incorporate different sources of knowledge and technology. Of particular importance is the state-of-the-art
in interactive data visualization and data analysis. Research in these factors is at heart of Information Visualization
(IV) and Visual Analytics (VA). Approaches in IV and VA provide meaningful visualization and interaction designs,
allowing domain experts to perform the information-seeking process in an effective and efficient way. Today, bestpractice
ESS almost exclusively exist for textual data content, e.g., put into practice in digital libraries to facilitate the
reuse of digital documents. For time-oriented primary data, ES mainly remains at a theoretical state.
Motivation and Problem Statement. This thesis is motivated by two main assumptions. First, we expect that
ES will have a tremendous impact on data-driven research for many research fields. In this thesis, we focus on
time-oriented primary data, as a complex and important data type for data-driven research. Second, we assume that
research conducted to IV and VA will particularly facilitate ES. For time-oriented primary data, however, novel
concepts and techniques are required that enhance the design and the application of ESS. In particular, we observe a
lack of methodological research in ESS for time-oriented primary data. In addition, the size, the complexity, and the
quality of time-oriented primary data hampers the content-based access, as well as the design of visual interfaces
for gaining an overview of the data content. Furthermore, the question arises how ESS can incorporate techniques
for seeking relations between data content and metadata to foster data-driven research. Overarching challenges for
data scientists are to create usable and useful designs, urgently requiring the involvement of the targeted user group
and support techniques for choosing meaningful algorithmic models and model parameters. Throughout this thesis,
we will resolve these challenges from conceptual, technical, and systemic perspectives. In turn, domain experts can
benefit from novel ESS as a powerful analytical support to conduct data-driven research.
Concepts for Exploratory Search Systems (Chapter 3). We postulate concepts for the ES in time-oriented primary
data. Based on a survey of analysis tasks supported in IV and VA research, we present a comprehensive selection of
tasks and techniques relevant for search and exploration activities. The assembly guides data scientists in the choice of
meaningful techniques presented in IV and VA. Furthermore, we present a reference workflow for the design and
the application of ESS for time-oriented primary data. The workflow divides the data processing and transformation
process into four steps, and thus divides the complexity of the design space into manageable parts. In addition, the
reference workflow describes how users can be involved in the design. The reference workflow is the framework for
the technical contributions of this thesis.
Visual-Interactive Preprocessing of Time-Oriented Primary Data (Chapter 4). We present a visual-interactive
system that enables users to construct workflows for preprocessing time-oriented primary data. In this way, we
introduce a means of providing content-based access. Based on a rich set of preprocessing routines, users can create
individual solutions for data cleansing, normalization, segmentation, and other preprocessing tasks. In addition, the
system supports the definition of time series descriptors and time series distance measures. Guidance concepts support
users in assessing the workflow generalizability, which is important for large data sets. The execution of the workflows
transforms time-oriented primary data into feature vectors, which can subsequently be used for downstream search
and exploration techniques. We demonstrate the applicability of the system in usage scenarios and case studies.
Content-Based Overviews (Chapter 5). We introduce novel guidelines and techniques for the design of contentbased
overviews. The three key factors are the creation of meaningful data aggregates, the visual mapping of these
aggregates into the visual space, and the view transformation providing layouts of these aggregates in the display
space. For each of these steps, we characterize important visualization and interaction design parameters allowing the
involvement of users. We introduce guidelines supporting data scientists in choosing meaningful solutions. In addition,
we present novel visual-interactive quality assessment techniques enhancing the choice of algorithmic model and
model parameters. Finally, we present visual interfaces enabling users to formulate visual queries of the time-oriented
data content. In this way, we provide means of combining content-based exploration with content-based search.
Relation Seeking Between Data Content and Metadata (Chapter 6). We present novel visual interfaces enabling
domain experts to seek relations between data content and metadata. These interfaces can be integrated into ESS
to bridge analytical gaps between the data content and attached metadata. In three different approaches, we focus
on different types of relations and define algorithmic support to guide users towards most interesting relations.
Furthermore, each of the three approaches comprises individual visualization and interaction designs, enabling users
to explore both the data and the relations in an efficient and effective way. We demonstrate the applicability of our
interfaces with usage scenarios, each conducted together with domain experts. The results confirm that our techniques
are beneficial for seeking relations between data content and metadata, particularly for data-centered research.
Case Studies - Exploratory Search Systems (Chapter 7). In two case studies, we put our concepts and techniques
into practice. We present two ESS constructed in design studies with real users, and real ES tasks, and real timeoriented
primary data collections. The web-based VisInfo ESS is a digital library system facilitating the visual access to
time-oriented primary data content. A content-based overview enables users to explore large collections of time series
measurements and serves as a baseline for content-based queries by example. In addition, VisInfo provides a visual
interface for querying time oriented data content by sketch. A result visualization combines different views of the data
content and metadata with faceted search functionality. The MotionExplorer ESS supports domain experts in human
motion analysis. Two content-based overviews enhance the exploration of large collections of human motion capture
data from two perspectives. MotionExplorer provides a search interface, allowing domain experts to query human
motion sequences by example. Retrieval results are depicted in a visual-interactive view enabling the exploration of
variations of human motions. Field study evaluations performed for both ESS confirm the applicability of the systems
in the environment of the involved user groups. The systems yield a significant improvement of both the effectiveness
and the efficiency in the day-to-day work of the domain experts. As such, both ESS demonstrate how large collections
of time-oriented primary data can be reused to enhance data-centered research.
In essence, our contributions cover the entire time series analysis process starting from accessing raw time-oriented
primary data, processing and transforming time series data, to visual-interactive analysis of time series. We present
visual search interfaces providing content-based access to time-oriented primary data. In a series of novel explorationsupport
techniques, we facilitate both gaining an overview of large and complex time-oriented primary data collections
and seeking relations between data content and metadata. Throughout this thesis, we introduce VA as a means of
designing effective and efficient visual-interactive systems. Our VA techniques empower data scientists to choose
appropriate models and model parameters, as well as to involve users in the design. With both principles, we support
the design of usable and useful interfaces which can be included into ESS. In this way, our contributions bridge the gap
between search systems requiring exploration support and exploratory data analysis systems requiring visual querying
capability. In the ESS presented in two case studies, we prove that our techniques and systems support data-driven
research in an efficient and effective way
Rough Set Based Rule Evaluations and Their Applications
Knowledge discovery is an important process in data analysis, data
mining and machine learning. Typically knowledge is presented in the
form of rules. However, knowledge discovery systems often generate a
huge amount of rules. One of the challenges we face is how to
automatically discover interesting and meaningful knowledge from
such discovered rules. It is infeasible for human beings to select
important and interesting rules manually. How to provide a measure
to evaluate the qualities of rules in order to facilitate the
understanding of data mining results becomes our focus. In this
thesis, we present a series of rule evaluation techniques for the
purpose of facilitating the knowledge understanding process. These
evaluation techniques help not only to reduce the number of rules,
but also to extract higher quality rules. Empirical studies on both
artificial data sets and real world data sets demonstrate how such
techniques can contribute to practical systems such as ones for
medical diagnosis and web personalization.
In the first part of this thesis, we discuss several rule evaluation
techniques that are proposed towards rule postprocessing. We show
how properly defined rule templates can be used as a rule evaluation
approach. We propose two rough set based measures, a Rule Importance
Measure, and a Rules-As-Attributes Measure,
%a measure of considering rules as attributes,
to rank the important and interesting rules. In the second part of
this thesis, we show how data preprocessing can help with rule
evaluation. Because well preprocessed data is essential for
important rule generation, we propose a new approach for processing
missing attribute values for enhancing the generated rules. In the
third part of this thesis, a rough set based rule evaluation system
is demonstrated to show the effectiveness of the measures proposed
in this thesis. Furthermore, a new user-centric web personalization
system is used as a case study to demonstrate how the proposed
evaluation measures can be used in an actual application
- …