1,839 research outputs found
Digging deep into weighted patient data through multiple-level patterns
Large data volumes have been collected by healthcare organizations at an unprecedented rate. Today both physicians and healthcare system managers are very interested in extracting value from such data. Nevertheless, the increasing data complexity and heterogeneity prompts the need for new efficient and effective data mining approaches to analyzing large patient datasets. Generalized association rule mining algorithms can be exploited to automatically extract hidden multiple-level associations among patient data items (e.g., examinations, drugs) from large datasets equipped with taxonomies. However, in current approaches all data items are assumed to be equally relevant within each transaction, even if this assumption is rarely true.
This paper presents a new data mining environment targeted to patient data analysis. It tackles the issue of extracting generalized rules from weighted patient data, where items may weight differently according to their importance within each transaction. To this aim, it proposes a novel type of association rule, namely the Weighted Generalized Association Rule (W-GAR). The usefulness of the proposed pattern has been evaluated on real patient datasets equipped with a taxonomy built over examinations and drugs. The achieved results demonstrate the effectiveness of the proposed approach in mining interesting and actionable knowledge in a real medical care scenario
MeTA: Characterization of medical treatments at different abstraction levels
Physicians and healthcare organizations always collect large amounts of data during patient care. These
large and high-dimensional datasets are usually characterized by an inherent sparseness. Hence, the analysis
of these datasets to gure out interesting and hidden knowledge is a challenging task.
This paper proposes a new data mining framework based on generalized association rules to discover
multiple-level correlations among patient data. Specically, correlations among prescribed examinations,
drugs, and patient proles are discovered and analyzed at different abstraction levels. The rule extraction
process is driven by a taxonomy to generalize examinations and drugs into their corresponding categories.
To ease the manual inspection of the result, a worthwhile subset of rules, i.e., the non-redundant generalized
rules, is considered. Furthermore, rules are classied according to the involved data features (medical treatments
or patient proles) and then explored in a top-down fashion, i.e., from the small subset of high-level
rules a drill-down is performed to target more specic rules.
The experiments, performed on a real diabetic patient dataset, demonstrate the effectiveness of the proposed
approach in discovering interesting rule groups at different abstraction levels
Towards semantic web mining
Semantic Web Mining aims at combining the two fast-developing research areas Semantic Web and Web Mining. The idea is to improve, on the one hand, the results of Web Mining by exploiting the new semantic structures in the Web; and to make use of Web Mining, on the other hand, for building up the Semantic Web. This paper gives an overview of where the two areas meet today, and sketches ways of how a closer integration could be profitable
A Framework for Auditing Multilevel Models using Explainability Methods
Applications of multilevel models usually result in binary classification
within groups or hierarchies based on a set of input features. For transparent
and ethical applications of such models, sound audit frameworks need to be
developed. In this paper, an audit framework for technical assessment of
regression MLMs is proposed. The focus is on three aspects, model,
discrimination, and transparency and explainability. These aspects are
subsequently divided into sub aspects. Contributors, such as inter MLM group
fairness, feature contribution order, and aggregated feature contribution, are
identified for each of these sub aspects. To measure the performance of the
contributors, the framework proposes a shortlist of KPIs. A traffic light risk
assessment method is furthermore coupled to these KPIs. For assessing
transparency and explainability, different explainability methods (SHAP and
LIME) are used, which are compared with a model intrinsic method using
quantitative methods and machine learning modelling. Using an open source
dataset, a model is trained and tested and the KPIs are computed. It is
demonstrated that popular explainability methods, such as SHAP and LIME,
underperform in accuracy when interpreting these models. They fail to predict
the order of feature importance, the magnitudes, and occasionally even the
nature of the feature contribution. For other contributors, such as group
fairness and their associated KPIs, similar analysis and calculations have been
performed with the aim of adding profundity to the proposed audit framework.
The framework is expected to assist regulatory bodies in performing conformity
assessments of AI systems using multilevel binomial classification models at
businesses. It will also benefit businesses deploying MLMs to be future proof
and aligned with the European Commission proposed Regulation on Artificial
Intelligence.Comment: Submitted at ECIAIR 202
Finding usage patterns from generalized weblog data
Buried in the enormous, heterogeneous and distributed information, contained in the web server access logs, is knowledge with great potential value. As websites continue to grow in number and complexity, web usage mining systems face two significant challenges - scalability and accuracy. This thesis develops a web data generalization technique and incorporates it into the web usage mining framework in an attempt to exploit this information-rich source of data for effective and efficient pattern discovery. Given a concept hierarchy on the web pages, generalization replaces actual page-clicks with their general concepts. Existing methods do this by taking a level-based cut through the concept hierarchy. This adversely affects the quality of mined patterns since, depending on the depth of the chosen level, either significant pages of user interests get coalesced, or many insignificant concepts are retained. We present a usage driven concept ascension algorithm, which only preserves significant items, possibly at different levels in the hierarchy. Concept usage is estimated using a small stratified sample of the large weblog data. A usage threshold is then used to define the nodes to be pruned in the hierarchy for generalization. Our experiments on large real weblog data demonstrate improved performance in terms of quality and computation time of the pattern discovery process. Our algorithm yields an effective and scalable tool for web usage mining
Frequent Itemset Mining for Big Data
Traditional data mining tools, developed to extract actionable knowledge from data, demonstrated to be inadequate to process the huge amount of data produced nowadays.
Even the most popular algorithms related to Frequent Itemset Mining, an exploratory data analysis technique used to discover frequent items co-occurrences in a transactional dataset, are inefficient with larger and more complex data.
As a consequence, many parallel algorithms have been developed, based on modern frameworks able to leverage distributed computation in commodity clusters of machines (e.g., Apache Hadoop, Apache Spark). However, frequent itemset mining parallelization is far from trivial. The search-space exploration, on which all the techniques are based, is not easily partitionable. Hence, distributed frequent itemset mining is a challenging problem and an interesting research topic.
In this context, our main contributions consist in an (i) exhaustive theoretical and experimental analysis of the best-in-class approaches, whose outcomes and open issues motivated (ii) the development of a distributed high-dimensional frequent itemset miner. The dissertation introduces also a data mining framework which takes strongly advantage of distributed frequent itemset mining for the extraction of a specific type of itemsets (iii).
The theoretical analysis highlights the challenges related to the distribution and the preliminary partitioning of the frequent itemset mining problem (i.e. the search-space exploration) describing the most adopted distribution strategies.
The extensive experimental campaign, instead, compares the expectations related to the algorithmic choices against the actual performances of the algorithms. We run more than 300 experiments in order to evaluate and discuss the performances of the algorithms with respect to different real life use cases and data distributions. The outcomes of the review is that no algorithm is universally superior and performances are heavily skewed by the data distribution.
Moreover, we were able to identify a concrete lack as regards frequent pattern extraction within high-dimensional use cases. For this reason, we have developed our own distributed high-dimensional frequent itemset miner based on Apache Hadoop. The algorithm splits the search-space exploration into independent sub-tasks. However, since the exploration strongly benefits of a full-knowledge of the problem, we introduced an interleaving synchronization phase. The result is a trade-off between the benefits of a centralized state and the ones related to the additional computational power due to parallelism. The experimental benchmarks, performed on real-life high-dimensional use cases, show the efficiency of the proposed approach in terms of execution time, load balancing and reliability to memory issues.
Finally, the dissertation introduces a data mining framework in which distributed itemset mining is a fundamental component of the processing pipeline. The aim of the framework is the extraction of a new type of itemsets, called misleading generalized itemsets
- …