Search CORE

3,654 research outputs found

Data Mining

Author: Parker Julian
Sloan Terence
Yau Hon
Publication venue
Publication date: 01/01/1998
Field of study

The Minimum Description Length Principle for Pattern Mining: A Survey

Author: Galbrun Esther
Publication venue
Publication date: 28/07/2021
Field of study

This is about the Minimum Description Length (MDL) principle applied to pattern mining. The length of this description is kept to the minimum. Mining patterns is a core task in data analysis and, beyond issues of efficient enumeration, the selection of patterns constitutes a major challenge. The MDL principle, a model selection method grounded in information theory, has been applied to pattern mining with the aim to obtain compact high-quality sets of patterns. After giving an outline of relevant concepts from information theory and coding, as well as of work on the theory behind the MDL and similar principles, we review MDL-based methods for mining various types of data and patterns. Finally, we open a discussion on some issues regarding these methods, and highlight currently active related data analysis problems

arXiv.org e-Print Archive

Data mining in manufacturing: a review based on the kind of knowledge

Author: Alok Choudhary (1251471)
Jennifer Harding (1258389)
Manoj K. Tiwari (7197308)
Publication venue
Publication date: 01/01/2009
Field of study

In modern manufacturing environments, vast amounts of data are collected in database management systems and data warehouses from all involved areas, including product and process design, assembly, materials planning, quality control, scheduling, maintenance, fault detection etc. Data mining has emerged as an important tool for knowledge acquisition from the manufacturing databases. This paper reviews the literature dealing with knowledge discovery and data mining applications in the broad domain of manufacturing with a special emphasis on the type of functions to be performed on the data. The major data mining functions to be performed include characterization and description, association, classification, prediction, clustering and evolution analysis. The papers reviewed have therefore been categorized in these five categories. It has been shown that there is a rapid growth in the application of data mining in the context of manufacturing processes and enterprises in the last 3 years. This review reveals the progressive applications and existing gaps identified in the context of data mining in manufacturing. A novel text mining approach has also been used on the abstracts and keywords of 150 papers to identify the research gaps and find the linkages between knowledge area, knowledge type and the applied data mining tools and techniques

Loughborough University Institutional Repository

Data complexity in machine learning

Author: Abu-Mostafa Yaser S.
Li Ling
Publication venue: 'California Institute of Technology Library'
Publication date: 26/05/2006
Field of study

We investigate the role of data complexity in the context of binary classification problems. The universal data complexity is defined for a data set as the Kolmogorov complexity of the mapping enforced by the data set. It is closely related to several existing principles used in machine learning such as Occam's razor, the minimum description length, and the Bayesian approach. The data complexity can also be defined based on a learning model, which is more realistic for applications. We demonstrate the application of the data complexity in two learning problems, data decomposition and data pruning. In data decomposition, we illustrate that a data set is best approximated by its principal subsets which are Pareto optimal with respect to the complexity and the set size. In data pruning, we show that outliers usually have high complexity contributions, and propose methods for estimating the complexity contribution. Since in practice we have to approximate the ideal data complexity measures, we also discuss the impact of such approximations

Caltech Authors

A rule induction approach to forecasting critical alarms in a telecommunication network

Author: Di Fatta Giuseppe
Karthikeyan Vidhyalakshmi
Nauck Detlef
Stahl Frederic
Wrench C.
Publication venue
Publication date: 01/01/2020
Field of study

This paper proposes a white box method of predicting critical alarms so they can be mitigated and understood by engineers. Forecasting these alarms will avoid outages and maintain the agreed service level which is beneficial to both the provider of telecommunication services and the consumers. The paper evaluates several item set mining approaches on a set of alarms of the British Telecom (BT) national telecommunication network and proposes a novel transformation of the data to enable the discovery of patterns undetectable by current item set mining approaches. The result is a method for rule induction that predicts alarms with high precision using a wide range of features

Central Archive at the University of Reading

Crossref

A Review of Subsequence Time Series Clustering

Author: Saeed Aghabozorgi
Seyedjamal Zolhavarieh
Ying Wah Teh
Publication venue: 'Hindawi Limited'
Publication date: 01/01/2014
Field of study

Clustering of subsequence time series remains an open issue in time series clustering. Subsequence time series clustering is used in different fields, such as e-commerce, outlier detection, speech recognition, biological systems, DNA recognition, and text mining. One of the useful fields in the domain of subsequence time series clustering is pattern recognition. To improve this field, a sequence of time series data is used. This paper reviews some definitions and backgrounds related to subsequence time series clustering. The categorization of the literature reviews is divided into three groups: preproof, interproof, and postproof period. Moreover, various state-of-the-art approaches in performing subsequence time series clustering are discussed under each of the following categories. The strengths and weaknesses of the employed methods are evaluated as potential issues for future studies

Crossref

Directory of Open Access Journals

PubMed Central

XML Schema Clustering with Semantic and Hierarchical Similarity Measures

Author: Iryadi Wina
Nayak Richi
Publication venue: 'Elsevier BV'
Publication date: 01/01/2007
Field of study

With the growing popularity of XML as the data representation language, collections of the XML data are exploded in numbers. The methods are required to manage and discover the useful information from them for the improved document handling. We present a schema clustering process by organising the heterogeneous XML schemas into various groups. The methodology considers not only the linguistic and the context of the elements but also the hierarchical structural similarity. We support our findings with experiments and analysis

Crossref

Queensland University of Technology ePrints Archive

Descriptive document clustering via discriminant learning in a co-embedded space of multilevel similarities

Author: Ananiadou
Arai
Baraldi
Baraldi
Beil
Belkin
Bengio
Bengio
Bharambe
Carmel
Carpineto
Chang
Chen
Cheng
Cover
Cribbin
Cristianini
Cutting
Deerwester
Domingos
Drineas
Dubin
Duda
Eckart
Frantzi
Geraci
Globerson
Hatzivassiloglou
Haykin
Hearst
Hussain
Jain
Jayabharathy
Jones
Kohonen
Korkontzelos
Koshman
Kovács
Lagus
Lam
Lan
Li
Li
Luxburg
Mu
Mu
Mu
Noel
Osiński
Osiński
Ouyang
Rooneya
Salton
Stefanowski
Syed
Theodosiou
Thomas
Torgerson
Tseng
Wang
Xu
Xu
Zeng
Zhang
Publication venue: 'Wiley'
Publication date: 03/12/2014
Field of study

Descriptive document clustering aims at discovering clusters of semantically interrelated documents together with meaningful labels to summarize the content of each document cluster. In this work, we propose a novel descriptive clustering framework, referred to as CEDL. It relies on the formulation and generation of 2 types of heterogeneous objects, which correspond to documents and candidate phrases, using multilevel similarity information. CEDL is composed of 5 main processing stages. First, it simultaneously maps the documents and candidate phrases into a common co‐embedded space that preserves higher‐order, neighbor‐based proximities between the combined sets of documents and phrases. Then, it discovers an approximate cluster structure of documents in the common space. The third stage extracts promising topic phrases by constructing a discriminant model where documents along with their cluster memberships are used as training instances. Subsequently, the final cluster labels are selected from the topic phrases using a ranking scheme using multiple scores based on the extracted co‐embedding information and the discriminant output. The final stage polishes the initial clusters to reduce noise and accommodate the multitopic nature of documents. The effectiveness and competitiveness of CEDL is demonstrated qualitatively and quantitatively with experiments using document databases from different application fields

University of Liverpool Repository

Crossref

Edge Hill University Research Information Repository

The University of Manchester - Institutional Repository