3,667 research outputs found

    Corporate Smart Content Evaluation

    Get PDF
    Nowadays, a wide range of information sources are available due to the evolution of web and collection of data. Plenty of these information are consumable and usable by humans but not understandable and processable by machines. Some data may be directly accessible in web pages or via data feeds, but most of the meaningful existing data is hidden within deep web databases and enterprise information systems. Besides the inability to access a wide range of data, manual processing by humans is effortful, error-prone and not contemporary any more. Semantic web technologies deliver capabilities for machine-readable, exchangeable content and metadata for automatic processing of content. The enrichment of heterogeneous data with background knowledge described in ontologies induces re-usability and supports automatic processing of data. The establishment of “Corporate Smart Content” (CSC) - semantically enriched data with high information content with sufficient benefits in economic areas - is the main focus of this study. We describe three actual research areas in the field of CSC concerning scenarios and datasets applicable for corporate applications, algorithms and research. Aspect- oriented Ontology Development advances modular ontology development and partial reuse of existing ontological knowledge. Complex Entity Recognition enhances traditional entity recognition techniques to recognize clusters of related textual information about entities. Semantic Pattern Mining combines semantic web technologies with pattern learning to mine for complex models by attaching background knowledge. This study introduces the afore-mentioned topics by analyzing applicable scenarios with economic and industrial focus, as well as research emphasis. Furthermore, a collection of existing datasets for the given areas of interest is presented and evaluated. The target audience includes researchers and developers of CSC technologies - people interested in semantic web features, ontology development, automation, extracting and mining valuable information in corporate environments. The aim of this study is to provide a comprehensive and broad overview over the three topics, give assistance for decision making in interesting scenarios and choosing practical datasets for evaluating custom problem statements. Detailed descriptions about attributes and metadata of the datasets should serve as starting point for individual ideas and approaches

    Pattern mining under different conditions

    Get PDF
    New requirements and demands on pattern mining arise in modern applications, which cannot be fulfilled using conventional methods. For example, in scientific research, scientists are more interested in unknown knowledge, which usually hides in significant but not frequent patterns. However, existing itemset mining algorithms are designed for very frequent patterns. Furthermore, scientists need to repeat an experiment many times to ensure reproducibility. A series of datasets are generated at once, waiting for clustering, which can contain an unknown number of clusters with various densities and shapes. Using existing clustering algorithms is time-consuming because parameter tuning is necessary for each dataset. Many scientific datasets are extremely noisy. They contain considerably more noises than in-cluster data points. Most existing clustering algorithms can only handle noises up to a moderate level. Temporal pattern mining is also important in scientific research. Existing temporal pattern mining algorithms only consider pointbased events. However, most activities in the real-world are interval-based with a starting and an ending timestamp. This thesis developed novel pattern mining algorithms for various data mining tasks under different conditions. The first part of this thesis investigates the problem of mining less frequent itemsets in transactional datasets. In contrast to existing frequent itemset mining algorithms, this part focus on itemsets that occurred not that frequent. Algorithms NIIMiner, RaCloMiner, and LSCMiner are proposed to identify such kind of itemsets efficiently. NIIMiner utilizes the negative itemset tree to extract all patterns that occurred less than a given support threshold in a top-down depth-first manner. RaCloMiner combines existing bottom-up frequent itemset mining algorithms with a top-down itemset mining algorithm to achieve a better performance in mining less frequent patterns. LSCMiner investigates the problem of mining less frequent closed patterns. The second part of this thesis studied the problem of interval-based temporal pattern mining in the stream environment. Interval-based temporal patterns are sequential patterns in which each event is aligned with a starting and ending temporal information. The ability to handle interval-based events and stream data is lacking in existing approaches. A novel intervalbased temporal pattern mining algorithm for stream data is described in this part. The last part of this thesis studies new problems in clustering on numeric datasets. The first problem tackled in this part is shape alternation adaptivity in clustering. In applications such as scientific data analysis, scientists need to deal with a series of datasets generated from one experiment. Cluster sizes and shapes are different in those datasets. A kNN density-based clustering algorithm, kadaClus, is proposed to provide the shape alternation adaptability so that users do not need to tune parameters for each dataset. The second problem studied in this part is clustering in an extremely noisy dataset. Many real-world datasets contain considerably more noises than in-cluster data points. A novel clustering algorithm, kenClus, is proposed to identify clusters in arbitrary shapes from extremely noisy datasets. Both clustering algorithms are kNN-based, which only require one parameter k. In each part, the efficiency and effectiveness of the presented techniques are thoroughly analyzed. Intensive experiments on synthetic and real-world datasets are conducted to show the benefits of the proposed algorithms over conventional approaches

    A genetic algorithm coupled with tree-based pruning for mining closed association rules

    Get PDF
    Due to the voluminous amount of itemsets that are generated, the association rules extracted from these itemsets contain redundancy, and designing an effective approach to address this issue is of paramount importance. Although multiple algorithms were proposed in recent years for mining closed association rules most of them underperform in terms of run time or memory. Another issue that remains challenging is the nature of the dataset. While some of the existing algorithms perform well on dense datasets others perform well on sparse datasets. This paper aims to handle these drawbacks by using a genetic algorithm for mining closed association rules. Recent studies have shown that genetic algorithms perform better than conventional algorithms due to their bitwise operations of crossover and mutation. Bitwise operations are predominantly faster than conventional approaches and bits consume lesser memory thereby improving the overall performance of the algorithm. To address the redundancy in the mined association rules a tree-based pruning algorithm has been designed here. This works on the principle of minimal antecedent and maximal consequent. Experiments have shown that the proposed approach works well on both dense and sparse datasets while surpassing existing techniques with regard to run time and memory

    LEARNFCA: A FUZZY FCA AND PROBABILITY BASED APPROACH FOR LEARNING AND CLASSIFICATION

    Get PDF
    Formal concept analysis(FCA) is a mathematical theory based on lattice and order theory used for data analysis and knowledge representation. Over the past several years, many of its extensions have been proposed and applied in several domains including data mining, machine learning, knowledge management, semantic web, software development, chemistry ,biology, medicine, data analytics, biology and ontology engineering. This thesis reviews the state-of-the-art of theory of Formal Concept Analysis(FCA) and its various extensions that have been developed and well-studied in the past several years. We discuss their historical roots, reproduce the original definitions and derivations with illustrative examples. Further, we provide a literature review of it’s applications and various approaches adopted by researchers in the areas of dataanalysis, knowledge management with emphasis to data-learning and classification problems. We propose LearnFCA, a novel approach based on FuzzyFCA and probability theory for learning and classification problems. LearnFCA uses an enhanced version of FuzzyLattice which has been developed to store class labels and probability vectors and has the capability to be used for classifying instances with encoded and unlabelled features. We evaluate LearnFCA on encodings from three datasets - mnist, omniglot and cancer images with interesting results and varying degrees of success. Adviser: Jitender Deogu

    A Data Centric Privacy Preserved Mining Model for Business Intelligence Applications

    Get PDF
    In present day competitive scenario, the techniques such as data warehouse and on-line analytical process (OLAP) have become a very significant approach for decision support in data centric applications and industries. In fact the decision support mechanism puts certain moderately varied needs on database technology as compared to OLAP based applications. Data centric decision support schemes (DSS) like data warehouse might play a significant role in extracting details from various areas and for standardizing data throughout the organization to achieve a singular way of detail presentation. Such efficient data presentation facilitates information for decision making in business intelligence (BI) applications in various industrial services. In order to enhance the effectiveness and robust computation in BI applications, the optimization in data mining and its processing is must. On the other hand, being in a multiuser scenario, the security of data on warehouse is also a critical issue, which is not explored till date. In this paper a data centric and service oriented privacy preserved model for BI applications has been proposed. The optimization in data mining has been accomplished by means of C5.0 classification algorithm and comparative study has been done with C4.5 algorithm. The implementation of enhanced C5.0 algorithm with BI model would provide much better performance with privacy preservation facility for Business Intelligence applications

    Mining Predictive Patterns and Extension to Multivariate Temporal Data

    Get PDF
    An important goal of knowledge discovery is the search for patterns in the data that can help explaining its underlying structure. To be practically useful, the discovered patterns should be novel (unexpected) and easy to understand by humans. In this thesis, we study the problem of mining patterns (defining subpopulations of data instances) that are important for predicting and explaining a specific outcome variable. An example is the task of identifying groups of patients that respond better to a certain treatment than the rest of the patients. We propose and present efficient methods for mining predictive patterns for both atemporal and temporal (time series) data. Our first method relies on frequent pattern mining to explore the search space. It applies a novel evaluation technique for extracting a small set of frequent patterns that are highly predictive and have low redundancy. We show the benefits of this method on several synthetic and public datasets. Our temporal pattern mining method works on complex multivariate temporal data, such as electronic health records, for the event detection task. It first converts time series into time-interval sequences of temporal abstractions and then mines temporal patterns backwards in time, starting from patterns related to the most recent observations. We show the benefits of our temporal pattern mining method on two real-world clinical tasks

    New Fundamental Technologies in Data Mining

    Get PDF
    The progress of data mining technology and large public popularity establish a need for a comprehensive text on the subject. The series of books entitled by "Data Mining" address the need by presenting in-depth description of novel mining algorithms and many useful applications. In addition to understanding each section deeply, the two books present useful hints and strategies to solving problems in the following chapters. The contributing authors have highlighted many future research directions that will foster multi-disciplinary collaborations and hence will lead to significant development in the field of data mining

    ARM-AMO: An Efficient Association Rule Mining Algorithm Based on Animal Migration Optimization

    Get PDF
    The file attached to this record is the author's final peer reviewed version. The Publisher's final version can be found by following the DOI linkAssociation rule mining (ARM) aims to find out association rules that satisfy predefined minimum support and confidence from a given database. However, in many cases ARM generates extremely large number of association rules, which are impossible for end users to comprehend or validate, thereby limiting the usefulness of data mining results. In this paper, we propose a new mining algorithm based on Animal Migration Optimization (AMO), called ARM-AMO, to reduce the number of association rules. It is based on the idea that rules which are not of high support and unnecessary are deleted from the data. Firstly, Apriori algorithm is applied to generate frequent itemsets and association rules. Then, AMO is used to reduce the number of association rules with a new fitness function that incorporates frequent rules. It is observed from the experiments that, in comparison with the other relevant techniques, ARM-AMO greatly reduces the computational time for frequent item set generation, memory for association rule generation, and the number of rules generated
    • …
    corecore