637 research outputs found

    Automatic pattern-taxonomy extraction for web mining

    Full text link
    In this paper, we propose a model for discovering frequent sequential patterns, phrases, which can be used as profile descriptors of documents. It is indubitable that we can obtain numerous phrases using data mining algorithms. However, it is difficult to use these phrases effectively for answering what users want. Therefore, we present a pattern taxonomy extraction model which performs the task of extracting descriptive frequent sequential patterns by pruning the meaningless ones. The model then is extended and tested by applying it to the information filtering system. The results of the experiment show that pattern-based methods outperform the keyword-based methods. The results also indicate that removal of meaningless patterns not only reduces the cost of computation but also improves the effectiveness of the system. <br /

    CRIS-IR 2006

    Get PDF
    The recognition of entities and their relationships in document collections is an important step towards the discovery of latent knowledge as well as to support knowledge management applications. The challenge lies on how to extract and correlate entities, aiming to answer key knowledge management questions, such as; who works with whom, on which projects, with which customers and on what research areas. The present work proposes a knowledge mining approach supported by information retrieval and text mining tasks in which its core is based on the correlation of textual elements through the LRD (Latent Relation Discovery) method. Our experiments show that LRD outperform better than other correlation methods. Also, we present an application in order to demonstrate the approach over knowledge management scenarios.Fundação para a Ciência e a Tecnologia (FCT) Denmark's Electronic Research Librar

    Fast attribute selection based on the rough set boundary region

    Get PDF
    The problem of clustering exists in numerous fields such as bioinformatics, data mining, and the recognition of patterns. The function of techniques is to suitably select the best attribute from numerous contending attribute(s). RST-based approaches for definite data has gained significant attention, but cannot select clustering attributes for optimum performance. In this paper, the focus is on the processes that exhibit a similar degree of results to an identical attribute value. First, the MIA algorithm was identified as the supplement to the MSA algorithm, which experiences set approximation. Second, the proposition that MIA accomplishes lesser computational complexity through the indiscernibility relation measurement was highlighted. This observation is ascribed to the relationship between various attributes, which is markedly similar to those induced by others. Based on the fact that the size of the attribute domain is relatively small, the selection of such an attribute under such circumstances is problematic. Failure to choose the most suitable clustering attribute is challenging and the set is defined rather than computing the relative mean where it can only be implemented with a distinctive category of the information system, as illustrated with an example. Lastly, a substitute method for selecting a clustering attribute-based RST using Mean Dependency degree attribute(s) (MMD) was proposed. This involved selecting the maximum value of a mean attribute(s) as a clustering attribute through a considerable targeting procedure for the rapid selection of an attribute to settle the instability in selecting clustering attributes. Thus, the comparative performance of the selected clustering attributes-based RST techniques MSA and MIA was conducted

    Learning Ontology Relations by Combining Corpus-Based Techniques and Reasoning on Data from Semantic Web Sources

    Get PDF
    The manual construction of formal domain conceptualizations (ontologies) is labor-intensive. Ontology learning, by contrast, provides (semi-)automatic ontology generation from input data such as domain text. This thesis proposes a novel approach for learning labels of non-taxonomic ontology relations. It combines corpus-based techniques with reasoning on Semantic Web data. Corpus-based methods apply vector space similarity of verbs co-occurring with labeled and unlabeled relations to calculate relation label suggestions from a set of candidates. A meta ontology in combination with Semantic Web sources such as DBpedia and OpenCyc allows reasoning to improve the suggested labels. An extensive formal evaluation demonstrates the superior accuracy of the presented hybrid approach

    New rough set based maximum partitioning attribute algorithm for categorical data clustering

    Get PDF
    Clustering a set of data into homogeneous groups is a fundamental operation in data mining. Recently, consideration has been put on categorical data clustering, where the data set consists of non-numerical attributes. However, implementing several existing categorical clustering algorithms is challenging as some cannot handle uncertainty while others have stability issues. The Rough Set theory (RST) is a mathematical tool for dealing with categorical data and handling uncertainty. It is also used to identify cause-effect relationships in databases as a form of learning and data mining. Therefore, this study aims to address the issues of uncertainty and stability for categorical clustering, and it proposes an improved algorithm centred on RST. The proposed method employed the partitioning measure to calculate the information system's positive and boundary regions of attributes. Firstly, an attributes partitioning method called Positive Region-based Indiscernibility (PRI) was developed to address the uncertainty issue in attribute partitioning for categorical data. The PRI method requires the positive and boundary regions-based partitioning calculation method. Next, to address the computational complexity issue in the clustering process, a clustering attribute selection method called Maximum Mean Partitioning (MMP) is introduced by computing the mean. The MMP method selects the maximum degree of the mean attribute, and the attribute with the maximum mean partitioning value is chosen as the best clustering attribute. The integration of proposed PRI and MMP methods generated a new rough set hybrid clustering algorithm for categorical data clustering algorithm named Maximum Partitioning Attribute (MPA) algorithm. This hybrid algorithm is an all-inclusive solution for uncertainty, computational complexity, cluster purity, and higher accuracy in attribute partitioning and selecting a clustering attribute. The proposed MPA algorithm is compared against the baseline algorithms, namely Maximum Significance Attribute (MSA), Information-Theoretic Dependency Roughness (ITDR), Maximum Indiscernibility Attribute (MIA), and simple classical K-Mean. In addition, seven small data sets from previously utilized research cases and 21 UCI repository and benchmark datasets are used for validation. Finally, the results were presented in tabular and graphical form, showing the proposed MPA algorithm outperforms the baseline algorithms for all data sets. Furthermore, the results showed that the proposed MPA algorithm improves the rough accuracy against MSA, ITDR, and MIA by 54.42%. Hence, the MPA algorithm has reduced the computational complexity compared to MSA, ITDR, and MIA with 77.11% less time and 58.66% minimum iterations. Similarly, a significant percentage improvement, up to 97.35%, was observed for overall purity by the MPA algorithm against MSA, ITDR, and MIA. In addition, the increment up to 34.41% of the overall accuracy of simple K-means by MPA has been obtained. Hence, it is proven that the proposed MPA has given promising solutions to address the categorical data clustering problem

    Rough set approach for categorical data clustering

    Get PDF
    A few techniques of rough categorical data clustering exist to group objects having similar characteristics. However, the performance of the techniques is an issue due to low accuracy, high computational complexity and clusters purity. This work proposes a new technique called Maximum Dependency Attributes (MDA) to improve the previous techniques due to these issues. The proposed technique is based on rough set theory by taking into account the dependency of attributes of an information system. The main contribution of this technique is to introduce a new technique to classify objects from categorical datasets which has better performance as compared to the baseline techniques. The algorithm of the proposed technique is implemented in MATLAB® version 7.6.0.324 (R2008a). They are executed sequentially on a processor Intel Core 2 Duo CPUs. The total main memory is 1 Gigabyte and the operating system is Windows XP Professional SP3. Results collected during the experiments on four small datasets and thirteen UCI benchmark datasets for selecting a clustering attribute show that the proposed MDA technique is an efficient approach in terms of accuracy and computational complexity as compared to BC, TR and MMR techniques. For the clusters purity, the results on Soybean and Zoo datasets show that MDA technique provided better purity up to 17% and 9%, respectively. The experimental result on supplier chain management clustering also demonstrates how MDA technique can contribute to practical system and establish the better performance for computation complexity and clusters purity up to 90% and 23%, respectively

    Pharmacovigilance Decision Support : The value of Disproportionality Analysis Signal Detection Methods, the development and testing of Covariability Techniques, and the importance of Ontology

    Get PDF
    The cost of adverse drug reactions to society in the form of deaths, chronic illness, foetal malformation, and many other effects is quite significant. For example, in the United States of America, adverse reactions to prescribed drugs is around the fourth leading cause of death. The reporting of adverse drug reactions is spontaneous and voluntary in Australia. Many methods that have been used for the analysis of adverse drug reaction data, mostly using a statistical approach as a basis for clinical analysis in drug safety surveillance decision support. This thesis examines new approaches that may be used in the analysis of drug safety data. These methods differ significantly from the statistical methods in that they utilize co variability methods of association to define drug-reaction relationships. Co variability algorithms were developed in collaboration with Musa Mammadov to discover drugs associated with adverse reactions and possible drug-drug interactions. This method uses the system organ class (SOC) classification in the Australian Adverse Drug Reaction Advisory Committee (ADRAC) data to stratify reactions. The text categorization algorithm BoosTexter was found to work with the same drug safety data and its performance and modus operandi was compared to our algorithms. These alternative methods were compared to a standard disproportionality analysis methods for signal detection in drug safety data including the Bayesean mulit-item gamma Poisson shrinker (MGPS), which was found to have a problem with similar reaction terms in a report and innocent by-stander drugs. A classification of drug terms was made using the anatomical-therapeutic-chemical classification (ATC) codes. This reduced the number of drug variables from 5081 drug terms to 14 main drug classes. The ATC classification is structured into a hierarchy of five levels. Exploitation of the ATC hierarchy allows the drug safety data to be stratified in such a way as to make them accessible to powerful existing tools. A data mining method that uses association rules, which groups them on the basis of content, was used as a basis for applying the ATC and SOC ontologies to ADRAC data. This allows different views of these associations (even very rare ones). A signal detection method was developed using these association rules, which also incorporates critical reaction terms.Doctor of Philosoph

    Proceedings of the 9th International Workshop on Information Retrieval on Current Research Information Systems

    Get PDF
    The recognition of entities and their relationships in document collections is an important step towards the discovery of latent knowledge as well as to support knowledge management applications. The challenge lies on how to extract and correlate entities, aiming to answer key knowledge management questions, such as; who works with whom, on which projects, with which customers and on what research areas. The present work proposes a knowledge mining approach supported by information retrieval and text mining tasks in which its core is based on the correlation of textual elements through the LRD (Latent Relation Discovery) method. Our experiments show that LRD outperform better than other correlation methods. Also, we present an application in order to demonstrate the approach over knowledge management scenarios
    corecore