4,035 research outputs found

    Dividing the Ontology Alignment Task with Semantic Embeddings and Logic-based Modules

    Get PDF
    Large ontologies still pose serious challenges to state-of-the-art ontology alignment systems. In this paper we present an approach that combines a neural embedding model and logic-based modules to accurately divide an input ontology matching task into smaller and more tractable matching (sub)tasks. We have conducted a comprehensive evaluation using the datasets of the Ontology Alignment Evaluation Initiative. The results are encouraging and suggest that the proposed method is adequate in practice and can be integrated within the workflow of systems unable to cope with very large ontologies

    Pan-genome Analysis, Visualization and Exploration

    Get PDF
    The dynamics of prokaryotic genomes are driven by the intricate interplay of different evolutionary forces such as gene duplication, gene loss and horizontal transfer. Even closely related strains can exhibit remarkable genetic diversity and substantial gene presence/absence variation. The pan-genome, namely the complete inventory of genes in a collection of strains, can be several times larger than the genome of any single strain. Although several tools for pan-genome analysis have been published, there is still much room for algorithmic improvement, as well as needs for applications that better interactively visualize and explore pan-genomes. Therefore, we have developed panX, an automated computational pipeline for efficient identification of orthologous gene clusters in the pan-genome. PanX identifies homologous relationships among genes using DIAMOND and MCL and then harnesses phylogeny-based post- processing to separate orthologs from paralogs. Furthermore, we take advantage of a divide-and-conquer strategy to achieve an approximately linear runtime on large datasets. The analysis result can be visualized by the accompanying software, an easy-to-use and powerful web-based visualization application for interactive exploration of the pan-genome. The visualization dashboard encompasses a variety of connected components that allow rapid searching, filtering and sorting of genes and flexible investigation of evolutionary relationships among strains and their genes. PanX seamlessly interlinks gene clusters with their alignments and gene phylogenies, maps mutations on the branches of gene tree and highlights gene gain and loss events on the core-genome phylogeny that can also be colored by metadata associated with strains. By using 120 simulated pan-genome datasets for benchmarking and comparing clustering results on real dataset between different tools, panX exhibits overall good performance across a large range of diversities. PanX is available at pangenome.de, with a wide range of microbial pan-genomes established. Besides, user-provided pan-genomes can be visualized either via a web server or by running panX locally as a web-based application

    Dynamic Rule Covering Classification in Data Mining with Cyber Security Phishing Application

    Get PDF
    Data mining is the process of discovering useful patterns from datasets using intelligent techniques to help users make certain decisions. A typical data mining task is classification, which involves predicting a target variable known as the class in previously unseen data based on models learnt from an input dataset. Covering is a well-known classification approach that derives models with If-Then rules. Covering methods, such as PRISM, have a competitive predictive performance to other classical classification techniques such as greedy, decision tree and associative classification. Therefore, Covering models are appropriate decision-making tools and users favour them carrying out decisions. Despite the use of Covering approach in data processing for different classification applications, it is also acknowledged that this approach suffers from the noticeable drawback of inducing massive numbers of rules making the resulting model large and unmanageable by users. This issue is attributed to the way Covering techniques induce the rules as they keep adding items to the rule’s body, despite the limited data coverage (number of training instances that the rule classifies), until the rule becomes with zero error. This excessive learning overfits the training dataset and also limits the applicability of Covering models in decision making, because managers normally prefer a summarised set of knowledge that they are able to control and comprehend rather a high maintenance models. In practice, there should be a trade-off between the number of rules offered by a classification model and its predictive performance. Another issue associated with the Covering models is the overlapping of training data among the rules, which happens when a rule’s classified data are discarded during the rule discovery phase. Unfortunately, the impact of a rule’s removed data on other potential rules is not considered by this approach. However, When removing training data linked with a rule, both frequency and rank of other rules’ items which have appeared in the removed data are updated. The impacted rules should maintain their true rank and frequency in a dynamic manner during the rule discovery phase rather just keeping the initial computed frequency from the original input dataset. In response to the aforementioned issues, a new dynamic learning technique based on Covering and rule induction, that we call Enhanced Dynamic Rule Induction (eDRI), is developed. eDRI has been implemented in Java and it has been embedded in WEKA machine learning tool. The developed algorithm incrementally discovers the rules using primarily frequency and rule strength thresholds. These thresholds in practice limit the search space for both items as well as potential rules by discarding any with insufficient data representation as early as possible resulting in an efficient training phase. More importantly, eDRI substantially cuts down the number of training examples scans by continuously updating potential rules’ frequency and strength parameters in a dynamic manner whenever a rule gets inserted into the classifier. In particular, and for each derived rule, eDRI adjusts on the fly the remaining potential rules’ items frequencies as well as ranks specifically for those that appeared within the deleted training instances of the derived rule. This gives a more realistic model with minimal rules redundancy, and makes the process of rule induction efficient and dynamic and not static. Moreover, the proposed technique minimises the classifier’s number of rules at preliminary stages by stopping learning when any rule does not meet the rule’s strength threshold therefore minimising overfitting and ensuring a manageable classifier. Lastly, eDRI prediction procedure not only priorities using the best ranked rule for class forecasting of test data but also restricts the use of the default class rule thus reduces the number of misclassifications. The aforementioned improvements guarantee classification models with smaller size that do not overfit the training dataset, while maintaining their predictive performance. The eDRI derived models particularly benefit greatly users taking key business decisions since they can provide a rich knowledge base to support their decision making. This is because these models’ predictive accuracies are high, easy to understand, and controllable as well as robust, i.e. flexible to be amended without drastic change. eDRI applicability has been evaluated on the hard problem of phishing detection. Phishing normally involves creating a fake well-designed website that has identical similarity to an existing business trustful website aiming to trick users and illegally obtain their credentials such as login information in order to access their financial assets. The experimental results against large phishing datasets revealed that eDRI is highly useful as an anti-phishing tool since it derived manageable size models when compared with other traditional techniques without hindering the classification performance. Further evaluation results using other several classification datasets from different domains obtained from University of California Data Repository have corroborated eDRI’s competitive performance with respect to accuracy, number of knowledge representation, training time and items space reduction. This makes the proposed technique not only efficient in inducing rules but also effective

    Reconstruction Methods for Providing Privacy in Data Mining

    Get PDF
    Data mining is the process of finding correlations or patterns among the dozens of fields in large database. A fruitful direction for data mining research will be the development of techniques that incorporate privacy concerns. Since primary task in our paper is that accurate data which we retrieve should be somewhat changed while providing to users. For this reason, recently much research effort has been devoted for addressing the problem of providing security in data mining. We consider the concrete case of building a decision tree classifier from data in which the values of individual records have been reconstructed. The resulting data records look very different from the original records and the distribution of data values is also very different from the original distribution. By using these reconstructed distribution we are able to build classifiers whose accuracy is comparable to the accuracy of classifiers built with the original data
    • …
    corecore