32 research outputs found

    DOMINANT ATTRIBUTE AND MULTIPLE SCANNING APPROACHES FOR DISCRETIZATION OF NUMERICAL ATTRIBUTES

    Get PDF
    Rapid development of high throughput technologies and database management systems has made it possible to produce and store large amount of data. However, making sense of big data and discovering knowledge from it is a compounding challenge. Generally, data mining techniques search for information in datasets and express gained knowledge in the form of trends, regularities, patterns or rules. Rules are frequently identified automatically by a technique called rule induction, which is the most important technique in data mining and machine learning and it was developed primarily to handle symbolic data. However, real life data often contain numerical attributes and therefore, in order to fully utilize the power of rule induction techniques, an essential preprocessing step of converting numeric data into symbolic data called discretization is employed in data mining. Here we present two entropy based discretization techniques known as dominant attribute approach and multiple scanning approach, respectively. These approaches were implemented as two explicit algorithms in a JAVA programming language and experiments were conducted by applying each algorithm separately on seventeen well known numerical data sets. The resulting discretized data sets were used for rule induction by LEM2 or Learning from Examples Module 2 algorithm. For each dataset in multiple scanning approach, experiments were repeated with incremental scans until interval counts were stabilized. Preliminary results from this study indicated that multiple scanning approach performed better than dominant attribute approach in terms of producing comparatively smaller and simpler rule sets

    A Multi-Tiered Genetic Algorithm for Data Mining and Hypothesis Refinement

    Get PDF
    While there are many approaches to data mining, it seems that there is a hole in the ability to make use of the advantages of multiple techniques. There are many methods that use rigid heuristics and guidelines in constructing rules for data, and are thus limited in their ability to describe patterns. Genetic algorithms provide a more flexible approach, and yet the genetic algorithms that have been employed don't capitalize on the fact that data models have two levels: individual rules and the overall data model. This dissertation introduces a multi-tiered genetic algorithm capable of evolving individual rules and the data model at the same time. The multi-tiered genetic algorithm also provides a means for taking advantage of the strengths of the more rigid methods by using their output as input to the genetic algorithm. Most genetic algorithms use a single "roulette wheel" approach. As such, they are only able to select either good data models or good rules, but are incapable of selecting for both simultaneously. With the additional roulette wheel of the multi-tiered genetic algorithm, the fitness of both rules and data models can be evaluated, enabling the algorithm to select good rules from good data models. This also more closely emulates how genes are passed from parents to children in actual biology. Consequently, this technique strengthens the "genetics" of genetic algorithms. For ease of discussion, the multi-tiered genetic algorithm has been named "Arcanum." This technique was tested on thirteen data sets obtained from The University of California Irvine Knowledge Discovery in Databases Archive. Results for these same data sets were gathered for GAssist, another genetic algorithm designed for data mining, and J4.8, the WEKA implementation of C4.5. While both of the other techniques outperformed Arcanum overall, it was able to provide comparable or better results for 5 of the 13 data sets, indicating that the algorithm can be used for data mining, although it needs improvement. The second stage of testing was on the ability to take results from a previous algorithm and perform refinement on the data model. Initially, Arcanum was used to refine its own data models. Of the six data models used for hypothesis refinement, Arcanum was able to improve upon 3 of them. Next, results from the LEM2 algorithm were used as input to Arcanum. Of the three data models used from LEM2, Arcanum was able to improve upon all three data models by sacrificing accuracy in order to improve coverage, resulting in a better data model overall. The last phase of hypothesis refinement was performed upon C4.5. It required several attempts, each using different parameters, but Arcanum was finally able to make a slight improvement to the C4.5 data model. From the experimental results, Arcanum was shown to yield results comparable to GAssist and C4.5 on some of the data sets. It was also able to take data models from three different techniques and improve upon them. While there is certainly room for improvement of the multi-tiered genetic algorithm described in this dissertation, the experimental evidence supports the claims that it can perform both data mining and hypothesis refinement of data models from other data mining techniques

    Epigenetic mechanisms of eu- and heterochromatin positioning in interphase nuclei

    Get PDF

    不完全な情報システムのためのラフ集合モデルと知識獲得

    Get PDF
    国立大学法人長岡技術科学大

    Epigenetic mechanisms of eu- and heterochromatin positioning in interphase nuclei

    Get PDF

    Kernel Methods for Melanoma Recognition

    Get PDF
    Malignant melanoma is the most deadly for of skin lesion. Early diagnosis is of critical importance to patient survival. Visual recognition algorithms could potentially be of great help for physicians in a computer assisted diagnosis system. Previous work on this topic has focused mostly on developing ad-hoc segmentation and feature extraction methods. In this paper we take a completely different approach. We put the emphasis on the learning part of the algorithm by using two kernel-based classifiers, one discriminative and one probabilistic. As a discriminative approach we chose support vector machines, a state of the art large-margin classifier which was proved very successful on visual applications. As a probabilistic approach we chose spin glass-Markov random fields, a kernel Gibbs distribution inspired by results of statistical physics. We benchmarked these algorithms against the (to our knowledge) state-of-the-art method on melanoma recognition, exploring how performance changes by using color or textural features, and how it is affected by the quality of the segmentation mask. We show with extensive experiments that the support vector machine approach outperforms the existing method and, on two classes out of three, it achieves performances comparable to those obtained by expert clinician

    Melanoma Recognition Using Representative and Discriminative Kernel Classifiers

    Get PDF
    Malignant melanoma is the most deadly for of skin lesion. Early diagnosis is of critical importance to patient survival. Visual recognition algorithms could potentially be of great help for physicians in a computer assisted diagnosis system. Previous work on this topic has focused mostly on developing ad\--hoc segmentation and feature extraction methods. In this paper we take a completely different approach. We put the emphasis on the learning part of the algorithm by using two kernel\--based classifiers, one discriminative and one probabilistic. As a discriminative approach we chose support vector machines, a state of the art large\--margin classifier which was proved very successful on visual applications. As a probabilistic approach we chose spin glass\--Markov random fields, a kernel Gibbs distribution inspired by results of statistical physics. We benchmarked these algorithms against the (to our knowledge) state\--of\--the\--art method on melanoma recognition, exploring how performance changes by using color or textural features, and how it is affected by the quality of the segmentation mask. We show with extensive experiments that the support vector machine approach outperforms the existing method and, on two classes out of three, it achieves performances comparable to those obtained by expert clinicians

    Finding the online cry for help : automatic text classification for suicide prevention

    Get PDF
    Successful prevention of suicide, a serious public health concern worldwide, hinges on the adequate detection of suicide risk. While online platforms are increasingly used for expressing suicidal thoughts, manually monitoring for such signals of distress is practically infeasible, given the information overload suicide prevention workers are confronted with. In this thesis, the automatic detection of suicide-related messages is studied. It presents the first classification-based approach to online suicidality detection, and focuses on Dutch user-generated content. In order to evaluate the viability of such a machine learning approach, we developed a gold standard corpus, consisting of message board and blog posts. These were manually labeled according to a newly developed annotation scheme, grounded in suicide prevention practice. The scheme provides for the annotation of a post's relevance to suicide, and the subject and severity of a suicide threat, if any. This allowed us to derive two tasks: the detection of suicide-related posts, and of severe, high-risk content. In a series of experiments, we sought to determine how well these tasks can be carried out automatically, and which information sources and techniques contribute to classification performance. The experimental results show that both types of messages can be detected with high precision. Therefore, the amount of noise generated by the system is minimal, even on very large datasets, making it usable in a real-world prevention setting. Recall is high for the relevance task, but at around 60%, it is considerably lower for severity. This is mainly attributable to implicit references to suicide, which often go undetected. We found a variety of information sources to be informative for both tasks, including token and character ngram bags-of-words, features based on LSA topic models, polarity lexicons and named entity recognition, and suicide-related terms extracted from a background corpus. To improve classification performance, the models were optimized using feature selection, hyperparameter, or a combination of both. A distributed genetic algorithm approach proved successful in finding good solutions for this complex search problem, and resulted in more robust models. Experiments with cascaded classification of the severity task did not reveal performance benefits over direct classification (in terms of F1-score), but its structure allows the use of slower, memory-based learning algorithms that considerably improved recall. At the end of this thesis, we address a problem typical of user-generated content: noise in the form of misspellings, phonetic transcriptions and other deviations from the linguistic norm. We developed an automatic text normalization system, using a cascaded statistical machine translation approach, and applied it to normalize the data for the suicidality detection tasks. Subsequent experiments revealed that, compared to the original data, normalized data resulted in fewer and more informative features, and improved classification performance. This extrinsic evaluation demonstrates the utility of automatic normalization for suicidality detection, and more generally, text classification on user-generated content

    Dynamic Rule Covering Classification in Data Mining with Cyber Security Phishing Application

    Get PDF
    Data mining is the process of discovering useful patterns from datasets using intelligent techniques to help users make certain decisions. A typical data mining task is classification, which involves predicting a target variable known as the class in previously unseen data based on models learnt from an input dataset. Covering is a well-known classification approach that derives models with If-Then rules. Covering methods, such as PRISM, have a competitive predictive performance to other classical classification techniques such as greedy, decision tree and associative classification. Therefore, Covering models are appropriate decision-making tools and users favour them carrying out decisions. Despite the use of Covering approach in data processing for different classification applications, it is also acknowledged that this approach suffers from the noticeable drawback of inducing massive numbers of rules making the resulting model large and unmanageable by users. This issue is attributed to the way Covering techniques induce the rules as they keep adding items to the rule’s body, despite the limited data coverage (number of training instances that the rule classifies), until the rule becomes with zero error. This excessive learning overfits the training dataset and also limits the applicability of Covering models in decision making, because managers normally prefer a summarised set of knowledge that they are able to control and comprehend rather a high maintenance models. In practice, there should be a trade-off between the number of rules offered by a classification model and its predictive performance. Another issue associated with the Covering models is the overlapping of training data among the rules, which happens when a rule’s classified data are discarded during the rule discovery phase. Unfortunately, the impact of a rule’s removed data on other potential rules is not considered by this approach. However, When removing training data linked with a rule, both frequency and rank of other rules’ items which have appeared in the removed data are updated. The impacted rules should maintain their true rank and frequency in a dynamic manner during the rule discovery phase rather just keeping the initial computed frequency from the original input dataset. In response to the aforementioned issues, a new dynamic learning technique based on Covering and rule induction, that we call Enhanced Dynamic Rule Induction (eDRI), is developed. eDRI has been implemented in Java and it has been embedded in WEKA machine learning tool. The developed algorithm incrementally discovers the rules using primarily frequency and rule strength thresholds. These thresholds in practice limit the search space for both items as well as potential rules by discarding any with insufficient data representation as early as possible resulting in an efficient training phase. More importantly, eDRI substantially cuts down the number of training examples scans by continuously updating potential rules’ frequency and strength parameters in a dynamic manner whenever a rule gets inserted into the classifier. In particular, and for each derived rule, eDRI adjusts on the fly the remaining potential rules’ items frequencies as well as ranks specifically for those that appeared within the deleted training instances of the derived rule. This gives a more realistic model with minimal rules redundancy, and makes the process of rule induction efficient and dynamic and not static. Moreover, the proposed technique minimises the classifier’s number of rules at preliminary stages by stopping learning when any rule does not meet the rule’s strength threshold therefore minimising overfitting and ensuring a manageable classifier. Lastly, eDRI prediction procedure not only priorities using the best ranked rule for class forecasting of test data but also restricts the use of the default class rule thus reduces the number of misclassifications. The aforementioned improvements guarantee classification models with smaller size that do not overfit the training dataset, while maintaining their predictive performance. The eDRI derived models particularly benefit greatly users taking key business decisions since they can provide a rich knowledge base to support their decision making. This is because these models’ predictive accuracies are high, easy to understand, and controllable as well as robust, i.e. flexible to be amended without drastic change. eDRI applicability has been evaluated on the hard problem of phishing detection. Phishing normally involves creating a fake well-designed website that has identical similarity to an existing business trustful website aiming to trick users and illegally obtain their credentials such as login information in order to access their financial assets. The experimental results against large phishing datasets revealed that eDRI is highly useful as an anti-phishing tool since it derived manageable size models when compared with other traditional techniques without hindering the classification performance. Further evaluation results using other several classification datasets from different domains obtained from University of California Data Repository have corroborated eDRI’s competitive performance with respect to accuracy, number of knowledge representation, training time and items space reduction. This makes the proposed technique not only efficient in inducing rules but also effective

    Data Science and Knowledge Discovery

    Get PDF
    Data Science (DS) is gaining significant importance in the decision process due to a mix of various areas, including Computer Science, Machine Learning, Math and Statistics, domain/business knowledge, software development, and traditional research. In the business field, DS's application allows using scientific methods, processes, algorithms, and systems to extract knowledge and insights from structured and unstructured data to support the decision process. After collecting the data, it is crucial to discover the knowledge. In this step, Knowledge Discovery (KD) tasks are used to create knowledge from structured and unstructured sources (e.g., text, data, and images). The output needs to be in a readable and interpretable format. It must represent knowledge in a manner that facilitates inferencing. KD is applied in several areas, such as education, health, accounting, energy, and public administration. This book includes fourteen excellent articles which discuss this trending topic and present innovative solutions to show the importance of Data Science and Knowledge Discovery to researchers, managers, industry, society, and other communities. The chapters address several topics like Data mining, Deep Learning, Data Visualization and Analytics, Semantic data, Geospatial and Spatio-Temporal Data, Data Augmentation and Text Mining
    corecore