92 research outputs found

    An Effective Ensemble Approach for Spam Classification

    Get PDF
    The annoyance of spam increasingly plagues both individuals and organizations. Spam classification is an important issue to distinguish the spam with the legitimate email or address. This paper presents a neural network ensemble approach based on a specially designed cooperative coevolution paradigm. Each component network corresponds to a separate subpopulation and all subpopulations are evolved simultaneously. The ensemble performance and the Q-statistic diversity measure are adopted as the objectives, and the component networks are evaluated by using the multi-objective Pareto optimality measure. Experimental results illustrate that the proposed algorithm outperforms the traditional ensemble methods on the spam classification problems

    Pengukuran Kemiripan Term Berbasis Co-Occurrence dan Inverse Class Frequency Pada Pengembangan Thesaurus Bahasa Arab

    Get PDF
    Thesaurus merupakan tools yang bermanfaat untuk melakukan query expansion dalam pencarian dokumen. Thesaurus adalah kamus yang dibentuk dengan melihat kemiripan term. Kemiripan term dalam pembentukan thesaurus secara otomatis salah satunya dilakukan dengan pendekatan statistikal dari term pada dokumen-dokumen corpus. Beberapa thesaurus pada bahasa arab dibentuk dengan menggunakan pendekatan statistikal. Salah satu pendekatan statistikal adalah teknik co-occurrence yang memperhatikan frekuensi kemunculan term secara bersama-sama. Melihat kemiripan term dalam pembentukan thesaurus tidak hanya bergantung pada nilai informatif suatu term terhadap dokumen. Namun juga nilai informatif suatu term terhadap cluster. Dokumen-dokumen corpus dikumpulkan kemudian dilakukan proses preprocessing untuk medaptakan daftar term. Daftar term tersebut akan dihitung nilai TF-IDF nya sebagi fitur untuk melakukan clustering pada dokumen. Dokumen yang telah ter-cluster akan dijadikan patokan untuk menghitung nilai Inverse Class frequency (ICF). Nilai TF – ICF digunakan untuk perhitungan cluster weight pada teknik co-occurence dimana perhitungan tersebut memperhatikan kemunculan bersama kedua term. Hasil dari cluster weight yang melibatkan TF-ICF tersebut menjadi patokan nilai kemiripan term dalam pembentukan thesaurus. Pengujian terhadap thesaurus hasil bentukan metode usulan menghasilkan nilai precision tertinggi sebesar 76,7% sedangkan recall memiliki nilai terbesar 81,8% dan f-measure sebesar 54,1%. ============================================================================================ Thesaurus is a useful tool to perform query expansion in the document search. Dictionary Thesaurus is formed by looking at the similarities term. Similarities in the formation of a thesaurus term is automatically one of them carried out by statistical approach of the term in the document corpus. Some thesaurus in Arabic is formed by using a statistical approach. One approach is a statistical technique that takes into account the co-occurrence frequency of occurrence of terms together. See the resemblance in the formation of a thesaurus term depends not only on the informative value of a term of the document. But also informative value of a term to the cluster. The documents collected corpus preprocessing process is then performed to medaptakan term list. The term list will be calculated the value of its TF-IDF as a feature to perform clustering on the document. Documents that have already been cluster will be used as a benchmark to calculate the value of Inverse Class frequency (ICF). TF value - ICF is used for the calculation of weight in the engineering cluster co-occurence where the calculation of the notice of appearance with the two terms. Results of cluster weight involving TF-ICF has become a benchmark value of term similarity in the formation of a thesaurus. Tests on the thesaurus result form the proposed method produces the highest precision value amounted to 76.7%, while the recall has the greatest value 81.8% and f-measure of 54.1%

    Introspective knowledge acquisition for case retrieval networks in textual case base reasoning.

    Get PDF
    Textual Case Based Reasoning (TCBR) aims at effective reuse of information contained in unstructured documents. The key advantage of TCBR over traditional Information Retrieval systems is its ability to incorporate domain-specific knowledge to facilitate case comparison beyond simple keyword matching. However, substantial human intervention is needed to acquire and transform this knowledge into a form suitable for a TCBR system. In this research, we present automated approaches that exploit statistical properties of document collections to alleviate this knowledge acquisition bottleneck. We focus on two important knowledge containers: relevance knowledge, which shows relatedness of features to cases, and similarity knowledge, which captures the relatedness of features to each other. The terminology is derived from the Case Retrieval Network (CRN) retrieval architecture in TCBR, which is used as the underlying formalism in this thesis applied to text classification. Latent Semantic Indexing (LSI) generated concepts are a useful resource for relevance knowledge acquisition for CRNs. This thesis introduces a supervised LSI technique called sprinkling that exploits class knowledge to bias LSI's concept generation. An extension of this idea, called Adaptive Sprinkling has been proposed to handle inter-class relationships in complex domains like hierarchical (e.g. Yahoo directory) and ordinal (e.g. product ranking) classification tasks. Experimental evaluation results show the superiority of CRNs created with sprinkling and AS, not only over LSI on its own, but also over state-of-the-art classifiers like Support Vector Machines (SVM). Current statistical approaches based on feature co-occurrences can be utilized to mine similarity knowledge for CRNs. However, related words often do not co-occur in the same document, though they co-occur with similar words. We introduce an algorithm to efficiently mine such indirect associations, called higher order associations. Empirical results show that CRNs created with the acquired similarity knowledge outperform both LSI and SVM. Incorporating acquired knowledge into the CRN transforms it into a densely connected network. While improving retrieval effectiveness, this has the unintended effect of slowing down retrieval. We propose a novel retrieval formalism called the Fast Case Retrieval Network (FCRN) which eliminates redundant run-time computations to improve retrieval speed. Experimental results show FCRN's ability to scale up over high dimensional textual casebases. Finally, we investigate novel ways of visualizing and estimating complexity of textual casebases that can help explain performance differences across casebases. Visualization provides a qualitative insight into the casebase, while complexity is a quantitative measure that characterizes classification or retrieval hardness intrinsic to a dataset. We study correlations of experimental results from the proposed approaches against complexity measures over diverse casebases

    Condition monitoring of helical gears using automated selection of features and sensors

    Get PDF
    The selection of most sensitive sensors and signal processing methods is essential process for the design of condition monitoring and intelligent fault diagnosis and prognostic systems. Normally, sensory data includes high level of noise and irrelevant or red undant information which makes the selection of the most sensitive sensor and signal processing method a difficult task. This paper introduces a new application of the Automated Sensor and Signal Processing Approach (ASPS), for the design of condition monitoring systems for developing an effective monitoring system for gearbox fault diagnosis. The approach is based on using Taguchi's orthogonal arrays, combined with automated selection of sensory characteristic features, to provide economically effective and optimal selection of sensors and signal processing methods with reduced experimental work. Multi-sensory signals such as acoustic emission, vibration, speed and torque are collected from the gearbox test rig under different health and operating conditions. Time and frequency domain signal processing methods are utilised to assess the suggested approach. The experiments investigate a single stage gearbox system with three level of damage in a helical gear to evaluate the proposed approach. Two different classification models are employed using neural networks to evaluate the methodology. The results have shown that the suggested approach can be applied to the design of condition monitoring systems of gearbox monitoring without the need for implementing pattern recognition tools during the design phase; where the pattern recognition can be implemented as part of decision making for diagnostics. The suggested system has a wide range of applications including industrial machinery as well as wind turbines for renewable energy applications

    Machine Learning

    Get PDF
    Machine Learning can be defined in various ways related to a scientific domain concerned with the design and development of theoretical and implementation tools that allow building systems with some Human Like intelligent behavior. Machine learning addresses more specifically the ability to improve automatically through experience

    Optimisation Method for Training Deep Neural Networks in Classification of Non- functional Requirements

    Get PDF
    Non-functional requirements (NFRs) are regarded critical to a software system's success. The majority of NFR detection and classification solutions have relied on supervised machine learning models. It is hindered by the lack of labelled data for training and necessitate a significant amount of time spent on feature engineering. In this work we explore emerging deep learning techniques to reduce the burden of feature engineering. The goal of this study is to develop an autonomous system that can classify NFRs into multiple classes based on a labelled corpus. In the first section of the thesis, we standardise the NFRs ontology and annotations to produce a corpus based on five attributes: usability, reliability, efficiency, maintainability, and portability. In the second section, the design and implementation of four neural networks, including the artificial neural network, convolutional neural network, long short-term memory, and gated recurrent unit are examined to classify NFRs. These models, necessitate a large corpus. To overcome this limitation, we proposed a new paradigm for data augmentation. This method uses a sort and concatenates strategy to combine two phrases from the same class, resulting in a two-fold increase in data size while keeping the domain vocabulary intact. We compared our method to a baseline (no augmentation) and an existing approach Easy data augmentation (EDA) with pre-trained word embeddings. All training has been performed under two modifications to the data; augmentation on the entire data before train/validation split vs augmentation on train set only. Our findings show that as compared to EDA and baseline, NFRs classification model improved greatly, and CNN outperformed when trained using our suggested technique in the first setting. However, we saw a slight boost in the second experimental setup with just train set augmentation. As a result, we can determine that augmentation of the validation is required in order to achieve acceptable results with our proposed approach. We hope that our ideas will inspire new data augmentation techniques, whether they are generic or task specific. Furthermore, it would also be useful to implement this strategy in other languages

    BNAIC 2008:Proceedings of BNAIC 2008, the twentieth Belgian-Dutch Artificial Intelligence Conference

    Get PDF

    Analyzing and Modeling Real-World Phenomena with Complex Networks: A Survey of Applications

    Get PDF
    The success of new scientific areas can be assessed by their potential for contributing to new theoretical approaches and in applications to real-world problems. Complex networks have fared extremely well in both of these aspects, with their sound theoretical basis developed over the years and with a variety of applications. In this survey, we analyze the applications of complex networks to real-world problems and data, with emphasis in representation, analysis and modeling, after an introduction to the main concepts and models. A diversity of phenomena are surveyed, which may be classified into no less than 22 areas, providing a clear indication of the impact of the field of complex networks.Comment: 103 pages, 3 figures and 7 tables. A working manuscript, suggestions are welcome

    Nodalida 2005 - proceedings of the 15th NODALIDA conference

    Get PDF

    Can humain association norm evaluate latent semantic analysis?

    Get PDF
    This paper presents the comparison of word association norm created by a psycholinguistic experiment to association lists generated by algorithms operating on text corpora. We compare lists generated by Church and Hanks algorithm and lists generated by LSA algorithm. An argument is presented on how those automatically generated lists reflect real semantic relations
    corecore