173 research outputs found

    LC an effective classification based association rule mining algorithm

    Get PDF
    Classification using association rules is a research field in data mining that primarily uses association rule discovery techniques in classification benchmarks. It has been confirmed by many research studies in the literature that classification using association tends to generate more predictive classification systems than traditional classification data mining techniques like probabilistic, statistical and decision tree. In this thesis, we introduce a novel data mining algorithm based on classification using association called “Looking at the Class” (LC), which can be used in for mining a range of classification data sets. Unlike known algorithms in classification using the association approach such as Classification based on Association rule (CBA) system and Classification based on Predictive Association (CPAR) system, which merge disjoint items in the rule learning step without anticipating the class label similarity, the proposed algorithm merges only items with identical class labels. This saves too many unnecessary items combining during the rule learning step, and consequently results in large saving in computational time and memory. Furthermore, the LC algorithm uses a novel prediction procedure that employs multiple rules to make the prediction decision instead of a single rule. The proposed algorithm has been evaluated thoroughly on real world security data sets collected using an automated tool developed at Huddersfield University. The security application which we have considered in this thesis is about categorizing websites based on their features to legitimate or fake which is a typical binary classification problem. Also, experimental results on a number of UCI data sets have been conducted and the measures used for evaluation is the classification accuracy, memory usage, and others. The results show that LC algorithm outperformed traditional classification algorithms such as C4.5, PART and Naïve Bayes as well as known classification based association algorithms like CBA with respect to classification accuracy, memory usage, and execution time on most data sets we consider

    Online Video Promotion with User Specific Information

    Get PDF
    ABSTRACT: There are various ways and methods used in video recommendation which are purely statistical. These would give recommendations to users based on either their previous search or other criteria. These systems set up a large number of context collectors at the terminals. However, the context collecting and exchanging result in heavy network overhead, and the context processing consumes huge computation. Due to these criterion users end up getting unnecessary content which makes the browser slow. In this paper we propose a user specific category based promotion, we propose and provide for characterization of individual content as well as social attributes that help distinguish each user class. Thus a user defined video recommendation would ensure faster access to only important information which is in the user's domain of interest which utilises low buffer space and increase the speed of the system for user satisfaction. KEYWORDS: Spammer ,User created content, Video-Tag , private storage, recommender. I.INTRODUCTION Online video sharing systems, out of which YouTube[1] is the most popular, provide features that allow users to post a video as a response to a discussion topic. These features open opportunities for users to introduce polluted content, or simply pollution, into the system. So we find For instance, spammers[2] may post an unrelated video as response to a popular one, their objective being to increase the viewer-ship of their content. According to Cisco forecast[3] by 2015, two-thirds of the world's mobile data traffic and 62% of the consumer Internet traffic will be video. Video sharing has continuously increased ground due to advancement in network bandwidth Internet users post a large number of video clips on Video-sharing websites and social network applications[5] every day. The video content may be duplicate, similar, related, or quite different. Facing billions of multimedia WebPages, online users are usually having a hard time finding their favourites. Some video-sharing websites recommend video lists for end users according to video classification, video description tags, or watching history. However, these recommendations are not accurate and are always not consistent with the end users' interests. To improve this, some websites also provide users with search engine[6] to search their desired videos quickly. This led to the development of personalization methods which collect and analyse the viewing patterns, such as: the target user's viewing pattern for contents, statistical information for the overall user's viewing patterns, a user's private profile or preference information through the analysis of a user's computing environment, a communication service, and the preferred device types such as a mobile phone, personal computer, etc. A content-based recommendations system recommends the most likely matched item, then compares the recommendation list to a user's previous input data or compared to preference items. It is also based on information searching and generally uses a rating method which is used in the information searching. The rating method calculates a user's preference information and items in a recommendation list. It recommends the most likely program in a user's profile. This method has the advantage with easily adopt in recommendation result and enable more quickly recommendation. But it has problems with difference result and efficient refer to appropriate rating configuration. In use there are several video recommendation algorithms that have been developed; these would include content-based filtering (CB) by Google[7]. This has adopted for their recommender system in AdWords services. It returns search results with keyword-related advertisements, like spam these advertisements annoy most users and have been ignored by most users. Also included are social network filtering (SNF) In Internet User Created Contents (UCC), and Online Digital Video (ODV) enabled the rapid increase of online Video and programs which can be selected by consumers. This was not expected when we consider the conventional Video technologies and policies. Due to these paradigm changes, thousand of video and programs are now available to consumers. In the existing limited content providers existed, such as licensed broadcasting companies and a small number of video and satellite broadcasting operators. Thus the number of movie and programs were limited. It has become difficult and time consuming to find an interesting movie video and program via the remote control or channel guide map. In this paper we propose a user defined recommendation system(UDC) under a cloud computing environment. The proposed UDV system analyses and uses the viewing pattern of consumers to personalize the program recommendations, and to efficiently use computing resources. A proposed framework for recommending online videos operates by constructing user profiles as an aggregate of tag clouds and generating recommendations according to similar viewing patterns. The proposed personalization method collects and analyses the viewing patterns, such as : the target user's viewing pattern for contents, statistical information for the overall user's viewing patterns, a user's private profile or preference information through the analysis of a user's computing environment, a communication service, and implemented in personal computer, but in future we preferred the Mobile device . II. RELATED WORK It considers a network with N mobile unlicensed nodes that move in an environment according to some stochastic mobility models. It also assumes that entire spectrum is divided into number of M non-overlapping orthogonal channels having different bandwidth. The access to each licensed channel is regulated by fixed duration time slots. Slot timing is assumed to be broadcast by the primary system. Before transmitting its message, each transmitter node, which is a node with the message, first selects a path node and a frequency channel to copy the message. After the path and channel selection, the transmitter node negotiates and handshakes with its path node and declares the selected channel frequency to the path. The communication needed for this coordination is assumed to be accomplished by a fixed length frequency hopping sequence (FHS) that is composed of K distinct licensed channels. In each time slot, each node consecutively hops on FHS within a given order to transmit and receive a coordination packet. The aim of coordination packet that is generated by a node with message is to inform its path about the frequency channel decided for the message copying. , they present an overview of the field of recommender systems and describe the current generation of recommendation methods that are classified as: 1. content-based, 2. collaborative, and 3. hybrid recommendation They went further to describe some shortcomings of present recommendation systems and also proposed possible extensions that can improve recommendation capabilities and make recommender systems applicable to an even broader range of applications. These extensions include an improvement of understanding of users and items, incorporation of the contextual information into the recommendation process, support for multi-criteria ratings, and a provision of more flexible and less intrusive types of recommendations

    Detecção de Spammers na Rede de Origem

    Get PDF
    A quantidade de mensagens nĂŁo-solicitadas (spams) enviadas na In- ternet representa mais de 85% de todos os e-mails. Mesmo com a evolução de tĂ©cnicas de filtragem como a anĂĄlise do conteĂșdo de mensagens e o bloqueio de IPs, recursos da rede sĂŁo desperdiçados, uma vez que essa filtragem Ă© realizada normalmente no servidor de destino dos e-mails. Este trabalho propĂ”e um mĂ©- todo para detecção de spammers na rede de origem utilizando uma tĂ©cnica de classificação supervisionada composta por mĂ©tricas que nĂŁo requerem a inspe- ção do conteĂșdo das mensagens enviadas. Os resultados mostram que o mĂ©todo utilizado Ă© eficaz, sendo capaz de identificar a maioria dos spammers ainda em sua rede de origem, preservando assim, os recursos da rede

    Validating Multimedia Content Moderation Software via Semantic Fusion

    Full text link
    The exponential growth of social media platforms, such as Facebook and TikTok, has revolutionized communication and content publication in human society. Users on these platforms can publish multimedia content that delivers information via the combination of text, audio, images, and video. Meanwhile, the multimedia content release facility has been increasingly exploited to propagate toxic content, such as hate speech, malicious advertisements, and pornography. To this end, content moderation software has been widely deployed on these platforms to detect and blocks toxic content. However, due to the complexity of content moderation models and the difficulty of understanding information across multiple modalities, existing content moderation software can fail to detect toxic content, which often leads to extremely negative impacts. We introduce Semantic Fusion, a general, effective methodology for validating multimedia content moderation software. Our key idea is to fuse two or more existing single-modal inputs (e.g., a textual sentence and an image) into a new input that combines the semantics of its ancestors in a novel manner and has toxic nature by construction. This fused input is then used for validating multimedia content moderation software. We realized Semantic Fusion as DUO, a practical content moderation software testing tool. In our evaluation, we employ DUO to test five commercial content moderation software and two state-of-the-art models against three kinds of toxic content. The results show that DUO achieves up to 100% error finding rate (EFR) when testing moderation software. In addition, we leverage the test cases generated by DUO to retrain the two models we explored, which largely improves model robustness while maintaining the accuracy on the original test set.Comment: Accepted by ISSTA 202

    Dynamic Rule Covering Classification in Data Mining with Cyber Security Phishing Application

    Get PDF
    Data mining is the process of discovering useful patterns from datasets using intelligent techniques to help users make certain decisions. A typical data mining task is classification, which involves predicting a target variable known as the class in previously unseen data based on models learnt from an input dataset. Covering is a well-known classification approach that derives models with If-Then rules. Covering methods, such as PRISM, have a competitive predictive performance to other classical classification techniques such as greedy, decision tree and associative classification. Therefore, Covering models are appropriate decision-making tools and users favour them carrying out decisions. Despite the use of Covering approach in data processing for different classification applications, it is also acknowledged that this approach suffers from the noticeable drawback of inducing massive numbers of rules making the resulting model large and unmanageable by users. This issue is attributed to the way Covering techniques induce the rules as they keep adding items to the rule’s body, despite the limited data coverage (number of training instances that the rule classifies), until the rule becomes with zero error. This excessive learning overfits the training dataset and also limits the applicability of Covering models in decision making, because managers normally prefer a summarised set of knowledge that they are able to control and comprehend rather a high maintenance models. In practice, there should be a trade-off between the number of rules offered by a classification model and its predictive performance. Another issue associated with the Covering models is the overlapping of training data among the rules, which happens when a rule’s classified data are discarded during the rule discovery phase. Unfortunately, the impact of a rule’s removed data on other potential rules is not considered by this approach. However, When removing training data linked with a rule, both frequency and rank of other rules’ items which have appeared in the removed data are updated. The impacted rules should maintain their true rank and frequency in a dynamic manner during the rule discovery phase rather just keeping the initial computed frequency from the original input dataset. In response to the aforementioned issues, a new dynamic learning technique based on Covering and rule induction, that we call Enhanced Dynamic Rule Induction (eDRI), is developed. eDRI has been implemented in Java and it has been embedded in WEKA machine learning tool. The developed algorithm incrementally discovers the rules using primarily frequency and rule strength thresholds. These thresholds in practice limit the search space for both items as well as potential rules by discarding any with insufficient data representation as early as possible resulting in an efficient training phase. More importantly, eDRI substantially cuts down the number of training examples scans by continuously updating potential rules’ frequency and strength parameters in a dynamic manner whenever a rule gets inserted into the classifier. In particular, and for each derived rule, eDRI adjusts on the fly the remaining potential rules’ items frequencies as well as ranks specifically for those that appeared within the deleted training instances of the derived rule. This gives a more realistic model with minimal rules redundancy, and makes the process of rule induction efficient and dynamic and not static. Moreover, the proposed technique minimises the classifier’s number of rules at preliminary stages by stopping learning when any rule does not meet the rule’s strength threshold therefore minimising overfitting and ensuring a manageable classifier. Lastly, eDRI prediction procedure not only priorities using the best ranked rule for class forecasting of test data but also restricts the use of the default class rule thus reduces the number of misclassifications. The aforementioned improvements guarantee classification models with smaller size that do not overfit the training dataset, while maintaining their predictive performance. The eDRI derived models particularly benefit greatly users taking key business decisions since they can provide a rich knowledge base to support their decision making. This is because these models’ predictive accuracies are high, easy to understand, and controllable as well as robust, i.e. flexible to be amended without drastic change. eDRI applicability has been evaluated on the hard problem of phishing detection. Phishing normally involves creating a fake well-designed website that has identical similarity to an existing business trustful website aiming to trick users and illegally obtain their credentials such as login information in order to access their financial assets. The experimental results against large phishing datasets revealed that eDRI is highly useful as an anti-phishing tool since it derived manageable size models when compared with other traditional techniques without hindering the classification performance. Further evaluation results using other several classification datasets from different domains obtained from University of California Data Repository have corroborated eDRI’s competitive performance with respect to accuracy, number of knowledge representation, training time and items space reduction. This makes the proposed technique not only efficient in inducing rules but also effective

    BlogForever D2.6: Data Extraction Methodology

    Get PDF
    This report outlines an inquiry into the area of web data extraction, conducted within the context of blog preservation. The report reviews theoretical advances and practical developments for implementing data extraction. The inquiry is extended through an experiment that demonstrates the effectiveness and feasibility of implementing some of the suggested approaches. More specifically, the report discusses an approach based on unsupervised machine learning that employs the RSS feeds and HTML representations of blogs. It outlines the possibilities of extracting semantics available in blogs and demonstrates the benefits of exploiting available standards such as microformats and microdata. The report proceeds to propose a methodology for extracting and processing blog data to further inform the design and development of the BlogForever platform

    Deriving Classifiers with Single and Multi-Label Rules using New Associative Classification Methods

    Get PDF
    Associative Classification (AC) in data mining is a rule based approach that uses association rule techniques to construct accurate classification systems (classifiers). The majority of existing AC algorithms extract one class per rule and ignore other class labels even when they have large data representation. Thus, extending current AC algorithms to find and extract multi-label rules is promising research direction since new hidden knowledge is revealed for decision makers. Furthermore, the exponential growth of rules in AC has been investigated in this thesis aiming to minimise the number of candidate rules, and therefore reducing the classifier size so end-user can easily exploit and maintain it. Moreover, an investigation to both rule ranking and test data classification steps have been conducted in order to improve the performance of AC algorithms in regards to predictive accuracy. Overall, this thesis investigates different problems related to AC not limited to the ones listed above, and the results are new AC algorithms that devise single and multi-label rules from different applications data sets, together with comprehensive experimental results. To be exact, the first algorithm proposed named Multi-class Associative Classifier (MAC): This algorithm derives classifiers where each rule is connected with a single class from a training data set. MAC enhanced the rule discovery, rule ranking, rule filtering and classification of test data in AC. The second algorithm proposed is called Multi-label Classifier based Associative Classification (MCAC) that adds on MAC a novel rule discovery method which discovers multi-label rules from single label data without learning from parts of the training data set. These rules denote vital information ignored by most current AC algorithms which benefit both the end-user and the classifier’s predictive accuracy. Lastly, the vital problem related to web threats called “website phishing detection” was deeply investigated where a technical solution based on AC has been introduced in Chapter 6. Particularly, we were able to detect new type of knowledge and enhance the detection rate with respect to error rate using our proposed algorithms and against a large collected phishing data set. Thorough experimental tests utilising large numbers of University of California Irvine (UCI) data sets and a variety of real application data collections related to website classification and trainer timetabling problems reveal that MAC and MCAC generates better quality classifiers if compared with other AC and rule based algorithms with respect to various evaluation measures, i.e. error rate, Label-Weight, Any-Label, number of rules, etc. This is mainly due to the different improvements related to rule discovery, rule filtering, rule sorting, classification step, and more importantly the new type of knowledge associated with the proposed algorithms. Most chapters in this thesis have been disseminated or under review in journals and refereed conference proceedings

    A modified multi-class association rule for text mining

    Get PDF
    Classification and association rule mining are significant tasks in data mining. Integrating association rule discovery and classification in data mining brings us an approach known as the associative classification. One common shortcoming of existing Association Classifiers is the huge number of rules produced in order to obtain high classification accuracy. This study proposes s a Modified Multi-class Association Rule Mining (mMCAR) that consists of three procedures; rule discovery, rule pruning and group-based class assignment. The rule discovery and rule pruning procedures are designed to reduce the number of classification rules. On the other hand, the group-based class assignment procedure contributes in improving the classification accuracy. Experiments on the structured and unstructured text datasets obtained from the UCI and Reuters repositories are performed in order to evaluate the proposed Association Classifier. The proposed mMCAR classifier is benchmarked against the traditional classifiers and existing Association Classifiers. Experimental results indicate that the proposed Association Classifier, mMCAR, produced high accuracy with a smaller number of classification rules. For the structured dataset, the mMCAR produces an average of 84.24% accuracy as compared to MCAR that obtains 84.23%. Even though the classification accuracy difference is small, the proposed mMCAR uses only 50 rules for the classification while its benchmark method involves 60 rules. On the other hand, mMCAR is at par with MCAR when unstructured dataset is utilized. Both classifiers produce 89% accuracy but mMCAR uses less number of rules for the classification. This study contributes to the text mining domain as automatic classification of huge and widely distributed textual data could facilitate the text representation and retrieval processes

    Multi-Instance Multi-Label Learning

    Get PDF
    In this paper, we propose the MIML (Multi-Instance Multi-Label learning) framework where an example is described by multiple instances and associated with multiple class labels. Compared to traditional learning frameworks, the MIML framework is more convenient and natural for representing complicated objects which have multiple semantic meanings. To learn from MIML examples, we propose the MimlBoost and MimlSvm algorithms based on a simple degeneration strategy, and experiments show that solving problems involving complicated objects with multiple semantic meanings in the MIML framework can lead to good performance. Considering that the degeneration process may lose information, we propose the D-MimlSvm algorithm which tackles MIML problems directly in a regularization framework. Moreover, we show that even when we do not have access to the real objects and thus cannot capture more information from real objects by using the MIML representation, MIML is still useful. We propose the InsDif and SubCod algorithms. InsDif works by transforming single-instances into the MIML representation for learning, while SubCod works by transforming single-label examples into the MIML representation for learning. Experiments show that in some tasks they are able to achieve better performance than learning the single-instances or single-label examples directly.Comment: 64 pages, 10 figures; Artificial Intelligence, 201

    Transfer Learning using Computational Intelligence: A Survey

    Get PDF
    Abstract Transfer learning aims to provide a framework to utilize previously-acquired knowledge to solve new but similar problems much more quickly and effectively. In contrast to classical machine learning methods, transfer learning methods exploit the knowledge accumulated from data in auxiliary domains to facilitate predictive modeling consisting of different data patterns in the current domain. To improve the performance of existing transfer learning methods and handle the knowledge transfer process in real-world systems, ..
    • 

    corecore