4,810 research outputs found

    Index ordering by query-independent measures

    Get PDF
    Conventional approaches to information retrieval search through all applicable entries in an inverted file for a particular collection in order to find those documents with the highest scores. For particularly large collections this may be extremely time consuming. A solution to this problem is to only search a limited amount of the collection at query-time, in order to speed up the retrieval process. In doing this we can also limit the loss in retrieval efficacy (in terms of accuracy of results). The way we achieve this is to firstly identify the most ā€œimportantā€ documents within the collection, and sort documents within inverted file lists in order of this ā€œimportanceā€. In this way we limit the amount of information to be searched at query time by eliminating documents of lesser importance, which not only makes the search more efficient, but also limits loss in retrieval accuracy. Our experiments, carried out on the TREC Terabyte collection, report significant savings, in terms of number of postings examined, without significant loss of effectiveness when based on several measures of importance used in isolation, and in combination. Our results point to several ways in which the computation cost of searching large collections of documents can be significantly reduced

    Effective pattern discovery for text mining

    Get PDF
    Many data mining techniques have been proposed for mining useful patterns in text documents. However, how to effectively use and update discovered patterns is still an open research issue, especially in the domain of text mining. Since most existing text mining methods adopted term-based approaches, they all suffer from the problems of polysemy and synonymy. Over the years, people have often held the hypothesis that pattern (or phrase) based approaches should perform better than the term-based ones, but many experiments did not support this hypothesis. This paper presents an innovative technique, effective pattern discovery which includes the processes of pattern deploying and pattern evolving, to improve the effectiveness of using and updating discovered patterns for finding relevant and interesting information. Substantial experiments on RCV1 data collection and TREC topics demonstrate that the proposed solution achieves encouraging performance

    A Faster Algorithm to Build New Users Similarity List in Neighbourhood-based Collaborative Filtering

    Full text link
    Neighbourhood-based Collaborative Filtering (CF) has been applied in the industry for several decades, because of the easy implementation and high recommendation accuracy. As the core of neighbourhood-based CF, the task of dynamically maintaining users' similarity list is challenged by cold-start problem and scalability problem. Recently, several methods are presented on solving the two problems. However, these methods applied an O(n2)O(n^2) algorithm to compute the similarity list in a special case, where the new users, with enough recommendation data, have the same rating list. To address the problem of large computational cost caused by the special case, we design a faster (O(1125n2)O(\frac{1}{125}n^2)) algorithm, TwinSearch Algorithm, to avoid computing and sorting the similarity list for the new users repeatedly to save the computational resources. Both theoretical and experimental results show that the TwinSearch Algorithm achieves better running time than the traditional method

    Dynamic Rule Covering Classification in Data Mining with Cyber Security Phishing Application

    Get PDF
    Data mining is the process of discovering useful patterns from datasets using intelligent techniques to help users make certain decisions. A typical data mining task is classification, which involves predicting a target variable known as the class in previously unseen data based on models learnt from an input dataset. Covering is a well-known classification approach that derives models with If-Then rules. Covering methods, such as PRISM, have a competitive predictive performance to other classical classification techniques such as greedy, decision tree and associative classification. Therefore, Covering models are appropriate decision-making tools and users favour them carrying out decisions. Despite the use of Covering approach in data processing for different classification applications, it is also acknowledged that this approach suffers from the noticeable drawback of inducing massive numbers of rules making the resulting model large and unmanageable by users. This issue is attributed to the way Covering techniques induce the rules as they keep adding items to the ruleā€™s body, despite the limited data coverage (number of training instances that the rule classifies), until the rule becomes with zero error. This excessive learning overfits the training dataset and also limits the applicability of Covering models in decision making, because managers normally prefer a summarised set of knowledge that they are able to control and comprehend rather a high maintenance models. In practice, there should be a trade-off between the number of rules offered by a classification model and its predictive performance. Another issue associated with the Covering models is the overlapping of training data among the rules, which happens when a ruleā€™s classified data are discarded during the rule discovery phase. Unfortunately, the impact of a ruleā€™s removed data on other potential rules is not considered by this approach. However, When removing training data linked with a rule, both frequency and rank of other rulesā€™ items which have appeared in the removed data are updated. The impacted rules should maintain their true rank and frequency in a dynamic manner during the rule discovery phase rather just keeping the initial computed frequency from the original input dataset. In response to the aforementioned issues, a new dynamic learning technique based on Covering and rule induction, that we call Enhanced Dynamic Rule Induction (eDRI), is developed. eDRI has been implemented in Java and it has been embedded in WEKA machine learning tool. The developed algorithm incrementally discovers the rules using primarily frequency and rule strength thresholds. These thresholds in practice limit the search space for both items as well as potential rules by discarding any with insufficient data representation as early as possible resulting in an efficient training phase. More importantly, eDRI substantially cuts down the number of training examples scans by continuously updating potential rulesā€™ frequency and strength parameters in a dynamic manner whenever a rule gets inserted into the classifier. In particular, and for each derived rule, eDRI adjusts on the fly the remaining potential rulesā€™ items frequencies as well as ranks specifically for those that appeared within the deleted training instances of the derived rule. This gives a more realistic model with minimal rules redundancy, and makes the process of rule induction efficient and dynamic and not static. Moreover, the proposed technique minimises the classifierā€™s number of rules at preliminary stages by stopping learning when any rule does not meet the ruleā€™s strength threshold therefore minimising overfitting and ensuring a manageable classifier. Lastly, eDRI prediction procedure not only priorities using the best ranked rule for class forecasting of test data but also restricts the use of the default class rule thus reduces the number of misclassifications. The aforementioned improvements guarantee classification models with smaller size that do not overfit the training dataset, while maintaining their predictive performance. The eDRI derived models particularly benefit greatly users taking key business decisions since they can provide a rich knowledge base to support their decision making. This is because these modelsā€™ predictive accuracies are high, easy to understand, and controllable as well as robust, i.e. flexible to be amended without drastic change. eDRI applicability has been evaluated on the hard problem of phishing detection. Phishing normally involves creating a fake well-designed website that has identical similarity to an existing business trustful website aiming to trick users and illegally obtain their credentials such as login information in order to access their financial assets. The experimental results against large phishing datasets revealed that eDRI is highly useful as an anti-phishing tool since it derived manageable size models when compared with other traditional techniques without hindering the classification performance. Further evaluation results using other several classification datasets from different domains obtained from University of California Data Repository have corroborated eDRIā€™s competitive performance with respect to accuracy, number of knowledge representation, training time and items space reduction. This makes the proposed technique not only efficient in inducing rules but also effective

    Application of User Profiling on Ontology Module Extraction for Medical portals

    Get PDF
    One fit all for approach for searching and ranking discovered knowledge on the Internet does not cater for the diverse variety of users and user groups with different preferences, information needs and priorities. This is of a particular case in the National electronic Library of Infection in the UK (NeLI, www.neli.org.uk) accessed by a number of medical professionals with different preferences and medical information needs. We define personal and group profiles, based on user-specified interests, and develop an ontology module extraction service defining the key area of the infection ontology of a particular relevance to each user group. In this paper we discuss how ontology modularisation can improve the NeLI portal by providing customised alert, recommender service and specialitycustomised browsing tree structure

    Evaluation of Health Information on the Web

    Get PDF
    There is a growing number of websites containing patient or consumer health information. The assumption is that patients should be able to become knowledgeable about their health conditions and become more involved in the management of their own health, but there are concerns about the validity of the information to be found on the Web. The goal of this research is to develop a system that evaluates websites containing educational information about Inflammatory Bowel Disease (IBD) to determine their quality and appropriateness for consumers. A small number of websites on IBD were evaluated manually by experts in the field. A number of different approaches based on machine learning algorithms evaluate websites as Excellent, Informative or Not Useful. Results of this approach are promising, but inconclusive as the dataset of websites for training and testing was not sufficiently large for any real conclusions to be made
    • ā€¦
    corecore