228 research outputs found

    Twitter data analysis by means of Strong Flipping Generalized Itemsets

    Get PDF
    Twitter data has recently been considered to perform a large variety of advanced analysis. Analysis ofTwitter data imposes new challenges because the data distribution is intrinsically sparse, due to a large number of messages post every day by using a wide vocabulary. Aimed at addressing this issue, generalized itemsets - sets of items at different abstraction levels - can be effectively mined and used todiscover interesting multiple-level correlations among data supplied with taxonomies. Each generalizeditemset is characterized by a correlation type (positive, negative, or null) according to the strength of thecorrelation among its items.This paper presents a novel data mining approach to supporting different and interesting targetedanalysis - topic trend analysis, context-aware service profiling - by analyzing Twitter posts. We aim atdiscovering contrasting situations by means of generalized itemsets. Specifically, we focus on comparingitemsets discovered at different abstraction levels and we select large subsets of specific (descendant)itemsets that show correlation type changes with respect to their common ancestor. To this aim, a novelkind of pattern, namely the Strong Flipping Generalized Itemset (SFGI), is extracted from Twitter mes-sages and contextual information supplied with taxonomy hierarchies. Each SFGI consists of a frequentgeneralized itemset X and the set of its descendants showing a correlation type change with respect to X. Experiments performed on both real and synthetic datasets demonstrate the effectiveness of the pro-posed approach in discovering interesting and hidden knowledge from Twitter dat

    Data mining by means of generalized patterns

    Get PDF
    The thesis is mainly focused on the study and the application of pattern discovery algorithms that aggregate database knowledge to discover and exploit valuable correlations, hidden in the analyzed data, at different abstraction levels. The aim of the research effort described in this work is two-fold: the discovery of associations, in the form of generalized patterns, from large data collections and the inference of semantic models, i.e., taxonomies and ontologies, suitable for driving the mining proces

    Event detection in high throughput social media

    Get PDF

    Feature Extraction and Duplicate Detection for Text Mining: A Survey

    Get PDF
    Text mining, also known as Intelligent Text Analysis is an important research area. It is very difficult to focus on the most appropriate information due to the high dimensionality of data. Feature Extraction is one of the important techniques in data reduction to discover the most important features. Proce- ssing massive amount of data stored in a unstructured form is a challenging task. Several pre-processing methods and algo- rithms are needed to extract useful features from huge amount of data. The survey covers different text summarization, classi- fication, clustering methods to discover useful features and also discovering query facets which are multiple groups of words or phrases that explain and summarize the content covered by a query thereby reducing time taken by the user. Dealing with collection of text documents, it is also very important to filter out duplicate data. Once duplicates are deleted, it is recommended to replace the removed duplicates. Hence we also review the literature on duplicate detection and data fusion (remove and replace duplicates).The survey provides existing text mining techniques to extract relevant features, detect duplicates and to replace the duplicate data to get fine grained knowledge to the user

    Exploring Data Hierarchies to Discover Knowledge in Different Domains

    Get PDF
    L'abstract è presente nell'allegato / the abstract is in the attachmen

    Mining XML documents with association rule algorithms

    Get PDF
    Thesis (Master)--Izmir Institute of Technology, Computer Engineering, Izmir, 2008Includes bibliographical references (leaves: 59-63)Text in English; Abstract: Turkish and Englishx, 63 leavesFollowing the increasing use of XML technology for data storage and data exchange between applications, the subject of mining XML documents has become more researchable and important topic. In this study, we considered the problem of Mining Association Rules between items in XML document. The principal purpose of this study is applying association rule algorithms directly to the XML documents with using XQuery which is a functional expression language that can be used to query or process XML data. We used three different algorithms; Apriori, AprioriTid and High Efficient AprioriTid. We give comparisons of mining times of these three apriori-like algorithms on XML documents using different support levels, different datasets and different dataset sizes

    Research on Personalized Recommender System for Tourism Information Service

    Get PDF
    Since the development in the 1990s, Recommender system has been widely applied in various fields. The conflict between the expansion of tourism information and difficulty of tourists obtaining tourism information allows Tourism Information Recommender System to have a practical significance. Based on the existing online tourism information service and the mature recommendation algorithms, Personal Recommender System can be used to solve present problems of the key recommendation algorithms. In the first place, this research presents an overview of researches on this issue both at home and abroad, and analyzes the applications of main stream recommendation algorithms. Secondly, a comparative study of domestic and international tourism information service websites is conducted. Drawbacks in their applications are defined and advantages are adopted in the settings of Recommender System. Finally, this research provides the framework of Recommender System, which combines the design and test of algorithms and the existing tourism information recommendation websites. This system allows customers to broaden experience of tourism information service and make tourism decisions more accurately and rapidly. Keywords: Tourism information service, Personalized recommendation, Intelligence recommendation module, Apriori algorith

    Closing the gap: Sequence mining at scale

    Full text link
    Frequent sequence mining is one of the fundamental building blocks in data mining. While the problem has been extensively studied, few of the available techniques are sufficiently scalable to handle datasets with billions of sequences; such large-scale datasets arise, for instance, in text mining and session analysis. In this article, we propose MG-FSM, a scalable algorithm for frequent sequence mining on MapReduce. MG-FSM can handle so-called “gap constraints”, which can be used to limit the output to a controlled set of frequent sequences. Both positional and temporal gap constraints, as well as appropriate maximality and closedness constraints, are supported. At its heart, MG-FSM partitions the input database in a way that allows us to mine each partition independently using any existing frequent sequence mining algorithm. We introduce the notion of ω-equivalency, which is a generalization of the notion of a “projected database” used by many frequent pattern mining algorithms. We also present a number of optimization techniques that minimize partition size, and therefore computational and communication costs, while still maintaining correctness. Our experimental study in the contexts of text mining and session analysis suggests that MG-FSM is significantly more efficient and scalable than alternative approaches

    Digging deep into weighted patient data through multiple-level patterns

    Get PDF
    Large data volumes have been collected by healthcare organizations at an unprecedented rate. Today both physicians and healthcare system managers are very interested in extracting value from such data. Nevertheless, the increasing data complexity and heterogeneity prompts the need for new efficient and effective data mining approaches to analyzing large patient datasets. Generalized association rule mining algorithms can be exploited to automatically extract hidden multiple-level associations among patient data items (e.g., examinations, drugs) from large datasets equipped with taxonomies. However, in current approaches all data items are assumed to be equally relevant within each transaction, even if this assumption is rarely true. This paper presents a new data mining environment targeted to patient data analysis. It tackles the issue of extracting generalized rules from weighted patient data, where items may weight differently according to their importance within each transaction. To this aim, it proposes a novel type of association rule, namely the Weighted Generalized Association Rule (W-GAR). The usefulness of the proposed pattern has been evaluated on real patient datasets equipped with a taxonomy built over examinations and drugs. The achieved results demonstrate the effectiveness of the proposed approach in mining interesting and actionable knowledge in a real medical care scenario
    • …
    corecore