2,912 research outputs found

    A Model-Based Frequency Constraint for Mining Associations from Transaction Data

    Full text link
    Mining frequent itemsets is a popular method for finding associated items in databases. For this method, support, the co-occurrence frequency of the items which form an association, is used as the primary indicator of the associations's significance. A single user-specified support threshold is used to decided if associations should be further investigated. Support has some known problems with rare items, favors shorter itemsets and sometimes produces misleading associations. In this paper we develop a novel model-based frequency constraint as an alternative to a single, user-specified minimum support. The constraint utilizes knowledge of the process generating transaction data by applying a simple stochastic mixture model (the NB model) which allows for transaction data's typically highly skewed item frequency distribution. A user-specified precision threshold is used together with the model to find local frequency thresholds for groups of itemsets. Based on the constraint we develop the notion of NB-frequent itemsets and adapt a mining algorithm to find all NB-frequent itemsets in a database. In experiments with publicly available transaction databases we show that the new constraint provides improvements over a single minimum support threshold and that the precision threshold is more robust and easier to set and interpret by the user

    Process mining tools:a comparative analysis

    Get PDF

    Data Mining Algorithms for Internet Data: from Transport to Application Layer

    Get PDF
    Nowadays we live in a data-driven world. Advances in data generation, collection and storage technology have enabled organizations to gather data sets of massive size. Data mining is a discipline that blends traditional data analysis methods with sophisticated algorithms to handle the challenges posed by these new types of data sets. The Internet is a complex and dynamic system with new protocols and applications that arise at a constant pace. All these characteristics designate the Internet a valuable and challenging data source and application domain for a research activity, both looking at Transport layer, analyzing network tra c flows, and going up to Application layer, focusing on the ever-growing next generation web services: blogs, micro-blogs, on-line social networks, photo sharing services and many other applications (e.g., Twitter, Facebook, Flickr, etc.). In this thesis work we focus on the study, design and development of novel algorithms and frameworks to support large scale data mining activities over huge and heterogeneous data volumes, with a particular focus on Internet data as data source and targeting network tra c classification, on-line social network analysis, recommendation systems and cloud services and Big data

    The Random Walk of High Frequency Trading

    Full text link
    This paper builds a model of high-frequency equity returns by separately modeling the dynamics of trade-time returns and trade arrivals. Our main contributions are threefold. First, we characterize the distributional behavior of high-frequency asset returns both in ordinary clock time and in trade time. We show that when controlling for pre-scheduled market news events, trade-time returns of the highly liquid near-month E-mini S&P 500 futures contract are well characterized by a Gaussian distribution at very fine time scales. Second, we develop a structured and parsimonious model of clock-time returns by subordinating a trade-time Gaussian distribution with a trade arrival process that is associated with a modified Markov-Switching Multifractal Duration (MSMD) model. This model provides an excellent characterization of high-frequency inter-trade durations. Over-dispersion in this distribution of inter-trade durations leads to leptokurtosis and volatility clustering in clock-time returns, even when trade-time returns are Gaussian. Finally, we use our model to extrapolate the empirical relationship between trade rate and volatility in an effort to understand conditions of market failure. Our model suggests that the 1,200 km physical separation of financial markets in Chicago and New York/New Jersey provides a natural ceiling on systemic volatility and may contribute to market stability during periods of extremely heavy trading

    Advanced pattern mining for complex data analysis

    Full text link
    The thesis has researched a set of critical problems in data mining and has proposed four advanced pattern mining algorithm to discover the most interesting and useful data patterns highly relevant to the user’s application targets from the data is represented in complex structures

    Multidimensional process discovery

    Get PDF

    Feature Extraction and Duplicate Detection for Text Mining: A Survey

    Get PDF
    Text mining, also known as Intelligent Text Analysis is an important research area. It is very difficult to focus on the most appropriate information due to the high dimensionality of data. Feature Extraction is one of the important techniques in data reduction to discover the most important features. Proce- ssing massive amount of data stored in a unstructured form is a challenging task. Several pre-processing methods and algo- rithms are needed to extract useful features from huge amount of data. The survey covers different text summarization, classi- fication, clustering methods to discover useful features and also discovering query facets which are multiple groups of words or phrases that explain and summarize the content covered by a query thereby reducing time taken by the user. Dealing with collection of text documents, it is also very important to filter out duplicate data. Once duplicates are deleted, it is recommended to replace the removed duplicates. Hence we also review the literature on duplicate detection and data fusion (remove and replace duplicates).The survey provides existing text mining techniques to extract relevant features, detect duplicates and to replace the duplicate data to get fine grained knowledge to the user
    • …
    corecore