2,912 research outputs found
A Model-Based Frequency Constraint for Mining Associations from Transaction Data
Mining frequent itemsets is a popular method for finding associated items in
databases. For this method, support, the co-occurrence frequency of the items
which form an association, is used as the primary indicator of the
associations's significance. A single user-specified support threshold is used
to decided if associations should be further investigated. Support has some
known problems with rare items, favors shorter itemsets and sometimes produces
misleading associations.
In this paper we develop a novel model-based frequency constraint as an
alternative to a single, user-specified minimum support. The constraint
utilizes knowledge of the process generating transaction data by applying a
simple stochastic mixture model (the NB model) which allows for transaction
data's typically highly skewed item frequency distribution. A user-specified
precision threshold is used together with the model to find local frequency
thresholds for groups of itemsets. Based on the constraint we develop the
notion of NB-frequent itemsets and adapt a mining algorithm to find all
NB-frequent itemsets in a database. In experiments with publicly available
transaction databases we show that the new constraint provides improvements
over a single minimum support threshold and that the precision threshold is
more robust and easier to set and interpret by the user
Data Mining Algorithms for Internet Data: from Transport to Application Layer
Nowadays we live in a data-driven world. Advances in data generation, collection and storage technology have enabled organizations to gather data sets of massive size. Data mining is a discipline that blends traditional data analysis methods with sophisticated algorithms to handle the challenges posed by these new types of data sets.
The Internet is a complex and dynamic system with new protocols and applications that arise at a constant pace. All these characteristics designate the Internet a valuable and challenging data source and application domain for a research activity, both looking at Transport layer, analyzing network tra c flows, and going up to Application layer, focusing on the ever-growing next generation web services: blogs, micro-blogs, on-line social networks, photo sharing services and many other applications (e.g., Twitter, Facebook, Flickr, etc.).
In this thesis work we focus on the study, design and development of novel algorithms and frameworks to support large scale data mining activities over huge and heterogeneous data volumes, with a particular focus on Internet data as data source and targeting network tra c classification, on-line social network analysis, recommendation systems and cloud services and Big data
The Random Walk of High Frequency Trading
This paper builds a model of high-frequency equity returns by separately
modeling the dynamics of trade-time returns and trade arrivals. Our main
contributions are threefold. First, we characterize the distributional behavior
of high-frequency asset returns both in ordinary clock time and in trade time.
We show that when controlling for pre-scheduled market news events, trade-time
returns of the highly liquid near-month E-mini S&P 500 futures contract are
well characterized by a Gaussian distribution at very fine time scales. Second,
we develop a structured and parsimonious model of clock-time returns by
subordinating a trade-time Gaussian distribution with a trade arrival process
that is associated with a modified Markov-Switching Multifractal Duration
(MSMD) model. This model provides an excellent characterization of
high-frequency inter-trade durations. Over-dispersion in this distribution of
inter-trade durations leads to leptokurtosis and volatility clustering in
clock-time returns, even when trade-time returns are Gaussian. Finally, we use
our model to extrapolate the empirical relationship between trade rate and
volatility in an effort to understand conditions of market failure. Our model
suggests that the 1,200 km physical separation of financial markets in Chicago
and New York/New Jersey provides a natural ceiling on systemic volatility and
may contribute to market stability during periods of extremely heavy trading
Advanced pattern mining for complex data analysis
The thesis has researched a set of critical problems in data mining and has proposed four advanced pattern mining algorithm to discover the most interesting and useful data patterns highly relevant to the user’s application targets from the data is represented in complex structures
Feature Extraction and Duplicate Detection for Text Mining: A Survey
Text mining, also known as Intelligent Text Analysis is an important research area. It is very difficult to focus on the most appropriate information due to the high dimensionality of data. Feature Extraction is one of the important techniques in data reduction to discover the most important features. Proce- ssing massive amount of data stored in a unstructured form is a challenging task. Several pre-processing methods and algo- rithms are needed to extract useful features from huge amount of data. The survey covers different text summarization, classi- fication, clustering methods to discover useful features and also discovering query facets which are multiple groups of words or phrases that explain and summarize the content covered by a query thereby reducing time taken by the user. Dealing with collection of text documents, it is also very important to filter out duplicate data. Once duplicates are deleted, it is recommended to replace the removed duplicates. Hence we also review the literature on duplicate detection and data fusion (remove and replace duplicates).The survey provides existing text mining techniques to extract relevant features, detect duplicates and to replace the duplicate data to get fine grained knowledge to the user
- …