24 research outputs found

    Parameter Free Bursty Events Detection in Text Streams

    No full text
    Text classification is a major data mining task. An advanced text classification technique..

    Dominant and K nearest probabilistic skylines

    No full text
    By definition, objects that are skyline points cannot be compared with each other. Yet, thanks to the probabilistic skyline model, skyline points with repeated observations can now be compared. In this model, each object will be assigned a value to denote for its probability of being a skyline point. When we are using this model, some questions will naturally be asked: (1) Which of the objects have skyline probabilities larger than a given object? (2) Which of the objects are the K nearest neighbors to a given object according to their skyline probabilities? (3)What is the ranking of these objects based on their skyline probabilities? Up to our knowledge, no existing work answers any of these questions.Yet, answering them is not trivial. For just a medium-size dataset, it may take more than an hour to obtain the skyline probabilities of all the objects in there. In this paper, we propose a tree called SPTree that answers all these queries effi-ciently. SPTree is based on the idea of space partition. We partition the data space into several subspaces so that we do not need to compute the skyline probabilities of all objects. Extensive experiments are conducted. The encouraging results show that our work is highly feasible

    Feature Article: The Predicting Power of Textual Information on Financial Markets The Predicting Power of Textual Information on Financial Markets

    No full text
    Abstract — Mining textual documents and time series concurrently, such as predicting the movements of stock prices based on the contents of the news stories, is an emerging topic in data mining community. Previous researches have shown that there is a strong relationship between the time when the news stories are released and the time when the stock prices fluctuate. In this paper, we propose a systematic framework for predicting the tertiary movements of stock prices by analyzing the impacts of the news stories on the stocks. To be more specific, we investigate the immediate impacts of news stories on the stocks based on the Efficient Markets Hypothesis. Several data mining and text mining techniques are used in a novel way. Extensive experiments using real-life data are conducted, and encouraging results are obtained. I

    The Predicting Power of Textual Information on Financial Markets

    No full text
    Mining textual documents and time series concurrently, such as predicting the movements of stock prices based on the contents of the news stories, is an emerging topic in data mining community. Previous researches have shown that there is a strong relationship between the time when the news stories are released and the time when the stock prices fluctuate. In this paper, we propose a systematic framework for predicting the tertiary movements of stock prices by analyzing the impacts of the news stories on the stocks. To be more specific, we investigate the immediate impacts of news stories on the stocks based on the Efficient Markets Hypothesis. Several data mining and text mining techniques are used in a novel way. Extensive experiments using real-life data are conducted, and encouraging results are obtained

    A framework for ranking and KNN queries in a probabilistic skyline model

    No full text
    Skyline computation has gained a lot of attention in recent years. According to the definition of skyline, objects that belong to skyline cannot be ranked among themselves because they are incomparable. This constraint limits the application of skyline. Fortunately, due to the recently proposed probabilistic skyline model, skyline objects which contain multiple elements, can now be compared with each others. Different from the traditional skyline model where each object can either be a skyline object or not, in the probabilistic skyline model, each object is assigned a skyline probability to denote its likelihood of being a skyline object. Under this model, two simple but important questions will naturally be asked: (1) Given an object, which of the objects are the K nearest neighbors to it based on their skyline probabilities? (2) Given an object, what is the ranking of the objects which have skyline probabilities greater than the given object? To the best of our knowledge, no existing work can effectively answer these two questions. Yet, answering them is not trivial. For a medium-size dataset (e.g. 10,000 objects), it may take more than an hour to compute the skyline probabilities of all objects. In this paper, we propose a novel framework to answering the above two questions on the fly efficiently. Our proposed work is based on the idea of bounding-pruning-refining strategy. We first compute the skyline probabilities of the target object and all its elements. For the rest of the objects, instead of computing their accurate skyline probabilities, we compute the upper bound and lower bound skyline probabilities using the elements of the target object. Based on lower bound and upper bound of their skyline probabilities, some objects, which cannot be in the result, will be pruned. For those objects, which we are unknown whether they are in the results or not, we need to refine their bounds. The refinement strategy is based on the idea of space partition. Specifically, we first partition the whole dataspace into several subspaces based on the distribution of elements in the target object. When we iteratively do the the refinement of the bounds, we will do the partitioning strategy in each subspace. In order to implement this framework, a novel tree, called Space Partition Tree (SPTree) is proposed to index the objects and their elements. We evaluate our proposed work using three synthetic datasets and one real-life dataset. We report all our findings in this paper

    Active duplicate detection

    No full text
    The aim of duplicate detection is to group records in a relation which refer to the same entity in the real world such as a person or business. Most existing works require user specified parameters such as similarity threshold in order to conduct duplicate detection. These methods are called user-first in this paper. However, in many scenarios, pre-specification from the user is very hard and often unreliable, thus limiting applicability of user-first methods. In this paper, we propose a user-last method, called Active Duplicate Detection (ADD), where an initial solution is returned without forcing user to specify such parameters and then user is involved to refine the initial solution. Different from user-first methods where user makes decision before any processing, ADD allows user to make decision based on an initial solution. The identified initial solution in ADD enjoys comparatively high quality and is easy to be refined in a systematic way (at almost zero cost)
    corecore