27 research outputs found

    Accelerating Sequence Searching: Dimensionality Reduction Method

    Get PDF
    Similarity search over long sequence dataset becomes increasingly popular in many emerging applications, such as text retrieval, genetic sequences exploring, etc. In this paper, a novel index structure, namely Sequence Embedding Multiset tree (SEM - tree), has been proposed to speed up the searching process over long sequences. The SEM-tree is a multi-level structure where each level represents the sequence data with different compression level of multiset, and the length of multiset increases towards the leaf level which contains original sequences. The multisets, obtained using sequence embedding algorithms, have the desirable property that they do not need to keep the character order in the sequence, i.e. shorter representation, but can reserve the majority of distance information of sequences. Each level of the tree serves to prune the search space more efficiently as the multisets utilize the predicability to finish the searching process beforehand and reduce the computational cost greatly. A set of comprehensive experiments are conducted to evaluate the performance of the SEM-tree, and the experimental results show that the proposed method is much more efficient than existing representative methods.Computer Science, Artificial IntelligenceComputer Science, Information SystemsSCI(E)6ARTICLE3301-3222

    Measuring Data Quality of Geoscience Datasets Using Data Mining Techniques

    No full text
    Currently there are many methods of collecting geoscience data, such as station observations, satellite images, sensor networks, etc. All of these data sources from different regions and time intervals are combined in geoscience research activities today. Using a mixture of several different data sources may have benefits but may also lead to severe data quality problems, such as inconsistent data and missing values. There have been efforts to produce more consistent data sets from multiple data sources. However, because of the huge gaps in data quality among the different sources, data quality inequality among different regions and time intervals has still occurred in the resultant data sets. As the construction methods of these data sets are quite complicated, it would be difficult for users to know the data quality of a dataset not to mention the data quality for a specified location or a given time interval. In this paper, the authors address the problem by generating a data quality measure for all regions and time intervals of a dataset. The data quality measure is computed by comparing the constructed datasets and their sources or other relevant data, using data mining techniques. This paper also demonstrates how to handle major quality problems, such as outliers and missing values, by using data mining techniques in the geoscience data, especially in global climate data

    An Efficient High Dimensional Cluster Method and its Application in Global Climate Sets

    No full text
    Because of the development of modern-day satellites and other data acquisition systems, global climate research often involves overwhelming volume and complexity of high dimensional datasets. As a data preprocessing and analysis method, the clustering method is playing a more and more important role in these researches. In this paper, we propose a spatial clustering algorithm that, to some extent, cures the problem of dimensionality in high dimensional clustering. The similarity measure of our algorithm is based on the number of top-k nearest neighbors that two grids share. The neighbors of each grid are computed based on the time series associated with each grid, and computing the nearest neighbor of an object is the most time consuming step. According to Tobler's "First Law of Geography," we add a spatial window constraint upon each grid to restrict the number of grids considered and greatly improve the efficiency of our algorithm. We apply this algorithm to a 100-year global climate dataset and partition the global surface into sub areas under various spatial granularities. Experiments indicate that our spatial clustering algorithm works well

    A peer-to-peer approach to Geospatial Web Services discovery

    No full text
    Geospatial Web Services are data-oriented services, which include a variety of complex data models and metadata. Discovering the appreciate services with related geospatial datasets among a large number of available ones is a key task in the Geospatial Web Services domain. This paper proposes a peer-to-peer (P2P) based approach for discovering geospatial Web Services. We characterize the geospatial Web Services profile as a set of keywords including the metadata attributes, the minimum bounding rectangle (MBR) and the QoS parameters. Differing from the keywords based P2P Web Services discovery approaches, we use the MBR information to cluster and index the services into a kind of Peer R+ tree. With the tree, a P2P system can support complex queries containing partial keywords and spatial querying. The approach has been used in the Beijing Spatial Data Infrastructure project. We implement an peer-to-peer geospatial Web Services discovery system prototype that shows our approach facilitates complex service queries that contains spatial keywords. ? 2006 ACM.EI

    Community-based greedy algorithm for mining top-K influential nodes in mobile social networks

    No full text
    With the proliferation of mobile devices and wireless technologies, mobile social network systems are increasingly available. A mobile social network plays an essential role as the spread of information and influence in the form of 'word-of-mouth'. It is a fundamental issue to find a subset of influential individuals in a mobile social network such that targeting them initially (e.g. to adopt a new product) will maximize the spread of the influence (further adoptions of the new product). The problem of finding the most influential nodes is unfortunately NP-hard. It has been shown that a Greedy algorithm with provable approximation guarantees can give good approximation; However, it is computationally expensive, if not prohibitive, to run the greedy algorithm on a large mobile network. In this paper we propose a new algorithm called Community-based Greedy algorithm for mining top-K influential nodes. The proposed algorithm encompasses two components: 1) an algorithm for detecting communities in a social network by taking into account information diffusion; and 2) a dynamic programming algorithm for selecting communities to find influential nodes. We also provide provable approximation guarantees for our algorithm. Empirical studies on a large real-world mobile social network show that our algorithm is more than an order of magnitudes faster than the state-of-the-art Greedy algorithm for finding top-K influential nodes and the error of our approximate algorithm is small. ? 2010 ACM.EI

    Encoding Tree Sparsity in Multi-Task Learning: A Probabilistic Framework

    No full text
    Multi-task learning seeks to improve the generalization performance by sharing common information among multiple related tasks. A key assumption in most MTL algorithms is that all tasks are related, which, however, may not hold in many real-world applications. Existing techniques, which attempt to address this issue, aim to identify groups of related tasks using group sparsity. In this paper, we propose a probabilistic tree sparsity (PTS) model to utilize the tree structure to obtain the sparse solution instead of the group structure. Specifically, each model coefficient in the learning model is decomposed into a product of multiple component coefficients each of which corresponds to a node in the tree. Based on the decomposition, Gaussian and Cauchy distributions are placed on the component coefficients as priors to restrict the model complexity. We devise an efficient expectation maximization algorithm to learn the model parameters. Experiments conducted on both synthetic and real-world problems show the effectiveness of our model compared with state-of-the-art baselines

    Data Mining for Teleconnections in Global Climate Datasets

    No full text
    Teleconnection is a linkage between two climate events that occur in widely separated regions of the globe on a monthly or longer timescale. In the past, statistical methods have been used to discover teleconnections. However, because of the overwhelming volume and high resolution of datasets acquired by modern data acquisition systems, these methods are not sufficient. In this paper, we propose a novel approach to finding teleconnections in global climate datasets using data mining technologies. We present experiments on real datasets and find some interesting teleconnections, including well-known ones such as ENSO. The experiments indicate that our method is usable and efficient

    Integrating Map Services and Location-based Services for Geo-Referenced Individual Data Collection

    No full text
    With the rapid advance of location-based services (LBS) and online map services, it is now more feasible than before to collect the geo-referenced individual level data. However, privacy is always an issue whenever personal location is traced and recorded. This paper proposes a reactive location-based service to collect and process individual location data by different privacy policies. The reactive LBS provides user with an active pull mode to collect his/her location information. With different privacy policies, the user can enter his/her current address manually to the server or automatically generated by the LBS location provider. In order to gain the accurate reconstruction of individuals' activity-travel patterns with considerable space-time details, the LBS server invokes an online map service to georeference the location data and to derive the route between locations. In this paper, a household's daily activity survey scenario is showed in Beijing city. ? 2008 IEEE.EI
    corecore