Search CORE

15 research outputs found

K-Reach: Who is in Your Small World

Author: Cheng Hong
Cheng James
Shang Zechao
Wang Haixun
Yu Jeffrey Xu
Publication venue
Publication date: 01/01/2012
Field of study

We study the problem of answering k-hop reachability queries in a directed graph, i.e., whether there exists a directed path of length k, from a source query vertex to a target query vertex in the input graph. The problem of k-hop reachability is a general problem of the classic reachability (where k=infinity). Existing indexes for processing classic reachability queries, as well as for processing shortest path queries, are not applicable or not efficient for processing k-hop reachability queries. We propose an index for processing k-hop reachability queries, which is simple in design and efficient to construct. Our experimental results on a wide range of real datasets show that our index is more efficient than the state-of-the-art indexes even for processing classic reachability queries, for which these indexes are primarily designed. We also show that our index is efficient in answering k-hop reachability queries.Comment: VLDB201

arXiv.org e-Print Archive

CiteSeerX

Finding global icebergs over distributed data sets

Author: Ogihara Mitsunori
Wang Haixun
Xu Jun
Zhao Qi
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2006
Field of study

Finding icebergs-items whose frequency of occurrence is above a certain threshold-is an important problem with a wide range of applications. Most of the existing work focuses on iceberg queries at a single node. However, in many real-life applications, data sets are distributed across a large number of nodes. Two naïve approaches might be considered. In the first, each node ships its entire data set to a central server, and the central server uses single-node algorithms to find icebergs. But it may incur prohibitive communication overhead. In the second, each node submits local icebergs, and the central server combines local icebergs to find global icebergs. But it may fail because in many important applications, globally frequent items may not be frequent at any node. In this work, we propose two novel schemes that provide accurate and efficient solutions to this problem: a sampling-based scheme and a counting-sketch-based scheme. In particular, the latter scheme incurs a communication cost at least an order of magnitude smaller than the naïve scheme of shipping all data, yet is able to achieve very high accuracy. Through rigorous theoretical and experimental analysis we establish the statistical properties of our proposed algorithms, including their accuracy bounds

Crossref

University of Miami: Scholarship Miami

A Novel Environmental Route to Ambient Pressure Dried Thermal Insulating Silica Aerogel via Recycled Coal Gangue

Author: Haixun Xu
Junyong Wu
Meng Zheng
Pinghua Zhu
Shanyu Zhao
Publication venue: 'Hindawi Limited'
Publication date: 01/01/2016
Field of study

Coal gangue, one of the main hazardous emissions of purifying coal from coalmine industry, is rich in silica and alumina. However, the recycling of the waste is normally restricted by less efficient techniques and low attractive output; the utilization of such waste is still staying lower than 15%. In this work, the silica aerogel materials were synthesized by using a precursor extracted from recycled silicon-rich coal gangue, followed by a single-step surface silylation and ambient pressure drying. A low density (~0.19 g/cm3) nanostructured aerogel with a 3D open porous microstructure and high surface area (~690 m2/g) was synthesized, which presents a superior thermal insulation performance (~26.5 mW·m−1·K−1 of a plane packed of 4-5 mm granules which was confirmed by transient hot-wire method). This study offers a new facile route to the synthesis of insulating aerogel material by recycling solid waste coal gangue and presents a potential cost reduction of industrial production of silica aerogels

Crossref

Directory of Open Access Journals

Dual labeling: Answering graph reachability queries in constant time

Author: Haixun Wang
Hao He
Jeffrey Xu Yu
Jun Yang
Philip S. Yu
Publication venue: IEEE Computer Society
Publication date: 01/01/2006
Field of study

Graph reachability is fundamental to a wide range of applications, including XML indexing, geographic navigation, Internet routing, ontology queries based on RDF/OWL, etc. Many applications involve huge graphs and require fast answering of reachability queries. Several reachability labeling methods have been proposed for this purpose. They assign labels to the vertices, such that the reachability between any two vertices may be decided using their labels only. For sparse graphs, 2-hop based reachability labeling schemes answer reachability queries efficiently using relatively small label space. However, the labeling process itself is often too time consuming to be practical for large graphs. In this paper, we propose a novel labeling scheme for sparse graphs. Our scheme ensures that graph reachability queries can be answered in constant time. Furthermore, for sparse graphs, the complexity of the labeling process is almost linear, which makes our algorithm applicable to massive datasets. Analytical and experimental results show that our approach is much more efficient than stateof-the-art approaches. Furthermore, our labeling method also provides an alternative scheme to tradeoff query time for label space, which further benefits applications that use tree-like graphs.

CiteSeerX

Crossref

Research Track Poster Suppressing Model Overfitting in Mining Concept-Drifting Data Streams ABSTRACT

Author: Haixun Wang
Jeffrey Xu Yu
Jian Yin
Philip S. Yu
Publication venue
Publication date
Field of study

Mining data streams of changing class distributions is important for real-time business decision support. The stream classifier must evolve to reflect the current class distribution. This poses a serious challenge. On the one hand, relying on historical data may increase the chances of learning obsolete models. On the other hand, learning only from the latest data may lead to biased classifiers, as the latest data is often an unrepresentative sample of the current class distribution. The problem is particularly acute in classifying rare events, when, for example, instances of the rare class do not even show up in the most recent training data. In this paper, we use a stochastic model to describe the concept shifting patterns and formulate this problem as an optimization one: from the historical and the current training data that we have observed, find the most-likely current distribution, and learn a classifier based on the most-likely distribution. We derive an analytic solution and approximate this solution with an efficient algorithm, which calibrates the influence of historical data carefully to create an accurate classifier. We evaluate our algorithm with both synthetic and real-world datasets. Our results show that our algorithm produces accurate and efficient classification

CiteSeerX

Fast computation of reachability labeling for large graphs

Author: Haixun Wang
Jeffrey Xu Yu
Jiefeng Cheng
Philip S. Yu
Xuemin Lin
Publication venue
Publication date: 01/01/2006
Field of study

There are numerous applications that need to deal with a large graph and need to query reachability between nodes in the graph. A 2-hop cover can compactly represent the whole edge transitive closure of a graph in O(|V | · |E | 1/2) space, and be used to answer reachability query efficiently. However, it is challenging to compute a 2-hop cover. The existing approaches suffer from either large resource consumption or low compression rate. In this paper, we propose a hierarchical partitioning approach to partition a large graph G into two subgraphs repeatedly in a top-down fashion. The unique feature of our approach is that we compute 2-hop cover while partitioning. In brief, in every iteration of top-down partitioning, we provide techniques to compute the 2-hop cover for connections between the two subgraphs first. A cover is computed to cut the graph into two subgraphs, which results in an overall cover with high compression for the entire graph G. Two approaches are proposed, namely a node-oriented approach and an edge-oriented approach. Our approach can efficiently compute 2-hop cover for a large graph with high compression rate. Our extensive experiment studies show that the 2-hop cover for a graph with 1,700,000 nodes and 169 billion connections can be obtained in less than 30 minutes with a compression rate about 40,000 using a PC

CiteSeerX

Crossref

Fast graph pattern matching

Author: Bolin Ding
Haixun Wang
Jeffrey Xu
Jiefeng Cheng
Philip S Yu
Yu
Publication venue
Publication date: 01/01/2008
Field of study

Abstrac

CiteSeerX

Fast computing reachability labelings for large graphs with high compression rate

Author: Haixun Wang
Jeffrey Xu Yu
Jiefeng Cheng
Philip S. Yu
Xuemin Lin
Publication venue
Publication date: 01/01/2008
Field of study

Abstract. The need of processing graph reachability queries stems from many applications that manage complex data as graphs. The applications include transportation network, Internet traffic analyzing, Web navigation, semantic web, chemical informatics and bio-informatics systems, and computer vision. A graph reachability query, as one of the primary tasks, is to find whether two given data objects, u and v, are related in any ways in a large and complex dataset. Formally, the query is about to find if v is reachable from u in a directed graph which is large in size. In this paper, we focus ourselves on building a reachability labeling for a large directed graph, in order to process reachability queries efficiently. Such a labeling needs to be minimized in size for the efficiency of answering the queries, and needs to be computed fast for the efficiency of constructing such a labeling. As such a labeling, 2-hop cover was proposed for arbitrary graphs with theoretical bounds on both the construction cost and the size of the resulting labeling. However, in practice, as reported, the construction cost of 2-hop cover is very high even with super power machines. In this paper, we propose a novel geometry-based algorithm which computes high-quality 2-hop cover fast. Our experimental results verify the effectiveness of our techniques over large real and synthetic graph datasets.

CiteSeerX

Crossref

A Balanced Ensemble Approach to Weighting Classifiers for Text Classification

Author: Cheong Fung
David W. Cheung
Gabriel Pui
Haixun Wang
Huan Liu
Jeffrey Xu Yu
Publication venue
Publication date
Field of study

This paper studies the problem of constructing an effective heterogeneous ensemble classifier for text classification. One major challenge of this problem is to formulate a good combination function, which combines the decisions of the individual classifiers in the ensemble. We show that the classification performance is affected by three weight components and they should be included in deriving an effective combination function. They are: (1) Global effectiveness, which measures the effectiveness of a member classifier in classifying a set of unseen documents; (2) Local effectiveness, which measures the effectiveness of a member classifier in classifying the particular domain of an unseen document; and (3) Decision confidence, which describes how confident a classifier is when making a decision when classifying a specific unseen document. We propose a new balanced combination function, called Dynamic Classifier Weighting (DCW), that incorporates the aforementioned three components. The empirical study demonstrates that the new combination function is highly effective for text classification.

CiteSeerX