    Detecting Stable Communities In Large Scale Networks

    A network is said to exhibit community structure if the nodes of the network can be easily grouped into groups of nodes, such that each group is densely connected internally but sparsely connected with other groups. Most real world networks exhibit community structure. A popular technique for detecting communities is based on computing the modularity of the network. Modularity reflects how well the vertices in a group are connected as opposed to being randomly connected. We propose a parallel algorithm for detecting modularity in large networks. However, all modularity based algorithms for detecting community structure are affected by the order in which the vertices in the network are processed. Therefore, detecting communities in real world graphs becomes increasingly difficult. We introduce the concept of stable community, that is, a group of vertices that are always partitioned to the same community independent of the vertex perturbations to the input. We develop a preprocessing step that identifies stable communities and empirically show that the number of stable communities in a network affects the range of modularity values obtained. In particular, stable communities can also help determine strong communities in the network. Modularity is a widely accepted metric for measuring the quality of a partition identified by various community detection algorithms. However, a growing number of researchers have started to explore the limitations of modularity maximization such as resolution limit, degeneracy of solutions and asymptotic growth of the modularity value for detecting communities. In order to address these issues we propose a novel vertex-level metric called permanence. We show that our metric permanence as compared to other standard metrics such as modularity, conductance and cut-ratio performs as a better community scoring function for evaluating the detected community structures from both synthetic networks and real-world networks. We demonstarte that maximizing permanence results in communities that match the ground-truth structure of networks more accurately than modularity based and other approaches. Finally, we demonstrate how maximizing permanence overcomes limitations associated with modularity maximization

    CoDiS: Community Detection via Distributed Seed-Set Expansion on Graph Streams

    Community detection has been and remains a very important topic in several fields. From marketing and social networking to biological studies, community detec- tion plays a key role in advancing research in many different fields. Research on this topic originally looked at classifying nodes into discrete communities, but eventually moved forward to placing nodes in multiple communities. Unfortunately, community detection has always been a time-inefficient process, and recent data sets have been simply to large to realistically process using traditional methods. Because of this, recent methods have turned to parallelism, but all these methods, while offering sig- nificant decrease in processing time, still have several issues. The innovation of this paper is that it distributes the seed nodes instead of the stream edges, and therefore assigns to each working node a subset of the current formed communities. Experi- mental results show that we are able to gain a significant improvement in running time with no loss of accuracy

    An end-to-end convolutional selective autoencoder approach to Soybean Cyst Nematode eggs detection

    This paper proposes a novel selective autoencoder approach within the framework of deep convolutional networks. The crux of the idea is to train a deep convolutional autoencoder to suppress undesired parts of an image frame while allowing the desired parts resulting in efficient object detection. The efficacy of the framework is demonstrated on a critical plant science problem. In the United States, approximately $1 billion is lost per annum due to a nematode infection on soybean plants. Currently, plant-pathologists rely on labor-intensive and time-consuming identification of Soybean Cyst Nematode (SCN) eggs in soil samples via manual microscopy. The proposed framework attempts to significantly expedite the process by using a series of manually labeled microscopic images for training followed by automated high-throughput egg detection. The problem is particularly difficult due to the presence of a large population of non-egg particles (disturbances) in the image frames that are very similar to SCN eggs in shape, pose and illumination. Therefore, the selective autoencoder is trained to learn unique features related to the invariant shapes and sizes of the SCN eggs without handcrafting. After that, a composite non-maximum suppression and differencing is applied at the post-processing stage.

    Distributed Kernel Regression: An Algorithm for Training Collaboratively

    This paper addresses the problem of distributed learning under communication constraints, motivated by distributed signal processing in wireless sensor networks and data mining with distributed databases. After formalizing a general model for distributed learning, an algorithm for collaboratively training regularized kernel least-squares regression estimators is derived. Noting that the algorithm can be viewed as an application of successive orthogonal projection algorithms, its convergence properties are investigated and the statistical behavior of the estimator is discussed in a simplified theoretical setting.

    Parallel and Flow-Based High Quality Hypergraph Partitioning

    Get PDF
    Balanced hypergraph partitioning is a classic NP-hard optimization problem that is a fundamental tool in such diverse disciplines as VLSI circuit design, route planning, sharding distributed databases, optimizing communication volume in parallel computing, and accelerating the simulation of quantum circuits. Given a hypergraph and an integer kk, the task is to divide the vertices into kk disjoint blocks with bounded size, while minimizing an objective function on the hyperedges that span multiple blocks. In this dissertation we consider the most commonly used objective, the connectivity metric, where we aim to minimize the number of different blocks connected by each hyperedge. The most successful heuristic for balanced partitioning is the multilevel approach, which consists of three phases. In the coarsening phase, vertex clusters are contracted to obtain a sequence of structurally similar but successively smaller hypergraphs. Once sufficiently small, an initial partition is computed. Lastly, the contractions are successively undone in reverse order, and an iterative improvement algorithm is employed to refine the projected partition on each level. An important aspect in designing practical heuristics for optimization problems is the trade-off between solution quality and running time. The appropriate trade-off depends on the specific application, the size of the data sets, and the computational resources available to solve the problem. Existing algorithms are either slow, sequential and offer high solution quality, or are simple, fast, easy to parallelize, and offer low quality. While this trade-off cannot be avoided entirely, our goal is to close the gaps as much as possible. We achieve this by improving the state of the art in all non-trivial areas of the trade-off landscape with only a few techniques, but employed in two different ways. Furthermore, most research on parallelization has focused on distributed memory, which neglects the greater flexibility of shared-memory algorithms and the wide availability of commodity multi-core machines. In this thesis, we therefore design and revisit fundamental techniques for each phase of the multilevel approach, and develop highly efficient shared-memory parallel implementations thereof. We consider two iterative improvement algorithms, one based on the Fiduccia-Mattheyses (FM) heuristic, and one based on label propagation. For these, we propose a variety of techniques to improve the accuracy of gains when moving vertices in parallel, as well as low-level algorithmic improvements. For coarsening, we present a parallel variant of greedy agglomerative clustering with a novel method to resolve cluster join conflicts on-the-fly. Combined with a preprocessing phase for coarsening based on community detection, a portfolio of from-scratch partitioning algorithms, as well as recursive partitioning with work-stealing, we obtain our first parallel multilevel framework. It is the fastest partitioner known, and achieves medium-high quality, beating all parallel partitioners, and is close to the highest quality sequential partitioner. Our second contribution is a parallelization of an n-level approach, where only one vertex is contracted and uncontracted on each level. This extreme approach aims at high solution quality via very fine-grained, localized refinement, but seems inherently sequential. We devise an asynchronous n-level coarsening scheme based on a hierarchical decomposition of the contractions, as well as a batch-synchronous uncoarsening, and later fully asynchronous uncoarsening. In addition, we adapt our refinement algorithms, and also use the preprocessing and portfolio. This scheme is highly scalable, and achieves the same quality as the highest quality sequential partitioner (which is based on the same components), but is of course slower than our first framework due to fine-grained uncoarsening. The last ingredient for high quality is an iterative improvement algorithm based on maximum flows. In the sequential setting, we first improve an existing idea by solving incremental maximum flow problems, which leads to smaller cuts and is faster due to engineering efforts. Subsequently, we parallelize the maximum flow algorithm and schedule refinements in parallel. Beyond the strive for highest quality, we present a deterministically parallel partitioning framework. We develop deterministic versions of the preprocessing, coarsening, and label propagation refinement. Experimentally, we demonstrate that the penalties for determinism in terms of partition quality and running time are very small. All of our claims are validated through extensive experiments, comparing our algorithms with state-of-the-art solvers on large and diverse benchmark sets. To foster further research, we make our contributions available in our open-source framework Mt-KaHyPar. While it seems inevitable, that with ever increasing problem sizes, we must transition to distributed memory algorithms, the study of shared-memory techniques is not in vain. With the multilevel approach, even the inherently slow techniques have a role to play in fast systems, as they can be employed to boost quality on coarse levels at little expense. Similarly, techniques for shared-memory parallelism are important, both as soon as a coarse graph fits into memory, and as local building blocks in the distributed algorithm

    SamBaS: Sampling-Based Stochastic Block Partitioning

    Community detection is a well-studied problem with applications in domains ranging from networking to bioinformatics. Due to the rapid growth in the volume of real-world data, there is growing interest in accelerating contemporary community detection algorithms. However, the more accurate and statistically robust methods tend to be hard to parallelize. One such method is stochastic block partitioning (SBP) - a community detection algorithm that works well on graphs with complex and heterogeneous community structure. In this paper, we present a sampling-based SBP (SamBaS) for accelerating SBP on sparse graphs. We characterize how various graph parameters affect the speedup and result quality of community detection with SamBaS and quantify the trade-offs therein. To evaluate SamBas on real-world web graphs without known ground-truth communities, we introduce partition quality score (PQS), an evaluation metric that outperforms modularity in terms of correlation with F1 score. Overall, SamBaS achieves speedups of up to 10X while maintaining result quality (and even improving result quality by over 150% on certain graphs, relative to F1 score).