5 research outputs found

    Clustering and its Application in Requirements Engineering

    Get PDF
    Large scale software systems challenge almost every activity in the software development life-cycle, including tasks related to eliciting, analyzing, and specifying requirements. Fortunately many of these complexities can be addressed through clustering the requirements in order to create abstractions that are meaningful to human stakeholders. For example, the requirements elicitation process can be supported through dynamically clustering incoming stakeholders’ requests into themes. Cross-cutting concerns, which have a significant impact on the architectural design, can be identified through the use of fuzzy clustering techniques and metrics designed to detect when a theme cross-cuts the dominant decomposition of the system. Finally, traceability techniques, required in critical software projects by many regulatory bodies, can be automated and enhanced by the use of cluster-based information retrieval methods. Unfortunately, despite a significant body of work describing document clustering techniques, there is almost no prior work which directly addresses the challenges, constraints, and nuances of requirements clustering. As a result, the effectiveness of software engineering tools and processes that depend on requirements clustering is severely limited. This report directly addresses the problem of clustering requirements through surveying standard clustering techniques and discussing their application to the requirements clustering process

    Automatic region-of-interest extraction in low depth-of-field images

    Get PDF
    PhD ThesisAutomatic extraction of focused regions from images with low depth-of-field (DOF) is a problem without an efficient solution yet. The capability of extracting focused regions can help to bridge the semantic gap by integrating image regions which are meaningfully relevant and generally do not exhibit uniform visual characteristics. There exist two main difficulties for extracting focused regions from low DOF images using high-frequency based techniques: computational complexity and performance. A novel unsupervised segmentation approach based on ensemble clustering is proposed to extract the focused regions from low DOF images in two stages. The first stage is to cluster image blocks in a joint contrast-energy feature space into three constituent groups. To achieve this, we make use of a normal mixture-based model along with standard expectation-maximization (EM) algorithm at two consecutive levels of block size. To avoid the common problem of local optima experienced in many models, an ensemble EM clustering algorithm is proposed. As a result, relevant blocks, i.e., block-based region-of-interest (ROI), closely conforming to image objects are extracted. In stage two, two different approaches have been developed to extract pixel-based ROI. In the first approach, a binary saliency map is constructed from the relevant blocks at the pixel level, which is based on difference of Gaussian (DOG) and binarization methods. Then, a set of morphological operations is employed to create the pixel-based ROI from the map. Experimental results demonstrate that the proposed approach achieves an average segmentation performance of 91.3% and is computationally 3 times faster than the best existing approach. In the second approach, a minimal graph cut is constructed by using the max-flow method and also by using object/background seeds provided by the ensemble clustering algorithm. Experimental results demonstrate an average segmentation performance of 91.7% and approximately 50% reduction of the average computational time by the proposed colour based approach compared with existing unsupervised approaches

    Voting-Based Consensus of Data Partitions

    Get PDF
    Over the past few years, there has been a renewed interest in the consensus problem for ensembles of partitions. Recent work is primarily motivated by the developments in the area of combining multiple supervised learners. Unlike the consensus of supervised classifications, the consensus of data partitions is a challenging problem due to the lack of globally defined cluster labels and to the inherent difficulty of data clustering as an unsupervised learning problem. Moreover, the true number of clusters may be unknown. A fundamental goal of consensus methods for partitions is to obtain an optimal summary of an ensemble and to discover a cluster structure with accuracy and robustness exceeding those of the individual ensemble partitions. The quality of the consensus partitions highly depends on the ensemble generation mechanism and on the suitability of the consensus method for combining the generated ensemble. Typically, consensus methods derive an ensemble representation that is used as the basis for extracting the consensus partition. Most ensemble representations circumvent the labeling problem. On the other hand, voting-based methods establish direct parallels with consensus methods for supervised classifications, by seeking an optimal relabeling of the ensemble partitions and deriving an ensemble representation consisting of a central aggregated partition. An important element of the voting-based aggregation problem is the pairwise relabeling of an ensemble partition with respect to a representative partition of the ensemble, which is refered to here as the voting problem. The voting problem is commonly formulated as a weighted bipartite matching problem. In this dissertation, a general theoretical framework for the voting problem as a multi-response regression problem is proposed. The problem is formulated as seeking to estimate the uncertainties associated with the assignments of the objects to the representative clusters, given their assignments to the clusters of an ensemble partition. A new voting scheme, referred to as cumulative voting, is derived as a special instance of the proposed regression formulation corresponding to fitting a linear model by least squares estimation. The proposed formulation reveals the close relationships between the underlying loss functions of the cumulative voting and bipartite matching schemes. A useful feature of the proposed framework is that it can be applied to model substantial variability between partitions, such as a variable number of clusters. A general aggregation algorithm with variants corresponding to cumulative voting and bipartite matching is applied and a simulation-based analysis is presented to compare the suitability of each scheme to different ensemble generation mechanisms. The bipartite matching is found to be more suitable than cumulative voting for a particular generation model, whereby each ensemble partition is generated as a noisy permutation of an underlying labeling, according to a probability of error. For ensembles with a variable number of clusters, it is proposed that the aggregated partition be viewed as an estimated distributional representation of the ensemble, on the basis of which, a criterion may be defined to seek an optimally compressed consensus partition. The properties and features of the proposed cumulative voting scheme are studied. In particular, the relationship between cumulative voting and the well-known co-association matrix is highlighted. Furthermore, an adaptive aggregation algorithm that is suited for the cumulative voting scheme is proposed. The algorithm aims at selecting the initial reference partition and the aggregation sequence of the ensemble partitions the loss of mutual information associated with the aggregated partition is minimized. In order to subsequently extract the final consensus partition, an efficient agglomerative algorithm is developed. The algorithm merges the aggregated clusters such that the maximum amount of information is preserved. Furthermore, it allows the optimal number of consensus clusters to be estimated. An empirical study using several artificial and real-world datasets demonstrates that the proposed cumulative voting scheme leads to discovering substantially more accurate consensus partitions compared to bipartite matching, in the case of ensembles with a relatively large or a variable number of clusters. Compared to other recent consensus methods, the proposed method is found to be comparable with or better than the best performing methods. Moreover, accurate estimates of the true number of clusters are often achieved using cumulative voting, whereas consistently poor estimates are achieved based on bipartite matching. The empirical evidence demonstrates that the bipartite matching scheme is not suitable for these types of ensembles

    Combining partitions by probabilistic label aggregation

    No full text
    Data clustering represents an important tool in exploratory data analysis. The lack of objective criteria render model selection as well as the identification of robust solutions particularly difficult. The use of a stability assessment and the combination of multiple clustering solutions represents an important ingredient to achieve the goal of finding useful partitions. In this work, we propose a novel way of combining multiple clustering solutions for both, hard and soft partitions: the approach is based on modeling the probability that two objects are grouped together. An efficient EM optimization strategy is employed in order to estimate the model parameters. Our proposal can also be extended in order to emphasize the signal more strongly by weighting individual base clustering solutions according to their consistency with the prediction for previously unseen objects. In addition to that, the probabilistic model supports an outof-sample extension that (i) makes it possible to assign previously unseen objects to classes of the combined solution and (ii) renders the efficient aggregation of solutions possible. In this work, we also shed some light on the usefulness of such combination approaches. In the experimental result section, we demonstrate the competitive performance of our proposal in comparison with other recently proposed methods for combining multiple classifications of a finite data set
    corecore