4 research outputs found

    An Investigation of Clustering Algorithms in the Identification of Similar Web Pages

    Get PDF
    In this paper we investigate the effect of using clustering algorithms in the reverse engineering field to identify pages that are similar either at the structural level or at the content level. To this end, we have used two instances of a general process that only differ for the measure used to compare web pages. In particular, two web pages at the structural level and at the content level are compared by using the Levenshtein edit distances and Latent Semantic Indexing, respectively. The static pages of two web applications and one static web site have been used to compare the results achieved by using the considered clustering algorithms both at the structural and content level. On these applications we generally achieved comparable results. However, the investigation has also suggested some heuristics to quickly identify the best partition of web pages into clusters among the possible partitions both at the structural and at the content level

    Semantic feature reduction and hybrid feature selection for clustering of Arabic Web pages

    Get PDF
    In the literature, high-dimensional data reduces the efficiency of clustering algorithms. Clustering the Arabic text is challenging because semantics of the text involves deep semantic processing. To overcome the problems, the feature selection and reduction methods have become essential to select and identify the appropriate features in reducing high-dimensional space. There is a need to develop a suitable design for feature selection and reduction methods that would result in a more relevant, meaningful and reduced representation of the Arabic texts to ease the clustering process. The research developed three different methods for analyzing the features of the Arabic Web text. The first method is based on hybrid feature selection that selects the informative term representation within the Arabic Web pages. It incorporates three different feature selection methods known as Chi-square, Mutual Information and Term Frequency–Inverse Document Frequency to build a hybrid model. The second method is a latent document vectorization method used to represent the documents as the probability distribution in the vector space. It overcomes the problems of high-dimension by reducing the dimensional space. To extract the best features, two document vectorizer methods have been implemented, known as the Bayesian vectorizer and semantic vectorizer. The third method is an Arabic semantic feature analysis used to improve the capability of the Arabic Web analysis. It ensures a good design for the clustering method to optimize clustering ability when analysing these Web pages. This is done by overcoming the problems of term representation, semantic modeling and dimensional reduction. Different experiments were carried out with k-means clustering on two different data sets. The methods provided solutions to reduce high-dimensional data and identify the semantic features shared between similar Arabic Web pages that are grouped together in one cluster. These pages were clustered according to the semantic similarities between them whereby they have a small Davies–Bouldin index and high accuracy. This study contributed to research in clustering algorithm by developing three methods to identify the most relevant features of the Arabic Web pages

    Improving Web site understanding with keyword based clustering

    No full text
    Web applications are becoming more and more complex and difficult to maintain. To satisfy the customer's demands, they need to be updated often and quickly. In the maintenance phase, Web site understanding is a central activity. In this phase, programmers spend a lot of time and effort in the comprehension of the internal Web site structure. Such activity is often required because the available documentation is not aligned with the implementation, if not missing at all. Reverse engineering techniques have the potential to support Web site understanding, by providing views that show the organization of a site and its navigational structure. However, representing each Web page as a node in a diagram recovered from the source code of the Web site often leads to huge and unreadable graphs. Moreover, since the level of connectivity is typically high, the edges in such graphs make the overall result even less usable. In this paper, we propose an approach to Web site understanding based on clustering of client-side HTML pages with similar content. This approach works well with content-oriented sites rather than application-oriented ones and uses a crawler to download the Web pages of the target Web site. The presence of common keywords is exploited to decide when it is appropriate to group pages together. An experimental work, including 17 Web sites, validates our approach and shows that the clusters produced automatically are close to those that a human would produce for a given Web sit