509 research outputs found

    Closed sequential pattern mining for sitemap generation

    Get PDF
    A sitemap represents an explicit specification of the design concept and knowledge organization of a website and is therefore considered as the website’s basic ontology. It not only presents the main usage flows for users, but also hierarchically organizes concepts of the website. Typically, sitemaps are defined by webmasters in the very early stages of the website design. However, during their life websites significantly change their structure, their content and their possible navigation paths. Even if this is not the case, webmasters can fail to either define sitemaps that reflect the actual website content or, vice versa, to define the actual organization of pages and links which do not reflect the intended organization of the content coded in the sitemaps. In this paper we propose an approach which automatically generates sitemaps. Contrary to other approaches proposed in the literature, which mainly generate sitemaps from the textual content of the pages, in this work sitemaps are generated by analyzing the Web graph of a website. This allows us to: i) automatically generate a sitemap on the basis of possible navigation paths, ii) compare the generated sitemaps with either the sitemap provided by the Web designer or with the intended sitemap of the website and, consequently, iii) plan possible website re-organization. The solution we propose is based on closed frequent sequence extraction and only concentrates on hyperlinks organized in “Web lists”, which are logical lists embedded in the pages. These “Web lists” are typically used for supporting users in Web site navigation and they include menus, navbars and content tables. Experiments performed on three real datasets show that the extracted sitemaps are much more similar to those defined by website curators than those obtained by competitor algorithms

    Closed sequential pattern mining for sitemap generation

    Get PDF
    AbstractA sitemap represents an explicit specification of the design concept and knowledge organization of a website and is therefore considered as the website's basic ontology. It not only presents the main usage flows for users, but also hierarchically organizes concepts of the website. Typically, sitemaps are defined by webmasters in the very early stages of the website design. However, during their life websites significantly change their structure, their content and their possible navigation paths. Even if this is not the case, webmasters can fail to either define sitemaps that reflect the actual website content or, vice versa, to define the actual organization of pages and links which do not reflect the intended organization of the content coded in the sitemaps. In this paper we propose an approach which automatically generates sitemaps. Contrary to other approaches proposed in the literature, which mainly generate sitemaps from the textual content of the pages, in this work sitemaps are generated by analyzing the Web graph of a website. This allows us to: i) automatically generate a sitemap on the basis of possible navigation paths, ii) compare the generated sitemaps with either the sitemap provided by the Web designer or with the intended sitemap of the website and, consequently, iii) plan possible website re-organization. The solution we propose is based on closed frequent sequence extraction and only concentrates on hyperlinks organized in "Web lists", which are logical lists embedded in the pages. These "Web lists" are typically used for supporting users in Web site navigation and they include menus, navbars and content tables. Experiments performed on three real datasets show that the extracted sitemaps are much more similar to those defined by website curators than those obtained by competitor algorithms

    Graph Sketches: Sparsification, Spanners, and Subgraphs

    Get PDF
    When processing massive data sets, a core task is to construct synopses of the data. To be useful, a synopsis data structure should be easy to construct while also yielding good approximations of the relevant properties of the data set. A particularly useful class of synopses are sketches, i.e., those based on linear projections of the data. These are applicable in many models including various parallel, stream, and compressed sensing settings. A rich body of analytic and empirical work exists for sketching numerical data such as the frequencies of a set of entities. Our work investigates graph sketching where the graphs of interest encode the relationships between these entities. The main challenge is to capture this richer structure and build the necessary synopses with only linear measurements. In this paper we consider properties of graphs including the size of the cuts, the distances between nodes, and the prevalence of dense sub-graphs. Our main result is a sketch-based sparsifier construction: we show that O̅(nΔ-2) random linear projections of a graph on n nodes suffice to (1 + Δ) approximate all cut values. Similarly, we show that O(Δ-2) linear projections suffice for (additively) approximating the fraction of induced sub-graphs that match a given pattern such as a small clique. Finally, for distance estimation we present sketch-based spanner constructions. In this last result the sketches are adaptive, i.e., the linear projections are performed in a small number of batches where each projection may be chosen dependent on the outcome of earlier sketches. All of the above results immediately give rise to data stream algorithms that also apply to dynamic graph streams where edges are both inserted and deleted. The non-adaptive sketches, such as those for sparsification and subgraphs, give us single-pass algorithms for distributed data streams with insertion and deletions. The adaptive sketches can be used to analyze MapReduce algorithms that use a small number of rounds

    BlogForever D2.6: Data Extraction Methodology

    Get PDF
    This report outlines an inquiry into the area of web data extraction, conducted within the context of blog preservation. The report reviews theoretical advances and practical developments for implementing data extraction. The inquiry is extended through an experiment that demonstrates the effectiveness and feasibility of implementing some of the suggested approaches. More specifically, the report discusses an approach based on unsupervised machine learning that employs the RSS feeds and HTML representations of blogs. It outlines the possibilities of extracting semantics available in blogs and demonstrates the benefits of exploiting available standards such as microformats and microdata. The report proceeds to propose a methodology for extracting and processing blog data to further inform the design and development of the BlogForever platform

    Applying patterns to hypermedia instructional design (APHID)

    Get PDF
    This research addresses the issue of automatically generating instructional hypermedia documents (in the form of web sites). Our hypothesis is that, for certain types of hypermedia, an automated approach can produce satisfactory hypermedia applications more efficiently than humans are able to create them. We propose a method (APHID) that guides a hypermedia creator through the design process and partially automates the creation of hypermedia applications. Our method uses concept maps and instructional design patterns, as well as the more common domain and presentation models, to support partial automation for creating instructional hypermedia. Most hypermedia application developers follow basic graphical design principles, but few commonly accepted principles exist for the structuring of hypermedia applications. The design of instructional hypermedia imposes the additional requirement that the designer be expert both in hypermedia design and in instructional design. APHID supports designers through the use of patterns to describe and clarify design concepts for both instructional design and interface design. This thesis describes the design and development of the APHID approach and a prototype software tool that supports the development of instructional hypermedia using the APHID approach. The thesis also presents a study in which web sites created with APHID are compared (by an independent evaluator) to web sites created by instructional technologists. The study shows that good instructional web sites can be generated semi-automatically with less expenditure of time on the part of the instructional designer

    Evolving Networks and Social Network Analysis Methods and Techniques

    Get PDF
    Evolving networks by definition are networks that change as a function of time. They are a natural extension of network science since almost all real-world networks evolve over time, either by adding or by removing nodes or links over time: elementary actor-level network measures like network centrality change as a function of time, popularity and influence of individuals grow or fade depending on processes, and events occur in networks during time intervals. Other problems such as network-level statistics computation, link prediction, community detection, and visualization gain additional research importance when applied to dynamic online social networks (OSNs). Due to their temporal dimension, rapid growth of users, velocity of changes in networks, and amount of data that these OSNs generate, effective and efficient methods and techniques for small static networks are now required to scale and deal with the temporal dimension in case of streaming settings. This chapter reviews the state of the art in selected aspects of evolving social networks presenting open research challenges related to OSNs. The challenges suggest that significant further research is required in evolving social networks, i.e., existent methods, techniques, and algorithms must be rethought and designed toward incremental and dynamic versions that allow the efficient analysis of evolving networks

    Seventh Biennial Report : June 2003 - March 2005

    No full text
    • 

    corecore