17,426 research outputs found

    BlogForever D2.6: Data Extraction Methodology

    Get PDF
    This report outlines an inquiry into the area of web data extraction, conducted within the context of blog preservation. The report reviews theoretical advances and practical developments for implementing data extraction. The inquiry is extended through an experiment that demonstrates the effectiveness and feasibility of implementing some of the suggested approaches. More specifically, the report discusses an approach based on unsupervised machine learning that employs the RSS feeds and HTML representations of blogs. It outlines the possibilities of extracting semantics available in blogs and demonstrates the benefits of exploiting available standards such as microformats and microdata. The report proceeds to propose a methodology for extracting and processing blog data to further inform the design and development of the BlogForever platform

    Strategies for online inference of model-based clustering in large and growing networks

    Full text link
    In this paper we adapt online estimation strategies to perform model-based clustering on large networks. Our work focuses on two algorithms, the first based on the SAEM algorithm, and the second on variational methods. These two strategies are compared with existing approaches on simulated and real data. We use the method to decipher the connexion structure of the political websphere during the US political campaign in 2008. We show that our online EM-based algorithms offer a good trade-off between precision and speed, when estimating parameters for mixture distributions in the context of random graphs.Comment: Published in at http://dx.doi.org/10.1214/10-AOAS359 the Annals of Applied Statistics (http://www.imstat.org/aoas/) by the Institute of Mathematical Statistics (http://www.imstat.org

    Automatic Genre Classification in Web Pages Applied to Web Comments

    Get PDF
    Automatic Web comment detection could significantly facilitate information retrieval systems, e.g., a focused Web crawler. In this paper, we propose a text genre classifier for Web text segments as intermediate step for Web comment detection in Web pages. Different feature types and classifiers are analyzed for this purpose. We compare the two-level approach to state-of-the-art techniques operating on the whole Web page text and show that accuracy can be improved significantly. Finally, we illustrate the applicability for information retrieval systems by evaluating our approach on Web pages achieved by a Web crawler

    Node Classification in Uncertain Graphs

    Full text link
    In many real applications that use and analyze networked data, the links in the network graph may be erroneous, or derived from probabilistic techniques. In such cases, the node classification problem can be challenging, since the unreliability of the links may affect the final results of the classification process. If the information about link reliability is not used explicitly, the classification accuracy in the underlying network may be affected adversely. In this paper, we focus on situations that require the analysis of the uncertainty that is present in the graph structure. We study the novel problem of node classification in uncertain graphs, by treating uncertainty as a first-class citizen. We propose two techniques based on a Bayes model and automatic parameter selection, and show that the incorporation of uncertainty in the classification process as a first-class citizen is beneficial. We experimentally evaluate the proposed approach using different real data sets, and study the behavior of the algorithms under different conditions. The results demonstrate the effectiveness and efficiency of our approach

    Building an IT Taxonomy with Co-occurrence Analysis, Hierarchical Clustering, and Multidimensional Scaling

    Get PDF
    Different information technologies (ITs) are related in complex ways. How can the relationships among a large number of ITs be described and analyzed in a representative, dynamic, and scalable way? In this study, we employed co-occurrence analysis to explore the relationships among 50 information technologies discussed in six magazines over ten years (1998-2007). Using hierarchical clustering and multidimensional scaling, we have found that the similarities of the technologies can be depicted in hierarchies and two-dimensional plots, and that similar technologies can be classified into meaningful categories. The results imply reasonable validity of our approach for understanding technology relationships and building an IT taxonomy. The methodology that we offer not only helps IT practitioners and researchers make sense of numerous technologies in the iField but also bridges two related but thus far largely separate research streams in iSchools - information management and IT management
    • 

    corecore