13 research outputs found

    Enhancing clustering blog documents by utilizing author/reader comments

    Full text link
    Blogs are a new form of internet phenomenon and a vast ever-increasing information resource. Mining blog files for information is a very new research direction in data mining. Blog files are different from standard web files and may need specialized mining strategies. We propose to include the title, body, and comments of the blog pages in clustering datasets from blog documents. In particular, we argue that the author/reader comments of the blog pages may have more discriminating effect in clustering blog documents. We constructed a word-page matrix by downloading blog pages from a well-known website and experimented a k-means clustering algorithm with different weights assigned to the title, body, and comment parts. Our experimental results show that assigning a larger weight value to the blog comments helps the k-means algorithm produce better clustering solutions. The experimental results confirm our hypothesis that the author/reader comments of the blog files are very useful in discriminating blog files

    Clustering Weblogs on the Basis of a Topic Detection Method

    Get PDF
    In recent years we have seen a vast increase in the volume of information published on weblog sites and also the creation of new web technologies where people discuss actual events. The need for automatic tools to organize this massive amount of information is clear, but the particular characteristics of weblogs such as shortness and overlapping vocabulary make this task difficult. In this work, we present a novel methodology to cluster weblog posts according to the topics discussed therein. This methodology is based on a generative probabilistic model in conjunction with a Self-Term Expansion methodology. We present our results which demonstrate a considerable improvement over the baseline

    FINDING HER MASTER’S VOICE: THE POWER OF COLLECTIVE ACTION AMONG FEMALE MUSLIM BLOGGERS

    Get PDF
    Emerging cyber-collective movements have frequently made headlines in the news. Despite the exponential growth of bloggers in Muslim countries, there is a lack of empirical study of cyber-collective actions in these countries. We analyzed the female Muslim blogosphere because very little research attempts to understand socio-political roles of female bloggers in the system where women are frequently denied freedom of expression. We collected 150 blogs from 17 countries ranging between April 2003 and July 2010 with a special focus on Al-Huwaider’s campaigns for our analysis. Bearing the analysis upon three central tenets of individual, community, and transnational perspectives, we develop novel algorithms modeling cyber-collective movements by utilizing existing social theories on collective action and computational social network analysis. This paper contributes a methodology to study the diffusion of issues in social networks and examines roles of influential community members. We also observe the transcending nature of cyber-collective movements with future possibilities for modeling transnational outreach. Using the global female Muslim blogosphere, we provide understanding of the complexity and dynamics of cyber-collective action. To the best of our knowledge, our research is the first to address the lacking fundamental research shedding light on re-framing collective action theory in online environments

    Prototype/topic based Clustering Method for Weblogs

    Full text link
    [EN] In the last 10 years, the information generated on weblog sites has increased exponentially, resulting in a clear need for intelligent approaches to analyse and organise this massive amount of information. In this work, we present a methodology to cluster weblog posts according to the topics discussed therein, which we derive by text analysis. We have called the methodology Prototype/Topic Based Clustering, an approach which is based on a generative probabilistic model in conjunction with a Self-Term Expansion methodology. The usage of the Self-Term Expansion methodology is to improve the representation of the data and the generative probabilistic model is employed to identify relevant topics discussed in the weblogs. We have modified the generative probabilistic model in order to exploit predefined initialisations of the model and have performed our experiments in narrow and wide domain subsets. The results of our approach have demonstrated a considerable improvement over the pre-defined baseline and alternative state of the art approaches, achieving an improvement of up to 20% in many cases. The experiments were performed on both narrow and wide domain datasets, with the latter showing better improvement. However in both cases, our results outperformed the baseline and state of the art algorithms.The work of the third author was carried out in the framework of the WIQ-EI IRSES project (Grant No. 269180) within the FP7 Marie Curie, the DIANA APPLICATIONS Finding Hidden Knowledge in Texts: Applications (TIN2012-38603-C02-01) project and the VLC/CAMPUS Microcluster on Multimodal Interaction in Intelligent Systems.Perez-Tellez, F.; Cardiff, J.; Rosso, P.; Pinto Avendaño, DE. (2016). Prototype/topic based Clustering Method for Weblogs. Intelligent Data Analysis. 20(1):47-65. https://doi.org/10.3233/IDA-150793S476520

    Interest identification from browser tab titles: A systematic literature review

    Get PDF
    Modeling and understanding users interests has become an essential part of our daily lives. A variety of business processes and a growing number of companies employ various tools to such an end. The outcomes of these identification strategies are beneficial for both companies and users: the former are more likely to offer services to those customers who really need them, while the latter are more likely to get the service they desire. Several works have been carried out in the area of user interests identification. As a result, it might not be easy for researchers, developers, and users to orient themselves in the field; that is, to find the tools and methods that they most need, to identify ripe areas for further investigations, and to propose the development and adoption of new research plans. In this study, to overcome these potential shortcomings, we performed a systematic literature review on user interests identification. We used as input data browsing tab titles. Our goal here is to offer a service to the readership, which is capable of systematically guiding and reliably orienting researchers, developers, and users in this very vast domain. Our findings demonstrate that the majority of the research carried out in the field gathers data from either social networks (such as Twitter, Instagram and Facebook) or from search engines, leaving open the question of what to do when such data is not available

    Enhancing clustering blog documents by utilizing author/reader comments

    No full text
    Blogs are a new form of internet phenomenon and a vast everincreasing information resource. Mining blog files for information is a very new research direction in data mining. We propose to include the title, body, and comments of the blog pages in clustering datasets from blog documents. In particular, we argue that the author/reader comments of the blog pages may have more discriminating effect in clustering blog documents. We constructed a word-page matrix by downloading blog pages from a well-known website and experimented a k-means clustering algorithm with different weights assigned to the title, body, and comment parts. Our experimental results show that assigning a larger weight value to the blog comments helps the k-means algorithm produce better clustering solutions. The experimental results confirm our hypothesis that the author/reader comments of the blog files are very useful in discriminating blog files

    Blog content mining: topic identification and evolution extraction.

    Get PDF
    Ng, Kuan Kit.Thesis (M.Phil.)--Chinese University of Hong Kong, 2009.Includes bibliographical references (leaves 92-100).Abstract also in Chinese.Abstract --- p.iAcknowledgement --- p.iiiChapter 1 --- Introduction --- p.1Chapter 1.1 --- Blog Overview --- p.2Chapter 1.2 --- Motivation --- p.4Chapter 1.2.1 --- Blog Mining --- p.5Chapter 1.2.2 --- Topic Detection and Tracking --- p.8Chapter 1.3 --- Objectives and Contributions --- p.9Chapter 1.4 --- Proposed Methodology --- p.11Chapter 2 --- Related Work --- p.13Chapter 2.1 --- Web Document Clustering --- p.13Chapter 2.2 --- Document Clustering with Temporal Information --- p.15Chapter 2.3 --- Blog Mining --- p.17Chapter 3 --- Feature Extraction and Selection --- p.20Chapter 3.1 --- Blog Extraction and Content Cleaning --- p.21Chapter 3.1.1 --- Blog Parsing and Structure Identification --- p.22Chapter 3.1.2 --- Stop-word Removal --- p.24Chapter 3.1.3 --- Word Stemming --- p.25Chapter 3.1.4 --- Heuristic Content Cleaning and Multiword Grouping --- p.25Chapter 3.2 --- Feature Selection --- p.26Chapter 3.2.1 --- Term Frequency Inverse Document Frequency --- p.27Chapter 3.2.2 --- Term Contribution --- p.29Chapter 4 --- Blog Topic Extraction --- p.31Chapter 4.1 --- Requirements of Document Clustering --- p.32Chapter 4.1.1 --- Vector Space Modeling --- p.32Chapter 4.1.2 --- Similarity Measurement --- p.33Chapter 4.2 --- Document Clustering --- p.34Chapter 4.2.1 --- Partitional Clustering --- p.36Chapter 4.2.2 --- Hierarchial Clustering --- p.37Chapter 4.2.3 --- Density-Based Clustering --- p.38Chapter 4.3 --- Proposed Concept Clustering --- p.40Chapter 4.3.1 --- Semantic Distance between Concepts --- p.43Chapter 4.3.2 --- Bounded Density-Based Clustering --- p.47Chapter 4.3.3 --- Document Assignment with Topic Clusters --- p.57Chapter 4.4 --- Discussion --- p.58Chapter 5 --- Blog Topic Evolution --- p.61Chapter 5.1 --- Topic Evolution Graph --- p.61Chapter 5.2 --- Topic Evolution --- p.64Chapter 6 --- Experimental Result --- p.69Chapter 6.1 --- Evaluation of Topic Cluster --- p.70Chapter 6.1.1 --- Evaluation Criteria --- p.70Chapter 6.1.2 --- Evaluation Result --- p.73Chapter 6.2 --- Evaluation of Topic Evolution --- p.79Chapter 6.2.1 --- Results of Topic Evolution Graph --- p.80Chapter 6.2.2 --- Evaluation Criteria --- p.82Chapter 6.2.3 --- Evaluation of Topic Evolution --- p.83Chapter 6.2.4 --- Case Study --- p.84Chapter 7 --- Conclusions and Future Work --- p.88Chapter 7.1 --- Conclusions --- p.88Chapter 7.2 --- Future Work --- p.90Bibliography --- p.92Chapter A --- Stop Word List --- p.101Chapter B --- Feature Selection Comparison --- p.104Chapter C --- Topic Evolution --- p.106Chapter D --- Topic Cluster --- p.10

    A Survey on Web 2.0

    Get PDF
    Today’s Internet is a far cry from the network of academic sharing as which it began. From the ruins of the dot-com bubble has risen a brave new Internet that O’Reilly has named Web 2.0 while others prefer such names as social net. We were interested in what characterizes today’s Internet services and set out to study eleven Web 2.0 sites that encapsulated the new breed of Internet services. We found that O’Reilly’s definition of Web 2.0 describes well what is happening on the Internet today. Today’s Internet is indeed about harnessing collective intelligence and about user-contributed content. Huge numbers of items require us to use social navigation with its recommender systems to find items of interest and users have advanced from being simple consumers of content to being a major source of the Web 2.0 content as well. Users contribute content directly by uploading text (in blogs, forums, and reviews), photos, and video clips, and in addition to such intentionally contributed content, the systems generate content by tracking user activities. Moreover, today’s Internet services are characterized by sociability. While some services merely provide means for communal discourse, many others, such as MySpace, LinkedIn, and Facebook, are based on building and maintaining social networks. Regrettably, the social aspects and user-contributed content of the services have also lead to multi-faceted privacy concerns and even such criminal activities as identity theft and child molestation. Furthermore, copyright violations have become an everyday phenomenon. This survey offers examples of modern, state-of-the-art interface features in today’s net and descriptions of the services from the user’s viewpoint. The main goal of the presentation is to outline the current state of Internet services together with recent research findings about them. However, we have not shied away from using many blog posts and other writings on the Internet as source material because it is on the Internet where the web of the future is currently being woven

    Ranking, Labeling, and Summarizing Short Text in Social Media

    Get PDF
    One of the key features driving the growth and success of the Social Web is large-scale participation through user-contributed content – often through short text in social media. Unlike traditional long-form documents – e.g., Web pages, blog posts – these short text resources are typically quite brief (on the order of 100s of characters), often of a personal nature (reflecting opinions and reactions of users), and being generated at an explosive rate. Coupled with this explosion of short text in social media is the need for new methods to organize, monitor, and distill relevant information from these large-scale social systems, even in the face of the inherent “messiness” of short text, considering the wide variability in quality, style, and substance of short text generated by a legion of Social Web participants. Hence, this dissertation seeks to develop new algorithms and methods to ensure the continued growth of the Social Web by enhancing how users engage with short text in social media. Concretely, this dissertation takes a three-fold approach: First, this dissertation develops a learning-based algorithm to automatically rank short text comments associated with a Social Web object (e.g., Web document, image, video) based on the expressed preferences of the community itself, so that low-quality short text may be filtered and user attention may be focused on highly-ranked short text. Second, this dissertation organizes short text through labeling, via a graph- based framework for automatically assigning relevant labels to short text. In this way meaningful semantic descriptors may be assigned to short text for improved classification, browsing, and visualization. Third, this dissertation presents a cluster-based summarization approach for extracting high-quality viewpoints expressed in a collection of short text, while maintaining diverse viewpoints. By summarizing short text, user attention may quickly assess the aggregate viewpoints expressed in a collection of short text, without the need to scan each of possibly thousands of short text items
    corecore