15 research outputs found

    Analysing features of Japanese splogs and characteristics of keywords

    Full text link
    This paper focuses on analyzing (Japanese) splogs based on various characteristics of keywords contained in them. We estimate the behavior of spammers when creating splogs from other sources by analyzing the characteristics of key-words contained in splogs. Since splogs often cause noises in word occurrence statistics in the blogosphere, we assume that we can efficiently (manually) collect splogs by sampling blog homepages containing keywords of a certain type on the date with its most frequent occurrence. We manually exam-ine various features of collected blog homepages regarding whether their text content is excerpt from other sources or not, as well as whether they display affiliate advertisement or out-going links to affiliated sites. Among various infor-mative results, it is important to note that more than half of the collected splogs are created by a very small number of spammers

    BlogForever: D2.5 Weblog Spam Filtering Report and Associated Methodology

    Get PDF
    This report is written as a first attempt to define the BlogForever spam detection strategy. It comprises a survey of weblog spam technology and approaches to their detection. While the report was written to help identify possible approaches to spam detection as a component within the BlogForver software, the discussion has been extended to include observations related to the historical, social and practical value of spam, and proposals of other ways of dealing with spam within the repository without necessarily removing them. It contains a general overview of spam types, ready-made anti-spam APIs available for weblogs, possible methods that have been suggested for preventing the introduction of spam into a blog, and research related to spam focusing on those that appear in the weblog context, concluding in a proposal for a spam detection workflow that might form the basis for the spam detection component of the BlogForever software

    Addressing the new generation of spam (Spam 2.0) through Web usage models

    Get PDF
    New Internet collaborative media introduce new ways of communicating that are not immune to abuse. A fake eye-catching profile in social networking websites, a promotional review, a response to a thread in online forums with unsolicited content or a manipulated Wiki page, are examples of new the generation of spam on the web, referred to as Web 2.0 Spam or Spam 2.0. Spam 2.0 is defined as the propagation of unsolicited, anonymous, mass content to infiltrate legitimate Web 2.0 applications.The current literature does not address Spam 2.0 in depth and the outcome of efforts to date are inadequate. The aim of this research is to formalise a definition for Spam 2.0 and provide Spam 2.0 filtering solutions. Early-detection, extendibility, robustness and adaptability are key factors in the design of the proposed method.This dissertation provides a comprehensive survey of the state-of-the-art web spam and Spam 2.0 filtering methods to highlight the unresolved issues and open problems, while at the same time effectively capturing the knowledge in the domain of spam filtering.This dissertation proposes three solutions in the area of Spam 2.0 filtering including: (1) characterising and profiling Spam 2.0, (2) Early-Detection based Spam 2.0 Filtering (EDSF) approach, and (3) On-the-Fly Spam 2.0 Filtering (OFSF) approach. All the proposed solutions are tested against real-world datasets and their performance is compared with that of existing Spam 2.0 filtering methods.This work has coined the term ‘Spam 2.0’, provided insight into the nature of Spam 2.0, and proposed filtering mechanisms to address this new and rapidly evolving problem

    Social Intelligence Design 2007. Proceedings Sixth Workshop on Social Intelligence Design

    Get PDF

    BlogForever D2.6: Data Extraction Methodology

    Get PDF
    This report outlines an inquiry into the area of web data extraction, conducted within the context of blog preservation. The report reviews theoretical advances and practical developments for implementing data extraction. The inquiry is extended through an experiment that demonstrates the effectiveness and feasibility of implementing some of the suggested approaches. More specifically, the report discusses an approach based on unsupervised machine learning that employs the RSS feeds and HTML representations of blogs. It outlines the possibilities of extracting semantics available in blogs and demonstrates the benefits of exploiting available standards such as microformats and microdata. The report proceeds to propose a methodology for extracting and processing blog data to further inform the design and development of the BlogForever platform

    The Impact of the Internet on the Public Sphere and on the Culture Industry. A study of blogs, social news sites and discussion forums

    Get PDF
    This thesis analyses how certain services of online communication (blogs, discussion forums, social news and bookmarking sites) contribute to the public sphere and to the culture industry. The concept of public sphere is derived from Jürgen Habermas' idea that political power can only be legitimate if it is applied in accordance with the best, common interests of the society – but these interests can only be crystallized in discursive debates between members of the society. However, contemporary national public spheres are said to be distorted and detached from real interests of citizens. The internet, through offering the possibility of democratic and reflexive communication, holds the potential of improving the state of public spheres. The concept of culture industry holds that the capitalization of the production of cultural products (i.e. works of art) rids societies of authoritative art, the one channel through which real individual freedom can be established. “Culture industry” is instrumental, through the promotion of consumption, to the capitalist domination of a few over masses. This, in turn, affects the general state of the public spheres. Once again, the internet has the potential to democratize this over-encompassing culture industry, through increasing cultural diversity via its several new channels of information and distribution. The analysis of blogs, discussion forums and social bookmarking and news sites confirms the democratic potential inherent in these services, but it also points out certain problems that hinder the actualization of this potential. It is established that the use of the generalizing category of “blogs” is misleading, because of the fake underlying dichotomy of “blogs vs traditional media.” The large, fragmented and asymmetrically interlinked (small, influential core and large, extremely fragmented periphery) totality of blogs is found to be contributive to the public sphere mostly as an alternative and very fast channel of information dissemination. The role of discussion forums is found to be ambiguous, certain forums being absolutely irrelevant, while others establishing powerful advocacy media and global issue publics. Social news sites are found to be potentially most constructive from the point of view of the public sphere, because they tend to effectively promote reasoned argumentation.fi=Opinnäytetyö kokotekstinä PDF-muodossa.|en=Thesis fulltext in PDF format.|sv=Lärdomsprov tillgängligt som fulltext i PDF-format

    Combining granularity-based topic-dependent and topic-independent evidences for opinion detection

    Get PDF
    Fouille des opinion, une sous-discipline dans la recherche d'information (IR) et la linguistique computationnelle, fait référence aux techniques de calcul pour l'extraction, la classification, la compréhension et l'évaluation des opinions exprimées par diverses sources de nouvelles en ligne, social commentaires des médias, et tout autre contenu généré par l'utilisateur. Il est également connu par de nombreux autres termes comme trouver l'opinion, la détection d'opinion, l'analyse des sentiments, la classification sentiment, de détection de polarité, etc. Définition dans le contexte plus spécifique et plus simple, fouille des opinion est la tâche de récupération des opinions contre son besoin aussi exprimé par l'utilisateur sous la forme d'une requête. Il y a de nombreux problèmes et défis liés à l'activité fouille des opinion. Dans cette thèse, nous nous concentrons sur quelques problèmes d'analyse d'opinion. L'un des défis majeurs de fouille des opinion est de trouver des opinions concernant spécifiquement le sujet donné (requête). Un document peut contenir des informations sur de nombreux sujets à la fois et il est possible qu'elle contienne opiniâtre texte sur chacun des sujet ou sur seulement quelques-uns. Par conséquent, il devient très important de choisir les segments du document pertinentes à sujet avec leurs opinions correspondantes. Nous abordons ce problème sur deux niveaux de granularité, des phrases et des passages. Dans notre première approche de niveau de phrase, nous utilisons des relations sémantiques de WordNet pour trouver cette association entre sujet et opinion. Dans notre deuxième approche pour le niveau de passage, nous utilisons plus robuste modèle de RI i.e. la language modèle de se concentrer sur ce problème. L'idée de base derrière les deux contributions pour l'association d'opinion-sujet est que si un document contient plus segments textuels (phrases ou passages) opiniâtre et pertinentes à sujet, il est plus opiniâtre qu'un document avec moins segments textuels opiniâtre et pertinentes. La plupart des approches d'apprentissage-machine basée à fouille des opinion sont dépendants du domaine i.e. leurs performances varient d'un domaine à d'autre. D'autre part, une approche indépendant de domaine ou un sujet est plus généralisée et peut maintenir son efficacité dans différents domaines. Cependant, les approches indépendant de domaine souffrent de mauvaises performances en général. C'est un grand défi dans le domaine de fouille des opinion à développer une approche qui est plus efficace et généralisé. Nos contributions de cette thèse incluent le développement d'une approche qui utilise de simples fonctions heuristiques pour trouver des documents opiniâtre. Fouille des opinion basée entité devient très populaire parmi les chercheurs de la communauté IR. Il vise à identifier les entités pertinentes pour un sujet donné et d'en extraire les opinions qui leur sont associées à partir d'un ensemble de documents textuels. Toutefois, l'identification et la détermination de la pertinence des entités est déjà une tâche difficile. Nous proposons un système qui prend en compte à la fois l'information de l'article de nouvelles en cours ainsi que des articles antérieurs pertinents afin de détecter les entités les plus importantes dans les nouvelles actuelles. En plus de cela, nous présentons également notre cadre d'analyse d'opinion et tâches relieés. Ce cadre est basée sur les évidences contents et les évidences sociales de la blogosphère pour les tâches de trouver des opinions, de prévision et d'avis de classement multidimensionnel. Cette contribution d'prématurée pose les bases pour nos travaux futurs. L'évaluation de nos méthodes comprennent l'utilisation de TREC 2006 Blog collection et de TREC Novelty track 2004 collection. La plupart des évaluations ont été réalisées dans le cadre de TREC Blog track.Opinion mining is a sub-discipline within Information Retrieval (IR) and Computational Linguistics. It refers to the computational techniques for extracting, classifying, understanding, and assessing the opinions expressed in various online sources like news articles, social media comments, and other user-generated content. It is also known by many other terms like opinion finding, opinion detection, sentiment analysis, sentiment classification, polarity detection, etc. Defining in more specific and simpler context, opinion mining is the task of retrieving opinions on an issue as expressed by the user in the form of a query. There are many problems and challenges associated with the field of opinion mining. In this thesis, we focus on some major problems of opinion mining

    Voicing the Web: The Trajectories of Blogging in the United States and France

    Get PDF
    The World Wide Web has turned into an important means to share voice, that is, the narratives through which individuals give a public account of their lives. This dissertation analyzes how this key cultural process came into being and discusses some of its main implications. To this end, it studies one specific technology of subjectivity that embodies this process in fundamental ways: the blog. This dissertation examines the processes that have shaped practices of subjectivity on the Web in two countries (the United States and France) from the mid-1990s to the early years of the 2010s. The focus is on three processes: the emergence of the blog; its constitution into a means for intervening in the public sphere and a commodity; and the identity crises triggered by the rise of novel media technologies (such as “microblogging”) designed to replace or extend it. A theoretical framework is developed that makes four analytic contributions: (a) it considers media technologies as assemblages of both textual meaning and material artifacts; (b) it analyzes both the production and use of media technologies; (c) it adopts a process-orientation to make sense of the temporal development of the Web; and (d) it implements a comparative approach to identify the similarities and differences between the cases under study. Drawing on interviews with key actors, content and artifact analyses of websites, traditional archival research, and online archival research, this dissertation examines how users and software developers have enacted particular notions of the self, conceived the publicness of their Web appropriation and development practices, and built and utilized media technologies such as websites and software programs to these ends. The analysis reveals that the cultural identity of blogging as a practice of subjectivity in these two countries is neither inevitable nor neutral. In the United States, particular liberal notions and neoliberal assumptions have informed the imaginary surrounding blogs in crucial ways. The study also shows how and why actors in France have gradually abandoned traditional makers of exceptionalism that were key in the development of the country’s national identity and favored notions that characterize the United States instead.UCR::Vicerrectoría de Investigación::Unidades de Investigación::Ciencias Sociales::Centro de Investigación en Comunicación (CICOM
    corecore