4,301 research outputs found

    BlogForever: D2.5 Weblog Spam Filtering Report and Associated Methodology

    Get PDF
    This report is written as a first attempt to define the BlogForever spam detection strategy. It comprises a survey of weblog spam technology and approaches to their detection. While the report was written to help identify possible approaches to spam detection as a component within the BlogForver software, the discussion has been extended to include observations related to the historical, social and practical value of spam, and proposals of other ways of dealing with spam within the repository without necessarily removing them. It contains a general overview of spam types, ready-made anti-spam APIs available for weblogs, possible methods that have been suggested for preventing the introduction of spam into a blog, and research related to spam focusing on those that appear in the weblog context, concluding in a proposal for a spam detection workflow that might form the basis for the spam detection component of the BlogForever software

    BlogForever D2.6: Data Extraction Methodology

    Get PDF
    This report outlines an inquiry into the area of web data extraction, conducted within the context of blog preservation. The report reviews theoretical advances and practical developments for implementing data extraction. The inquiry is extended through an experiment that demonstrates the effectiveness and feasibility of implementing some of the suggested approaches. More specifically, the report discusses an approach based on unsupervised machine learning that employs the RSS feeds and HTML representations of blogs. It outlines the possibilities of extracting semantics available in blogs and demonstrates the benefits of exploiting available standards such as microformats and microdata. The report proceeds to propose a methodology for extracting and processing blog data to further inform the design and development of the BlogForever platform

    Methods for ranking user-generated text streams: a case study in blog feed retrieval

    Get PDF
    User generated content are one of the main sources of information on the Web nowadays. With the huge amount of this type of data being generated everyday, having an efficient and effective retrieval system is essential. The goal of such a retrieval system is to enable users to search through this data and retrieve documents relevant to their information needs. Among the different retrieval tasks of user generated content, retrieving and ranking streams is one of the important ones that has various applications. The goal of this task is to rank streams, as collections of documents with chronological order, in response to a user query. This is different than traditional retrieval tasks where the goal is to rank single documents and temporal properties are less important in the ranking. In this thesis we investigate the problem of ranking user-generated streams with a case study in blog feed retrieval. Blogs, like all other user generated streams, have specific properties and require new considerations in the retrieval methods. Blog feed retrieval can be defined as retrieving blogs with a recurrent interest in the topic of the given query. We define three different properties of blog feed retrieval each of which introduces new challenges in the ranking task. These properties include: 1) term mismatch in blog retrieval, 2) evolution of topics in blogs and 3) diversity of blog posts. For each of these properties, we investigate its corresponding challenges and propose solutions to overcome those challenges. We further analyze the effect of our solutions on the performance of a retrieval system. We show that taking the new properties into account for developing the retrieval system can help us to improve state of the art retrieval methods. In all the proposed methods, we specifically pay attention to temporal properties that we believe are important information in any type of streams. We show that when combined with content-based information, temporal information can be useful in different situations. Although we apply our methods to blog feed retrieval, they are mostly general methods that are applicable to similar stream ranking problems like ranking experts or ranking twitter users

    BlogForever D2.4: Weblog spider prototype and associated methodology

    Get PDF
    The purpose of this document is to present the evaluation of different solutions for capturing blogs, established methodology and to describe the developed blog spider prototype

    Combining granularity-based topic-dependent and topic-independent evidences for opinion detection

    Get PDF
    Fouille des opinion, une sous-discipline dans la recherche d'information (IR) et la linguistique computationnelle, fait référence aux techniques de calcul pour l'extraction, la classification, la compréhension et l'évaluation des opinions exprimées par diverses sources de nouvelles en ligne, social commentaires des médias, et tout autre contenu généré par l'utilisateur. Il est également connu par de nombreux autres termes comme trouver l'opinion, la détection d'opinion, l'analyse des sentiments, la classification sentiment, de détection de polarité, etc. Définition dans le contexte plus spécifique et plus simple, fouille des opinion est la tâche de récupération des opinions contre son besoin aussi exprimé par l'utilisateur sous la forme d'une requête. Il y a de nombreux problèmes et défis liés à l'activité fouille des opinion. Dans cette thèse, nous nous concentrons sur quelques problèmes d'analyse d'opinion. L'un des défis majeurs de fouille des opinion est de trouver des opinions concernant spécifiquement le sujet donné (requête). Un document peut contenir des informations sur de nombreux sujets à la fois et il est possible qu'elle contienne opiniâtre texte sur chacun des sujet ou sur seulement quelques-uns. Par conséquent, il devient très important de choisir les segments du document pertinentes à sujet avec leurs opinions correspondantes. Nous abordons ce problème sur deux niveaux de granularité, des phrases et des passages. Dans notre première approche de niveau de phrase, nous utilisons des relations sémantiques de WordNet pour trouver cette association entre sujet et opinion. Dans notre deuxième approche pour le niveau de passage, nous utilisons plus robuste modèle de RI i.e. la language modèle de se concentrer sur ce problème. L'idée de base derrière les deux contributions pour l'association d'opinion-sujet est que si un document contient plus segments textuels (phrases ou passages) opiniâtre et pertinentes à sujet, il est plus opiniâtre qu'un document avec moins segments textuels opiniâtre et pertinentes. La plupart des approches d'apprentissage-machine basée à fouille des opinion sont dépendants du domaine i.e. leurs performances varient d'un domaine à d'autre. D'autre part, une approche indépendant de domaine ou un sujet est plus généralisée et peut maintenir son efficacité dans différents domaines. Cependant, les approches indépendant de domaine souffrent de mauvaises performances en général. C'est un grand défi dans le domaine de fouille des opinion à développer une approche qui est plus efficace et généralisé. Nos contributions de cette thèse incluent le développement d'une approche qui utilise de simples fonctions heuristiques pour trouver des documents opiniâtre. Fouille des opinion basée entité devient très populaire parmi les chercheurs de la communauté IR. Il vise à identifier les entités pertinentes pour un sujet donné et d'en extraire les opinions qui leur sont associées à partir d'un ensemble de documents textuels. Toutefois, l'identification et la détermination de la pertinence des entités est déjà une tâche difficile. Nous proposons un système qui prend en compte à la fois l'information de l'article de nouvelles en cours ainsi que des articles antérieurs pertinents afin de détecter les entités les plus importantes dans les nouvelles actuelles. En plus de cela, nous présentons également notre cadre d'analyse d'opinion et tâches relieés. Ce cadre est basée sur les évidences contents et les évidences sociales de la blogosphère pour les tâches de trouver des opinions, de prévision et d'avis de classement multidimensionnel. Cette contribution d'prématurée pose les bases pour nos travaux futurs. L'évaluation de nos méthodes comprennent l'utilisation de TREC 2006 Blog collection et de TREC Novelty track 2004 collection. La plupart des évaluations ont été réalisées dans le cadre de TREC Blog track.Opinion mining is a sub-discipline within Information Retrieval (IR) and Computational Linguistics. It refers to the computational techniques for extracting, classifying, understanding, and assessing the opinions expressed in various online sources like news articles, social media comments, and other user-generated content. It is also known by many other terms like opinion finding, opinion detection, sentiment analysis, sentiment classification, polarity detection, etc. Defining in more specific and simpler context, opinion mining is the task of retrieving opinions on an issue as expressed by the user in the form of a query. There are many problems and challenges associated with the field of opinion mining. In this thesis, we focus on some major problems of opinion mining

    The voting model for people search

    Get PDF
    The thesis investigates how persons in an enterprise organisation can be ranked in response to a query, so that those persons with relevant expertise to the query topic are ranked first. The expertise areas of the persons are represented by documentary evidence of expertise, known as candidate profiles. The statement of this research work is that the expert search task in an enterprise setting can be successfully and effectively modelled using a voting paradigm. In the so-called Voting Model, when a document is retrieved for a query, this document represents a vote for every expert associated with the document to have relevant expertise to the query topic. This voting paradigm is manifested by the proposition of various voting techniques that aggregate the votes from documents to candidate experts. Moreover, the research work demonstrates that these voting techniques can be modelled in terms of a Bayesian belief network, providing probabilistic semantics for the proposed voting paradigm. The proposed voting techniques are thoroughly evaluated on three standard expert search test collections, deriving conclusions concerning each component of the Voting Model, namely the method used to identify the documents that represent each candidate's expertise areas, the weighting models that are used to rank the documents, and the voting techniques which are used to convert the ranking of documents into the ranking of experts. Effective settings are identified and insights about the behaviour of each voting technique are derived. Moreover, the practical aspects of deploying an expert search engine such as its efficiency and how it should be trained are also discussed. This thesis includes an investigation of the relationship between the quality of the underlying ranking of documents and the resulting effectiveness of the voting techniques. The thesis shows that various effective document retrieval approaches have a positive impact on the performance of the voting techniques. Interestingly, it also shows that a `perfect' ranking of documents does not necessarily translate into an equally perfect ranking of candidates. Insights are provided into the reasons for this, which relate to the complexity of evaluating tasks based on ranking aggregates of documents. Furthermore, it is shown how query expansion can be adapted and integrated into the expert search process, such that the query expansion successfully acts on a pseudo-relevant set containing only a list of names of persons. Five ways of performing query expansion in the expert search task are proposed, which vary in the extent to which they tackle expert search-specific problems, in particular, the occurrence of topic drift within the expertise evidence for each candidate. Not all documentary evidence of expertise for a given person are equally useful, nor may there be sufficient expertise evidence for a relevant person within an enterprise. This thesis investigates various approaches to identify the high quality evidence for each person, and shows how the World Wide Web can be mined as a resource to find additional expertise evidence. This thesis also demonstrates how the proposed model can be applied to other people search tasks such as ranking blog(ger)s in the blogosphere setting, and suggesting reviewers for the submitted papers to an academic conference. The central contributions of this thesis are the introduction of the Voting Model, and the definition of a number of voting techniques within the model. The thesis draws insights from an extremely large and exhaustive set of experiments, involving many experimental parameters, and using different test collections for several people search tasks. This illustrates the effectiveness and the generality of the Voting Model at tackling various people search tasks and, indeed, the retrieval of aggregates of documents in general

    What and how do companies benefit from social media?:a review of seven company case studies

    Get PDF
    Abstract. Social Media (SM) has turned into our daily life in the information society. People are showing an increasing tendency to build and nurture their online social relationship on SM platforms. Organizational SM has become an important research area for both scholars and practitioners who are interested in online technologies. It is worth of studying what the researches have been done related to the SM usage in organizations, and how the organizations have utilized the SMs for own specific purpose. SM used in organization has grown continuously. Business enterprises quickly recognize the value of shared contents. They have been increasingly adopted in the workplace for decision-making, supporting corporate communication, knowledge management, facilitating communication both inside organization and the stakeholders outside organization, increasing the social capital, enhancing the brand value and promoting the marketing practice in organization, both in business to business and business to customers. Many corporations are using blogs, wikis, and social networking sites (SNS) as routine parts of their business operations. The performances of the usage of the SM can be generally classified as internal usage and external usage. All of them have the different purposes and tactics basing on various targeted users. For the thesis work, a literature review was conducted by studying the existing empirical research on the usage of SM in organizations from selected existing scientific articles. The current status of the organizational SM usage has been investigated. Two theories have been chosen which are both the affordances of SM (Visibility, Persistence, Editability, and Association) and the honeycomb functional building blocks of SM (Identity, Conversations, Sharing, Presence, Relationships, Reputation and Groups). Based on them, the case studies in seven companies have been explored, aiming to search some of the four affordances and certain amount of the seven functional blocks in organizational SM activities, and explain how they influences the organizational behaviours. Applying such two theories shows that the organizations may participant the SM platforms efficiently and effectively in various ways. Some good practises of the SM usage have been portrayed. The risks related of using SM platforms have been mentioned as well. The future potential research works related the organizational SM usage has been discussed

    Technical Services Transparency: Using a LibGuide to Expose the Mysteries of Technical Services

    Get PDF
    Technical services departments in academic libraries have long struggled to communicate effectively with other library departments, particularly public services departments. As academic libraries acquire large numbers of digital resources, technical services departments are increasingly responsible for providing current information about those resources to public services staff. The authors of this paper describe the process of creating, testing, and implementing LibGuides (proprietary software for building library portals and facilitating information sharing in libraries) as a new way of communicating much-needed information between technical services and public services staff at Miami University Libraries
    corecore