    BlogForever D2.6: Data Extraction Methodology

    This report outlines an inquiry into the area of web data extraction, conducted within the context of blog preservation. The report reviews theoretical advances and practical developments for implementing data extraction. The inquiry is extended through an experiment that demonstrates the effectiveness and feasibility of implementing some of the suggested approaches. More specifically, the report discusses an approach based on unsupervised machine learning that employs the RSS feeds and HTML representations of blogs. It outlines the possibilities of extracting semantics available in blogs and demonstrates the benefits of exploiting available standards such as microformats and microdata. The report proceeds to propose a methodology for extracting and processing blog data to further inform the design and development of the BlogForever platform

    Living Knowledge

    Diversity, especially manifested in language and knowledge, is a function of local goals, needs, competences, beliefs, culture, opinions and personal experience. The Living Knowledge project considers diversity as an asset rather than a problem. With the project, foundational ideas emerged from the synergic contribution of different disciplines, methodologies (with which many partners were previously unfamiliar) and technologies flowed in concrete diversity-aware applications such as the Future Predictor and the Media Content Analyser providing users with better structured information while coping with Web scale complexities. The key notions of diversity, fact, opinion and bias have been defined in relation to three methodologies: Media Content Analysis (MCA) which operates from a social sciences perspective; Multimodal Genre Analysis (MGA) which operates from a semiotic perspective and Facet Analysis (FA) which operates from a knowledge representation and organization perspective. A conceptual architecture that pulls all of them together has become the core of the tools for automatic extraction and the way they interact. In particular, the conceptual architecture has been implemented with the Media Content Analyser application. The scientific and technological results obtained are described in the following

    Combining granularity-based topic-dependent and topic-independent evidences for opinion detection

    Fouille des opinion, une sous-discipline dans la recherche d'information (IR) et la linguistique computationnelle, fait référence aux techniques de calcul pour l'extraction, la classification, la compréhension et l'évaluation des opinions exprimées par diverses sources de nouvelles en ligne, social commentaires des médias, et tout autre contenu généré par l'utilisateur. Il est également connu par de nombreux autres termes comme trouver l'opinion, la détection d'opinion, l'analyse des sentiments, la classification sentiment, de détection de polarité, etc. Définition dans le contexte plus spécifique et plus simple, fouille des opinion est la tâche de récupération des opinions contre son besoin aussi exprimé par l'utilisateur sous la forme d'une requête. Il y a de nombreux problèmes et défis liés à l'activité fouille des opinion. Dans cette thèse, nous nous concentrons sur quelques problèmes d'analyse d'opinion. L'un des défis majeurs de fouille des opinion est de trouver des opinions concernant spécifiquement le sujet donné (requête). Un document peut contenir des informations sur de nombreux sujets à la fois et il est possible qu'elle contienne opiniâtre texte sur chacun des sujet ou sur seulement quelques-uns. Par conséquent, il devient très important de choisir les segments du document pertinentes à sujet avec leurs opinions correspondantes. Nous abordons ce problème sur deux niveaux de granularité, des phrases et des passages. Dans notre première approche de niveau de phrase, nous utilisons des relations sémantiques de WordNet pour trouver cette association entre sujet et opinion. Dans notre deuxième approche pour le niveau de passage, nous utilisons plus robuste modèle de RI i.e. la language modèle de se concentrer sur ce problème. L'idée de base derrière les deux contributions pour l'association d'opinion-sujet est que si un document contient plus segments textuels (phrases ou passages) opiniâtre et pertinentes à sujet, il est plus opiniâtre qu'un document avec moins segments textuels opiniâtre et pertinentes. La plupart des approches d'apprentissage-machine basée à fouille des opinion sont dépendants du domaine i.e. leurs performances varient d'un domaine à d'autre. D'autre part, une approche indépendant de domaine ou un sujet est plus généralisée et peut maintenir son efficacité dans différents domaines. Cependant, les approches indépendant de domaine souffrent de mauvaises performances en général. C'est un grand défi dans le domaine de fouille des opinion à développer une approche qui est plus efficace et généralisé. Nos contributions de cette thèse incluent le développement d'une approche qui utilise de simples fonctions heuristiques pour trouver des documents opiniâtre. Fouille des opinion basée entité devient très populaire parmi les chercheurs de la communauté IR. Il vise à identifier les entités pertinentes pour un sujet donné et d'en extraire les opinions qui leur sont associées à partir d'un ensemble de documents textuels. Toutefois, l'identification et la détermination de la pertinence des entités est déjà une tâche difficile. Nous proposons un système qui prend en compte à la fois l'information de l'article de nouvelles en cours ainsi que des articles antérieurs pertinents afin de détecter les entités les plus importantes dans les nouvelles actuelles. En plus de cela, nous présentons également notre cadre d'analyse d'opinion et tâches relieés. Ce cadre est basée sur les évidences contents et les évidences sociales de la blogosphère pour les tâches de trouver des opinions, de prévision et d'avis de classement multidimensionnel. Cette contribution d'prématurée pose les bases pour nos travaux futurs. L'évaluation de nos méthodes comprennent l'utilisation de TREC 2006 Blog collection et de TREC Novelty track 2004 collection. La plupart des évaluations ont été réalisées dans le cadre de TREC Blog track.Opinion mining is a sub-discipline within Information Retrieval (IR) and Computational Linguistics. It refers to the computational techniques for extracting, classifying, understanding, and assessing the opinions expressed in various online sources like news articles, social media comments, and other user-generated content. It is also known by many other terms like opinion finding, opinion detection, sentiment analysis, sentiment classification, polarity detection, etc. Defining in more specific and simpler context, opinion mining is the task of retrieving opinions on an issue as expressed by the user in the form of a query. There are many problems and challenges associated with the field of opinion mining. In this thesis, we focus on some major problems of opinion mining

    The Strategic Adaptation of Party Organizations to New Information and Communication Technologies : A Study of Catalan and Spanish Parties

    Aquest document se centra en els casos dels dos principals partits espanyols (PP i PSOE) i catalans (PSC i CDC) en el període immediatament després de les eleccions generals espanyoles de maig de 2008, quan aquests celebraren els seus congressos. En general, es poden distingir tres tipus d'actors: en primer lloc, els ciberactivistes que tracten d'obtenir el reconeixement formal de la seva activitat en els seus partits. Així com, els líders del partit que poden intentar promoure la presència del partit en el ciberespai, però que també poden romandre indecisos perquè no és clar l'impacte electoral a la xarxa del ciberactivisme. Finalment, alguns militants tradicionals (off-line) solen ser reticents al reconeixement del ciberactivisme perquè amenaça les recompenses previstes dins del partit. Aquest article mostra com els nostres partits varen respondre al desafiament del ciberactivisme i arriba a la conclusió que la seva situació electoral, mediada per la seva ideologia, estructura organitzativa i el tipus de militància, poden ajudar-nos a comprendre el grau diferent d'institucionalització en l'organització del partit.Este documento se centra en los casos de los dos principales partidos españoles (PP y PSOE) y catalanes (PSC y CDC) en el período inmediatamente después de las elecciones generales de mayo de 2008, cuando estos celebraron sus congresos. En general, se pueden distinguir tres tipos de actores: en primer lugar, los ciberactivistas que tratan de obtener el reconocimiento formal de su actividad en sus partidos. Así como, los líderes del partido que pueden intentar promover la presencia del partido en el ciberespacio, pero que también pueden permanecer indecisos porque no está claro el impacto electoral en la red del ciberactivismo. Finalmente, algunos militantes tradicionales (off-line) suelen ser reticentes al reconocimiento del ciberactivismo porque amenaza las recompensas previstas dentro del partido. Este artículo muestra cómo nuestros partidos respondieron al desafío del ciberactivismo y llega a la conclusión de que su situación electoral, mediada por su ideología, estructura organizativa y el tipo de militancia, pueden ayudarnos a comprender el grado diferente de institucionalización en la organización del partido.This paper focuses on the cases of the two major Spanish (PP and PSOE) and Catalan parties (PSC and CDC) in the period just after the Spanish general elections of May 2008, when these parties held their party conferences. In general, three kind of actors can be distinguished: first, cyber-activists that try to get formal recognition of their activity in their parties. Then, party leaders that can try to promote the presence of the party in cyberspace but that can also remain undecided because it is not clear the net electoral impact of the cyber-activism. Finally, some traditional off-line militants are typically reluctant to the recognizance of the cyber-activism because it threatens their expected payoffs within the party. This paper shows how our parties responded to the challenge of cyber-activism and concludes that their electoral situation, mediated by their ideology, organizational structure and type of membership, can help us to understand their differential degree of party organizational institutionalization

    From people to entities : typed search in the enterprise and the web

    no abstract

    Semi-Supervised Learning For Identifying Opinions In Web Content

    Thesis (Ph.D.) - Indiana University, Information Science, 2011Opinions published on the World Wide Web (Web) offer opportunities for detecting personal attitudes regarding topics, products, and services. The opinion detection literature indicates that both a large body of opinions and a wide variety of opinion features are essential for capturing subtle opinion information. Although a large amount of opinion-labeled data is preferable for opinion detection systems, opinion-labeled data is often limited, especially at sub-document levels, and manual annotation is tedious, expensive and error-prone. This shortage of opinion-labeled data is less challenging in some domains (e.g., movie reviews) than in others (e.g., blog posts). While a simple method for improving accuracy in challenging domains is to borrow opinion-labeled data from a non-target data domain, this approach often fails because of the domain transfer problem: Opinion detection strategies designed for one data domain generally do not perform well in another domain. However, while it is difficult to obtain opinion-labeled data, unlabeled user-generated opinion data are readily available. Semi-supervised learning (SSL) requires only limited labeled data to automatically label unlabeled data and has achieved promising results in various natural language processing (NLP) tasks, including traditional topic classification; but SSL has been applied in only a few opinion detection studies. This study investigates application of four different SSL algorithms in three types of Web content: edited news articles, semi-structured movie reviews, and the informal and unstructured content of the blogosphere. SSL algorithms are also evaluated for their effectiveness in sparse data situations and domain adaptation. Research findings suggest that, when there is limited labeled data, SSL is a promising approach for opinion detection in Web content. Although the contributions of SSL varied across data domains, significant improvement was demonstrated for the most challenging data domain--the blogosphere--when a domain transfer-based SSL strategy was implemented

    Une exploration désagrégée de corpus d'archives Web pour étudier des collectifs migrants éteints.

    International audienceThe Web is an unsteady environment. As Web sites emerge and expand every days, whole communities may fade away over time by leaving too few or incomplete traces on the living Web. Worldwide volumes of Web archives preserve the history of the Web and reduce the loss of this digital heritage. Web archives remain essential to the comprehension of the lifecycles of extinct online collectives. In this paper, we propose a framework to follow the intern dynamics of vanished Web communities, based on the exploration of corpora of Web archives. To achieve this goal, we define a new unit of analysis called Web fragment: a semantic and syntactic subset of a given Web page, designed to increase historical accuracy. This contribution has practical value for those who conduct large-scale archive exploration (in terms of time range and volume) or are interested in computational approach to Web history and social science. By applying our framework to the Moroccan archives of the e-Diasporas Atlas, we first witness the collapsing of an established community of Moroccan migrant blogs. We show its progressive mutation towards rising social platforms, between 2008 and 2018. Then, we study the sudden creation of an ephemeral collective of forum members gathered by the wave of the Arab Spring in the early 2011. We finally yield new insights into historical Web studies by suggesting the concept of pivot moment of the Web

    Uncontrollable Color: Street Art Meets Street Style

    Uncontrollable Color: Street Art Meets Street Style is the title of my senior Capstone thesis project for the Bachelor of Fine Arts degree in Fashion Design at Syracuse University. For this project, I designed a collection of six complete outfits that explore the theme of London street art. The thesis project will be presented in a public fashion show at Syracuse University at the end of April. In this critical statement, I will explore a brief history of fashion design, fabric dyes, and street style. Chapter two explains the origins of my inspiration for the collection. The last portion of this essay is dedicated to an explanation of my design process and a discussion of how I created my collection