7 research outputs found

    Live monitoring 4chan discussion threads

    Get PDF
    The 4chan portal has been known for several years as a ``fringe'' internet service for sharing and commenting pictures. Thanks to the possibility to post anonymously, guaranteed by the total lack of a registration/identification mechanism, the portal has somewhat evolved to a global, if mostly US-centred, locus for the posting of extreme views, including racism and all sorts of hate speech. A pivotal role in the emergence of the website as a bastion of ``free speech" has been played by the /pol/ board (https://boards.4chan.org/pol/), which declares its commitment to host ``politically incorrect'' discussions. Several research groups have intensively studied 4chan structure, dynamics and contents. Thanks to works such as[4, 12], we now have a fairly clear description of how 4chan works and what type of discussion dynamics the site supports. In particular, the latter work shed light on the extremely ephemeral nature of discussions, with threads lasting on the website for a few hours at most, and often just for minutes - depending on the traffic they generate - before being removed to make room for new discussion. Given the fast-paced nature of the evolution of the content of the boards, and especially given how such ephemerality shapes the tone and the content of the discussion itself [4, 14], it is of extreme importance for researchers to be able to capture the content of the threads at various points over the course of their short lives. To the best of our knowledge, the existing 4chan literature has relied either on autoptic exploration by the scholars [14], or on large scale data collection campaigns that drew their content from the archived versions of the threads [12], i.e. on copies of the threads as they appeared at the time of their closure, and at that time only. In order to observe at a more fine-grained level the content on the website, we devised a ``scraping'' architecture, summarised in Figure 2, which based on the OXPath platform [9]. It enables the retrieval of the threads posted on a board at various points while they are still live

    Web Data Extraction For Content Aggregation From E-Commerce Websites

    Get PDF
    Internetist on saanud piiramatu andmeallikas. Läbi otsingumootorite\n\ron see andmehulk tehtud kättesaadavaks igapäevasele interneti kasutajale. Sellele vaatamata on seal ikka informatsiooni, mis pole lihtsasti kättesaadav olemasolevateotsingumootoritega. See tekitab jätkuvalt vajadust ehitada aina uusi otsingumootoreid, mis esitavad informatsiooni uuel kujul, paremini kui seda on varem tehtud. Selleks, et esitada andmeid sellisel kujul, et neist tekiks lisaväärtus tuleb nad kõigepealt kokku koguda ning seejärel töödelda ja analüüsida. Antud magistritöö uurib andmete kogumise faasi selles protsessis.\n\rEsitletakse modernset andmete eraldamise süsteemi ZedBot, mis võimaldab veebilehtedel esinevad pooleldi struktureeritud andmed teisendada kõrge täpsusega struktureeritud kujule. Loodud süsteem täidab enamikku nõudeid, mida peab tänapäevane andmeeraldussüsteem täitma, milleks on: platvormist sõltumatus, võimas reeglite kirjelduse süsteem, automaatne reeglite genereerimise süsteem ja lihtsasti kasutatav kasutajaliides andmete annoteerimiseks. Eriliselt disainitud otsi-robot võimaldab andmete eraldamist kogu veebilehelt ilma inimese sekkumiseta. Töös näidatakse, et esitletud programm on sobilik andmete eraldamiseks väga suure täpsusega suurelt hulgalt veebilehtedelt ning tööriista poolt loodud andmestiku saab kasutada tooteinfo agregeerimiseks ning uue lisandväärtuse loomiseks.World Wide Web has become an unlimited source of data. Search engines have made this information available to every day Internet user. There is still information available that is not easily accessible through existing search engines, so there remains the need to create new search engines that would present information better than before. In order to present data in a way that gives extra value, it must be collected, analysed and transformed. This master thesis focuses on data collection part. Modern information extraction system ZedBot is presented, that allows extraction of highly structured data form semi structured web pages. It complies with majority of requirements set for modern data extraction system: it is platform independent, it has powerful semi automatic wrapper generation system and has easy to use user interface for annotating structured data. Specially designed web crawler allows to extraction to be performed on whole web site level without human interaction. \n\r We show that presented tool is suitable for extraction highly accurate data from large number of websites and can be used as a data source for product aggregation system to create new added value

    Web Data Extraction, Applications and Techniques: A Survey

    Full text link
    Web Data Extraction is an important problem that has been studied by means of different scientific tools and in a broad range of applications. Many approaches to extracting data from the Web have been designed to solve specific problems and operate in ad-hoc domains. Other approaches, instead, heavily reuse techniques and algorithms developed in the field of Information Extraction. This survey aims at providing a structured and comprehensive overview of the literature in the field of Web Data Extraction. We provided a simple classification framework in which existing Web Data Extraction applications are grouped into two main classes, namely applications at the Enterprise level and at the Social Web level. At the Enterprise level, Web Data Extraction techniques emerge as a key tool to perform data analysis in Business and Competitive Intelligence systems as well as for business process re-engineering. At the Social Web level, Web Data Extraction techniques allow to gather a large amount of structured data continuously generated and disseminated by Web 2.0, Social Media and Online Social Network users and this offers unprecedented opportunities to analyze human behavior at a very large scale. We discuss also the potential of cross-fertilization, i.e., on the possibility of re-using Web Data Extraction techniques originally designed to work in a given domain, in other domains.Comment: Knowledge-based System

    Acquisition des contenus intelligents dans l’archivage du Web

    Get PDF
    Web sites are dynamic by nature with content and structure changing overtime; many pages on the Web are produced by content management systems (CMSs). Tools currently used by Web archivists to preserve the content of the Web blindly crawl and store Web pages, disregarding the CMS the site is based on and whatever structured content is contained in Web pages. We first present an application-aware helper (AAH) that fits into an archiving crawl processing chain to perform intelligent and adaptive crawling of Web applications, given a knowledge base of common CMSs. The AAH has been integrated into two Web crawlers in the framework of the ARCOMEM project: the proprietary crawler of the Internet Memory Foundation and a customized version of Heritrix. Then we propose an efficient unsupervised Web crawling system ACEBot (Adaptive Crawler Bot for data Extraction), a structure-driven crawler that utilizes the inner structure of the pages and guides the crawling process based on the importance of their content. ACEBot works intwo phases: in the offline phase, it constructs a dynamic site map (limiting the number of URLs retrieved), learns a traversal strategy based on the importance of navigation patterns (selecting those leading to valuable content); in the online phase, ACEBot performs massive downloading following the chosen navigation patterns. The AAH and ACEBot makes 7 and 5 times, respectively, fewer HTTP requests as compared to a generic crawler, without compromising on effectiveness. We finally propose OWET (Open Web Extraction Toolkit) as a free platform for semi-supervised data extraction. OWET allows a user to extract the data hidden behind Web formsLes sites Web sont par nature dynamiques, leur contenu et leur structure changeant au fil du temps ; de nombreuses pages sur le Web sont produites par des systèmes de gestion de contenu (CMS). Les outils actuellement utilisés par les archivistes du Web pour préserver le contenu du Web collectent et stockent de manière aveugle les pages Web, en ne tenant pas compte du CMS sur lequel le site est construit et du contenu structuré de ces pages Web. Nous présentons dans un premier temps un application-aware helper (AAH) qui s’intègre à une chaine d’archivage classique pour accomplir une collecte intelligente et adaptative des applications Web, étant donnée une base de connaissance deCMS courants. L’AAH a été intégrée à deux crawlers Web dans le cadre du projet ARCOMEM : le crawler propriétaire d’Internet Memory Foundation et une version personnalisée d’Heritrix. Nous proposons ensuite un système de crawl efficace et non supervisé, ACEBot (Adaptive Crawler Bot for data Extraction), guidé par la structure qui exploite la structure interne des pages et dirige le processus de crawl en fonction de l’importance du contenu. ACEBot fonctionne en deux phases : dans la phase hors-ligne, il construit un plan dynamique du site (en limitant le nombre d’URL récupérées), apprend une stratégie de parcours basée sur l’importance des motifs de navigation (sélectionnant ceux qui mènent à du contenu de valeur) ; dans la phase en-ligne, ACEBot accomplit un téléchargement massif en suivant les motifs de navigation choisis. L’AAH et ACEBot font 7 et 5 fois moins, respectivement, de requêtes HTTP qu’un crawler générique, sans compromis de qualité. Nous proposons enfin OWET (Open Web Extraction Toolkit), une plate-forme libre pour l’extraction de données semi-supervisée. OWET permet à un utilisateur d’extraire les données cachées derrière des formulaires Web

    XPath-based information extraction

    Get PDF

    Geocomputation methods for spatial economic analysis

    Full text link
    Tésis doctoral inédita leída en la Universidad Autónoma de Madrid, Facultad de Ciencias Económicas y Empresariales, Departamento de Economía Aplicada. Fecha de lectura 18-02-2019Geocomputation is a new scientific paradigm that uses computational techniques to analyze spatial phenomena. Spatial economics and regional science quickly adopted geocomputation techniques to study the complex structures of urban and regional systems. This thesis contributes to the use of geocomputation in spatial economic analysis through construction and application of a new set of algorithms and functions in the R programming language to deal with spatial economic data. First, we created the ’DataSpa’ package, which collects data at low geographical levels to generate socio-economic information for Spanish municipalities using URL parsing, PDF extraction and web scraping. Second, based on a search and replace algorithm, we built the ’msp’ package to harmonize data with accuracy problems such as spelling errors, acronym abbreviations and names listed differently. This methodology enables study of the patenting activity and research collaboration in Chile between 1989- 2013. We also adapted classical spatial autocorrelation methods to visualize and explore the existence of productivity spillovers among the network’s members. Finally, we created ’estdaR’ to improve knowledge of Chile’s urban system by evaluating the influence of spatial proximity among human settlements on the evolution of cities. The package contains new tools for exploratory spatio-temporal data analysis that are very useful for detecting spatial differences in time trends. All R codes used in computation and the packages themselves are considered as research results and are freely available to other researchers in a Github repositoryLa Geocomputatión es un nuevo paradigma científico que utiliza métodos computacionales para analizar fenómenos espaciales. La economía espacial y la ciencia regional adoptaron rápidamente las técnicas de la geocomputación para estudiar las estructuras complejas de los sistemas urbanos y regionales. Esta Tesis constituye una contribución al campo de la geocomputación a través de la construcción y la aplicación al análisis económico espacial de un nuevo conjunto de algoritmos y funciones programadas en lenguaje R. En primer lugar, utilizando técnicas de análisis sintáctico de las URL, de extracción de textos en formato PDF y de “web scraping”, hemos desarrollado el paquete “DataSpa” que recopila información procedente de Internet necesaria para generar indicadores socioeconómicos para los municipios españoles. En segundo lugar, utilizando un algoritmo de búsqueda y remplazo se genera el paquete “msp” que permite arreglar textos con imprecisiones y errores de escritura en los acrónimos y nombres propios. De esta forma, fue posible estudiar las relaciones de colaboración empresarial y la actividad de I+D de las empresas chilenas, en el período 1989-2013, a través de las relaciones en materia de patentes. Adicionalmente, hemos adaptado métodos clásicos de autocorrelación espacial a este ámbito para explorar y visualizar la existencia de efectos de contagio en la productividad de la red de colaboración en la actividad de I+D entre las empresas. Finalmente, para mejorar el conocimiento del sistema urbano chileno, hemos evaluado la influencia que la proximidad espacial entre ciudades tiene en la evolución de su tamaño poblacional, a través del paquete “estdaR”, que contiene funciones para el análisis exploratorio de datos espacio-temporales que permiten para analizar diferencias espaciales en las tendencias temporales. Todos los códigos de R usados y los paquetes son considerados, en sí mismos, un resultado de la investigación y están libremente disponibles en un repositorio en Githu

    Library Automation Domain Ontology

    Get PDF
    Disertace je věnována problematice automatizace knihoven. Zkoumá vztahy mezi oblastí automatizace knihoven a knihovními systémy a zohledňuje přitom potřeby knihoven v souvislosti s problematikou výběru knihovního softwaru. Cílem disertace bylo vytvořit doménovou ontologii pro oblast automatizace knihoven a navrhnout nástroje pro podporu rozhodování při výběru knihovního softwaru. V rámci disertace byla vytvořena doménová ontologie popisující oblast automatizace knihoven ve vztahu ke knihovním systémům, které jsou zasazeny do širšího kontextu oblasti automatizace knihoven a je na ně nahlíženo z hlediska potřeb knihoven. Z účelem získání termínů pro názvy entit v ontologii byla provedena analýza odborných textů vztahujících se k problematice automatizace knihoven a knihovních systémů, pro kterou byl využit softwarový nástroj Voyant Tools. Vlastnosti a vztahy entit v ontologii jsou založeny na výsledcích kvalitativního průzkumu potřeb knihoven ve vztahu ke knihovním softwarům. Na základě doménové ontologie byly vytvořeny vývojové diagramy, které ilustrují rozhodovací proces knihovny při výběru knihovního systému. Vedlejším výstupem disertační práce je metodická příručka pro knihovny Připravujeme změnu knihovního softwaru.The dissertation focuses on library automation. It examines relationships between the field of library automation and library management systems and the needs of libraries in relation to the process of library software selection. The main goal of the dissertation has been to create a library automation domain ontology and to develop tools to support a decision-making process of library software selection. The domain ontology has been created; it describes the field of library automation in relation to library systems, which are set in the broader context of the field of library automation and are regarded in terms of needs of libraries in connection to library software change. To obtain terms for ontology entity names both quantitative and mixed research methods were used, namely the analysis of scholarly texts on library automation and library systems; text analyses were performed using Voyant Tools software. Object properties and relationships among entities in ontology are based on the results of a qualitative survey of the needs of libraries in relation to the library management software. Based on the domain ontology, flowcharts have been created that illustrate the library's decision-making process when selecting a library management system. During the work on the dissertation, a methodological...Institute of Information Studies and LibrarianshipÚstav informačních studií a knihovnictvíFaculty of ArtsFilozofická fakult
    corecore