Search CORE

20 research outputs found

Enriching Existing Test Collections with OXPath

Author: P Mayr
R Berendsen
T Beckers
T Furche
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 21/06/2017
Field of study

Extending TREC-style test collections by incorporating external resources is a time consuming and challenging task. Making use of freely available web data requires technical skills to work with APIs or to create a web scraping program specifically tailored to the task at hand. We present a light-weight alternative that employs the web data extraction language OXPath to harvest data to be added to an existing test collection from web resources. We demonstrate this by creating an extended version of GIRT4 called GIRT4-XT with additional metadata fields harvested via OXPath from the social sciences portal Sowiport. This allows the re-use of this collection for other evaluation purposes like bibliometrics-enhanced retrieval. The demonstrated method can be applied to a variety of similar scenarios and is not limited to extending existing collections but can also be used to create completely new ones with little effort.Comment: Experimental IR Meets Multilinguality, Multimodality, and Interaction - 8th International Conference of the CLEF Association, CLEF 2017, Dublin, Ireland, September 11-14, 201

arXiv.org e-Print Archive

Crossref

The emergence of interpersonal and social trust in online interactions

Author: Prifti Ylli
Publication venue: Birkbeck, University of London
Publication date
Field of study

My PhD work is in the area of extracting and modelling user-created data on the web. In particular, I focussed on locating and extracting user data that ’signals’ the evolution of human, 1-on-1 interactions between participants of large social networks who are forever stranger to each other. The booming of ”Online Social Networks” created an opportunity for social scientists to study social phenomena at a scale unseen before. The vast amount of information combined with computer science techniques led to significant developments in a relatively new field: Computational Social Science. Furthermore, in recent years the Gig Economy and mass adoption of ”business sharing” sites such as Airbnb, Uber, or JustEat drove a new wave of computational social science research into reviews, feedback, and recommendations. All these ingredients of the larger Social Trust have been vastly discussed in the literature, in both the social aspect and computational models of trust. However, some fundamental gaps remain, and there is often confusion about when trust is being expressed and how reviews (or recommendations) relate to social trust. Additionally, the computational trust models found in the literature tend to either be entirely theoretical or focused on a specific data set, thus lacking universal applicability. The latter problem, I believe, was due to a lack of data available to researchers in the early stages of the web. Today, the broader Online Social Networks have matured and consolidated mechanisms for allowing access to data. Access to information is rarely trivial for more specialised and smaller online communities. Yet smaller, focussed platforms are precisely where social trust and interactions could be observed (or not observed) and perhaps acquire a meaning that approaches the social trust social scientists see in in-person interactions. To address this gap, we initially propose and discuss the following research question: ”Is there a meeting point between online interactions and social trust so that the core components of trust are retained? ” We addressed this general open question by working on a computational architecture for data retrieval in social media platforms that can be suitably generalised and re-applied to different platforms. Lastly, as we enjoy the luxury of vast amounts of data that closely represent interpersonal and social trust, we addressed the question of ”what models of trust emerge from data” and ”how do existing models of trust perform with the data available”. I have defined a category of online social networks that retains the core components of social trust, which we call ”Online Social Networks of Needs.” Hence, I have a classification and categorisation mechanism for grouping online social networks of needs by the level of trust necessary for cooperation (aka. the cooperation threshold) and interactions to be triggered among participating cognitive agents. My focus has always been on data acquisition, and I have designed and implemented a system for data retrieval that is easily deployed to social media/social web platforms. A case study of such a system performing in a challenging scenario is further detailed to show the more extensive applicability of such a system for data retrieval and contribution to a scenario of complete distrust, anonymity, and ephemerality of data (such as 4chan.org). Further, studying the granularity of 4Chan data, we discovered that: 1. ephemerality is not sustained, and web archiving sites have a complete view of the ephemeral data [1], 2. we can track sentiment and topic modelling of moderation in 4chan [2], and 3. it is possible to have a live view of the topics and sentiment being discussed in the live board and see how these changes over time We studied the dynamics of high trust interactions [3] and found gender biases [4,5] in care interactions. Another topic related to trust but concerning institutions and media is the ’spillover’ effect between 4Chan and the traditional media. As a premise, 4chan anonymous threads have anticipated important global trends, notably the ”Anonymous” movement. Apart from the US, how do national topics interact with the essentially global discussion that is taking place there? Again, thanks to our extensive data collection/analysis, we sought to determine the level of participation from a selected non-US country, Norway, and the degree to which Norwegian 4chan /pol/ users and domestic news influence each other [6]. We continued the journey by collecting data from eight social networks of needs into the top two high trust demanding categories. Whilst these datasets are made available to researchers [7], we further study emerging networks and their properties and project the online social networks of needs into multiplex graphs by transforming the root links. Finally, we look into the applicability and predictive power of the non-reductionist model of trust proposed by Castelfranchi. We look at total social trust holistically and consider signals to evaluate fluctuations of the social capital influenced by economic and political dynamics and domination of the public discord by conspiracy theories. Summary of contributions 1. the first comprehensive real-time scrape of 4Chan (in literature, only post hoc solutions were available); 2. the application of Castelfranchi’s theoretical model of trust to actual data from online social networks; 3. one of the first studies on the relationship between the institutional (nationwide) press and extremisms on 4Chan; 4. the study of the application of predictive models to heterogeneous multi-source data (not user-created but not very trustable either), and 5. contributing live data scraping expertise into several other publications [8] [9]. Publications [1] Ylli Prifti, Iacopo Pozzana, and Alessandro Provetti. Live monitoring 4chan discussion threads. In 7th Int’l Conference on Computational Social Science, 2021. [2] Y. Prifti I. Pozzana and A. Provetti. On-line page scraping reveals evidence of moderation in 4chan/pol/ anonymous discussion threads. In Proc. of 3rd European Symposium on Societal Challenges in Computational Social Science. ETH Press, 2019. [3] Y. Prifti P. De Meo, I. Pozzana and A. Provetti. The dynamics of recommendation in high-trust personal care services. In 5th Int’l Conference on Computational Social Science (IC2S2), 2019. [4] Y. Prifti P. DeMeo, I. Pozzana and A. Provetti. Finding gender bias in web-based, high-trust interactions. In Proc. of 2nd European Symposium on Societal Challenges in Computational Social Science, GeWISS reports, 2018. [5] Y. Prifti P. De Meo, I. Pozzana and A. Provetti. Gender bias in web-based, high-trust interactions. In 5th Int’l Conference on Computational Social Science (IC2S2), 2019. [6] Alessandro Provetti Iacopo Pozzana, Ylli Prifti and Anders Seyersted Sandbu. Mapping the norwegian 4chan: How conspiracy theories travel the language barriers. In 7th Int’l Conference on Computational Social Science (IC2S2), 2021. [7] Ylli Prifti. 4chan /pol board as a temporary evolution of live threads and posts., July 2021. [8] Paschalis Lagias, George D. Magoulas, Ylli Prifti, and Alessandro Provetti. Predicting seriousness of injury in a traffic accident: A new imbalanced dataset and benchmark. In Lazaros Iliadis, Chrisina Jayne, Anastasios Tefas, and Elias Pimenidis, editors, Engineering Applications of Neural Networks - 23rd International Conference, EAAAI/EANN 2022, Chersonissos, Crete, Greece, June 17-20, 2022, Proceedings, volume 1600 of Communications [9] Andrea Ballatore, A. Pang, Iacopo Pozzana, Ylli Prifti, and Alessandro Provetti. Geo-referencing as a connector between user reviews and urban environment quality. In 5th Int’l Conference on Computational Social Science, 2019

Birkbeck Institutional Research Online

Web Data Extraction For Content Aggregation From E-Commerce Websites

Author: Viikmaa Andres
Publication venue
Publication date: 01/01/2016
Field of study

Internetist on saanud piiramatu andmeallikas. Läbi otsingumootorite\n\ron see andmehulk tehtud kättesaadavaks igapäevasele interneti kasutajale. Sellele vaatamata on seal ikka informatsiooni, mis pole lihtsasti kättesaadav olemasolevateotsingumootoritega. See tekitab jätkuvalt vajadust ehitada aina uusi otsingumootoreid, mis esitavad informatsiooni uuel kujul, paremini kui seda on varem tehtud. Selleks, et esitada andmeid sellisel kujul, et neist tekiks lisaväärtus tuleb nad kõigepealt kokku koguda ning seejärel töödelda ja analüüsida. Antud magistritöö uurib andmete kogumise faasi selles protsessis.\n\rEsitletakse modernset andmete eraldamise süsteemi ZedBot, mis võimaldab veebilehtedel esinevad pooleldi struktureeritud andmed teisendada kõrge täpsusega struktureeritud kujule. Loodud süsteem täidab enamikku nõudeid, mida peab tänapäevane andmeeraldussüsteem täitma, milleks on: platvormist sõltumatus, võimas reeglite kirjelduse süsteem, automaatne reeglite genereerimise süsteem ja lihtsasti kasutatav kasutajaliides andmete annoteerimiseks. Eriliselt disainitud otsi-robot võimaldab andmete eraldamist kogu veebilehelt ilma inimese sekkumiseta. Töös näidatakse, et esitletud programm on sobilik andmete eraldamiseks väga suure täpsusega suurelt hulgalt veebilehtedelt ning tööriista poolt loodud andmestiku saab kasutada tooteinfo agregeerimiseks ning uue lisandväärtuse loomiseks.World Wide Web has become an unlimited source of data. Search engines have made this information available to every day Internet user. There is still information available that is not easily accessible through existing search engines, so there remains the need to create new search engines that would present information better than before. In order to present data in a way that gives extra value, it must be collected, analysed and transformed. This master thesis focuses on data collection part. Modern information extraction system ZedBot is presented, that allows extraction of highly structured data form semi structured web pages. It complies with majority of requirements set for modern data extraction system: it is platform independent, it has powerful semi automatic wrapper generation system and has easy to use user interface for annotating structured data. Specially designed web crawler allows to extraction to be performed on whole web site level without human interaction. \n\r We show that presented tool is suitable for extraction highly accurate data from large number of websites and can be used as a data source for product aggregation system to create new added value

DSpace at Tartu University Library

BlogForever D2.6: Data Extraction Methodology

Author: Banos V.
Davis R.
Gkotsis G.
Pincent E.
Stepanyan K.
Publication venue
Publication date: 25/10/2013
Field of study

This report outlines an inquiry into the area of web data extraction, conducted within the context of blog preservation. The report reviews theoretical advances and practical developments for implementing data extraction. The inquiry is extended through an experiment that demonstrates the effectiveness and feasibility of implementing some of the suggested approaches. More specifically, the report discusses an approach based on unsupervised machine learning that employs the RSS feeds and HTML representations of blogs. It outlines the possibilities of extracting semantics available in blogs and demonstrates the benefits of exploiting available standards such as microformats and microdata. The report proceeds to propose a methodology for extracting and processing blog data to further inform the design and development of the BlogForever platform

ZENODO

NEUROSURGERY ENTHUSIASTIC WOMEN SOCIETY

Recommendation Techniques for smart cities

Author: Horváth Balázs
Publication venue
Publication date: 28/08/2017
Field of study

The bottleneck of event recommender systems is the availability of actual, up-to-date information on events. Usually, there is no single data feed, thus information on events must be crawled from numerous sources. Ranking these sources helps the system to decide which sources to crawl and how often. In this thesis, a model for event source evaluation and ranking is proposed based on well-known centrality measures from social network analysis. Experiments made on real data, crawled from Budapest event sources, shows interesting results for further research

Aaltodoc Publication Archive

Acquisition des contenus intelligents dans l’archivage du Web

Author: Faheem Muhammad
Publication venue: HAL CCSD
Publication date: 17/12/2014
Field of study

Web sites are dynamic by nature with content and structure changing overtime; many pages on the Web are produced by content management systems (CMSs). Tools currently used by Web archivists to preserve the content of the Web blindly crawl and store Web pages, disregarding the CMS the site is based on and whatever structured content is contained in Web pages. We first present an application-aware helper (AAH) that fits into an archiving crawl processing chain to perform intelligent and adaptive crawling of Web applications, given a knowledge base of common CMSs. The AAH has been integrated into two Web crawlers in the framework of the ARCOMEM project: the proprietary crawler of the Internet Memory Foundation and a customized version of Heritrix. Then we propose an efficient unsupervised Web crawling system ACEBot (Adaptive Crawler Bot for data Extraction), a structure-driven crawler that utilizes the inner structure of the pages and guides the crawling process based on the importance of their content. ACEBot works intwo phases: in the offline phase, it constructs a dynamic site map (limiting the number of URLs retrieved), learns a traversal strategy based on the importance of navigation patterns (selecting those leading to valuable content); in the online phase, ACEBot performs massive downloading following the chosen navigation patterns. The AAH and ACEBot makes 7 and 5 times, respectively, fewer HTTP requests as compared to a generic crawler, without compromising on effectiveness. We finally propose OWET (Open Web Extraction Toolkit) as a free platform for semi-supervised data extraction. OWET allows a user to extract the data hidden behind Web formsLes sites Web sont par nature dynamiques, leur contenu et leur structure changeant au fil du temps ; de nombreuses pages sur le Web sont produites par des systèmes de gestion de contenu (CMS). Les outils actuellement utilisés par les archivistes du Web pour préserver le contenu du Web collectent et stockent de manière aveugle les pages Web, en ne tenant pas compte du CMS sur lequel le site est construit et du contenu structuré de ces pages Web. Nous présentons dans un premier temps un application-aware helper (AAH) qui s’intègre à une chaine d’archivage classique pour accomplir une collecte intelligente et adaptative des applications Web, étant donnée une base de connaissance deCMS courants. L’AAH a été intégrée à deux crawlers Web dans le cadre du projet ARCOMEM : le crawler propriétaire d’Internet Memory Foundation et une version personnalisée d’Heritrix. Nous proposons ensuite un système de crawl efficace et non supervisé, ACEBot (Adaptive Crawler Bot for data Extraction), guidé par la structure qui exploite la structure interne des pages et dirige le processus de crawl en fonction de l’importance du contenu. ACEBot fonctionne en deux phases : dans la phase hors-ligne, il construit un plan dynamique du site (en limitant le nombre d’URL récupérées), apprend une stratégie de parcours basée sur l’importance des motifs de navigation (sélectionnant ceux qui mènent à du contenu de valeur) ; dans la phase en-ligne, ACEBot accomplit un téléchargement massif en suivant les motifs de navigation choisis. L’AAH et ACEBot font 7 et 5 fois moins, respectivement, de requêtes HTTP qu’un crawler générique, sans compromis de qualité. Nous proposons enfin OWET (Open Web Extraction Toolkit), une plate-forme libre pour l’extraction de données semi-supervisée. OWET permet à un utilisateur d’extraire les données cachées derrière des formulaires Web

Thèses en Ligne

Web Data Extraction, Applications and Techniques: A Survey

Author: Abel
Amalfitano
Balduzzi
Baumgartner
Baumgartner
Baumgartner
Baumgartner
Baumgartner
Baumgartner
Berger
Berthold
Bettencourt
Califf
Catanese
Chang
Chen
Chen
Chen
Collins
Conover
Crandall
Crescenzi
Crescenzi
Dalvi
Dalvi
De Meo
De Meo
Doan
Emilio Ferrara
Ferrara
Ferrara
Ferrara
Ferrara
Ferrara
Flesca
Freitag
Furche
Gatterbauer
Gatterbauer
Giacomo Fiumara
Gjoka
Gkotsis
Gottlob
Gottlob
Hammersley
Han
Hecht
Hsu
Irmak
Khare
Kim
Kinsella
Kleinberg
Kleinberg
Kohlschütter
Kokkoras
Kokkoras
Kokkoras
Krüpl
Kushmerick
Kwak
Laender
Liu
Manning
Masanès
Mathes
Meng
Mislove
Monge
Muslea
Oro
Pan
Pasquale De Meo
Perito
Phan
Plake
Rahm
Rahm
Reis
Robert Baumgartner
Sahuguet
Sarawagi
Schifanella
Selkow
Shi
Soderland
Szomszor
Turmo
Vosecky
Wang
Wang
Weikum
Wilson
Winograd
Yang
Ye
Zafarani
Zanasi
Zhai
Zhang
Zhang
Publication venue: 'Elsevier BV'
Publication date: 09/06/2014
Field of study

Web Data Extraction is an important problem that has been studied by means of different scientific tools and in a broad range of applications. Many approaches to extracting data from the Web have been designed to solve specific problems and operate in ad-hoc domains. Other approaches, instead, heavily reuse techniques and algorithms developed in the field of Information Extraction. This survey aims at providing a structured and comprehensive overview of the literature in the field of Web Data Extraction. We provided a simple classification framework in which existing Web Data Extraction applications are grouped into two main classes, namely applications at the Enterprise level and at the Social Web level. At the Enterprise level, Web Data Extraction techniques emerge as a key tool to perform data analysis in Business and Competitive Intelligence systems as well as for business process re-engineering. At the Social Web level, Web Data Extraction techniques allow to gather a large amount of structured data continuously generated and disseminated by Web 2.0, Social Media and Online Social Network users and this offers unprecedented opportunities to analyze human behavior at a very large scale. We discuss also the potential of cross-fertilization, i.e., on the possibility of re-using Web Data Extraction techniques originally designed to work in a given domain, in other domains.Comment: Knowledge-based System

arXiv.org e-Print Archive

Crossref

Data context informed data wrangling

Author: Abel Edward
Bogatu Alex
Civili Cristina
Fernandes Alvaro A. A.
Keane John
Koehler Martin
Konstantinou Nikolaos
Libkin Leonid
Paton Norman W.
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 15/01/2018
Field of study

The process of preparing potentially large and complex data sets for further analysis or manual examination is often called data wrangling. In classical warehousing environments, the steps in such a process have been carried out using Extract-Transform-Load platforms, with significant manual involvement in specifying, configuring or tuning many of them. Cost-effective data wrangling processes need to ensure that data wrangling steps benefit from automation wherever possible. In this paper, we define a methodology to fully automate an end-to-end data wrangling process incorporating data context, which associates portions of a target schema with potentially spurious extensional data of types that are commonly available. Instance-based evidence together with data profiling paves the way to inform automation in several steps within the wrangling process, specifically, matching, mapping validation, value format transformation, and data repair. The approach is evaluated with real estate data showing substantial improvements in the results of automated wrangling

arXiv.org e-Print Archive

Edinburgh Research Explorer

The University of Manchester - Institutional Repository

Feedback Driven Improvement of Data Preparation Pipelines

Author: Konstantinou Nikolaos
Paton Norman
Publication venue
Publication date: 01/01/2019
Field of study

The University of Manchester - Institutional Repository