20 research outputs found

    Enriching Existing Test Collections with OXPath

    Full text link
    Extending TREC-style test collections by incorporating external resources is a time consuming and challenging task. Making use of freely available web data requires technical skills to work with APIs or to create a web scraping program specifically tailored to the task at hand. We present a light-weight alternative that employs the web data extraction language OXPath to harvest data to be added to an existing test collection from web resources. We demonstrate this by creating an extended version of GIRT4 called GIRT4-XT with additional metadata fields harvested via OXPath from the social sciences portal Sowiport. This allows the re-use of this collection for other evaluation purposes like bibliometrics-enhanced retrieval. The demonstrated method can be applied to a variety of similar scenarios and is not limited to extending existing collections but can also be used to create completely new ones with little effort.Comment: Experimental IR Meets Multilinguality, Multimodality, and Interaction - 8th International Conference of the CLEF Association, CLEF 2017, Dublin, Ireland, September 11-14, 201

    The emergence of interpersonal and social trust in online interactions

    Get PDF
    My PhD work is in the area of extracting and modelling user-created data on the web. In particular, I focussed on locating and extracting user data that ’signals’ the evolution of human, 1-on-1 interactions between participants of large social networks who are forever stranger to each other. The booming of ”Online Social Networks” created an opportunity for social scientists to study social phenomena at a scale unseen before. The vast amount of information combined with computer science techniques led to significant developments in a relatively new field: Computational Social Science. Furthermore, in recent years the Gig Economy and mass adoption of ”business sharing” sites such as Airbnb, Uber, or JustEat drove a new wave of computational social science research into reviews, feedback, and recommendations. All these ingredients of the larger Social Trust have been vastly discussed in the literature, in both the social aspect and computational models of trust. However, some fundamental gaps remain, and there is often confusion about when trust is being expressed and how reviews (or recommendations) relate to social trust. Additionally, the computational trust models found in the literature tend to either be entirely theoretical or focused on a specific data set, thus lacking universal applicability. The latter problem, I believe, was due to a lack of data available to researchers in the early stages of the web. Today, the broader Online Social Networks have matured and consolidated mechanisms for allowing access to data. Access to information is rarely trivial for more specialised and smaller online communities. Yet smaller, focussed platforms are precisely where social trust and interactions could be observed (or not observed) and perhaps acquire a meaning that approaches the social trust social scientists see in in-person interactions. To address this gap, we initially propose and discuss the following research question: ”Is there a meeting point between online interactions and social trust so that the core components of trust are retained? ” We addressed this general open question by working on a computational architecture for data retrieval in social media platforms that can be suitably generalised and re-applied to different platforms. Lastly, as we enjoy the luxury of vast amounts of data that closely represent interpersonal and social trust, we addressed the question of ”what models of trust emerge from data” and ”how do existing models of trust perform with the data available”. I have defined a category of online social networks that retains the core components of social trust, which we call ”Online Social Networks of Needs.” Hence, I have a classification and categorisation mechanism for grouping online social networks of needs by the level of trust necessary for cooperation (aka. the cooperation threshold) and interactions to be triggered among participating cognitive agents. My focus has always been on data acquisition, and I have designed and implemented a system for data retrieval that is easily deployed to social media/social web platforms. A case study of such a system performing in a challenging scenario is further detailed to show the more extensive applicability of such a system for data retrieval and contribution to a scenario of complete distrust, anonymity, and ephemerality of data (such as 4chan.org). Further, studying the granularity of 4Chan data, we discovered that: 1. ephemerality is not sustained, and web archiving sites have a complete view of the ephemeral data [1], 2. we can track sentiment and topic modelling of moderation in 4chan [2], and 3. it is possible to have a live view of the topics and sentiment being discussed in the live board and see how these changes over time We studied the dynamics of high trust interactions [3] and found gender biases [4,5] in care interactions. Another topic related to trust but concerning institutions and media is the ’spillover’ effect between 4Chan and the traditional media. As a premise, 4chan anonymous threads have anticipated important global trends, notably the ”Anonymous” movement. Apart from the US, how do national topics interact with the essentially global discussion that is taking place there? Again, thanks to our extensive data collection/analysis, we sought to determine the level of participation from a selected non-US country, Norway, and the degree to which Norwegian 4chan /pol/ users and domestic news influence each other [6]. We continued the journey by collecting data from eight social networks of needs into the top two high trust demanding categories. Whilst these datasets are made available to researchers [7], we further study emerging networks and their properties and project the online social networks of needs into multiplex graphs by transforming the root links. Finally, we look into the applicability and predictive power of the non-reductionist model of trust proposed by Castelfranchi. We look at total social trust holistically and consider signals to evaluate fluctuations of the social capital influenced by economic and political dynamics and domination of the public discord by conspiracy theories. Summary of contributions 1. the first comprehensive real-time scrape of 4Chan (in literature, only post hoc solutions were available); 2. the application of Castelfranchi’s theoretical model of trust to actual data from online social networks; 3. one of the first studies on the relationship between the institutional (nationwide) press and extremisms on 4Chan; 4. the study of the application of predictive models to heterogeneous multi-source data (not user-created but not very trustable either), and 5. contributing live data scraping expertise into several other publications [8] [9]. Publications [1] Ylli Prifti, Iacopo Pozzana, and Alessandro Provetti. Live monitoring 4chan discussion threads. In 7th Int’l Conference on Computational Social Science, 2021. [2] Y. Prifti I. Pozzana and A. Provetti. On-line page scraping reveals evidence of moderation in 4chan/pol/ anonymous discussion threads. In Proc. of 3rd European Symposium on Societal Challenges in Computational Social Science. ETH Press, 2019. [3] Y. Prifti P. De Meo, I. Pozzana and A. Provetti. The dynamics of recommendation in high-trust personal care services. In 5th Int’l Conference on Computational Social Science (IC2S2), 2019. [4] Y. Prifti P. DeMeo, I. Pozzana and A. Provetti. Finding gender bias in web-based, high-trust interactions. In Proc. of 2nd European Symposium on Societal Challenges in Computational Social Science, GeWISS reports, 2018. [5] Y. Prifti P. De Meo, I. Pozzana and A. Provetti. Gender bias in web-based, high-trust interactions. In 5th Int’l Conference on Computational Social Science (IC2S2), 2019. [6] Alessandro Provetti Iacopo Pozzana, Ylli Prifti and Anders Seyersted Sandbu. Mapping the norwegian 4chan: How conspiracy theories travel the language barriers. In 7th Int’l Conference on Computational Social Science (IC2S2), 2021. [7] Ylli Prifti. 4chan /pol board as a temporary evolution of live threads and posts., July 2021. [8] Paschalis Lagias, George D. Magoulas, Ylli Prifti, and Alessandro Provetti. Predicting seriousness of injury in a traffic accident: A new imbalanced dataset and benchmark. In Lazaros Iliadis, Chrisina Jayne, Anastasios Tefas, and Elias Pimenidis, editors, Engineering Applications of Neural Networks - 23rd International Conference, EAAAI/EANN 2022, Chersonissos, Crete, Greece, June 17-20, 2022, Proceedings, volume 1600 of Communications [9] Andrea Ballatore, A. Pang, Iacopo Pozzana, Ylli Prifti, and Alessandro Provetti. Geo-referencing as a connector between user reviews and urban environment quality. In 5th Int’l Conference on Computational Social Science, 2019

    Web Data Extraction For Content Aggregation From E-Commerce Websites

    Get PDF
    Internetist on saanud piiramatu andmeallikas. LĂ€bi otsingumootorite\n\ron see andmehulk tehtud kĂ€ttesaadavaks igapĂ€evasele interneti kasutajale. Sellele vaatamata on seal ikka informatsiooni, mis pole lihtsasti kĂ€ttesaadav olemasolevateotsingumootoritega. See tekitab jĂ€tkuvalt vajadust ehitada aina uusi otsingumootoreid, mis esitavad informatsiooni uuel kujul, paremini kui seda on varem tehtud. Selleks, et esitada andmeid sellisel kujul, et neist tekiks lisavÀÀrtus tuleb nad kĂ”igepealt kokku koguda ning seejĂ€rel töödelda ja analĂŒĂŒsida. Antud magistritöö uurib andmete kogumise faasi selles protsessis.\n\rEsitletakse modernset andmete eraldamise sĂŒsteemi ZedBot, mis vĂ”imaldab veebilehtedel esinevad pooleldi struktureeritud andmed teisendada kĂ”rge tĂ€psusega struktureeritud kujule. Loodud sĂŒsteem tĂ€idab enamikku nĂ”udeid, mida peab tĂ€napĂ€evane andmeeraldussĂŒsteem tĂ€itma, milleks on: platvormist sĂ”ltumatus, vĂ”imas reeglite kirjelduse sĂŒsteem, automaatne reeglite genereerimise sĂŒsteem ja lihtsasti kasutatav kasutajaliides andmete annoteerimiseks. Eriliselt disainitud otsi-robot vĂ”imaldab andmete eraldamist kogu veebilehelt ilma inimese sekkumiseta. Töös nĂ€idatakse, et esitletud programm on sobilik andmete eraldamiseks vĂ€ga suure tĂ€psusega suurelt hulgalt veebilehtedelt ning tööriista poolt loodud andmestiku saab kasutada tooteinfo agregeerimiseks ning uue lisandvÀÀrtuse loomiseks.World Wide Web has become an unlimited source of data. Search engines have made this information available to every day Internet user. There is still information available that is not easily accessible through existing search engines, so there remains the need to create new search engines that would present information better than before. In order to present data in a way that gives extra value, it must be collected, analysed and transformed. This master thesis focuses on data collection part. Modern information extraction system ZedBot is presented, that allows extraction of highly structured data form semi structured web pages. It complies with majority of requirements set for modern data extraction system: it is platform independent, it has powerful semi automatic wrapper generation system and has easy to use user interface for annotating structured data. Specially designed web crawler allows to extraction to be performed on whole web site level without human interaction. \n\r We show that presented tool is suitable for extraction highly accurate data from large number of websites and can be used as a data source for product aggregation system to create new added value

    BlogForever D2.6: Data Extraction Methodology

    Get PDF
    This report outlines an inquiry into the area of web data extraction, conducted within the context of blog preservation. The report reviews theoretical advances and practical developments for implementing data extraction. The inquiry is extended through an experiment that demonstrates the effectiveness and feasibility of implementing some of the suggested approaches. More specifically, the report discusses an approach based on unsupervised machine learning that employs the RSS feeds and HTML representations of blogs. It outlines the possibilities of extracting semantics available in blogs and demonstrates the benefits of exploiting available standards such as microformats and microdata. The report proceeds to propose a methodology for extracting and processing blog data to further inform the design and development of the BlogForever platform

    Recommendation Techniques for smart cities

    Get PDF
    The bottleneck of event recommender systems is the availability of actual, up-to-date information on events. Usually, there is no single data feed, thus information on events must be crawled from numerous sources. Ranking these sources helps the system to decide which sources to crawl and how often. In this thesis, a model for event source evaluation and ranking is proposed based on well-known centrality measures from social network analysis. Experiments made on real data, crawled from Budapest event sources, shows interesting results for further research

    Acquisition des contenus intelligents dans l’archivage du Web

    Get PDF
    Web sites are dynamic by nature with content and structure changing overtime; many pages on the Web are produced by content management systems (CMSs). Tools currently used by Web archivists to preserve the content of the Web blindly crawl and store Web pages, disregarding the CMS the site is based on and whatever structured content is contained in Web pages. We first present an application-aware helper (AAH) that fits into an archiving crawl processing chain to perform intelligent and adaptive crawling of Web applications, given a knowledge base of common CMSs. The AAH has been integrated into two Web crawlers in the framework of the ARCOMEM project: the proprietary crawler of the Internet Memory Foundation and a customized version of Heritrix. Then we propose an efficient unsupervised Web crawling system ACEBot (Adaptive Crawler Bot for data Extraction), a structure-driven crawler that utilizes the inner structure of the pages and guides the crawling process based on the importance of their content. ACEBot works intwo phases: in the offline phase, it constructs a dynamic site map (limiting the number of URLs retrieved), learns a traversal strategy based on the importance of navigation patterns (selecting those leading to valuable content); in the online phase, ACEBot performs massive downloading following the chosen navigation patterns. The AAH and ACEBot makes 7 and 5 times, respectively, fewer HTTP requests as compared to a generic crawler, without compromising on effectiveness. We finally propose OWET (Open Web Extraction Toolkit) as a free platform for semi-supervised data extraction. OWET allows a user to extract the data hidden behind Web formsLes sites Web sont par nature dynamiques, leur contenu et leur structure changeant au fil du temps ; de nombreuses pages sur le Web sont produites par des systĂšmes de gestion de contenu (CMS). Les outils actuellement utilisĂ©s par les archivistes du Web pour prĂ©server le contenu du Web collectent et stockent de maniĂšre aveugle les pages Web, en ne tenant pas compte du CMS sur lequel le site est construit et du contenu structurĂ© de ces pages Web. Nous prĂ©sentons dans un premier temps un application-aware helper (AAH) qui s’intĂšgre Ă  une chaine d’archivage classique pour accomplir une collecte intelligente et adaptative des applications Web, Ă©tant donnĂ©e une base de connaissance deCMS courants. L’AAH a Ă©tĂ© intĂ©grĂ©e Ă  deux crawlers Web dans le cadre du projet ARCOMEM : le crawler propriĂ©taire d’Internet Memory Foundation et une version personnalisĂ©e d’Heritrix. Nous proposons ensuite un systĂšme de crawl efficace et non supervisĂ©, ACEBot (Adaptive Crawler Bot for data Extraction), guidĂ© par la structure qui exploite la structure interne des pages et dirige le processus de crawl en fonction de l’importance du contenu. ACEBot fonctionne en deux phases : dans la phase hors-ligne, il construit un plan dynamique du site (en limitant le nombre d’URL rĂ©cupĂ©rĂ©es), apprend une stratĂ©gie de parcours basĂ©e sur l’importance des motifs de navigation (sĂ©lectionnant ceux qui mĂšnent Ă  du contenu de valeur) ; dans la phase en-ligne, ACEBot accomplit un tĂ©lĂ©chargement massif en suivant les motifs de navigation choisis. L’AAH et ACEBot font 7 et 5 fois moins, respectivement, de requĂȘtes HTTP qu’un crawler gĂ©nĂ©rique, sans compromis de qualitĂ©. Nous proposons enfin OWET (Open Web Extraction Toolkit), une plate-forme libre pour l’extraction de donnĂ©es semi-supervisĂ©e. OWET permet Ă  un utilisateur d’extraire les donnĂ©es cachĂ©es derriĂšre des formulaires Web

    Web Data Extraction, Applications and Techniques: A Survey

    Full text link
    Web Data Extraction is an important problem that has been studied by means of different scientific tools and in a broad range of applications. Many approaches to extracting data from the Web have been designed to solve specific problems and operate in ad-hoc domains. Other approaches, instead, heavily reuse techniques and algorithms developed in the field of Information Extraction. This survey aims at providing a structured and comprehensive overview of the literature in the field of Web Data Extraction. We provided a simple classification framework in which existing Web Data Extraction applications are grouped into two main classes, namely applications at the Enterprise level and at the Social Web level. At the Enterprise level, Web Data Extraction techniques emerge as a key tool to perform data analysis in Business and Competitive Intelligence systems as well as for business process re-engineering. At the Social Web level, Web Data Extraction techniques allow to gather a large amount of structured data continuously generated and disseminated by Web 2.0, Social Media and Online Social Network users and this offers unprecedented opportunities to analyze human behavior at a very large scale. We discuss also the potential of cross-fertilization, i.e., on the possibility of re-using Web Data Extraction techniques originally designed to work in a given domain, in other domains.Comment: Knowledge-based System

    Data context informed data wrangling

    Get PDF
    The process of preparing potentially large and complex data sets for further analysis or manual examination is often called data wrangling. In classical warehousing environments, the steps in such a process have been carried out using Extract-Transform-Load platforms, with significant manual involvement in specifying, configuring or tuning many of them. Cost-effective data wrangling processes need to ensure that data wrangling steps benefit from automation wherever possible. In this paper, we define a methodology to fully automate an end-to-end data wrangling process incorporating data context, which associates portions of a target schema with potentially spurious extensional data of types that are commonly available. Instance-based evidence together with data profiling paves the way to inform automation in several steps within the wrangling process, specifically, matching, mapping validation, value format transformation, and data repair. The approach is evaluated with real estate data showing substantial improvements in the results of automated wrangling
    corecore