Search CORE

19 research outputs found

The Archive Query Log: Mining Millions of Search Result Pages of Hundreds of Search Engines from 25 Years of Web Archives

Author: Fröbe Maik
Gienapp Lukas
Hagen Matthias
Potthast Martin
Reimer Jan Heinrich
Scells Harrisen
Schmidt Sebastian
Stein Benno
Publication venue
Publication date: 31/07/2023
Field of study

The Archive Query Log (AQL) is a previously unused, comprehensive query log collected at the Internet Archive over the last 25 years. Its first version includes 356 million queries, 166 million search result pages, and 1.7 billion search results across 550 search providers. Although many query logs have been studied in the literature, the search providers that own them generally do not publish their logs to protect user privacy and vital business data. Of the few query logs publicly available, none combines size, scope, and diversity. The AQL is the first to do so, enabling research on new retrieval models and (diachronic) search engine analyses. Provided in a privacy-preserving manner, it promotes open research as well as more transparency and accountability in the search industry.Comment: SIGIR 2023 resource paper, 13 page

arXiv.org e-Print Archive

Privacy-preserving efficient searchable encryption

Author: Ferreira Bernardo Luís da Silva
Publication venue
Publication date: 01/12/2016
Field of study

Data storage and computation outsourcing to third-party managed data centers, in environments such as Cloud Computing, is increasingly being adopted by individuals, organizations, and governments. However, as cloud-based outsourcing models expand to society-critical data and services, the lack of effective and independent control over security and privacy conditions in such settings presents significant challenges. An interesting solution to these issues is to perform computations on encrypted data, directly in the outsourcing servers. Such an approach benefits from not requiring major data transfers and decryptions, increasing performance and scalability of operations. Searching operations, an important application case when cloud-backed repositories increase in number and size, are good examples where security, efficiency, and precision are relevant requisites. Yet existing proposals for searching encrypted data are still limited from multiple perspectives, including usability, query expressiveness, and client-side performance and scalability. This thesis focuses on the design and evaluation of mechanisms for searching encrypted data with improved efficiency, scalability, and usability. There are two particular concerns addressed in the thesis: on one hand, the thesis aims at supporting multiple media formats, especially text, images, and multimodal data (i.e. data with multiple media formats simultaneously); on the other hand the thesis addresses client-side overhead, and how it can be minimized in order to support client applications executing in both high-performance desktop devices and resource-constrained mobile devices. From the research performed to address these issues, three core contributions were developed and are presented in the thesis: (i) CloudCryptoSearch, a middleware system for storing and searching text documents with privacy guarantees, while supporting multiple modes of deployment (user device, local proxy, or computational cloud) and exploring different tradeoffs between security, usability, and performance; (ii) a novel framework for efficiently searching encrypted images based on IES-CBIR, an Image Encryption Scheme with Content-Based Image Retrieval properties that we also propose and evaluate; (iii) MIE, a Multimodal Indexable Encryption distributed middleware that allows storing, sharing, and searching encrypted multimodal data while minimizing client-side overhead and supporting both desktop and mobile devices

Repositório da Universidade Nova de Lisboa

A review of the role of sensors in mobile context-aware recommendation systems

Author: Hermoso Ramon
Ilarri Sergio
Rodriguez Hernandez Maria del Carmen
Trillo-Lado Raquel
Publication venue: 'Hindawi Limited'
Publication date: 01/01/2015
Field of study

Recommendation systems are specialized in offering suggestions about specific items of different types (e.g., books, movies, restaurants, and hotels) that could be interesting for the user. They have attracted considerable research attention due to their benefits and also their commercial interest. Particularly, in recent years, the concept of context-aware recommendation system has appeared to emphasize the importance of considering the context of the situations in which the user is involved in order to provide more accurate recommendations. The detection of the context requires the use of sensors of different types, which measure different context variables. Despite the relevant role played by sensors in the development of context-aware recommendation systems, sensors and recommendation approaches are two fields usually studied independently. In this paper, we provide a survey on the use of sensors for recommendation systems. Our contribution can be seen from a double perspective. On the one hand, we overview existing techniques used to detect context factors that could be relevant for recommendation. On the other hand, we illustrate the interest of sensors by considering different recommendation use cases and scenarios

Crossref

Repositorio Universidad de Zaragoza

Directory of Open Access Journals

Beyond the Book: Linking Books to Wikipedia

Author: Buschenhenke F.
Koolen M.
Martinez-Ortiz C.
van Dalen-Oskam K.
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2015
Field of study

The book translation market is a topic of interest in literary studies, but the reasons why a book is selected for translation are not well understood. The "Beyond the Book" project investigates whether web resources like Wikipedia can be used to establish the level of cultural bias. This work describes the eScience tools used to estimate the cultural appeal of a book: semantic linking is used to identify key words in the text of the book, and afterwards the revision information from corresponding Wikipedia articles is examined to identify countries that generated a more than average amount of contributions to those articles. Comparison between the number of contributions from two countries on the same set of articles may show with which knowledge the contributors are familiar. We assume a lack of contributions from a country may indicate a gap in the knowledge of readers from that country. We assume that a book dealing with that concept could be more exotic and therefore more appealing for certain readers, while others are therefore less interested in the book. An indication of the 'level of exoticness' thus could help a reader/publisher to decide to read/translate the book or not. Experimental results are presented for four selected books from a set of 564 books written in Dutch or translated into Dutch, assessing their potential appeal for a Canadian audience. A qualitative assessment of quantitative results provides insight into named entities that may indicate a high/low cultural bias towards a book

Crossref

International Migration, Integration and Social Cohesion online publications

TSKY: a dependable middleware solution for data privacy using public storage clouds

Author: Rodrigues João Miguel Cardia Melro
Publication venue: Faculdade de Ciências e Tecnologia
Publication date: 01/01/2013
Field of study

Dissertação para obtenção do Grau de Mestre em Engenharia InformáticaThis dissertation aims to take advantage of the virtues offered by data storage cloud based systems on the Internet, proposing a solution that avoids security issues by combining different providers’ solutions in a vision of a cloud-of-clouds storage and computing. The solution, TSKY System (or Trusted Sky), is implemented as a middleware system, featuring a set of components designed to establish and to enhance conditions for security, privacy, reliability and availability of data, with these conditions being secured and verifiable by the end-user, independently of each provider. These components, implement cryptographic tools, including threshold and homomorphic cryptographic schemes, combined with encryption, replication, and dynamic indexing mecha-nisms. The solution allows data management and distribution functions over data kept in different storage clouds, not necessarily trusted, improving and ensuring resilience and security guarantees against Byzantine faults and at-tacks. The generic approach of the TSKY system model and its implemented services are evaluated in the context of a Trusted Email Repository System (TSKY-TMS System). The TSKY-TMS system is a prototype that uses the base TSKY middleware services to store mailboxes and email Messages in a cloud-of-clouds

Repositório da Universidade Nova de Lisboa

Practical Isolated Searchable Encryption in a Trusted Computing Environment

Author: Borges Guilherme Rosas
Publication venue
Publication date: 01/01/2018
Field of study

Cloud computing has become a standard computational paradigm due its numerous advantages, including high availability, elasticity, and ubiquity. Both individual users and companies are adopting more of its services, but not without loss of privacy and control. Outsourcing data and computations to a remote server implies trusting its owners, a problem many end-users are aware. Recent news have proven data stored on Cloud servers is susceptible to leaks from the provider, third-party attackers, or even from government surveillance programs, exposing users’ private data. Different approaches to tackle these problems have surfaced throughout the years. Naïve solutions involve storing data encrypted on the server, decrypting it only on the client-side. Yet, this imposes a high overhead on the client, rendering such schemes impractical. Searchable Symmetric Encryption (SSE) has emerged as a novel research topic in recent years, allowing efficient querying and updating over encrypted datastores in Cloud servers, while retaining privacy guarantees. Still, despite relevant recent advances, existing SSE schemes still make a critical trade-off between efficiency, security, and query expressiveness, thus limiting their adoption as a viable technology, particularly in large-scale scenarios. New technologies providing Isolated Execution Environments (IEEs) may help improve SSE literature. These technologies allow applications to be run remotely with privacy guarantees, in isolation from other, possibly privileged, processes inside the CPU, such as the operating system kernel. Prominent example technologies are Intel SGX and ARM TrustZone, which are being made available in today’s commodity CPUs. In this thesis we study these new trusted hardware technologies in depth, while exploring their application to the problem of searching over encrypted data, primarily focusing in SGX. In more detail, we study the application of IEEs in SSE schemes, improving their efficiency, security, and query expressiveness. We design, implement, and evaluate three new SSE schemes for different query types, namely Boolean queries over text, similarity queries over image datastores, and multimodal queries over text and images. These schemes can support queries combining different media formats simultaneously, envisaging applications such as privacy-enhanced medical diagnosis and management of electronic-healthcare records, or confidential photograph catalogues, running without the danger of privacy breaks in Cloud-based provisioned services

Repositório da Universidade Nova de Lisboa

Advances in next-track music recommendation

Author: Kamehkhosh Iman
Publication venue
Publication date: 01/01/2017
Field of study

Technological advances in the music industry have dramatically changed how people access and listen to music. Today, online music stores and streaming services offer easy and immediate means to buy or listen to a huge number of songs. One traditional way to find interesting items in such cases when a vast amount of choices are available is to ask others for recommendations. Music providers utilize correspondingly music recommender systems as a software solution to the problem of music overload to provide a better user experience for their customers. At the same time, an enhanced user experience can lead to higher customer retention and higher business value for music providers. Different types of music recommendations can be found on today's music platforms, such as Spotify or Deezer. Providing a list of currently trending music, finding similar tracks to the user's favorite ones, helping users discover new artists, or recommending curated playlists for a certain mood (e.g., romantic) or activity (e.g., driving) are examples of common music recommendation scenarios. "Next-track music recommendation" is a specific form of music recommendation that relies mainly on the user's recently played tracks to create a list of tracks to be played next. Next-track music recommendations are used, for instance, to support users during playlist creation or to provide personalized radio stations. A particular challenge in this context is that the recommended tracks should not only match the general taste of the listener but should also match the characteristics of the most recently played tracks. This thesis by publication focuses on the next-track music recommendation problem and explores some challenges and questions that have not been addressed in previous research. In the first part of this thesis, various next-track music recommendation algorithms as well as approaches to evaluate them from the research literature are reviewed. The recommendation techniques are categorized into the four groups of content-based filtering, collaborative filtering, co-occurrence-based, and sequence-aware algorithms. Moreover, a number of challenges, such as personalizing next-track music recommendations and generating recommendations that are coherent with the user's listening history are discussed. Furthermore, some common approaches in the literature to determine relevant quality criteria for next-track music recommendations and to evaluate the quality of such recommendations are presented. The second part of the thesis contains a selection of the author's publications on next- track music recommendation as follows. 1. The results of comprehensive analyses of the musical characteristics of manually created playlists for music recommendation; 2. the results of a multi-dimensional comparison of different academic and commercial next-track recommending techniques; 3. the results of a multi-faceted comparison of different session-based recommenders, among others, for the next-track music recommendation problem with respect to their accuracy, popularity bias, catalog coverage as well as computational complexity; 4. a two-phase approach to recommend accurate next-track recommendations that also match the characteristics of the most recent listening history; 5. a personalization approach based on multi-dimensional user models that are extracted from the users' long-term preferences; 6. a user study with the aim of determining the quality perception of next-track music recommendations generated by different algorithms

Eldorado - Ressourcen aus und für Lehre, Studium und Forschung

Approaches for enriching and improving textual knowledge bases

Author: Fetahu Besnik
Publication venue: Hannover : Gottfried Wilhelm Leibniz Universität Hannover
Publication date: 01/01/2017
Field of study

[no abstract

arXiv.org e-Print Archive

Institutionelles Repositorium der Leibniz Universität Hannover

Recommended from our members

Learning Topical Social Media Sensors for Twitter

Author: Iman Zahra
Publication venue: 'Oregon State University'
Publication date
Field of study

Social media sources such as Twitter represent a massively distributed social sensor over diverse topics ranging from social and political events to entertainment and sports news. However, due to the overwhelming volume of content, it can be difﬁcult to identify novel and signiﬁcant content within a broad topic in a timely fashion. To this end, this thesis proposes a scalable and practical method to automatically construct social sensors for generic topics. The concept of using social media as a sensor for detection of events and news has been proposed in the literature. However, we argue that most of these works do not focus on targeted content detection or they use very basic methods for collecting the topical data for further analysis. This demonstrates a gap in the use of social media as a sensor for high-quality topical content detection that we aim to address via machine learning. In this thesis, given minimal supervised training content from a user, we learn to identify topical tweets from millions of features capturing content, user and social interactions on Twitter. On a corpus of over 800 million English Tweets collected from the Twitter streaming API during 2013 and 2014 and learning for 10 diverse topics, we empirically show that our learned social sensor automatically generalizes to unseen future content with high ranking and precision scores. Furthermore, we provide an extensive analysis of features and feature types across different topics that reveals, for example, that (1) largely independent of topic, simple terms are the most informative feature followed by location features and that (2) the number of unique hashtags and tweets by a user correlates more with their informativeness than their follower or friend count. In summary, this work provides a novel, effective, and efﬁcient way to learn topical social sensors requiring minimal user curation effort and offering strong generalization performance for identifying future topical content

ScholarsArchive@OSU