19 research outputs found

    The Archive Query Log: Mining Millions of Search Result Pages of Hundreds of Search Engines from 25 Years of Web Archives

    Full text link
    The Archive Query Log (AQL) is a previously unused, comprehensive query log collected at the Internet Archive over the last 25 years. Its first version includes 356 million queries, 166 million search result pages, and 1.7 billion search results across 550 search providers. Although many query logs have been studied in the literature, the search providers that own them generally do not publish their logs to protect user privacy and vital business data. Of the few query logs publicly available, none combines size, scope, and diversity. The AQL is the first to do so, enabling research on new retrieval models and (diachronic) search engine analyses. Provided in a privacy-preserving manner, it promotes open research as well as more transparency and accountability in the search industry.Comment: SIGIR 2023 resource paper, 13 page

    Privacy-preserving efficient searchable encryption

    Get PDF
    Data storage and computation outsourcing to third-party managed data centers, in environments such as Cloud Computing, is increasingly being adopted by individuals, organizations, and governments. However, as cloud-based outsourcing models expand to society-critical data and services, the lack of effective and independent control over security and privacy conditions in such settings presents significant challenges. An interesting solution to these issues is to perform computations on encrypted data, directly in the outsourcing servers. Such an approach benefits from not requiring major data transfers and decryptions, increasing performance and scalability of operations. Searching operations, an important application case when cloud-backed repositories increase in number and size, are good examples where security, efficiency, and precision are relevant requisites. Yet existing proposals for searching encrypted data are still limited from multiple perspectives, including usability, query expressiveness, and client-side performance and scalability. This thesis focuses on the design and evaluation of mechanisms for searching encrypted data with improved efficiency, scalability, and usability. There are two particular concerns addressed in the thesis: on one hand, the thesis aims at supporting multiple media formats, especially text, images, and multimodal data (i.e. data with multiple media formats simultaneously); on the other hand the thesis addresses client-side overhead, and how it can be minimized in order to support client applications executing in both high-performance desktop devices and resource-constrained mobile devices. From the research performed to address these issues, three core contributions were developed and are presented in the thesis: (i) CloudCryptoSearch, a middleware system for storing and searching text documents with privacy guarantees, while supporting multiple modes of deployment (user device, local proxy, or computational cloud) and exploring different tradeoffs between security, usability, and performance; (ii) a novel framework for efficiently searching encrypted images based on IES-CBIR, an Image Encryption Scheme with Content-Based Image Retrieval properties that we also propose and evaluate; (iii) MIE, a Multimodal Indexable Encryption distributed middleware that allows storing, sharing, and searching encrypted multimodal data while minimizing client-side overhead and supporting both desktop and mobile devices

    A review of the role of sensors in mobile context-aware recommendation systems

    Get PDF
    Recommendation systems are specialized in offering suggestions about specific items of different types (e.g., books, movies, restaurants, and hotels) that could be interesting for the user. They have attracted considerable research attention due to their benefits and also their commercial interest. Particularly, in recent years, the concept of context-aware recommendation system has appeared to emphasize the importance of considering the context of the situations in which the user is involved in order to provide more accurate recommendations. The detection of the context requires the use of sensors of different types, which measure different context variables. Despite the relevant role played by sensors in the development of context-aware recommendation systems, sensors and recommendation approaches are two fields usually studied independently. In this paper, we provide a survey on the use of sensors for recommendation systems. Our contribution can be seen from a double perspective. On the one hand, we overview existing techniques used to detect context factors that could be relevant for recommendation. On the other hand, we illustrate the interest of sensors by considering different recommendation use cases and scenarios

    Beyond the Book: Linking Books to Wikipedia

    Full text link
    The book translation market is a topic of interest in literary studies, but the reasons why a book is selected for translation are not well understood. The "Beyond the Book" project investigates whether web resources like Wikipedia can be used to establish the level of cultural bias. This work describes the eScience tools used to estimate the cultural appeal of a book: semantic linking is used to identify key words in the text of the book, and afterwards the revision information from corresponding Wikipedia articles is examined to identify countries that generated a more than average amount of contributions to those articles. Comparison between the number of contributions from two countries on the same set of articles may show with which knowledge the contributors are familiar. We assume a lack of contributions from a country may indicate a gap in the knowledge of readers from that country. We assume that a book dealing with that concept could be more exotic and therefore more appealing for certain readers, while others are therefore less interested in the book. An indication of the 'level of exoticness' thus could help a reader/publisher to decide to read/translate the book or not. Experimental results are presented for four selected books from a set of 564 books written in Dutch or translated into Dutch, assessing their potential appeal for a Canadian audience. A qualitative assessment of quantitative results provides insight into named entities that may indicate a high/low cultural bias towards a book

    TSKY: a dependable middleware solution for data privacy using public storage clouds

    Get PDF
    Dissertação para obtenção do Grau de Mestre em Engenharia InformáticaThis dissertation aims to take advantage of the virtues offered by data storage cloud based systems on the Internet, proposing a solution that avoids security issues by combining different providers’ solutions in a vision of a cloud-of-clouds storage and computing. The solution, TSKY System (or Trusted Sky), is implemented as a middleware system, featuring a set of components designed to establish and to enhance conditions for security, privacy, reliability and availability of data, with these conditions being secured and verifiable by the end-user, independently of each provider. These components, implement cryptographic tools, including threshold and homomorphic cryptographic schemes, combined with encryption, replication, and dynamic indexing mecha-nisms. The solution allows data management and distribution functions over data kept in different storage clouds, not necessarily trusted, improving and ensuring resilience and security guarantees against Byzantine faults and at-tacks. The generic approach of the TSKY system model and its implemented services are evaluated in the context of a Trusted Email Repository System (TSKY-TMS System). The TSKY-TMS system is a prototype that uses the base TSKY middleware services to store mailboxes and email Messages in a cloud-of-clouds

    Practical Isolated Searchable Encryption in a Trusted Computing Environment

    Get PDF
    Cloud computing has become a standard computational paradigm due its numerous advantages, including high availability, elasticity, and ubiquity. Both individual users and companies are adopting more of its services, but not without loss of privacy and control. Outsourcing data and computations to a remote server implies trusting its owners, a problem many end-users are aware. Recent news have proven data stored on Cloud servers is susceptible to leaks from the provider, third-party attackers, or even from government surveillance programs, exposing users’ private data. Different approaches to tackle these problems have surfaced throughout the years. Naïve solutions involve storing data encrypted on the server, decrypting it only on the client-side. Yet, this imposes a high overhead on the client, rendering such schemes impractical. Searchable Symmetric Encryption (SSE) has emerged as a novel research topic in recent years, allowing efficient querying and updating over encrypted datastores in Cloud servers, while retaining privacy guarantees. Still, despite relevant recent advances, existing SSE schemes still make a critical trade-off between efficiency, security, and query expressiveness, thus limiting their adoption as a viable technology, particularly in large-scale scenarios. New technologies providing Isolated Execution Environments (IEEs) may help improve SSE literature. These technologies allow applications to be run remotely with privacy guarantees, in isolation from other, possibly privileged, processes inside the CPU, such as the operating system kernel. Prominent example technologies are Intel SGX and ARM TrustZone, which are being made available in today’s commodity CPUs. In this thesis we study these new trusted hardware technologies in depth, while exploring their application to the problem of searching over encrypted data, primarily focusing in SGX. In more detail, we study the application of IEEs in SSE schemes, improving their efficiency, security, and query expressiveness. We design, implement, and evaluate three new SSE schemes for different query types, namely Boolean queries over text, similarity queries over image datastores, and multimodal queries over text and images. These schemes can support queries combining different media formats simultaneously, envisaging applications such as privacy-enhanced medical diagnosis and management of electronic-healthcare records, or confidential photograph catalogues, running without the danger of privacy breaks in Cloud-based provisioned services

    Advances in next-track music recommendation

    Get PDF
    Technological advances in the music industry have dramatically changed how people access and listen to music. Today, online music stores and streaming services offer easy and immediate means to buy or listen to a huge number of songs. One traditional way to find interesting items in such cases when a vast amount of choices are available is to ask others for recommendations. Music providers utilize correspondingly music recommender systems as a software solution to the problem of music overload to provide a better user experience for their customers. At the same time, an enhanced user experience can lead to higher customer retention and higher business value for music providers. Different types of music recommendations can be found on today's music platforms, such as Spotify or Deezer. Providing a list of currently trending music, finding similar tracks to the user's favorite ones, helping users discover new artists, or recommending curated playlists for a certain mood (e.g., romantic) or activity (e.g., driving) are examples of common music recommendation scenarios. "Next-track music recommendation" is a specific form of music recommendation that relies mainly on the user's recently played tracks to create a list of tracks to be played next. Next-track music recommendations are used, for instance, to support users during playlist creation or to provide personalized radio stations. A particular challenge in this context is that the recommended tracks should not only match the general taste of the listener but should also match the characteristics of the most recently played tracks. This thesis by publication focuses on the next-track music recommendation problem and explores some challenges and questions that have not been addressed in previous research. In the first part of this thesis, various next-track music recommendation algorithms as well as approaches to evaluate them from the research literature are reviewed. The recommendation techniques are categorized into the four groups of content-based filtering, collaborative filtering, co-occurrence-based, and sequence-aware algorithms. Moreover, a number of challenges, such as personalizing next-track music recommendations and generating recommendations that are coherent with the user's listening history are discussed. Furthermore, some common approaches in the literature to determine relevant quality criteria for next-track music recommendations and to evaluate the quality of such recommendations are presented. The second part of the thesis contains a selection of the author's publications on next- track music recommendation as follows. 1. The results of comprehensive analyses of the musical characteristics of manually created playlists for music recommendation; 2. the results of a multi-dimensional comparison of different academic and commercial next-track recommending techniques; 3. the results of a multi-faceted comparison of different session-based recommenders, among others, for the next-track music recommendation problem with respect to their accuracy, popularity bias, catalog coverage as well as computational complexity; 4. a two-phase approach to recommend accurate next-track recommendations that also match the characteristics of the most recent listening history; 5. a personalization approach based on multi-dimensional user models that are extracted from the users' long-term preferences; 6. a user study with the aim of determining the quality perception of next-track music recommendations generated by different algorithms

    Approaches for enriching and improving textual knowledge bases

    Get PDF
    [no abstract
    corecore