249 research outputs found
Dynamic Integration of Evolving Distributed Databases using Services
This thesis investigates the integration of many separate existing heterogeneous and distributed databases which, due to organizational changes, must be merged and appear as one database. A solution to some database evolution problems is presented. It presents an Evolution Adaptive Service-Oriented Data Integration Architecture (EA-SODIA) to dynamically integrate heterogeneous and distributed source databases, aiming to minimize the cost of the maintenance caused by database evolution.
An algorithm, named Relational Schema Mapping by Views (RSMV), is designed to integrate source databases that are exposed as services into a pre-designed global schema that is in a data integrator service. Instead of producing hard-coded programs, views are built using relational algebra operations to eliminate the heterogeneities among the source databases. More importantly, the definitions of those views are represented and stored in the meta-database with some constraints to test their validity. Consequently, the method, called Evolution Detection, is then able to identify in the meta-database the views affected by evolutions and then modify them automatically.
An evaluation is presented using case study. Firstly, it is shown that most types of heterogeneity defined in this thesis can be eliminated by RSMV, except semantic conflict. Secondly, it presents that few manual modification on the system is required as long as the evolutions follow the rules. For only three types of database evolutions, human intervention is required and some existing views are discarded. Thirdly, the computational cost of the automatic modification shows a slow linear growth in the number of source database. Other characteristics addressed include EA-SODIA’ scalability, domain independence, autonomy of source databases, and potential of involving other data sources (e.g.XML). Finally, the descriptive comparison with other data integration approaches is presented. It shows that although other approaches may provide better performance of query processing in some circumstances, the service-oriented architecture provide better autonomy, flexibility and capability of evolution
CONCEPT GENERATION SUPPORT BY CONCEPTUAL BLENDING: MULTI-AREA INSPIRATION SEARCH
Master'sMASTER OF ENGINEERIN
iAggregator: Multidimensional Relevance Aggregation Based on a Fuzzy Operator
International audienceRecently, an increasing number of information retrieval studies have triggered a resurgence of interest in redefining the algorithmic estimation of relevance, which implies a shift from topical to multidimensional relevance assessment. A key underlying aspect that emerged when addressing this concept is the aggregation of the relevance assessments related to each of the considered dimensions. The most commonly adopted forms of aggregation are based on classical weighted means and linear combination schemes to address this issue. Although some initiatives were recently proposed, none was concerned with considering the inherent dependencies and interactions existing among the relevance criteria, as is the case in many real-life applications. In this article, we present a new fuzzy-based operator, called iAggregator, for multidimensional relevance aggregation. Its main originality, beyond its ability to model interactions between different relevance criteria, lies in its generalization of many classical aggregation functions. To validate our proposal, we apply our operator within a tweet search task. Experiments using a standard benchmark, namely, Text REtrieval Conference Microblog,1 emphasize the relevance of our contribution when compared with traditional aggregation schemes. In addition, it outperforms state-of-the-art aggregation operators such as the Scoring and the And prioritized operators as well as some representative learning-to-rank algorithms
CRIS-IR 2006
The recognition of entities and their
relationships in document collections is an important step towards the discovery of latent knowledge as well as to support knowledge management applications.
The challenge lies on how to extract and correlate entities, aiming to answer key knowledge management questions, such as; who works with whom, on which projects, with which customers and on what research areas. The present work proposes a
knowledge mining approach supported by information retrieval and text mining tasks in which its core is based on the correlation of textual elements through the LRD (Latent Relation Discovery) method. Our experiments show that LRD outperform better than
other correlation methods. Also, we present an application in order to demonstrate the approach over knowledge management scenarios.Fundação para a Ciência e a Tecnologia (FCT)
Denmark's Electronic Research Librar
Recommended from our members
SEARCHING BASED ON QUERY DOCUMENTS
Searches can start with query documents where search queries are formulated based on document-level descriptions. This type of searches is more common in domain-specific search environments. For example, in patent retrieval, one major search task is finding relevant information for new (query) patents, and search queries are generated from the query patents One unique characteristic of this search is that the search process can take longer and be more comprehensive, compared to general web search. As an example, to complete a single patent retrieval task, a typical user may generate 15 queries and examine more than 100 retrieved documents. In these search environments, searchers need to formulate multiple queries based on query documents that are typically complex and difficult to understand. In this work, we describe methods for automatically generating queries and diversifying search results based on query documents, which can be used for query vi suggestion and for improving the quality of retrieval results. In particular, we focus on resolving three main issues related to query document-based searches: (1) query generation, (2) query suggestion and formulation, and (3) search result diversification. Automatic query generation helps users by reducing the burden of formulating queries from query documents. Using generated queries as suggestions is investigated as a method of presenting alternative queries. Search result diversification is important in domain-specific search because of the nature of the query documents. Since query documents generally contain long complex descriptions, diverse query topics can be identified, and a range of relevant documents can be found that are related to these diverse topics. The proposed methods we study in this thesis explicitly address these three issues. To solve the query generation issue, we use binary decision trees to generate effective Boolean queries and labeling propagation to formulate more effective phrasal-concept queries. In order to diversify search results, we propose two different approaches: query-side and result-level diversification. To generate diverse queries, we identify important topics from query documents and generate queries based on the identified topics. For result-level diversification, we extract query topics from query documents, and apply state-of-the-art diversification algorithms based on the extracted topics. In addition, we devise query suggestion techniques for each query generation method. To demonstrate the effectiveness of our approach, we conduct experiments for various domain-specific search tasks, and devise appropriate evaluation measures for domain-specific search environments
Machine Learning in Automated Text Categorization
The automated categorization (or classification) of texts into predefined
categories has witnessed a booming interest in the last ten years, due to the
increased availability of documents in digital form and the ensuing need to
organize them. In the research community the dominant approach to this problem
is based on machine learning techniques: a general inductive process
automatically builds a classifier by learning, from a set of preclassified
documents, the characteristics of the categories. The advantages of this
approach over the knowledge engineering approach (consisting in the manual
definition of a classifier by domain experts) are a very good effectiveness,
considerable savings in terms of expert manpower, and straightforward
portability to different domains. This survey discusses the main approaches to
text categorization that fall within the machine learning paradigm. We will
discuss in detail issues pertaining to three different problems, namely
document representation, classifier construction, and classifier evaluation.Comment: Accepted for publication on ACM Computing Survey
Practical Isolated Searchable Encryption in a Trusted Computing Environment
Cloud computing has become a standard computational paradigm due its numerous
advantages, including high availability, elasticity, and ubiquity. Both individual users and
companies are adopting more of its services, but not without loss of privacy and control.
Outsourcing data and computations to a remote server implies trusting its owners, a
problem many end-users are aware. Recent news have proven data stored on Cloud
servers is susceptible to leaks from the provider, third-party attackers, or even from
government surveillance programs, exposing users’ private data.
Different approaches to tackle these problems have surfaced throughout the years.
Naïve solutions involve storing data encrypted on the server, decrypting it only on the
client-side. Yet, this imposes a high overhead on the client, rendering such schemes
impractical. Searchable Symmetric Encryption (SSE) has emerged as a novel research
topic in recent years, allowing efficient querying and updating over encrypted datastores
in Cloud servers, while retaining privacy guarantees. Still, despite relevant recent advances,
existing SSE schemes still make a critical trade-off between efficiency, security,
and query expressiveness, thus limiting their adoption as a viable technology, particularly
in large-scale scenarios.
New technologies providing Isolated Execution Environments (IEEs) may help improve
SSE literature. These technologies allow applications to be run remotely with
privacy guarantees, in isolation from other, possibly privileged, processes inside the CPU,
such as the operating system kernel. Prominent example technologies are Intel SGX and
ARM TrustZone, which are being made available in today’s commodity CPUs.
In this thesis we study these new trusted hardware technologies in depth, while exploring
their application to the problem of searching over encrypted data, primarily focusing
in SGX. In more detail, we study the application of IEEs in SSE schemes, improving their
efficiency, security, and query expressiveness.
We design, implement, and evaluate three new SSE schemes for different query types,
namely Boolean queries over text, similarity queries over image datastores, and multimodal
queries over text and images. These schemes can support queries combining different
media formats simultaneously, envisaging applications such as privacy-enhanced medical diagnosis and management of electronic-healthcare records, or confidential photograph
catalogues, running without the danger of privacy breaks in Cloud-based provisioned
services
Formulating Complex Queries Using Templates
While many users have relatively general information needs, users who are familiar with a certain topic may have more specific or complex information needs. Such users already have some knowledge of a subject and its concepts, and they need to find information on a specific aspect of a certain entity, such as its cause, effect, and relationships between entities. To successfully resolve this kind of complex information needs, in our study, we investigated the effectiveness of topic-independent query templates as a tool for assisting users in articulating their information needs. A set of query templates, which were written in the form of fill-in-the-blanks was designed to represent general semantic relationships between concepts, such as cause-effect and problem-solution. To conduct the research, we designed a control interface with a single query textbox and an experimental interface with the query templates. A user study was performed with 30 users. Okapi information retrieval system was used to retrieve documents in response to the users’ queries.
The analysis in this paper indicates that while users found the template-based query formulation less easy to use, the queries written using templates performed better than the queries written using the control interface with one query textbox. Our analysis of a group of users and some specific topics demonstrates that the experimental interface tended to help users create more detailed search queries and the users were able to think about different aspects of their complex information needs and fill in many templates.
In the future, an interesting research direction would be to tune the templates, adapting them to users’ specific query requests and avoiding showing non-relevant templates to users by automatically selecting related templates from a larger set of templates
Graph Processing in Main-Memory Column Stores
Evermore, novel and traditional business applications leverage the advantages of a graph data model, such as the offered schema flexibility and an explicit representation of relationships between entities. As a consequence, companies are confronted with the challenge of storing, manipulating, and querying terabytes of graph data for enterprise-critical applications. Although these business applications operate on graph-structured data, they still require direct access to the relational data and typically rely on an RDBMS to keep a single source of truth and access.
Existing solutions performing graph operations on business-critical data either use a combination of SQL and application logic or employ a graph data management system. For the first approach, relying solely on SQL results in poor execution performance caused by the functional mismatch between typical graph operations and the relational algebra. To the worse, graph algorithms expose a tremendous variety in structure and functionality caused by their often domain-specific implementations and therefore can be hardly integrated into a database management system other than with custom coding. Since the majority of these enterprise-critical applications exclusively run on relational DBMSs, employing a specialized system for storing and processing graph data is typically not sensible. Besides the maintenance overhead for keeping the systems in sync, combining graph and relational operations is hard to realize as it requires data transfer across system boundaries.
A basic ingredient of graph queries and algorithms are traversal operations and are a fundamental component of any database management system that aims at storing, manipulating, and querying graph data. Well-established graph traversal algorithms are standalone implementations relying on optimized data structures. The integration of graph traversals as an operator into a database management system requires a tight integration into the existing database environment and a development of new components, such as a graph topology-aware optimizer and accompanying graph statistics, graph-specific secondary index structures to speedup traversals, and an accompanying graph query language.
In this thesis, we introduce and describe GRAPHITE, a hybrid graph-relational data management system. GRAPHITE is a performance-oriented graph data management system as part of an RDBMS allowing to seamlessly combine processing of graph data with relational data in the same system. We propose a columnar storage representation for graph data to leverage the already existing and mature data management and query processing infrastructure of relational database management systems. At the core of GRAPHITE we propose an execution engine solely based on set operations and graph traversals.
Our design is driven by the observation that different graph topologies expose different algorithmic requirements to the design of a graph traversal operator. We derive two graph traversal implementations targeting the most common graph topologies and demonstrate how graph-specific statistics can be leveraged to select the optimal physical traversal operator. To accelerate graph traversals, we devise a set of graph-specific, updateable secondary index structures to improve the performance of vertex neighborhood expansion. Finally, we introduce a domain-specific language with an intuitive programming model to extend graph traversals with custom application logic at runtime. We use the LLVM compiler framework to generate efficient code that tightly integrates the user-specified application logic with our highly optimized built-in graph traversal operators.
Our experimental evaluation shows that GRAPHITE can outperform native graph management systems by several orders of magnitude while providing all the features of an RDBMS, such as transaction support, backup and recovery, security and user management, effectively providing a promising alternative to specialized graph management systems that lack many of these features and require expensive data replication and maintenance processes
- …