1,365 research outputs found

    EFFICIENT COMPUTATION OF DIVERSE QUERY RESULTS

    Get PDF
    Ph.DDOCTOR OF PHILOSOPH

    Query-Time Data Integration

    Get PDF
    Today, data is collected in ever increasing scale and variety, opening up enormous potential for new insights and data-centric products. However, in many cases the volume and heterogeneity of new data sources precludes up-front integration using traditional ETL processes and data warehouses. In some cases, it is even unclear if and in what context the collected data will be utilized. Therefore, there is a need for agile methods that defer the effort of integration until the usage context is established. This thesis introduces Query-Time Data Integration as an alternative concept to traditional up-front integration. It aims at enabling users to issue ad-hoc queries on their own data as if all potential other data sources were already integrated, without declaring specific sources and mappings to use. Automated data search and integration methods are then coupled directly with query processing on the available data. The ambiguity and uncertainty introduced through fully automated retrieval and mapping methods is compensated by answering those queries with ranked lists of alternative results. Each result is then based on different data sources or query interpretations, allowing users to pick the result most suitable to their information need. To this end, this thesis makes three main contributions. Firstly, we introduce a novel method for Top-k Entity Augmentation, which is able to construct a top-k list of consistent integration results from a large corpus of heterogeneous data sources. It improves on the state-of-the-art by producing a set of individually consistent, but mutually diverse, set of alternative solutions, while minimizing the number of data sources used. Secondly, based on this novel augmentation method, we introduce the DrillBeyond system, which is able to process Open World SQL queries, i.e., queries referencing arbitrary attributes not defined in the queried database. The original database is then augmented at query time with Web data sources providing those attributes. Its hybrid augmentation/relational query processing enables the use of ad-hoc data search and integration in data analysis queries, and improves both performance and quality when compared to using separate systems for the two tasks. Finally, we studied the management of large-scale dataset corpora such as data lakes or Open Data platforms, which are used as data sources for our augmentation methods. We introduce Publish-time Data Integration as a new technique for data curation systems managing such corpora, which aims at improving the individual reusability of datasets without requiring up-front global integration. This is achieved by automatically generating metadata and format recommendations, allowing publishers to enhance their datasets with minimal effort. Collectively, these three contributions are the foundation of a Query-time Data Integration architecture, that enables ad-hoc data search and integration queries over large heterogeneous dataset collections

    Interactive data analysis and its applications on multi-structured datasets

    Get PDF
    Ph.DDOCTOR OF PHILOSOPH

    Location- and keyword-based querying of geo-textual data: a survey

    Get PDF
    With the broad adoption of mobile devices, notably smartphones, keyword-based search for content has seen increasing use by mobile users, who are often interested in content related to their geographical location. We have also witnessed a proliferation of geo-textual content that encompasses both textual and geographical information. Examples include geo-tagged microblog posts, yellow pages, and web pages related to entities with physical locations. Over the past decade, substantial research has been conducted on integrating location into keyword-based querying of geo-textual content in settings where the underlying data is assumed to be either relatively static or is assumed to stream into a system that maintains a set of continuous queries. This paper offers a survey of both the research problems studied and the solutions proposed in these two settings. As such, it aims to offer the reader a first understanding of key concepts and techniques, and it serves as an “index” for researchers who are interested in exploring the concepts and techniques underlying proposed solutions to the querying of geo-textual data.Agency for Science, Technology and Research (A*STAR)Ministry of Education (MOE)Nanyang Technological UniversityThis research was supported in part by MOE Tier-2 Grant MOE2019-T2-2-181, MOE Tier-1 Grant RG114/19, an NTU ACE Grant, and the Singtel Cognitive and Artificial Intelligence Lab for Enterprises (SCALE@NTU), which is a collaboration between Singapore Telecommunications Limited (Singtel) and Nanyang Technological University (NTU) that is funded by the Singapore Government through the Industry Alignment Fund Industry Collaboration Projects Grant, and by the Innovation Fund Denmark centre, DIREC

    A Survey on Intent-based Diversification for Fuzzy Keyword Search

    Get PDF
    Keyword search is an interesting phenomenon, it is the process of finding important and relevant information from various data repositories. Structured and semistructured data can precisely be stored. Fully unstructured documents can annotate and be stored in the form of metadata. For the total web search, half of the web search is for information exploration process. In this paper, the earlier works for semantic meaning of keywords based on their context in the specified documents are thoroughly analyzed. In a tree data representation, the nodes are objects and could hold some intention. These nodes act as anchors for a Smallest Lowest Common Ancestor (SLCA) based pruning process. Based on their features, nodes are clustered. The feature is a distinctive attribute, it is the quality, property or traits of something. Automatic text classification algorithms are the modern way for feature extraction. Summarization and segmentation produce n consecutive grams from various forms of documents. The set of items which describe and summarize one important aspect of a query is known as the facet. Instead of exact string matching a fuzzy mapping based on semantic correlation is the new trend, whereas the correlation is quantified by cosine similarity. Once the outlier is detected, nearest neighbors of the selected points are mapped to the same hash code of the intend nodes with high probability. These methods collectively retrieve the relevant data and prune out the unnecessary data, and at the same time create a hash signature for the nearest neighbor search. This survey emphasizes the need for a framework for fuzzy oriented keyword search

    Understanding User Intentions in Vertical Image Search

    Get PDF
    With the development of Internet and Web 2.0, large volume of multimedia contents have been made online. It is highly desired to provide easy accessibility to such contents, i.e. efficient and precise retrieval of images that satisfies users' needs. Towards this goal, content-based image retrieval (CBIR) has been intensively studied in the research community, while text-based search is better adopted in the industry. Both approaches have inherent disadvantages and limitations. Therefore, unlike the great success of text search, Web image search engines are still premature. In this thesis, we present iLike, a vertical image search engine which integrates both textual and visual features to improve retrieval performance. We bridge the semantic gap by capturing the meaning of each text term in the visual feature space, and re-weight visual features according to their significance to the query terms. We also bridge the user intention gap since we are able to infer the "visual meanings" behind the textual queries. Last but not least, we provide a visual thesaurus, which is generated from the statistical similarity between the visual space representation of textual terms. Experimental results show that our approach improves both precision and recall, compared with content-based or text-based image retrieval techniques. More importantly, search results from iLike are more consistent with users' perception of the query terms

    Spatial Search Strategies for Open Government Data: A Systematic Comparison

    Full text link
    The increasing availability of open government datasets on the Web calls for ways to enable their efficient access and searching. There is however an overall lack of understanding regarding spatial search strategies which would perform best in this context. To address this gap, this work has assessed the impact of different spatial search strategies on performance and user relevance judgment. We harvested machine-readable spatial datasets and their metadata from three English-based open government data portals, performed metadata enhancement, developed a prototype and performed both a theoretical and user-based evaluation. The results highlight that (i) switching between area of overlap and Hausdorff distance for spatial similarity computation does not have any substantial impact on performance; and (ii) the use of Hausdorff distance induces slightly better user relevance ratings than the use of area of overlap. The data collected and the insights gleaned may serve as a baseline against which future work can compare.Comment: Paper accepted to GIR'19: 13th Workshop on Geographic Information Retrieval (Lyon, France

    The ecosystem of peatland research : a bibliometric analysis

    Get PDF
    Peatlands provide a range of services to societies, such as sequestration of organic carbon, biodiversity protection, attenuation of water flow, and the provision of fuel, wood and fruit, among others. Despite their global importance, no study has yet characterised peatland research on a global scale. This study aims at providing a better understanding of the geographical distribution of peatland research, of its variations through time, and of the specific topics studied. Results show that peatland research has, between 1991 and 2017, become increasingly international and diversified, with more countries and study sites active, and an increase in foreign sites studied. We do observe, however, that the general vast peatland regions of the world showed relatively distinct profiles in terms of topics which, in most cases, are related to their geographical features. Peatland research has a spatial imbalance in favour of central Europe, with studies in Africa and Brazil highly under-represented in relation to their area, and those in western and eastern Siberia moderately under-represented. We also observe some topics have become increasingly studied during recent decades, e.g. climate change, fire, restoration and carbon, while others have been decreasingly studied, such as botany, nitrogen and coal
    corecore