5 research outputs found

    Towards intelligent geospatial data discovery: a machine learning framework for search ranking

    No full text
    Current search engines in most geospatial data portals tend to induce users to focus on one single-data characteristic dimension (e.g. popularity and release date). This approach largely fails to take account of users’ multidimensional preferences for geospatial data, and hence may likely result in a less than optimal user experience in discovering the most applicable dataset. This study reports a machine learning framework to address the ranking challenge, the fundamental obstacle in geospatial data discovery, by (1) identifying a number of ranking features of geospatial data to represent users’ multidimensional preferences by considering semantics, user behavior, spatial similarity, and static dataset metadata attributes; (2) applying a machine learning method to automatically learn a ranking function; and (3) proposing a system architecture to combine existing search-oriented open source software, semantic knowledge base, ranking feature extraction, and machine learning algorithm. Results show that the machine learning approach outperforms other methods, in terms of both precision at K and normalized discounted cumulative gain. As an early attempt of utilizing machine learning to improve the search ranking in the geospatial domain, we expect this work to set an example for further research and open the door towards intelligent geospatial data discovery

    A Smart Web-Based Geospatial Data Discovery System with Oceanographic Data as an Example

    No full text
    Discovering and accessing geospatial data presents a significant challenge for the Earth sciences community as massive amounts of data are being produced on a daily basis. In this article, we report a smart web-based geospatial data discovery system that mines and utilizes data relevancy from metadata user behavior. Specifically, (1) the system enables semantic query expansion and suggestion to assist users in finding more relevant data; (2) machine-learned ranking is utilized to provide the optimal search ranking based on a number of identified ranking features that can reflect users’ search preferences; (3) a hybrid recommendation module is designed to allow users to discover related data considering metadata attributes and user behavior; (4) an integrated graphic user interface design is developed to quickly and intuitively guide data consumers to the appropriate data resources. As a proof of concept, we focus on a well-defined domain-oceanography and use oceanographic data discovery as an example. Experiments and a search example show that the proposed system can improve the scientific community’s data search experience by providing query expansion, suggestion, better search ranking, and data recommendation via a user-friendly interface

    A Cloud-Based Framework for Large-Scale Log Mining through Apache Spark and Elasticsearch

    No full text
    The volume, variety, and velocity of different data, e.g., simulation data, observation data, and social media data, are growing ever faster, posing grand challenges for data discovery. An increasing trend in data discovery is to mine hidden relationships among users and metadata from the web usage logs to support the data discovery process. Web usage log mining is the process of reconstructing sessions from raw logs and finding interesting patterns or implicit linkages. The mining results play an important role in improving quality of search-related components, e.g., ranking, query suggestion, and recommendation. While researches were done in the data discovery domain, collecting and analyzing logs efficiently remains a challenge because (1) the volume of web usage logs continues to grow as long as users access the data; (2) the dynamic volume of logs requires on-demand computing resources for mining tasks; (3) the mining process is compute-intensive and time-intensive. To speed up the mining process, we propose a cloud-based log-mining framework using Apache Spark and Elasticsearch. In addition, a data partition paradigm, logPartitioner, is designed to solve the data imbalance problem in data parallelism. As a proof of concept, oceanographic data search and access logs are chosen to validate performance of the proposed parallel log-mining framework
    corecore