5 research outputs found
Generative Benchmark Creation for Table Union Search
Data management has traditionally relied on synthetic data generators to
generate structured benchmarks, like the TPC suite, where we can control
important parameters like data size and its distribution precisely. These
benchmarks were central to the success and adoption of database management
systems. But more and more, data management problems are of a semantic nature.
An important example is finding tables that can be unioned. While any two
tables with the same cardinality can be unioned, table union search is the
problem of finding tables whose union is semantically coherent. Semantic
problems cannot be benchmarked using synthetic data. Our current methods for
creating benchmarks involve the manual curation and labeling of real data.
These methods are not robust or scalable and perhaps more importantly, it is
not clear how robust the created benchmarks are. We propose to use generative
AI models to create structured data benchmarks for table union search. We
present a novel method for using generative models to create tables with
specified properties. Using this method, we create a new benchmark containing
pairs of tables that are both unionable and non-unionable but related. We
thoroughly evaluate recent existing table union search methods over existing
benchmarks and our new benchmark. We also present and evaluate a new table
search methods based on recent large language models over all benchmarks. We
show that the new benchmark is more challenging for all methods than
hand-curated benchmarks, specifically, the top-performing method achieves a
Mean Average Precision of around 60%, over 30% less than its performance on
existing manually created benchmarks. We examine why this is the case and show
that the new benchmark permits more detailed analysis of methods, including a
study of both false positives and false negatives that were not possible with
existing benchmarks
LakeBench: Benchmarks for Data Discovery over Data Lakes
Within enterprises, there is a growing need to intelligently navigate data
lakes, specifically focusing on data discovery. Of particular importance to
enterprises is the ability to find related tables in data repositories. These
tables can be unionable, joinable, or subsets of each other. There is a dearth
of benchmarks for these tasks in the public domain, with related work targeting
private datasets. In LakeBench, we develop multiple benchmarks for these tasks
by using the tables that are drawn from a diverse set of data sources such as
government data from CKAN, Socrata, and the European Central Bank. We compare
the performance of 4 publicly available tabular foundational models on these
tasks. None of the existing models had been trained on the data discovery tasks
that we developed for this benchmark; not surprisingly, their performance shows
significant room for improvement. The results suggest that the establishment of
such benchmarks may be useful to the community to build tabular models usable
for data discovery in data lakes
Status and Response Till Third Stage of 2019 novel coronavirus disease (COVID-19) in Nepal
An outbreak of severe acute respiratory syndrome coronavirus infection occurred in Wuhan, China at the end of December 2019 and spread of this virus already reached to almost 210 countries around the world. WHO declared COVID-19 as âglobal pandemicâ on 11 March, 2020 and accounted South Asia as the high-risk region. Nepal, a landlocked country bordering two most populous countries, India and China, was expected to have high number of cases of COVID-19 due to its proximity to the highly infected country China, and lately spreading country India. Also, many of the Nepali people are engaged in the businesses related with China and India. However, there has been very few reported cases in Nepal. The first case was reported on 24th January 2020, one and half months after the first case was confirmed in China. It took almost three months for the number of cases to reach 45 and to kick off the community spread stage of the pandemic. This research presented the detailed situation of the cases, testing facilities, quarantine and isolation, hospital, and nursing care etc. before the start of Community Transmission stage in Nepal. The scenario has been represented graphically and the condition of other South Asian nations has also been compared and visualized. The steps taken by the government, individuals, and other organizations are also highlighted. This paper also provides the concrete data and their analysis about the pandemic which can be helpful not only for the current but also for the future pandemic controls
Results of SemTab 2023
SemTab 2023 was the fifth edition of the Semantic Web Challenge on Tabular Data to Knowledge Graph Matching, collocated with the 22nd International Semantic Web Conference (ISWC) and the 18th Ontology Matching (OM) Workshop. SemTab provides a framework to conduct a systematic evaluation of state-of-the-art semantic table interpretation systems. In this paper, we give an overview of the 2023 edition of the challenge and summarize the results.</p
Results of SemTab 2023
SemTab 2023 was the fifth edition of the Semantic Web Challenge on Tabular Data to Knowledge Graph Matching, collocated with the 22nd International Semantic Web Conference (ISWC) and the 18th Ontology Matching (OM) Workshop. SemTab provides a framework to conduct a systematic evaluation of state-of-the-art semantic table interpretation systems. In this paper, we give an overview of the 2023 edition of the challenge and summarize the results