150,187 research outputs found
T-Crowd: Effective Crowdsourcing for Tabular Data
Crowdsourcing employs human workers to solve computer-hard problems, such as
data cleaning, entity resolution, and sentiment analysis. When crowdsourcing
tabular data, e.g., the attribute values of an entity set, a worker's answers
on the different attributes (e.g., the nationality and age of a celebrity star)
are often treated independently. This assumption is not always true and can
lead to suboptimal crowdsourcing performance. In this paper, we present the
T-Crowd system, which takes into consideration the intricate relationships
among tasks, in order to converge faster to their true values. Particularly,
T-Crowd integrates each worker's answers on different attributes to effectively
learn his/her trustworthiness and the true data values. The attribute
relationship information is also used to guide task allocation to workers.
Finally, T-Crowd seamlessly supports categorical and continuous attributes,
which are the two main datatypes found in typical databases. Our extensive
experiments on real and synthetic datasets show that T-Crowd outperforms
state-of-the-art methods in terms of truth inference and reducing the cost of
crowdsourcing
A linear optimization based method for data privacy in statistical tabular data
National Statistical Agencies routinely disseminate large amounts of data. Prior to dissemination these data have to be protected to avoid releasing confidential information. Controlled tabular adjustment (CTA) is one of the available methods for this purpose. CTA formulates an optimization problem that looks for the safe table which is closest to the original one. The standard CTA approach results in a mixed integer linear optimization (MILO) problem, which is very challenging for current
technology. In this work we present a much less costly variant of CTA that formulates a multiobjective linear optimization (LO) problem, where binary variables are pre-fixed, and the resulting continuous problem is solved by lexicographic optimization. Extensive computational results are reported using both commercial (CPLEX and XPRESS) and open source (Clp) solvers, with either simplex or interior-point methods, on a set of real instances. Most instances were successfully solved with
the LO-CTA variant in less than one hour, while many of them are computationally very expensive with the MILO-CTA formulation. The interior-point method outperformed simplex in this particular application.Peer ReviewedPreprin
Recommended from our members
SemTab 2019: Resources to Benchmark Tabular Data to Knowledge Graph Matching Systems
Tabular data to Knowledge Graph matching is the process of assigning semantic tags from knowledge graphs (e.g., Wikidata or DBpedia) to the elements of a table. This task is a challenging problem for various reasons, including the lack of metadata (e.g., table and column names), the noisiness, heterogeneity, incompleteness and ambiguity in the data. The results of this task provide significant insights about potentially highly valuable tabular data, as recent works have shown, enabling a new family of data analytics and data science applications. Despite significant amount of work on various flavors of this problem, there is a lack of a common framework to conduct a systematic evaluation of state-of-the-art systems. The creation of the Semantic Web Challenge on Tabular Data to Knowledge Graph Matching (SemTab) aims at filling this gap. In this paper, we report about the datasets, infrastructure and lessons learned from the first edition of the SemTab challenge
A second order cone formulation of continuous CTA model
The final publication is available at link.springer.comIn this paper we consider a minimum distance Controlled Tabular Adjustment (CTA) model for statistical disclosure limitation (control) of tabular data. The goal of the CTA model is to find the closest safe table to some original tabular data set that contains sensitive information. The measure of closeness is usually measured using l1 or l2 norm; with each measure having its advantages and disadvantages. Recently, in [4] a regularization of the l1 -CTA using Pseudo-Huber func- tion was introduced in an attempt to combine positive characteristics of both l1 -CTA and l2 -CTA. All three models can be solved using appro- priate versions of Interior-Point Methods (IPM). It is known that IPM in general works better on well structured problems such as conic op- timization problems, thus, reformulation of these CTA models as conic optimization problem may be advantageous. We present reformulation of Pseudo-Huber-CTA, and l1 -CTA as Second-Order Cone (SOC) op- timization problems and test the validity of the approach on the small example of two-dimensional tabular data set.Peer ReviewedPostprint (author's final draft
UniTabE: Pretraining a Unified Tabular Encoder for Heterogeneous Tabular Data
Recent advancements in Natural Language Processing (NLP) have witnessed the
groundbreaking impact of pretrained models, yielding impressive outcomes across
various tasks. This study seeks to extend the power of pretraining
methodologies to tabular data, a domain traditionally overlooked, yet
inherently challenging due to the plethora of table schemas intrinsic to
different tasks. The primary research questions underpinning this work revolve
around the adaptation to heterogeneous table structures, the establishment of a
universal pretraining protocol for tabular data, the generalizability and
transferability of learned knowledge across tasks, the adaptation to diverse
downstream applications, and the incorporation of incremental columns over
time. In response to these challenges, we introduce UniTabE, a pioneering
method designed to process tables in a uniform manner, devoid of constraints
imposed by specific table structures. UniTabE's core concept relies on
representing each basic table element with a module, termed TabUnit. This is
subsequently followed by a Transformer encoder to refine the representation.
Moreover, our model is designed to facilitate pretraining and finetuning
through the utilization of free-form prompts. In order to implement the
pretraining phase, we curated an expansive tabular dataset comprising
approximately 13 billion samples, meticulously gathered from the Kaggle
platform. Rigorous experimental testing and analyses were performed under a
myriad of scenarios to validate the effectiveness of our methodology. The
experimental results demonstrate UniTabE's superior performance against several
baseline models across a multitude of benchmark datasets. This, therefore,
underscores UniTabE's potential to significantly enhance the semantic
representation of tabular data, thereby marking a significant stride in the
field of tabular data analysis.Comment: 9 page
- …