2 research outputs found
Learning Interesting Categorical Attributes for Refined Data Exploration
This work proposes and evaluates a novel approach to determine interesting
categorical attributes for lists of entities. Once identified, such categories
are of immense value to allow constraining (filtering) a current view of a user
to subsets of entities. We show how a classifier is trained that is able to
tell whether or not a categorical attribute can act as a constraint, in the
sense of human-perceived interestingness. The training data is harnessed from
Web tables, treating the presence or absence of a table as an indication that
the attribute used as a filter constraint is reasonable or not. For learning
the classification model, we review four well-known statistical measures
(features) for categorical attributes---entropy, unalikeability, peculiarity,
and coverage. We additionally propose three new statistical measures to capture
the distribution of data, tailored to our main objective. The learned model is
evaluated by relevance assessments obtained through a user study, reflecting
the applicability of the approach as a whole and, further, demonstrates the
superiority of the proposed diversity measures over existing statistical
measures like information entropy.Comment: 13 pages, 9 figures, 6 table
Columnar Database Techniques for Creating AI Features
Recent advances with in-memory columnar database techniques have increased
the performance of analytical queries on very large databases and data
warehouses. At the same time, advances in artificial intelligence (AI)
algorithms have increased the ability to analyze data. We use the term AI to
encompass both Deep Learning (DL or neural network) and Machine Learning (ML
aka Big Data analytics). Our exploration of the AI full stack has led us to a
cross-stack columnar database innovation that efficiently creates features for
AI analytics. The innovation is to create Augmented Dictionary Values (ADVs) to
add to existing columnar database dictionaries in order to increase the
efficiency of featurization by minimizing data movement and data duplication.
We show how various forms of featurization (feature selection, feature
extraction, and feature creation) can be efficiently calculated in a columnar
database. The full stack AI investigation has also led us to propose an
integrated columnar database and AI architecture. This architecture has
information flows and feedback loops to improve the whole analytics cycle
during multiple iterations of extracting data from the data sources,
featurization, and analysis.Comment: 7 pages, 2 figures, 5 table