2 research outputs found

    Learning Interesting Categorical Attributes for Refined Data Exploration

    Full text link
    This work proposes and evaluates a novel approach to determine interesting categorical attributes for lists of entities. Once identified, such categories are of immense value to allow constraining (filtering) a current view of a user to subsets of entities. We show how a classifier is trained that is able to tell whether or not a categorical attribute can act as a constraint, in the sense of human-perceived interestingness. The training data is harnessed from Web tables, treating the presence or absence of a table as an indication that the attribute used as a filter constraint is reasonable or not. For learning the classification model, we review four well-known statistical measures (features) for categorical attributes---entropy, unalikeability, peculiarity, and coverage. We additionally propose three new statistical measures to capture the distribution of data, tailored to our main objective. The learned model is evaluated by relevance assessments obtained through a user study, reflecting the applicability of the approach as a whole and, further, demonstrates the superiority of the proposed diversity measures over existing statistical measures like information entropy.Comment: 13 pages, 9 figures, 6 table

    Columnar Database Techniques for Creating AI Features

    Full text link
    Recent advances with in-memory columnar database techniques have increased the performance of analytical queries on very large databases and data warehouses. At the same time, advances in artificial intelligence (AI) algorithms have increased the ability to analyze data. We use the term AI to encompass both Deep Learning (DL or neural network) and Machine Learning (ML aka Big Data analytics). Our exploration of the AI full stack has led us to a cross-stack columnar database innovation that efficiently creates features for AI analytics. The innovation is to create Augmented Dictionary Values (ADVs) to add to existing columnar database dictionaries in order to increase the efficiency of featurization by minimizing data movement and data duplication. We show how various forms of featurization (feature selection, feature extraction, and feature creation) can be efficiently calculated in a columnar database. The full stack AI investigation has also led us to propose an integrated columnar database and AI architecture. This architecture has information flows and feedback loops to improve the whole analytics cycle during multiple iterations of extracting data from the data sources, featurization, and analysis.Comment: 7 pages, 2 figures, 5 table
    corecore