Search CORE

12 research outputs found

Cost-effective data structural preparation

Author: Chodpathumwan Yodsawalai
Publication venue
Publication date: 01/12/2018
Field of study

People structure and represent their data in many different ways. One factor to consider in choosing between different representations is how the structure will affect the effectiveness of algorithms that run over the data. In fact, before sophisticated analytics can be performed, one must usually go through a data preparation phase, where the structural representation of the data is changed to be more suitable for the particular analytics procedure that will be performed. This is necessary because individual analytics algorithms are effective only for certain kinds of structural representations of their input data. Unfortunately, analytics algorithms do not come with a clear description of their desired representation. Hence, time and expertise is required to identify and materialize a suitable representation for each analytics task. In this dissertation, we address this issue in data preparation. Our first contribution focuses on the concept of design independence, in which the intent is to create an analytics algorithm that is effective regardless of the choices of data representations. The benefit of becoming more design independent is that it will reduce or, in the most favorable outcome, remove the cost of manually finding and preparing the most effective structure or schema for the data. In this part of our work, we consider common variations of data source structure that preserve its content. For the analytics task of similarity search, we propose an algorithm that satisfies the design independence property against the studied variations. We then generalize our findings for other structural variations, and prove that it is design independent with respect to these structural variants. We show that humans find its answers at least as desirable as those provided by existing similarity search algorithms. In the case where design independence is not achievable, we address the data preparation issue by proposing an algorithm that finds a cost-effective structure to be imposed on an unstructured dataset. Under this approach, structural information is added to the data source to improve the effectiveness of an algorithm running over the data. We leverage the information from an existing domain of concepts or an ontology to add structure to the data collection in the form of annotations. Because each concept may require different amounts of resources and time in annotating and/or maintaining the data source, we would like to find a set of affordable concepts that improves the effectiveness of an algorithm the most. This is called the cost-effective conceptual design problem. Previous works on this topic assumed that a domain of concepts is simply an unorganized set of concepts. However, real-world domains are often organized, in the form of taxonomies for example. Hence, in this dissertation, we explore a new version of the cost-effective conceptual design problem, using taxonomies of concepts and considering multi-concept queries

Illinois Digital Environment for Access to Learning and Scholarship Repository

Representation Independent Analytics Over Structured Data

Author: Chodpathumwan Yodsawalai
Fern Alan
Picado Jose
Sun Yizhou
Termehchy Arash
Publication venue
Publication date: 08/09/2014
Field of study

Database analytics algorithms leverage quantifiable structural properties of the data to predict interesting concepts and relationships. The same information, however, can be represented using many different structures and the structural properties observed over particular representations do not necessarily hold for alternative structures. Thus, there is no guarantee that current database analytics algorithms will still provide the correct insights, no matter what structures are chosen to organize the database. Because these algorithms tend to be highly effective over some choices of structure, such as that of the databases used to validate them, but not so effective with others, database analytics has largely remained the province of experts who can find the desired forms for these algorithms. We argue that in order to make database analytics usable, we should use or develop algorithms that are effective over a wide range of choices of structural organizations. We introduce the notion of representation independence, study its fundamental properties for a wide range of data analytics algorithms, and empirically analyze the amount of representation independence of some popular database analytics algorithms. Our results indicate that most algorithms are not generally representation independent and find the characteristics of more representation independent heuristics under certain representational shifts

arXiv.org e-Print Archive

CiteSeerX

Cost-effective data structural preparation

Author: Chodpathumwan Yodsawalai
Publication venue
Publication date
Field of study

How schema independent are schema free query interfaces?

Author: Arash Termehchy
Marianne Winslett
Yodsawalai Chodpathumwan
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2011
Field of study

Abstract—Real-world databases often have extremely complex schemas. With thousands of entity types and relationships, each with a hundred or so attributes, it is extremely difficult for new users to explore the data and formulate queries. Schema free query interfaces (SFQIs) address this problem by allowing users with no knowledge of the schema to submit queries. We postulate that SFQIs should deliver the same answers when given alternative but equivalent schemas for the same underlying information. In this paper, we introduce and formally define design independence, which captures this property for SFQIs. We establish a theoretical framework to measure the amount of design independence provided by an SFQI. We show that most current SFQIs provide a very limited degree of design indepen-dence. We also show that SFQIs based on the statistical properties of data can provide design independence when the changes in the schema do not introduce or remove redundancy in the data. We propose a novel XML SFQI called Duplication Aware Coherency Ranking (DA-CR) based on information-theoretic relationships among the data items in the database, and prove that DA-CR is design independent. Our extensive empirical study using three real-world data sets shows that the average case design independence of current SFQIs is considerably lower than that of DA-CR. We also show that the ranking quality of DA-CR is better than or equal to that of current SFQI methods. I

CiteSeerX

Crossref

Cost-effective conceptual design using taxonomies

Author: Chodpathumwan Yodsawalai
Nayyeri Amir
Termehchy Arash
Vakilian Ali
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 24/09/2020
Field of study

It is known that annotating entities in unstructured and semi-structured datasets by their concepts improves the effectiveness of answering queries over these datasets. Ideally, one would like to annotate entities of all relevant concepts in a dataset. However, it takes substantial time and computational resources to annotate concepts in large datasets, and an organization may have sufficient resources to annotate only a subset of relevant concepts. Clearly, it would like to annotate a subset of concepts that provides the most effective answers to queries over the dataset. We propose a formal framework that quantifies the amount by which annotating entities of concepts from a taxonomy in a dataset improves the effectiveness of answering queries over the dataset. Because the problem is NP-hard, we propose efficient approximation and pseudo-polynomial time algorithms for several cases of the problem. Our extensive empirical studies validate our framework and show accuracy and efficiency of our algorithms.National Science Foundation (Grants IIS-1421247, CCF-0938071, CCF-0938064 and CNS-0716532

DSpace@MIT

Recommended from our members

Cost Effective Conceptual Design for Semantic Annotation (Extended Version)

Author: Termehchy Arash
Vakilian Ali
Winslett Marianne
Yodsawalai Chodpathumwan
Publication venue: Corvallis, OR : Oregon State University, Dept. of Computer Science
Publication date
Field of study

It is well established that annotating occurrences of entities in a collection of unstructured or semi-structured text documents with their concepts (i.e. entity sets), called semantic annotation, improves the effectiveness of answers to users’ queries. However, an enterprise has to spend large amount of financial, computational, and human resources to develop and deploy an annotation program to semantically annotate a concept in a large collection. Moreover, since the structure and content of the documents may evolve over time, the annotation programs should be often rewritten and repaired. These efforts are even more costly for concepts that are defined in specific domains, such as medicine and the law, as they require extensive collaboration between developers and domain experts. Since the available resources in an enterprise are limited, it has to select only a subset of relevant concepts for annotation. We call this subset a conceptual design for the annotated collection. To the best of our knowledge, finding a conceptual design for semantic annotation are generally left to intuition. In this paper, we introduce and formally define the problem of cost effective conceptual design, where given a set of relevant concepts and a fixed budget, one likes to find a conceptual design that improves the effectiveness of answers to users’ queries the most. We prove that the problem is generally NP-hard in the number of relevant concepts and propose two efficient approximation algorithms: Approximate Popularity Maximization (APM for short) and Annotation-Benefit Maximization (AAM for short). We prove that APM has a constant approximation ratio and AAM is a fully polynomial time approximation scheme algorithm. We validate our analysis using extensive experiments over Wikipedia articles and queries form a real-world search engine query log. Our results indicate that AAM generally returns optimal or near-optimal conceptual designs that are more effective than the solutions provided by APM in most cases. Since the precise values of the input parameters for APM and AAM may not be available at the design time, we analyze the sensitivity of AAM to the estimation errors in their input parameters. Our results show that using input parameters computed over small samples of the collection, AAM generally return the same answers as the cases where they have full information about the values of their input parameters

ScholarsArchive@OSU

Cost-Effective Conceptual Design for Information Extraction

Author: Ali Vakilian
Anderson Michael
Arash Termehchy
Cafarella Michael
Chakrabarti Soumen
Doan Anhai
Finin Tim
Garciamolina Hector
Graupmann Jens
Graupmann Jens
Huang Jian
Kowalkiewicz Marek
Marianne Winslett
Riloff Ellen
Schenkel Ralf
Shen Warren
Yodsawalai Chodpathumwan
Zwol Roelof Van
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date
Field of study

Crossref