212 research outputs found
Assessing and Remedying Coverage for a Given Dataset
Data analysis impacts virtually every aspect of our society today. Often,
this analysis is performed on an existing dataset, possibly collected through a
process that the data scientists had limited control over. The existing data
analyzed may not include the complete universe, but it is expected to cover the
diversity of items in the universe. Lack of adequate coverage in the dataset
can result in undesirable outcomes such as biased decisions and algorithmic
racism, as well as creating vulnerabilities such as opening up room for
adversarial attacks.
In this paper, we assess the coverage of a given dataset over multiple
categorical attributes. We first provide efficient techniques for traversing
the combinatorial explosion of value combinations to identify any regions of
attribute space not adequately covered by the data. Then, we determine the
least amount of additional data that must be obtained to resolve this lack of
adequate coverage. We confirm the value of our proposal through both
theoretical analyses and comprehensive experiments on real data.Comment: in ICDE 201
Democratizing Self-Service Data Preparation through Example Guided Program Synthesis,
The majority of real-world data we can access today have one thing in common: they are not immediately usable in their original state. Trapped in a swamp of data usability issues like non-standard data formats and heterogeneous data sources, most data analysts and machine learning practitioners have to burden themselves with "data janitor" work, writing ad-hoc Python, PERL or SQL scripts, which is tedious and inefficient. It is estimated that data scientists or analysts typically spend 80% of their time in preparing data, a significant amount of human effort that can be redirected to better goals. In this dissertation, we accomplish this task by harnessing knowledge such as examples and other useful hints from the end user. We develop program synthesis techniques guided by heuristics and machine learning, which effectively make data preparation less painful and more efficient to perform by data users, particularly those with little to no programming experience.
Data transformation, also called data wrangling or data munging, is an important task in data preparation, seeking to convert data from one format to a different (often more structured) format. Our system Foofah shows that allowing end users to describe their desired transformation, through providing small input-output transformation examples, can significantly reduce the overall user effort. The underlying program synthesizer can often succeed in finding meaningful data transformation programs within a reasonably short amount of time. Our second system, CLX, demonstrates that sometimes the user does not even need to provide complete input-output examples, but only label ones that are desirable if they exist in the original dataset. The system is still capable of suggesting reasonable and explainable transformation operations to fix the non-standard data format issue in a dataset full of heterogeneous data with varied formats.
PRISM, our third system, targets a data preparation task of data integration, i.e., combining multiple relations to formulate a desired schema. PRISM allows the user to describe the target schema using not only high-resolution (precise) constraints of complete example data records in the target schema, but also (imprecise) constraints of varied resolutions, such as incomplete data record examples with missing values, value ranges, or multiple possible values in each element (cell), so as to require less familiarity of the database contents from the end user.PHDComputer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/163059/1/markjin_1.pd
Recommended from our members
Composition and protein precipitation capacity of condensed tannins in purple prairie clover (Dalea purpurea Vent.)
The objectives of this study were to determine the concentration and composition of condensed tannins (CT) in different tissues of purple prairie clover (PPC; Dalea purpurea Vent.) at different maturities and to determine their protein-precipitating capacity. The compositions of CT were elucidated after thiolysis with benzyl mercaptan followed by high-performance liquid chromatography and 1H−13C HSQC NMR spectroscopy. Results indicated that PPC flowering heads contained the highest CT concentration. Purple prairie clover CT consisted mainly of epicatechin (EC) and epigallocatechin (EGC) subunits. Condensed tannins in leaves were composed of more EC and less EGC than CT in stems and flowering heads at both early flowering and late flowering head stages. The mean degree of polymerization was the highest for CT in stems and increased with maturity. Condensed tannins isolated from PPC leaves at the early flowering head stage exhibited the greatest biological activity in terms of protein precipitation. Overall, condensed tannins in PPC were predominantly procyanidins and the concentration and composition varied among plant tissues and with maturity
Components of the Hematopoietic Compartments in Tumor Stroma and Tumor-Bearing Mice
Solid tumors are composed of cancerous cells and non-cancerous stroma. A better understanding of the tumor stroma could lead to new therapeutic applications. However, the exact compositions and functions of the tumor stroma are still largely unknown. Here, using a Lewis lung carcinoma implantation mouse model, we examined the hematopoietic compartments in tumor stroma and tumor-bearing mice. Different lineages of differentiated hematopoietic cells existed in tumor stroma with the percentage of myeloid cells increasing and the percentage of lymphoid and erythroid cells decreasing over time. Using bone marrow reconstitution analysis, we showed that the tumor stroma also contained functional hematopoietic stem cells. All hematopoietic cells in the tumor stroma originated from bone marrow. In the bone marrow and peripheral blood of tumor-bearing mice, myeloid populations increased and lymphoid and erythroid populations decreased and numbers of hematopoietic stem cells markedly increased with time. To investigate the function of hematopoietic cells in tumor stroma, we co-implanted various types of hematopoietic cells with cancer cells. We found that total hematopoietic cells in the tumor stroma promoted tumor development. Furthermore, the growth of the primary implanted Lewis lung carcinomas and their metastasis were significantly decreased in mice reconstituted with IGF type I receptor-deficient hematopoietic stem cells, indicating that IGF signaling in the hematopoietic tumor stroma supports tumor outgrowth. These results reveal that hematopoietic cells in the tumor stroma regulate tumor development and that tumor progression significantly alters the host hematopoietic compartment
A Self-Cloning Agents Based Model for High-Performance Mobile-Cloud Computing
The rise of the mobile-cloud computing paradigm in recent years has enabled mobile devices with processing power and battery life limitations to achieve complex tasks in real-time. While mobile-cloud computing is promising to overcome the limitations of mobile devices for real-time computing, the lack of frameworks compatible with standard technologies and techniques for dynamic performance estimation and program component relocation makes it harder to adopt mobile-cloud computing at large. Most of the available frameworks rely on strong assumptions such as the availability of a full clone of the application code and negligible execution time in the cloud. In this paper, we present a dynamic computation offloading model for mobile-cloud computing, based on autonomous agents. Our approach does not impose any requirements on the cloud platform other than providing isolated execution containers, and it alleviates the management burden of offloaded code by the mobile platform using stateful, autonomous application partitions. We also investigate the effects of different cloud runtime environment conditions on the performance of mobile-cloud computing, and present a simple and low-overhead dynamic makespan estimation model integrated into autonomous agents to enhance them with self-performance evaluation in addition to self-cloning capabilities. The proposed performance profiling model is used in conjunction with a cloud resource optimization scheme to ensure optimal performance. Experiments with two mobile applications demonstrate the effectiveness of the proposed approach for high-performance mobile-cloud computing
- …
