2,472 research outputs found
Synergy of Database Techniques and Machine Learning Models for String Similarity Search and Join
Peer reviewe
Valentine: Evaluating Matching Techniques for Dataset Discovery
Data scientists today search large data lakes to discover and integrate
datasets. In order to bring together disparate data sources, dataset discovery
methods rely on some form of schema matching: the process of establishing
correspondences between datasets. Traditionally, schema matching has been used
to find matching pairs of columns between a source and a target schema.
However, the use of schema matching in dataset discovery methods differs from
its original use. Nowadays schema matching serves as a building block for
indicating and ranking inter-dataset relationships. Surprisingly, although a
discovery method's success relies highly on the quality of the underlying
matching algorithms, the latest discovery methods employ existing schema
matching algorithms in an ad-hoc fashion due to the lack of openly-available
datasets with ground truth, reference method implementations, and evaluation
metrics. In this paper, we aim to rectify the problem of evaluating the
effectiveness and efficiency of schema matching methods for the specific needs
of dataset discovery. To this end, we propose Valentine, an extensible
open-source experiment suite to execute and organize large-scale automated
matching experiments on tabular data. Valentine includes implementations of
seminal schema matching methods that we either implemented from scratch (due to
absence of open source code) or imported from open repositories. The
contributions of Valentine are: i) the definition of four schema matching
scenarios as encountered in dataset discovery methods, ii) a principled dataset
fabrication process tailored to the scope of dataset discovery methods and iii)
the most comprehensive evaluation of schema matching techniques to date,
offering insight on the strengths and weaknesses of existing techniques, that
can serve as a guide for employing schema matching in future dataset discovery
methods
Automating data preparation with statistical analysis
Data preparation is the process of transforming raw data into a clean and consumable format. It is widely known as the bottleneck to extract value and insights from data, due to the number of possible tasks in the pipeline and factors that can largely affect the results, such as human expertise, application scenarios, and solution methodology. Researchers and practitioners devised a great variety of techniques and tools over the decades, while many of them still place a significant burden on human’s side to configure the suitable input rules and parameters. In this thesis, with the goal of reducing human manual effort, we explore using the power of statistical analysis techniques to automate three subtasks in the data preparation pipeline: data enrichment, error detection, and entity matching. Statistical analysis is the process of discovering underlying patterns and trends from data and deducing properties of an underlying distribution of probability from a sample, for example, testing hypotheses and deriving estimates. We first discuss CrawlEnrich, which automatically figures out the queries for data enrichment via web API data, by estimating the potential benefit of issuing a certain query. Then we study how to derive reusable error detection configuration rules from a web table corpus, so that end-users get results with no efforts. Finally, we introduce AutoML-EM, aiming to automate the entity matching model development process. Entity matching is to find the identical entities in real-world. Our work provides powerful angles to automate the process of various data preparation steps, and we conclude this thesis by discussing future directions
ChatGPT Chemistry Assistant for Text Mining and Prediction of MOF Synthesis
We use prompt engineering to guide ChatGPT in the automation of text mining
of metal-organic frameworks (MOFs) synthesis conditions from diverse formats
and styles of the scientific literature. This effectively mitigates ChatGPT's
tendency to hallucinate information -- an issue that previously made the use of
Large Language Models (LLMs) in scientific fields challenging. Our approach
involves the development of a workflow implementing three different processes
for text mining, programmed by ChatGPT itself. All of them enable parsing,
searching, filtering, classification, summarization, and data unification with
different tradeoffs between labor, speed, and accuracy. We deploy this system
to extract 26,257 distinct synthesis parameters pertaining to approximately 800
MOFs sourced from peer-reviewed research articles. This process incorporates
our ChemPrompt Engineering strategy to instruct ChatGPT in text mining,
resulting in impressive precision, recall, and F1 scores of 90-99%.
Furthermore, with the dataset built by text mining, we constructed a
machine-learning model with over 86% accuracy in predicting MOF experimental
crystallization outcomes and preliminarily identifying important factors in MOF
crystallization. We also developed a reliable data-grounded MOF chatbot to
answer questions on chemical reactions and synthesis procedures. Given that the
process of using ChatGPT reliably mines and tabulates diverse MOF synthesis
information in a unified format, while using only narrative language requiring
no coding expertise, we anticipate that our ChatGPT Chemistry Assistant will be
very useful across various other chemistry sub-disciplines.Comment: Published on Journal of the American Chemical Society (2023); 102
pages (18-page manuscript, 84 pages of supporting information
End-to-End Entity Resolution for Big Data: A Survey
One of the most important tasks for improving data quality and the
reliability of data analytics results is Entity Resolution (ER). ER aims to
identify different descriptions that refer to the same real-world entity, and
remains a challenging problem. While previous works have studied specific
aspects of ER (and mostly in traditional settings), in this survey, we provide
for the first time an end-to-end view of modern ER workflows, and of the novel
aspects of entity indexing and matching methods in order to cope with more than
one of the Big Data characteristics simultaneously. We present the basic
concepts, processing steps and execution strategies that have been proposed by
different communities, i.e., database, semantic Web and machine learning, in
order to cope with the loose structuredness, extreme diversity, high speed and
large scale of entity descriptions used by real-world applications. Finally, we
provide a synthetic discussion of the existing approaches, and conclude with a
detailed presentation of open research directions
Development of Integrated Machine Learning and Data Science Approaches for the Prediction of Cancer Mutation and Autonomous Drug Discovery of Anti-Cancer Therapeutic Agents
Few technological ideas have captivated the minds of biochemical researchers to the degree that machine learning (ML) and artificial intelligence (AI) have. Over the last few years, advances in the ML field have driven the design of new computational systems that improve with experience and are able to model increasingly complex chemical and biological phenomena. In this dissertation, we capitalize on these achievements and use machine learning to study drug receptor sites and design drugs to target these sites. First, we analyze the significance of various single nucleotide variations and assess their rate of contribution to cancer. Following that, we used a portfolio of machine learning and data science approaches to design new drugs to target protein kinase inhibitors. We show that these techniques exhibit strong promise in aiding cancer research and drug discovery
- …