Growing materials data and data-driven informatics drastically promote the
discovery and design of materials. While there are significant advancements in
data-driven models, the quality of data resources is less studied despite its
huge impact on model performance. In this work, we focus on data bias arising
from uneven coverage of materials families in existing knowledge. Observing
different diversities among crystal systems in common materials databases, we
propose an information entropy-based metric for measuring this bias. To
mitigate the bias, we develop an entropy-targeted active learning (ET-AL)
framework, which guides the acquisition of new data to improve the diversity of
underrepresented crystal systems. We demonstrate the capability of ET-AL for
bias mitigation and the resulting improvement in downstream machine learning
models. This approach is broadly applicable to data-driven materials discovery,
including autonomous data acquisition and dataset trimming to reduce bias, as
well as data-driven informatics in other scientific domains.Comment: 35 pages, 13 figures, under revie