8 research outputs found
Recommended from our members
Data Mining Chemistry and Crystal Structure
The availability of large amounts of data generated by high-throughput computing and experimentation has generated interest in the application of machine learning techniques to materials science. Machine learning of materials behavior requires the use of feature vectors that capture compositional or structural information influence a target property. We present methods for assessing the similarity of compositions, substructures, and crystal structures. Similarity measures are important for the classification and clustering of data points, allowing for the organization of data and the prediction of materials properties.Engineering and Applied Science
Proposed definition of crystal substructure and substructural similarity
There is a clear need for a practical and mathematically rigorous description of local structure in inorganic compounds so that structures and chemistries can be easily compared across large data sets. Here a method for decomposing crystal structures into substructures is given, and a similarity function between those substructures is defined. The similarity function is based on both geometric and chemical similarity. This construction allows for large-scale data mining of substructural properties, and the analysis of substructures and void spaces within crystal structures. The method is validated via the prediction of Li-ion intercalation sites for the oxides. Tested on databases of known Li-ion-containing oxides, the method reproduces all Li-ion sites in an oxide with a maximum of 4 incorrect guesses 80% of the time.National Science Foundation (U.S.) (SI2-SSI Collaborative Research program Award OCI-1147503)United States. Dept. of Energy. Office of Basic Energy Sciences (Grant EDCBEE
Crystal Structure Search with Random Relaxations Using Graph Networks
Materials design enables technologies critical to humanity, including
combating climate change with solar cells and batteries. Many properties of a
material are determined by its atomic crystal structure. However, prediction of
the atomic crystal structure for a given material's chemical formula is a
long-standing grand challenge that remains a barrier in materials design. We
investigate a data-driven approach to accelerating ab initio random structure
search (AIRSS), a state-of-the-art method for crystal structure search. We
build a novel dataset of random structure relaxations of Li-Si battery anode
materials using high-throughput density functional theory calculations. We
train graph neural networks to simulate relaxations of random structures. Our
model is able to find an experimentally verified structure of Li15Si4 it was
not trained on, and has potential for orders of magnitude speedup over AIRSS
when searching large unit cells and searching over multiple chemical
stoichiometries. Surprisingly, we find that data augmentation of adding
Gaussian noise improves both the accuracy and out of domain generalization of
our models.Comment: Removed citations from the abstract, paper content is unchange
Data-mined similarity function between material compositions
A new method for assessing the similarity of material compositions is described. A similarity measure is important for the classification and clustering of compositions. The similarity of the material compositions is calculated utilizing a data-mined ionic substitutional similarity based upon the probability with which two ions will substitute for each other within the same structure prototype. The method is validated via the prediction of crystal structure prototypes for oxides from the Inorganic Crystal Structure Database, selecting the correct prototype from a list of known prototypes within five guesses 75% of the time. It performs particularly well on the quaternary oxides, selecting the correct prototype from a list of known prototypes on the first guess 65% of the time.United States. Dept. of Energy (Contract DE-FG02-96ER45571)United States. Office of Naval Research (Contract N00014-11-1-0212)National Science Foundation (U.S.) (Cyber-enabled Discover and Innovation Contract ECCS-0941043
Discovery of complex oxides via automated experiments and data science
This dataset is licensed under the Creative Commons Attribution 4.0 license(CC-BY-4.0). See https://creativecommons.org/licenses/by/4.0/for more information.
If using this dataset, please cite https://doi.org/10.1073/pnas.2106042118
We've released data from 6 print sessions, comprising 173 plates, 131 quaternary oxide systems, 6,918,024 individual composition samples, and 376,752 distinct compositions. While the tenfold reproductions within each plate are well controlled, uncontrolled variables (printhead age, etc) may lead to poorer consistency between print sessions.
The data exists in four directories and one metadata file. Each directory contains one type of data, with one *.csv file per printed plate.
i. The data in ten_replicas/ consists of optical transmission data, with one row per printed patch on a plate. The column headers are:
ExpID: an integer experiment ID for the printed patch on the plate.
row, col: The row and the column coordinates of the printed patch in the microscope image
signal_#: The measurement of ɑ, the optical transmission spectrum of the printed patch, at a given wavelength. # ranges from 0 to 8, inclusive, indicating transmission spectra at the following wavelengths: 375, 395, 455, 530, 590, 617, 660, 735, & 850 nm.
plate: The integer plate identifier.
line: An integer identifier of the composition gradient that was printed.
line_experiment_id: An integer identifier of the composition sample along the composition gradient.
replica: An integer identifier of the replica # of the printed line.
metal: Each plate will have up to six metal column headers, where the possible metals include: ['Ce', 'Co', 'Cu', 'Fe', 'In', 'Mg', 'Ni', 'Sn', 'Ta', 'Y']. The metal columns sum to 1, indicating the ratios of metals printed.
ii. The data in aggregated_replicas/ consists of optical transmission data, with one row per tenfold aggregated patch on a plate. The column headers are:
signal_#: The measurement of ɑ, the optical transmission spectrum of the printed patch, at a given wavelength. # ranges from 0 to 8, inclusive, indicating transmission spectra at the following wavelengths: 375, 395, 455, 530, 590, 617, 660, 735, & 850 nm.
plate: The integer plate identifier.
line: An integer identifier of the composition gradient that was printed.
line_experiment_id: An integer identifier of the composition sample along the composition gradient.
metal: Each plate will have up to six metal column headers, where the possible metals include: ['Ce', 'Co', 'Cu', 'Fe', 'In', 'Mg', 'Ni', 'Sn', 'Ta', 'Y']. The metal columns sum to 1, indicating the ratios of metals printed.
iii. The data in mixture/ represents the outcome of a probabilistic model that a given composition can be explained by a mixture of at most 3 binary signals. There is one row per composition. The column headers are:
log_prob: The log of the probability that this composition is explainable by at most 3 binary signals.
metal: Each plate will have up to six metal column headers, where the possible metals include: ['Ce', 'Co', 'Cu', 'Fe', 'In', 'Mg', 'Ni', 'Sn', 'Ta', 'Y']. The metal columns sum to 1, indicating the ratios of metals in the composition.
iv. The data in phase_fits/ represents the outcome of a phase fitting model. There is one row per phase diagram. This data is meant to be read using the example colab. The column headers are:
residual: Float, the residual of the phase fit.
signal_type: This is either 'signal' or 'sigma', indicating the type of the phase fit (see paper).
discretization: The integer number of intervals we discretized the phase space into.
n_points: The number of internal points in the phase diagram. This is an integer between 1 and 5, inclusive.
metal_0, metal_1, metal_2: Three strings identifying the constituent metals of the phase diagram.
point_#_pos_0, point_#_pos_y: The coordinates of a phase point. # ranges between 0 and 7, inclusive. point_#_pos_0 gives the float amount of metal_0, and point_#_pos_1 gives the float amount of metal_1. The float amount of metal_2 can be inferred via 1 - (point_#_pos_0 + point_#_pos_1).
point_#_fitted_channel_X: The fitted optical absorption spectra of point_#. # is an integer between 0 and 7, inclusive. X is an integer between 0 and 8, inclusive, indicating the wavelength of the light absorbed.
The files are publicly available for access via:
- the gsutil CLI tool at https://cloud.google.com/storage/docs/gsutil
- the tf.io.gfile APIs at https://www.tensorflow.org/api_docs/python/tf/io/gfile/GFile
- HTTP API: http://storage.googleapis.com/gresearch/metal-oxide-spectroscopy/path/to/file
This file, the README, is available at:
http://storage.googleapis.com/gresearch/metal-oxide-spectroscopy/README.txt
The metadata file is available at:
http://storage.googleapis.com/gresearch/metal-oxide-spectroscopy/metadata.csv, which lists all the plates available for download.
The plate data for each of the four data types listed above can be found at:
http://storage.googleapis.com/gresearch/metal-oxide-spectroscopy/data_type_subdir/plate.cs