11 research outputs found
Non-Parametric Approximations for Anisotropy Estimation in Two-dimensional Differentiable Gaussian Random Fields
Spatially referenced data often have autocovariance functions with elliptical
isolevel contours, a property known as geometric anisotropy. The anisotropy
parameters include the tilt of the ellipse (orientation angle) with respect to
a reference axis and the aspect ratio of the principal correlation lengths.
Since these parameters are unknown a priori, sample estimates are needed to
define suitable spatial models for the interpolation of incomplete data. The
distribution of the anisotropy statistics is determined by a non-Gaussian
sampling joint probability density. By means of analytical calculations, we
derive an explicit expression for the joint probability density function of the
anisotropy statistics for Gaussian, stationary and differentiable random
fields. Based on this expression, we obtain an approximate joint density which
we use to formulate a statistical test for isotropy. The approximate joint
density is independent of the autocovariance function and provides conservative
probability and confidence regions for the anisotropy parameters. We validate
the theoretical analysis by means of simulations using synthetic data, and we
illustrate the detection of anisotropy changes with a case study involving
background radiation exposure data. The approximate joint density provides (i)
a stand-alone approximate estimate of the anisotropy statistics distribution
(ii) informed initial values for maximum likelihood estimation, and (iii) a
useful prior for Bayesian anisotropy inference.Comment: 39 pages; 8 figure
Automatic identification of relevant chemical compounds from patents
In commercial research and development projects, public disclosure of new chemical
compounds often takes place in patents. Only a small proportion of these compounds
are published in journals, usually a few years after the patent. Patent authorities make
available the patents but do not provide systematic continuous chemical annotations.
Content databases such as Elsevier’s Reaxys provide such services mostly based on
manual excerptions, which are time-consuming and costly. Automatic text-mining
approaches help overcome some of the limitations of the manual process. Different
text-mining approaches exist to extract chemical entities from patents. The majority
of them have been developed using sub-sections of patent documents and focus on
mentions of compounds. Less attention has been given to relevancy of a compound in a
patent. Relevancy of a compound to a patent is based on the patent’s context. A relevant
compound plays a major role within a patent. Identification of relevant compounds
reduces the size of the extracted data and improves the usefulness of patent resources
(e.g. supports identifying the main compounds). Annotators of databases like Reaxys
only annotate relevant compounds. In this study, we design an automated system
that extracts chemical entities from patents and classifies their relevance. The goldstandard set contained 18 789 chemical entity annotations. Of these, 10% were relevant
compounds, 88% were irrelevant and 2% were equivocal. Our compound recognition
system was based on proprietary tools. The performance (F-score) of the system on
compound recognition was 84% on the development set and 86% on the test set. The
relevancy classification system had an F-score of 86% on the development set and 82% on the test set. Our system can extract chemical compounds from patents and
classify their relevance with high performance. This enables the extension of the Reaxys
database by means of automation
PubChem chemical structure standardization
Abstract Background PubChem is a chemical information repository, consisting of three primary databases: Substance, Compound, and BioAssay. When individual data contributors submit chemical substance descriptions to Substance, the unique chemical structures are extracted and stored into Compound through an automated process called structure standardization. The present study describes the PubChem standardization approaches and analyzes them for their success rates, reasons that cause structures to be rejected, and modifications applied to structures during the standardization process. Furthermore, the PubChem standardization is compared to the structure normalization of the IUPAC International Chemical Identifier (InChI) software, as manifested by conversion of the InChI back into a chemical structure. Results The observed rejection rate for substances processed by PubChem standardization was 0.36%, which is predominantly attributed to structures with invalid atom valences that cannot be readily corrected without additional information from contributors. Of all structures that pass standardization, 44% are modified in the process, reducing the count of unique structures from 53,574,724 in substance to 45,808,881 in compound as identified by de-aromatized canonical isomeric SMILES. Even though the processing time is very low on average (only 0.4% of structures have individual standardization time above 0.1Â s), total standardization time is completely dominated by edge cases: 90% of the time to standardize all structures in PubChem substance is spent on the 2.05% of structures with the highest individual standardization time. It is worth noting that 60% of the structures obtained from PubChem structure standardization are not identical to the chemical structure resulting from the InChI (primarily due to preferences for a different tautomeric form). Conclusions Standardization of chemical structures is complicated by the diversity of chemical information and their representations approaches. The PubChem standardization is an effective and efficient tool to account for molecular diversity and to eliminate invalid/incomplete structures. Further development will concentrate on improved tautomer consideration and an expanded stereocenter definition. Modifications are difficult to thoroughly validate, with slight changes often affecting many thousands of structures and various edge cases. The PubChem structure standardization service is accessible as a public resource (https://pubchem.ncbi.nlm.nih.gov/standardize), and via programmatic interfaces