638 research outputs found
Modeling reactivity to biological macromolecules with a deep multitask network
Most
small-molecule drug candidates fail before entering the market,
frequently because of unexpected toxicity. Often, toxicity is detected
only late in drug development, because many types of toxicities, especially
idiosyncratic adverse drug reactions (IADRs), are particularly hard
to predict and detect. Moreover, drug-induced liver injury (DILI)
is the most frequent reason drugs are withdrawn from the market and
causes 50% of acute liver failure cases in the United States. A common
mechanism often underlies many types of drug toxicities, including
both DILI and IADRs. Drugs are bioactivated by drug-metabolizing enzymes
into reactive metabolites, which then conjugate to sites in proteins
or DNA to form adducts. DNA adducts are often mutagenic and may alter
the reading and copying of genes and their regulatory elements, causing
gene dysregulation and even triggering cancer. Similarly, protein
adducts can disrupt their normal biological functions and induce harmful
immune responses. Unfortunately, reactive metabolites are not reliably
detected by experiments, and it is also expensive to test drug candidates
for potential to form DNA or protein adducts during the early stages
of drug development. In contrast, computational methods have the potential
to quickly screen for covalent binding potential, thereby flagging
problematic molecules and reducing the total number of necessary experiments.
Here, we train a deep convolution neural networkthe XenoSite
reactivity modelusing literature data to accurately predict
both sites and probability of reactivity for molecules with glutathione,
cyanide, protein, and DNA. On the site level, cross-validated predictions
had area under the curve (AUC) performances of 89.8% for DNA and 94.4%
for protein. Furthermore, the model separated molecules electrophilically
reactive with DNA and protein from nonreactive molecules with cross-validated
AUC performances of 78.7% and 79.8%, respectively. On both the site-
and molecule-level, the model’s performances significantly
outperformed reactivity indices derived from quantum simulations that
are reported in the literature. Moreover, we developed and applied
a selectivity score to assess preferential reactions with the macromolecules
as opposed to the common screening traps. For the entire data set
of 2803 molecules, this approach yielded totals of 257 (9.2%) and
227 (8.1%) molecules predicted to be reactive only with DNA and protein,
respectively, and hence those that would be missed by standard reactivity
screening experiments. Site of reactivity data is an underutilized
resource that can be used to not only predict if molecules are reactive,
but also show where they might be modified to reduce toxicity while
retaining efficacy. The XenoSite reactivity model is available at http://swami.wustl.edu/xenosite/p/reactivity
Simple data-driven context-sensitive lemmatization
Lemmatization for languages with rich inflectional morphology is one of the basic, indispensable steps in a language processing pipeline. In this paper we present a simple data-driven context-sensitive approach to lemmatizating word forms in running text. We treat lemmatization as a classification task for Machine Learning, and automatically induce class labels. We achieve this by computing a Shortest Edit Script (SES) between reversed input and output strings. A SES describes the transformations that have to be applied to the
input string (word form) in order to convert it to the output string (lemma). Our approach shows competitive performance on a range of typologically different languages
Recommended from our members
A novel knowledge discovery based approach for supplier risk scoring with application in the HVAC industry
This thesis was submitted for the degree of Doctor of Philosophy and awarded by Brunel University LondonThis research has led to a novel methodology for assessment and quantification of supply risks in the supply chain. The research has built on advanced Knowledge Discovery techniques and has resulted to a software implementation to be able to do so. The methodology developed and presented here resembles the well-known consumer credit scoring methods as it leads to a similar metric, or score, for assessing a supplier’s reliability and risk of conducting business with that supplier. However, the focus is on a wide range of operational metrics rather than just financial, which credit scoring techniques typically focus on.
The core of the methodology comprises the application of Knowledge Discovery techniques to extract the likelihood of possible risks from within a range of available datasets. In combination with cross-impact analysis, those datasets are examined for establish the inter-relationships and mutual connections among several factors that are likely contribute to risks associated with particular suppliers. This approach is called conjugation analysis. The resulting parameters become the inputs into a logistic regression which leads to a risk scoring model the outcome of the process is the standardized risk score which is analogous to the well-known consumer risk scoring model, better known as FICO score.
The proposed methodology has been applied to an Air Conditioning manufacturing company. Two models have been developed. The first identifies the supply risks based on the data about purchase orders and selected risk factors. With this model the likelihoods of delivery failures, quality failures and cost failures are obtained. The second model built on the first one but also used the actual data about the performance of supplier to identify risks of conducting business with particular suppliers. Its target was to provide quantitative measures of an individual supplier’s risk level.
The supplier risk scoring model is tested on the data acquired from the company for its performance analysis. The supplier risk scoring model achieved 86.2% accuracy, while the area under curve (AUC) was 0.863. The AUC curve is much higher than required model’s validity threshold value of 0.5. It represents developed model’s validity and reliability for future data. The numerical studies conducted with real-life datasets have demonstrated the effectiveness of the proposed methodology and system as well as its future potential for industrial adoption
On new maximal supergravity and its BPS domain-walls
We revise the SU(3)-invariant sector of supergravity with
dyonic SO(8) gaugings. By using the embedding tensor formalism, analytic
expressions for the scalar potential, superpotential(s) and fermion mass terms
are obtained as a function of the electromagnetic phase and the
scalars in the theory. Equipped with these results, we explore
non-supersymmetric AdS critical points at for which
perturbative stability could not be analysed before. The -dependent
superpotential is then used to derive first-order flow equations and obtain new
BPS domain-wall solutions at . We numerically look at
steepest-descent paths motivated by the (conjectured) RG flows.Comment: 40 pages (30 pages + appendices), 3 tables, 6 figures. v2: References
added and discussion in section 4.2 clarified. v3: References added,
published version. v4: Fixed typo
Open-source resources and standards for Arabic word structure analysis: Fine grained morphological analysis of Arabic text corpora
Morphological analyzers are preprocessors for text analysis. Many Text Analytics applications need them to perform their tasks. The aim of this thesis is to develop
standards, tools and resources that widen the scope of Arabic word structure analysis - particularly morphological analysis, to process Arabic text corpora of different domains, formats and genres, of both vowelized and non-vowelized text.
We want to morphologically tag our Arabic Corpus, but evaluation of existing morphological analyzers has highlighted shortcomings and shown that more research is
required. Tag-assignment is significantly more complex for Arabic than for many languages. The morphological analyzer should add the appropriate linguistic information
to each part or morpheme of the word (proclitic, prefix, stem, suffix and enclitic); in effect, instead of a tag for a word, we need a subtag for each part.
Very fine-grained distinctions may cause problems for automatic morphosyntactic analysis – particularly probabilistic taggers which require training data, if some words can change grammatical tag depending on function and context; on the other hand, finegrained distinctions may actually help to disambiguate other words in the local context. The SALMA – Tagger is a fine grained morphological analyzer which is mainly depends on linguistic information extracted from traditional Arabic grammar books and prior knowledge broad-coverage lexical resources; the SALMA – ABCLexicon.
More fine-grained tag sets may be more appropriate for some tasks. The SALMA –Tag Set is a theory standard for encoding, which captures long-established traditional
fine-grained morphological features of Arabic, in a notation format intended to be compact yet transparent.
The SALMA – Tagger has been used to lemmatize the 176-million words Arabic Internet Corpus. It has been proposed as a language-engineering toolkit for Arabic lexicography and for phonetically annotating the Qur’an by syllable and primary stress information, as well as, fine-grained morphological tagging
Advances in structure elucidation of small molecules using mass spectrometry
The structural elucidation of small molecules using mass spectrometry plays an important role in modern life sciences and bioanalytical approaches. This review covers different soft and hard ionization techniques and figures of merit for modern mass spectrometers, such as mass resolving power, mass accuracy, isotopic abundance accuracy, accurate mass multiple-stage MS(n) capability, as well as hybrid mass spectrometric and orthogonal chromatographic approaches. The latter part discusses mass spectral data handling strategies, which includes background and noise subtraction, adduct formation and detection, charge state determination, accurate mass measurements, elemental composition determinations, and complex data-dependent setups with ion maps and ion trees. The importance of mass spectral library search algorithms for tandem mass spectra and multiple-stage MS(n) mass spectra as well as mass spectral tree libraries that combine multiple-stage mass spectra are outlined. The successive chapter discusses mass spectral fragmentation pathways, biotransformation reactions and drug metabolism studies, the mass spectral simulation and generation of in silico mass spectra, expert systems for mass spectral interpretation, and the use of computational chemistry to explain gas-phase phenomena. A single chapter discusses data handling for hyphenated approaches including mass spectral deconvolution for clean mass spectra, cheminformatics approaches and structure retention relationships, and retention index predictions for gas and liquid chromatography. The last section reviews the current state of electronic data sharing of mass spectra and discusses the importance of software development for the advancement of structure elucidation of small molecules
Acoustic seafloor classification using the Weyl transform of multibeam echosounder backscatter mosaic
The use of multibeam echosounder systems (MBES) for detailed seafloor mapping is increasing at a fast pace. Due to their design, enabling continuous high-density measurements and the coregistration of seafloor’s depth and reflectivity, MBES has become a fundamental instrument in the advancing field of acoustic seafloor classification (ASC). With these data becoming available, recent seafloor mapping research focuses on the interpretation of the hydroacoustic data and automated predictive modeling of seafloor composition. While a methodological consensus on which seafloor sediment classification algorithm and routine does not exist in the scientific community, it is expected that progress will occur through the refinement of each stage of the ASC pipeline: ranging from the data acquisition to the modeling phase. This research focuses on the stage of the feature extraction; the stage wherein the spatial variables used for the classification are, in this case, derived from the MBES backscatter data. This contribution explored the sediment classification potential of a textural feature based on the recently introduced Weyl transform of 300 kHz MBES backscatter imagery acquired over a nearshore study site in Belgian Waters. The goodness of the Weyl transform textural feature for seafloor sediment classification was assessed in terms of cluster separation of Folk’s sedimentological categories (4-class scheme). Class separation potential was quantified at multiple spatial scales by cluster silhouette coefficients. Weyl features derived from MBES backscatter data were found to exhibit superior thematic class separation compared to other well-established textural features, namely: (1) First-order Statistics, (2) Gray Level Co-occurrence Matrices (GLCM), (3) Wavelet Transform and (4) Local Binary Pattern (LBP). Finally, by employing a Random Forest (RF) categorical classifier, the value of the proposed textural feature for seafloor sediment mapping was confirmed in terms of global and by-class classification accuracies, highest for models based on the backscatter Weyl features. Further tests on different backscatter datasets and sediment classification schemes are required to further elucidate the use of the Weyl transform of MBES backscatter imagery in the context of seafloor mapping
- …