113 research outputs found
Learning Domain-Specific Word Embeddings from Sparse Cybersecurity Texts
Word embedding is a Natural Language Processing (NLP) technique that
automatically maps words from a vocabulary to vectors of real numbers in an
embedding space. It has been widely used in recent years to boost the
performance of a vari-ety of NLP tasks such as Named Entity Recognition,
Syntac-tic Parsing and Sentiment Analysis. Classic word embedding methods such
as Word2Vec and GloVe work well when they are given a large text corpus. When
the input texts are sparse as in many specialized domains (e.g.,
cybersecurity), these methods often fail to produce high-quality vectors. In
this pa-per, we describe a novel method to train domain-specificword embeddings
from sparse texts. In addition to domain texts, our method also leverages
diverse types of domain knowledge such as domain vocabulary and semantic
relations. Specifi-cally, we first propose a general framework to encode
diverse types of domain knowledge as text annotations. Then we de-velop a novel
Word Annotation Embedding (WAE) algorithm to incorporate diverse types of text
annotations in word em-bedding. We have evaluated our method on two
cybersecurity text corpora: a malware description corpus and a Common
Vulnerability and Exposure (CVE) corpus. Our evaluation re-sults have
demonstrated the effectiveness of our method in learning domain-specific word
embeddings
Split-NER: Named Entity Recognition via Two Question-Answering-based Classifications
In this work, we address the NER problem by splitting it into two logical
sub-tasks: (1) Span Detection which simply extracts entity mention spans
irrespective of entity type; (2) Span Classification which classifies the spans
into their entity types. Further, we formulate both sub-tasks as
question-answering (QA) problems and produce two leaner models which can be
optimized separately for each sub-task. Experiments with four cross-domain
datasets demonstrate that this two-step approach is both effective and time
efficient. Our system, SplitNER outperforms baselines on OntoNotes5.0, WNUT17
and a cybersecurity dataset and gives on-par performance on BioNLP13CG. In all
cases, it achieves a significant reduction in training time compared to its QA
baseline counterpart. The effectiveness of our system stems from fine-tuning
the BERT model twice, separately for span detection and classification. The
source code can be found at https://github.com/c3sr/split-ner
Backdoor Smoothing: Demystifying Backdoor Attacks on Deep Neural Networks
Backdoor attacks mislead machine-learning models to output an
attacker-specified class when presented a specific trigger at test time. These
attacks require poisoning the training data to compromise the learning
algorithm, e.g., by injecting poisoning samples containing the trigger into the
training set, along with the desired class label. Despite the increasing number
of studies on backdoor attacks and defenses, the underlying factors affecting
the success of backdoor attacks, along with their impact on the learning
algorithm, are not yet well understood. In this work, we aim to shed light on
this issue by unveiling that backdoor attacks induce a smoother decision
function around the triggered samples -- a phenomenon which we refer to as
\textit{backdoor smoothing}. To quantify backdoor smoothing, we define a
measure that evaluates the uncertainty associated to the predictions of a
classifier around the input samples.
Our experiments show that smoothness increases when the trigger is added to
the input samples, and that this phenomenon is more pronounced for more
successful attacks.
We also provide preliminary evidence that backdoor triggers are not the only
smoothing-inducing patterns, but that also other artificial patterns can be
detected by our approach, paving the way towards understanding the limitations
of current defenses and designing novel ones.Comment: 9 pages, 7 figures, under submissio
Scalable nonparametric multiway data analysis
Abstract Multiway data analysis deals with multiway arrays, i.e., tensors, and the goal is twofold: predicting missing entries by modeling the interactions between array elements and discovering hidden patterns, such as clusters or communities in each mode. Despite the success of existing tensor factorization approaches, they are either unable to capture nonlinear interactions, or computationally expensive to handle massive data. In addition, most of the existing methods lack a principled way to discover latent clusters, which is important for better understanding of the data. To address these issues, we propose a scalable nonparametric tensor decomposition model. It employs Dirichlet process mixture (DPM) prior to model the latent clusters; it uses local Gaussian processes (GPs) to capture nonlinear relationships and to improve scalability. An efficient online variational Bayes Expectation-Maximization algorithm is proposed to learn the model. Experiments on both synthetic and real-world data show that the proposed model is able to discover latent clusters with higher prediction accuracy than competitive methods. Furthermore, the proposed model obtains significantly better predictive performance than the state-of-the-art large scale tensor decomposition algorithm, GigaTensor, on two large datasets with billions of entries
Self-similarity in NMR spectra: an application in assessing the level of cysteine
High resolution of NMR spectroscopic data of biosamples are a rich source of information on the metabolic response to physiological variation or pathological events. There are many advantages of NMR techniques such as the sample preparation is fast, simple and non-invasive. Statistical analysis of NMR spectra usually focuses on differential expression of large resonance intensity corresponding to abundant metabolites and involves several data preprocessing steps. In this paper we estimate functional components of spectra and test their significance using multiscale techniques. We also explore scaling in NMR spectra and use the systematic variability of scaling descriptors to predict the level of cysteine, an important precursor of glutathione, a control antioxidant in human body. This is motivated by high cost (in time and resources) of traditional methods for assessing cysteine level by high performance liquid chromatograph (HPLC)
Hepatic Oxidative Stress in Fructose-Induced Fatty Liver Is Not Caused by Sulfur Amino Acid Insufficiency
Fructose-sweetened liquid consumption is associated with fatty liver and oxidative stress. In rodent models of fructose-mediated fatty liver, protein consumption is decreased. Additionally, decreased sulfur amino acid intake is known to cause oxidative stress. Studies were designed to test whether oxidative stress in fructose-sweetened liquid-induced fatty liver is caused by decreased ad libitum solid food intake with associated inadequate sulfur amino acid intake. C57BL6 mice were grouped as: control (ad libitum water), fructose (ad libitum 30% fructose-sweetened liquid), glucose (ad libitum 30% glucose-sweetened water) and pair-fed (ad libitum water and sulfur amino acid intake same as the fructose group). Hepatic and plasma thiol-disulfide antioxidant status were analyzed after five weeks. Fructose- and glucose-fed mice developed fatty liver. The mitochondrial antioxidant protein, thioredoxin-2, displayed decreased abundance in the liver of fructose and glucose-fed mice compared to controls. Glutathione/glutathione disulfide redox potential (EhGSSG) and abundance of the cytoplasmic antioxidant protein, peroxiredoxin-2, were similar among groups. We conclude that both fructose and glucose-sweetened liquid consumption results in fatty liver and upregulated thioredoxin-2 expression, consistent with mitochondrial oxidative stress; however, inadequate sulfur amino acid intake was not the cause of this oxidative stress
Detailed Mitochondrial Phenotyping by High Resolution Metabolomics
Mitochondrial phenotype is complex and difficult to define at the level of individual cell types. Newer metabolic profiling methods provide information on dozens of metabolic pathways from a relatively small sample. This pilot study used “top-down” metabolic profiling to determine the spectrum of metabolites present in liver mitochondria. High resolution mass spectral analyses and multivariate statistical tests provided global metabolic information about mitochondria and showed that liver mitochondria possess a significant phenotype based on gender and genotype. The data also show that mitochondria contain a large number of unidentified chemicals
- …