113 research outputs found

    Learning Domain-Specific Word Embeddings from Sparse Cybersecurity Texts

    Full text link
    Word embedding is a Natural Language Processing (NLP) technique that automatically maps words from a vocabulary to vectors of real numbers in an embedding space. It has been widely used in recent years to boost the performance of a vari-ety of NLP tasks such as Named Entity Recognition, Syntac-tic Parsing and Sentiment Analysis. Classic word embedding methods such as Word2Vec and GloVe work well when they are given a large text corpus. When the input texts are sparse as in many specialized domains (e.g., cybersecurity), these methods often fail to produce high-quality vectors. In this pa-per, we describe a novel method to train domain-specificword embeddings from sparse texts. In addition to domain texts, our method also leverages diverse types of domain knowledge such as domain vocabulary and semantic relations. Specifi-cally, we first propose a general framework to encode diverse types of domain knowledge as text annotations. Then we de-velop a novel Word Annotation Embedding (WAE) algorithm to incorporate diverse types of text annotations in word em-bedding. We have evaluated our method on two cybersecurity text corpora: a malware description corpus and a Common Vulnerability and Exposure (CVE) corpus. Our evaluation re-sults have demonstrated the effectiveness of our method in learning domain-specific word embeddings

    Split-NER: Named Entity Recognition via Two Question-Answering-based Classifications

    Full text link
    In this work, we address the NER problem by splitting it into two logical sub-tasks: (1) Span Detection which simply extracts entity mention spans irrespective of entity type; (2) Span Classification which classifies the spans into their entity types. Further, we formulate both sub-tasks as question-answering (QA) problems and produce two leaner models which can be optimized separately for each sub-task. Experiments with four cross-domain datasets demonstrate that this two-step approach is both effective and time efficient. Our system, SplitNER outperforms baselines on OntoNotes5.0, WNUT17 and a cybersecurity dataset and gives on-par performance on BioNLP13CG. In all cases, it achieves a significant reduction in training time compared to its QA baseline counterpart. The effectiveness of our system stems from fine-tuning the BERT model twice, separately for span detection and classification. The source code can be found at https://github.com/c3sr/split-ner

    Backdoor Smoothing: Demystifying Backdoor Attacks on Deep Neural Networks

    Full text link
    Backdoor attacks mislead machine-learning models to output an attacker-specified class when presented a specific trigger at test time. These attacks require poisoning the training data to compromise the learning algorithm, e.g., by injecting poisoning samples containing the trigger into the training set, along with the desired class label. Despite the increasing number of studies on backdoor attacks and defenses, the underlying factors affecting the success of backdoor attacks, along with their impact on the learning algorithm, are not yet well understood. In this work, we aim to shed light on this issue by unveiling that backdoor attacks induce a smoother decision function around the triggered samples -- a phenomenon which we refer to as \textit{backdoor smoothing}. To quantify backdoor smoothing, we define a measure that evaluates the uncertainty associated to the predictions of a classifier around the input samples. Our experiments show that smoothness increases when the trigger is added to the input samples, and that this phenomenon is more pronounced for more successful attacks. We also provide preliminary evidence that backdoor triggers are not the only smoothing-inducing patterns, but that also other artificial patterns can be detected by our approach, paving the way towards understanding the limitations of current defenses and designing novel ones.Comment: 9 pages, 7 figures, under submissio

    Scalable nonparametric multiway data analysis

    Get PDF
    Abstract Multiway data analysis deals with multiway arrays, i.e., tensors, and the goal is twofold: predicting missing entries by modeling the interactions between array elements and discovering hidden patterns, such as clusters or communities in each mode. Despite the success of existing tensor factorization approaches, they are either unable to capture nonlinear interactions, or computationally expensive to handle massive data. In addition, most of the existing methods lack a principled way to discover latent clusters, which is important for better understanding of the data. To address these issues, we propose a scalable nonparametric tensor decomposition model. It employs Dirichlet process mixture (DPM) prior to model the latent clusters; it uses local Gaussian processes (GPs) to capture nonlinear relationships and to improve scalability. An efficient online variational Bayes Expectation-Maximization algorithm is proposed to learn the model. Experiments on both synthetic and real-world data show that the proposed model is able to discover latent clusters with higher prediction accuracy than competitive methods. Furthermore, the proposed model obtains significantly better predictive performance than the state-of-the-art large scale tensor decomposition algorithm, GigaTensor, on two large datasets with billions of entries

    Self-similarity in NMR spectra: an application in assessing the level of cysteine

    Get PDF
    High resolution of NMR spectroscopic data of biosamples are a rich source of information on the metabolic response to physiological variation or pathological events. There are many advantages of NMR techniques such as the sample preparation is fast, simple and non-invasive. Statistical analysis of NMR spectra usually focuses on differential expression of large resonance intensity corresponding to abundant metabolites and involves several data preprocessing steps. In this paper we estimate functional components of spectra and test their significance using multiscale techniques. We also explore scaling in NMR spectra and use the systematic variability of scaling descriptors to predict the level of cysteine, an important precursor of glutathione, a control antioxidant in human body. This is motivated by high cost (in time and resources) of traditional methods for assessing cysteine level by high performance liquid chromatograph (HPLC)

    Hepatic Oxidative Stress in Fructose-Induced Fatty Liver Is Not Caused by Sulfur Amino Acid Insufficiency

    Get PDF
    Fructose-sweetened liquid consumption is associated with fatty liver and oxidative stress. In rodent models of fructose-mediated fatty liver, protein consumption is decreased. Additionally, decreased sulfur amino acid intake is known to cause oxidative stress. Studies were designed to test whether oxidative stress in fructose-sweetened liquid-induced fatty liver is caused by decreased ad libitum solid food intake with associated inadequate sulfur amino acid intake. C57BL6 mice were grouped as: control (ad libitum water), fructose (ad libitum 30% fructose-sweetened liquid), glucose (ad libitum 30% glucose-sweetened water) and pair-fed (ad libitum water and sulfur amino acid intake same as the fructose group). Hepatic and plasma thiol-disulfide antioxidant status were analyzed after five weeks. Fructose- and glucose-fed mice developed fatty liver. The mitochondrial antioxidant protein, thioredoxin-2, displayed decreased abundance in the liver of fructose and glucose-fed mice compared to controls. Glutathione/glutathione disulfide redox potential (EhGSSG) and abundance of the cytoplasmic antioxidant protein, peroxiredoxin-2, were similar among groups. We conclude that both fructose and glucose-sweetened liquid consumption results in fatty liver and upregulated thioredoxin-2 expression, consistent with mitochondrial oxidative stress; however, inadequate sulfur amino acid intake was not the cause of this oxidative stress

    Detailed Mitochondrial Phenotyping by High Resolution Metabolomics

    Get PDF
    Mitochondrial phenotype is complex and difficult to define at the level of individual cell types. Newer metabolic profiling methods provide information on dozens of metabolic pathways from a relatively small sample. This pilot study used “top-down” metabolic profiling to determine the spectrum of metabolites present in liver mitochondria. High resolution mass spectral analyses and multivariate statistical tests provided global metabolic information about mitochondria and showed that liver mitochondria possess a significant phenotype based on gender and genotype. The data also show that mitochondria contain a large number of unidentified chemicals
    corecore