Search CORE

13,891 research outputs found

Transformation of Dense and Sparse Text Representations

Author: Chen Haiqing
Hu Wenpeng
Ji Feng
Liu Bing
Ma Jinwen
Wang Mengyu
Yan Rui
Zhao Dongyan
Publication venue
Publication date: 07/11/2019
Field of study

Sparsity is regarded as a desirable property of representations, especially in terms of explanation. However, its usage has been limited due to the gap with dense representations. Most NLP research progresses in recent years are based on dense representations. Thus the desirable property of sparsity cannot be leveraged. Inspired by Fourier Transformation, in this paper, we propose a novel Semantic Transformation method to bridge the dense and sparse spaces, which can facilitate the NLP research to shift from dense space to sparse space or to jointly use both spaces. The key idea of the proposed approach is to use a Forward Transformation to transform dense representations to sparse representations. Then some useful operations in the sparse space can be performed over the sparse representations, and the sparse representations can be used directly to perform downstream tasks such as text classification and natural language inference. Then, a Backward Transformation can also be carried out to transform those processed sparse representations to dense representations. Experiments using classification tasks and natural language inference task show that the proposed Semantic Transformation is effective

arXiv.org e-Print Archive

An alternative text representation to TF-IDF and Bag-of-Words

Author: Chen Minmin
Sha Fei
Weinberger Kilian Q.
Xu
Zhixiang
Publication venue
Publication date: 28/01/2013
Field of study

In text mining, information retrieval, and machine learning, text documents are commonly represented through variants of sparse Bag of Words (sBoW) vectors (e.g. TF-IDF). Although simple and intuitive, sBoW style representations suffer from their inherent over-sparsity and fail to capture word-level synonymy and polysemy. Especially when labeled data is limited (e.g. in document classification), or the text documents are short (e.g. emails or abstracts), many features are rarely observed within the training corpus. This leads to overfitting and reduced generalization accuracy. In this paper we propose Dense Cohort of Terms (dCoT), an unsupervised algorithm to learn improved sBoW document features. dCoT explicitly models absent words by removing and reconstructing random sub-sets of words in the unlabeled corpus. With this approach, dCoT learns to reconstruct frequent words from co-occurring infrequent words and maps the high dimensional sparse sBoW vectors into a low-dimensional dense representation. We show that the feature removal can be marginalized out and that the reconstruction can be solved for in closed-form. We demonstrate empirically, on several benchmark datasets, that dCoT features significantly improve the classification accuracy across several document classification tasks

arXiv.org e-Print Archive

Effective Feature Representation for Clinical Text Concept Extraction

Author: Genthial Guillaume
Godefroy Bruno
Potts Christopher
Tao Yifeng
Publication venue
Publication date: 01/01/2019
Field of study

Crucial information about the practice of healthcare is recorded only in free-form text, which creates an enormous opportunity for high-impact NLP. However, annotated healthcare datasets tend to be small and expensive to obtain, which raises the question of how to make maximally efficient uses of the available data. To this end, we develop an LSTM-CRF model for combining unsupervised word representations and hand-built feature representations derived from publicly available healthcare ontologies. We show that this combined model yields superior performance on five datasets of diverse kinds of healthcare text (clinical, social, scientific, commercial). Each involves the labeling of complex, multi-word spans that pick out different healthcare concepts. We also introduce a new labeled dataset for identifying the treatment relations between drugs and diseases

arXiv.org e-Print Archive

Low Latency Privacy Preserving Inference

Author: Brutzkus Alon
Elisha Oren
Gilad-Bachrach Ran
Publication venue
Publication date: 06/06/2019
Field of study

When applying machine learning to sensitive data, one has to find a balance between accuracy, information security, and computational-complexity. Recent studies combined Homomorphic Encryption with neural networks to make inferences while protecting against information leakage. However, these methods are limited by the width and depth of neural networks that can be used (and hence the accuracy) and exhibit high latency even for relatively simple networks. In this study we provide two solutions that address these limitations. In the first solution, we present more than

10\times

improvement in latency and enable inference on wider networks compared to prior attempts with the same level of security. The improved performance is achieved by novel methods to represent the data during the computation. In the second solution, we apply the method of transfer learning to provide private inference services using deep networks with latency of

\sim0.16

seconds. We demonstrate the efficacy of our methods on several computer vision tasks

arXiv.org e-Print Archive

Fighting Redundancy and Model Decay with Embeddings

Author: Baxter Jay
Belli Luca
Shiebler Dan
Tayal Abhishek
Xiong Hanchen
Publication venue
Publication date: 18/09/2018
Field of study

Every day, hundreds of millions of new Tweets containing over 40 languages of ever-shifting vernacular flow through Twitter. Models that attempt to extract insight from this firehose of information must face the torrential covariate shift that is endemic to the Twitter platform. While regularly-retrained algorithms can maintain performance in the face of this shift, fixed model features that fail to represent new trends and tokens can quickly become stale, resulting in performance degradation. To mitigate this problem we employ learned features, or embedding models, that can efficiently represent the most relevant aspects of a data distribution. Sharing these embedding models across teams can also reduce redundancy and multiplicatively increase cross-team modeling productivity. In this paper, we detail the commoditized tools, algorithms and pipelines that we have developed and are developing at Twitter to regularly generate high quality, up-to-date embeddings and share them broadly across the company.Comment: Presented at the Common Model Infrastructure Workshop at KDD 2018 (link: https://cmi2018.sdsc.edu/

arXiv.org e-Print Archive

An Efficient Shared-memory Parallel Sinkhorn-Knopp Algorithm to Compute the Word Mover's Distance

Author: Petrini Fabrizio
Tithi Jesmin Jahan
Publication venue
Publication date: 22/03/2021
Field of study

The Word Mover's Distance (WMD) is a metric that measures the semantic dissimilarity between two text documents by computing the cost of moving all words of a source/query document to the most similar words of a target document optimally. Computing WMD between two documents is costly because it requires solving an optimization problem that costs

O(V^3log(V))

where

V

is the number of unique words in the document. Fortunately, the WMD can be framed as the Earth Mover's Distance (EMD) (also known as the Optimal Transportation Distance) for which it has been shown that the algorithmic complexity can be reduced to

O(V^2)

by adding an entropy penalty to the optimization problem and a similar idea can be adapted to compute WMD efficiently. Additionally, the computation can be made highly parallel by computing WMD of a single query document against multiple target documents at once (e.g., finding whether a given tweet is similar to any other tweets happened in a day). In this paper, we present a shared-memory parallel Sinkhorn-Knopp Algorithm to compute the WMD of one document against many other documents by adopting the

O(V^2)

EMD algorithm. We used algorithmic transformations to change the original dense compute-heavy kernel to a sparse compute kernel and obtained

67\times

speedup using

96

cores on the state-of-the-art of Intel\textregistered{} 4-sockets Cascade Lake machine w.r.t. its sequential run. Our parallel algorithm is over

700\times

faster than the naive parallel python code that internally uses optimized matrix library calls.Comment: 10 pages, 1 page for reference, total 11 page

arXiv.org e-Print Archive

Flexible Operator Embeddings via Deep Learning

Author: Marcus Ryan
Papaemmanouil Olga
Publication venue
Publication date: 31/01/2019
Field of study

Integrating machine learning into the internals of database management systems requires significant feature engineering, a human effort-intensive process to determine the best way to represent the pieces of information that are relevant to a task. In addition to being labor intensive, the process of hand-engineering features must generally be repeated for each data management task, and may make assumptions about the underlying database that are not universally true. We introduce flexible operator embeddings, a deep learning technique for automatically transforming query operators into feature vectors that are useful for a multiple data management tasks and is custom-tailored to the underlying database. Our approach works by taking advantage of an operator's context, resulting in a neural network that quickly transforms sparse representations of query operators into dense, information-rich feature vectors. Experimentally, we show that our flexible operator embeddings perform well across a number of data management tasks, using both synthetic and real-world datasets

arXiv.org e-Print Archive

Sparseness helps: Sparsity Augmented Collaborative Representation for Classification

Author: Akhtar Naveed
Mian Ajmal
Shafait Faisal
Publication venue
Publication date: 28/11/2015
Field of study

Many classification approaches first represent a test sample using the training samples of all the classes. This collaborative representation is then used to label the test sample. It was a common belief that sparseness of the representation is the key to success for this classification scheme. However, more recently, it has been claimed that it is the collaboration and not the sparseness that makes the scheme effective. This claim is attractive as it allows to relinquish the computationally expensive sparsity constraint over the representation. In this paper, we first extend the analysis supporting this claim and then show that sparseness explicitly contributes to improved classification, hence it should not be completely ignored for computational gains. Inspired by this result, we augment a dense collaborative representation with a sparse representation and propose an efficient classification method that capitalizes on the resulting representation. The augmented representation and the classification method work together meticulously to achieve higher accuracy and lower computational time compared to state-of-the-art collaborative representation based classification approaches. Experiments on benchmark face, object and action databases show the efficacy of our approach.Comment: 10 page

arXiv.org e-Print Archive

Efficient molecular quantum dynamics in coordinate and phase space using pruned bases

Author: Hartke Bernd
Larsson Henrik R.
Tannor David J.
Publication venue: 'AIP Publishing'
Publication date: 15/11/2016
Field of study

We present an efficient implementation of dynamically pruned quantum dynamics, both in coordinate space and in phase space. We combine the ideas behind the biorthogonal von Neumann basis (PvB) with the orthogonalized momentum-symmetrized Gaussians (Weylets) to create a new basis, projected Weylets, that takes the best from both methods. We benchmark pruned dynamics using phase-space-localized PvB, projected Weylets, and coordinate-space-localized DVR bases, with real-world examples in up to six dimensions. We show that coordinate-space localization is most important for efficient pruning and that pruned dynamics is much faster compared to unpruned, exact dynamics. Phase-space localization is useful for more demanding dynamics where many basis functions are required. There, projected Weylets offer a more compact representation than pruned DVR bases.Comment: in pres

arXiv.org e-Print Archive

Large-Scale 3D Shape Reconstruction and Segmentation from ShapeNet Core55

We introduce a large-scale 3D shape understanding benchmark using data and annotation from ShapeNet 3D object database. The benchmark consists of two tasks: part-level segmentation of 3D shapes and 3D reconstruction from single view images. Ten teams have participated in the challenge and the best performing teams have outperformed state-of-the-art approaches on both tasks. A few novel deep learning architectures have been proposed on various 3D representations on both tasks. We report the techniques used by each team and the corresponding performances. In addition, we summarize the major discoveries from the reported results and possible trends for the future work in the field

arXiv.org e-Print Archive