226 research outputs found
Labeling topics with images using a neural network
Topics generated by topic models are usually represented by lists of t terms or alternatively using short phrases or images. The current state-of-the-art work on labeling topics using images selects images by re-ranking a small set of candidates for a given topic. In this paper, we present a more generic method that can estimate the degree of association between any arbitrary pair of an unseen topic and image using a deep neural network. Our method achieves better runtime performance O(n) compared to O(n2) for the current state-of-the-art method, and is also significantly more accurate
Analyzing political parody in social media
Parody is a figurative device used to imitate an entity for comedic or critical purposes and represents a widespread phenomenon in social media through many popular parody accounts. In this paper, we present the first computational study of parody. We introduce a new publicly available data set of tweets from real politicians and their corresponding parody accounts. We run a battery of supervised machine learning models for automatically detecting parody tweets with an emphasis on robustness by testing on tweets from accounts unseen in training, across different genders and across countries. Our results show that political parody tweets can be predicted with an accuracy up to 90%. Finally, we identify the markers of parody through a linguistic analysis. Beyond research in linguistics and political communication, accurately and automatically detecting parody is important to improving fact checking for journalists and analytics such as sentiment analysis through filtering out parodical utterances
SlideImages: A Dataset for Educational Image Classification
In the past few years, convolutional neural networks (CNNs) have achieved
impressive results in computer vision tasks, which however mainly focus on
photos with natural scene content. Besides, non-sensor derived images such as
illustrations, data visualizations, figures, etc. are typically used to convey
complex information or to explore large datasets. However, this kind of images
has received little attention in computer vision. CNNs and similar techniques
use large volumes of training data. Currently, many document analysis systems
are trained in part on scene images due to the lack of large datasets of
educational image data. In this paper, we address this issue and present
SlideImages, a dataset for the task of classifying educational illustrations.
SlideImages contains training data collected from various sources, e.g.,
Wikimedia Commons and the AI2D dataset, and test data collected from
educational slides. We have reserved all the actual educational images as a
test dataset in order to ensure that the approaches using this dataset
generalize well to new educational images, and potentially other domains.
Furthermore, we present a baseline system using a standard deep neural
architecture and discuss dealing with the challenge of limited training data.Comment: 8 pages, 2 figures, to be presented at ECIR 202
Re-ranking words to improve interpretability of automatically generated topics
Topics models, such as LDA, are widely used in Natural Language Processing. Making their output interpretable is an important area of research with applications to areas such as the enhancement of exploratory search interfaces and the development of interpretable machine learning models. Conventionally, topics are represented by their n most probable words, however, these representations are often difficult for humans to interpret. This paper explores the re-ranking of topic words to generate more interpretable topic representations. A range of approaches are compared and evaluated in two experiments. The first uses crowdworkers to associate topics represented by different word rankings with related documents. The second experiment is an automatic approach based on a document retrieval task applied on multiple domains. Results in both experiments demonstrate that re-ranking words improves topic interpretability and that the most effective re-ranking schemes were those which combine information about the importance of words both within topics and their relative frequency in the entire corpus. In addition, close correlation between the results of the two evaluation approaches suggests that the automatic method proposed here could be used to evaluate re-ranking methods without the need for human judgements
Frustratingly simple pretraining alternatives to masked language modeling
Masked language modeling (MLM), a self-supervised pretraining objective, is widely used in natural language processing for learning text representations. MLM trains a model to predict a random sample of input tokens that have been replaced by a [MASK] placeholder in a multi-class setting over the entire vocabulary. When pretraining, it is common to use alongside MLM other auxiliary objectives on the token or sequence level to improve downstream performance (e.g. next sentence prediction). However, no previous work so far has attempted in examining whether other simpler linguistically intuitive or not objectives can be used standalone as main pretraining objectives. In this paper, we explore five simple pretraining objectives based on token-level classification tasks as replacements of MLM. Empirical results on GLUE and SQuAD show that our proposed methods achieve comparable or better performance to MLM using a BERT-BASE architecture. We further validate our methods using smaller models, showing that pretraining a model with 41% of the BERT-BASE's parameters, BERT-MEDIUM results in only a 1% drop in GLUE scores with our best objective
Dynamics of proteins: Light scattering study of dilute and dense colloidal suspensions of eye lens homogenates
We report a dynamic light scattering study on protein suspensions of bovine
lens homogenates at conditions (pH and ionic strength) similar to the
physiological ones. Light scattering data were collected at two temperatures,
20 oC and 37 oC, over a wide range of concentrations from the very dilute limit
up to the dense regime approaching to the physiological lens concentration. A
comparison with experimental data from intact bovine lenses was advanced
revealing differences between dispersions and lenses at similar concentrations.
In the dilute regime two scattering entities were detected and identified with
the long-time, self-diffusion modes of alpha-crystallins and their aggregates,
which naturally exist in lens nucleus. Self-diffusion coefficients are
temperature insensitive, whereas the collective diffusion coefficient depends
strongly on temperature revealing a reduction of the net repulsive
interparticle forces with lowering temperature. While there are no rigorous
theoretical approaches on particle diffusion properties for multi-component,
non-ideal hard-sphere, polydispersed systems, as the suspensions studied here,
a discussion of the volume fraction dependence of the long-time, self-diffusion
coefficient in the context of existing theoretical approaches was undertaken.
This study is purported to provide some insight into the complex light
scattering pattern of intact lenses and the interactions between the
constituent proteins that are responsible for lens transparency. This would
lead to understand basic mechanisms of specific protein interactions that lead
to lens opacification (cataract) under pathological conditions.Comment: To appear in J. Chem. Phy
Quantifying right ventricular motion and strain using 3D cine DENSE MRI
Background: The RV is difficult to image because of its thin wall, asymmetric geometry and complex motion. DENSE is a quantitative MRI technique for measuring myocardial displacement and strain at high spatial and temporal resolutions [1,2]. DENSE encodes tissue displacement directly into the image phase, allowing for the direct extraction of motion data at a pixel resolution. A free-breathing navigator-gated spiral 3D cine DENSE sequence was recently developed [3], providing an MRI technique which is well suited to quantifying RV mechanics. Methods: Whole heart 3D cine DENSE data were acquired from two normal volunteers, after informed consent was obtained and in accordance with protocols approved by the University of Virginia institutional review board. The endocardial and epicardial contours were manually delineated to identify the myocardium from surrounding anatomical structures. A 3D spatiotemporal phase unwrapping algorithm was used to remove phase aliasing [4], and 3D Lagrangian displacement fields were derived for all cardiac phases. Midline contours were calculated from the epicardial and endocardial contours, and tissue tracking seed points were defined at pixel spaced intervals. A 3D tracking algorithm was implemented as a direct extension of the 2D tracking algorithm presented in [4], producing midline motion trajectories from which strain was calculated. Tangential 1D strain was calculated in the longitudinal and circumferential cardiac directions. Strain time curves are computed representing the free wall of the RV. Results: Figure 1 illustrates the RV free wall mean tangential 1D strain time curves for approximately 3/4 of the cardiac cycle over the apical-mid section of the heart for one volunteer. Results show measurements ranging between -0.1 and -0.25, and further illustrate a greater displacement in the longitudinal direction. Results compare favorably with studies using myocardial tagging and DENSE[5,6]
J Invest Dermatol
Acne vulgaris is a skin disorder of the sebaceous follicles, involving hyperkeratinization and perifollicular inflammation. Matrix metalloproteinases (MMP) have a predominant role in inflammatory matrix remodeling and hyperproliferative skin disorders. We investigated the expression of MMP and tissue inhibitors of MMP (TIMP) in facial sebum specimens from acne patients, before and after treatment with isotretinoin. Gelatin zymography and Western-blot analysis revealed that sebum contains proMMP-9, which was decreased following per os or topical treatment with isotretinoin and in parallel to the clinical improvement of acne. Sebum also contains MMP-1, MMP-13, TIMP-1, and TIMP-2, as assessed by ELISA and western blot, but only MMP-13 was decreased following treatment with isotretinoin. The origin of MMP and TIMP in sebum is attributed to keratinocytes and sebocytes, since we found that HaCaT keratinocytes in culture secrete proMMP-2, proMMP-9, MMP-1, MMP-13, TIMP-1, and TIMP-2. SZ95 sebocytes in culture secreted proMMP-2 and proMMP-9, which was also confirmed by microarray analysis. Isotretinoin inhibited the arachidonic acid-induced secretion and mRNA expression of proMMP-2 and -9 in both cell types and of MMP-13 in HaCaT keratinocytes. These data indicate that MMP and TIMP of epithelial origin may be involved in acne pathogenesis, and that isotretinoin-induced reduction in MMP-9 and -13 may contribute to the therapeutic effects of the agent in acne
Knowledge distillation for quality estimation
Quality Estimation (QE) is the task of automatically predicting Machine Translation quality in the absence of reference translations, making it applicable in real-time settings, such as translating online social media conversations. Recent success in QE stems from the use of multilingual pre-trained representations, where very large models lead to impressive results. However, the inference time, disk and memory requirements of such models do not allow for wide usage in the real world. Models trained on distilled pre-trained representations remain prohibitively large for many usage scenarios. We instead propose to directly transfer knowledge from a strong QE teacher model to a much smaller model with a different, shallower architecture. We show that this approach, in combination with data augmentation, leads to light-weight QE models that perform competitively with distilled pre-trained representations with 8x fewer parameters
LexGLUE : a benchmark dataset for legal language understanding in English
Law, interpretations of law, legal arguments, agreements, etc. are typically expressed in writing, leading to the production of vast corpora of legal text. Their analysis, which is at the center of legal practice, becomes increasingly elaborate as these collections grow in size. Natural language understanding (NLU) technologies can be a valuable tool to support legal practitioners in these endeavors. Their usefulness, however, largely depends on whether current state-of-the-art models can generalize across various tasks in the legal domain. To answer this currently open question, we introduce the Legal General Language Understanding Evaluation (LexGLUE) benchmark, a collection of datasets for evaluating model performance across a diverse set of legal NLU tasks in a standardized way. We also provide an evaluation and analysis of several generic and legal-oriented models demonstrating that the latter consistently offer performance improvements across multiple tasks
- …