36 research outputs found
Confluence of Vision and Natural Language Processing for Cross-media Semantic Relations Extraction
In this dissertation, we focus on extracting and understanding semantically meaningful relationships between data items of various modalities; especially relations between images and natural language. We explore the ideas and techniques to integrate such cross-media semantic relations for machine understanding of large heterogeneous datasets, made available through the expansion of the World Wide Web. The datasets collected from social media websites, news media outlets and blogging platforms usually contain multiple modalities of data. Intelligent systems are needed to automatically make sense out of these datasets and present them in such a way that humans can find the relevant pieces of information or get a summary of the available material. Such systems have to process multiple modalities of data such as images, text, linguistic features, and structured data in reference to each other. For example, image and video search and retrieval engines are required to understand the relations between visual and textual data so that they can provide relevant answers in the form of images and videos to the users\u27 queries presented in the form of text. We emphasize the automatic extraction of semantic topics or concepts from the data available in any form such as images, free-flowing text or metadata. These semantic concepts/topics become the basis of semantic relations across heterogeneous data types, e.g., visual and textual data. A classic problem involving image-text relations is the automatic generation of textual descriptions of images. This problem is the main focus of our work. In many cases, large amount of text is associated with images. Deep exploration of linguistic features of such text is required to fully utilize the semantic information encoded in it. A news dataset involving images and news articles is an example of this scenario. We devise frameworks for automatic news image description generation based on the semantic relations of images, as well as semantic understanding of linguistic features of the news articles
PathologyBERT -- Pre-trained Vs. A New Transformer Language Model for Pathology Domain
Pathology text mining is a challenging task given the reporting variability
and constant new findings in cancer sub-type definitions. However, successful
text mining of a large pathology database can play a critical role to advance
'big data' cancer research like similarity-based treatment selection, case
identification, prognostication, surveillance, clinical trial screening, risk
stratification, and many others. While there is a growing interest in
developing language models for more specific clinical domains, no
pathology-specific language space exist to support the rapid data-mining
development in pathology space. In literature, a few approaches fine-tuned
general transformer models on specialized corpora while maintaining the
original tokenizer, but in fields requiring specialized terminology, these
models often fail to perform adequately. We propose PathologyBERT - a
pre-trained masked language model which was trained on 347,173 histopathology
specimen reports and publicly released in the Huggingface repository. Our
comprehensive experiments demonstrate that pre-training of transformer model on
pathology corpora yields performance improvements on Natural Language
Understanding (NLU) and Breast Cancer Diagnose Classification when compared to
nonspecific language models.Comment: submitted to "American Medical Informatics Association (AMIA)" 2022
Annual Symposiu
Was there COVID-19 back in 2012? Challenge for AI in Diagnosis with Similar Indications
Purpose: Since the recent COVID-19 outbreak, there has been an avalanche of research papers applying deep learning based image processing to chest radiographs for detection of the disease. To test the performance of the two top models for CXR COVID-19 diagnosis on external datasets to assess model generalizability.
Methods: In this paper, we present our argument regarding the efficiency and applicability of existing deep learning models for COVID-19 diagnosis. We provide results from two popular models - COVID-Net and CoroNet evaluated on three publicly available datasets and an additional institutional dataset collected from EMORY Hospital between January and May 2020, containing patients tested for COVID-19 infection using RT-PCR.
Results: There is a large false positive rate (FPR) for COVID-Net on both ChexPert (55.3%) and MIMIC-CXR (23.4%) dataset. On the EMORY Dataset, COVID-Net has 61.4% sensitivity, 0.54 F1-score and 0.49 precision value. The FPR of the CoroNet model is significantly lower across all the datasets as compared to COVID-Net - EMORY(9.1%), ChexPert (1.3%), ChestX-ray14 (0.02%), MIMIC-CXR (0.06%).
Conclusion: The models reported good to excellent performance on their internal datasets, however we observed from our testing that their performance dramatically worsened on external data. This is likely from several causes including overfitting models due to lack of appropriate control patients and ground truth labels. The fourth institutional dataset was labeled using RT-PCR, which could be positive without radiographic findings and vice versa. Therefore, a fusion model of both clinical and radiographic data may have better performance and generalization
Designing A Symmetric Classifier For Image Annotation Using Multi-Layer Sparse Coding
Automatic annotation of images with descriptive words is a challenging problem with vast applications in the areas of image search and retrieval. This problem can be viewed as a label-assignment problem by a classifier dealing with a very large set of labels, i.e., the vocabulary set. We propose a novel annotation method that employs two layers of sparse coding and performs coarse-to-fine labeling. Themes extracted from the training data are treated as coarse labels. Each theme is a set of training images that share a common subject in their visual and textual contents. Our system extracts coarse labels for training and test images without requiring any prior knowledge. Vocabulary words are the fine labels to be associated with images. Most of the annotation methods achieve low recall due to the large number of available fine labels, i.e., vocabulary words. These systems also tend to achieve high precision for highly frequent words only. On the other hand, text mining literature discusses a general trend where relatively rare/moderately frequent words are more important for search retrieval process than the extremely frequent words. Our system not only outperforms various previously proposed annotation systems, but also achieves symmetric response in terms of precision and recall. Our system scores and maintains high precision for words with a wide range of frequencies. Such behavior is achieved by intelligently reducing the number of available fine labels or words for each image based on coarse labels assigned to it
A Context-Driven Extractive Framework For Generating Realistic Image Descriptions
Automatic image annotation methods are extremely beneficial for image search, retrieval, and organization systems. The lack of strict correlation between semantic concepts and visual features, referred to as the semantic gap, is a huge challenge for annotation systems. In this paper, we propose an image annotation model that incorporates contextual cues collected from sources both intrinsic and extrinsic to images, to bridge the semantic gap. The main focus of this paper is a large real-world data set of news images that we collected. Unlike standard image annotation benchmark data sets, our data set does not require human annotators to generate artificial ground truth descriptions after data collection, since our images already include contextually meaningful and real-world captions written by journalists. We thoroughly study the nature of image descriptions in this real-world data set. News image captions describe both visual contents and the contexts of images. Auxiliary information sources are also available with such images in the form of news article and metadata (e.g., keywords and categories). The proposed framework extracts contextual-cues from available sources of different data modalities and transforms them into a common representation space, i.e., the probability space. Predicted annotations are later transformed into sentence-like captions through an extractive framework applied over news articles. Our context-driven framework outperforms the state of the art on the collected data set of approximately 20 000 items, as well as on a previously available smaller news images data set
A Context-Driven Extractive Framework for Generating Realistic Image Descriptions
Automatic image annotation methods are extremely beneficial for image search, retrieval, and organization systems. The lack of strict correlation between semantic concepts and visual features, referred to as the semantic gap, is a huge challenge for annotation systems. In this paper, we propose an image annotation model that incorporates contextual cues collected from sources both intrinsic and extrinsic to images, to bridge the semantic gap. The main focus of this paper is a large real-world data set of news images that we collected. Unlike standard image annotation benchmark data sets, our data set does not require human annotators to generate artificial ground truth descriptions after data collection, since our images already include contextually meaningful and real-world captions written by journalists. We thoroughly study the nature of image descriptions in this real-world data set. News image captions describe both visual contents and the contexts of images. Auxiliary information sources are also available with such images in the form of news article and metadata (e.g., keywords and categories). The proposed framework extracts contextual-cues from available sources of different data modalities and transforms them into a common representation space, i.e., the probability space. Predicted annotations are later transformed into sentence-like captions through an extractive framework applied over news articles. Our context-driven framework outperforms the state of the art on the collected data set of approximately 20 000 items, as well as on a previously available smaller news images data set
T-Clustering: Image Clustering By Tensor Decomposition
Image clustering is an important tool for organizing evergrowing image repositories for efficient search and retrieval. A variety of clustering algorithms have been employed to cluster images. In this paper, we present a clustering algorithm, named T-Clustering, especially tailored to suit image collections. T-Clustering is based on tensor decomposition and takes into account the spatial configuration of images. This algorithm is non-parametric and works very well with raw images, thus alleviating the need for transformation of images in any feature domain. Our experiments prove that this algorithm outperforms well-known non-parametric clustering algorithms for a variety of image collections
Feature-Independent Context Estimation For Automatic Image Annotation
Automatic image annotation is a highly valuable tool for image search, retrieval and archival systems. In the absence of an annotation tool, such systems have to rely on either users\u27 input or large amount of text on the webpage of the image, to acquire its textual description. Users may provide insufficient/noisy tags and all the text on the webpage may not be a description or an explanation of the accompanying image. Therefore, it is of extreme importance to develop efficient tools for automatic annotation of images with correct and sufficient tags. The context of the image plays a significant role in this process, along with the content of the image. A suitable quantification of the context of the image may reduce the semantic gap between visual features and appropriate textual description of the image. In this paper, we present an unsupervised feature-independent quantification of the context of the image through tensor decomposition. We incorporate the estimated context as prior knowledge in the process of automatic image annotation. Evaluation of the predicted annotations provides evidence of the effectiveness of our feature-independent context estimation method