8,573 research outputs found
Scientific document summarization via citation contextualization and scientific discourse
The rapid growth of scientific literature has made it difficult for the
researchers to quickly learn about the developments in their respective fields.
Scientific document summarization addresses this challenge by providing
summaries of the important contributions of scientific papers. We present a
framework for scientific summarization which takes advantage of the citations
and the scientific discourse structure. Citation texts often lack the evidence
and context to support the content of the cited paper and are even sometimes
inaccurate. We first address the problem of inaccuracy of the citation texts by
finding the relevant context from the cited paper. We propose three approaches
for contextualizing citations which are based on query reformulation, word
embeddings, and supervised learning. We then train a model to identify the
discourse facets for each citation. We finally propose a method for summarizing
scientific papers by leveraging the faceted citations and their corresponding
contexts. We evaluate our proposed method on two scientific summarization
datasets in the biomedical and computational linguistics domains. Extensive
evaluation results show that our methods can improve over the state of the art
by large margins.Comment: Preprint. The final publication is available at Springer via
http://dx.doi.org/10.1007/s00799-017-0216-8, International Journal on Digital
Libraries (IJDL) 201
Latent Dirichlet Allocation (LDA) and Topic modeling: models, applications, a survey
Topic modeling is one of the most powerful techniques in text mining for data
mining, latent data discovery, and finding relationships among data, text
documents. Researchers have published many articles in the field of topic
modeling and applied in various fields such as software engineering, political
science, medical and linguistic science, etc. There are various methods for
topic modeling, which Latent Dirichlet allocation (LDA) is one of the most
popular methods in this field. Researchers have proposed various models based
on the LDA in topic modeling. According to previous work, this paper can be
very useful and valuable for introducing LDA approaches in topic modeling. In
this paper, we investigated scholarly articles highly (between 2003 to 2016)
related to Topic Modeling based on LDA to discover the research development,
current trends and intellectual structure of topic modeling. Also, we summarize
challenges and introduce famous tools and datasets in topic modeling based on
LDA.Comment: arXiv admin note: text overlap with arXiv:1505.07302 by other author
Semi-Automatic Terminology Ontology Learning Based on Topic Modeling
Ontologies provide features like a common vocabulary, reusability,
machine-readable content, and also allows for semantic search, facilitate agent
interaction and ordering & structuring of knowledge for the Semantic Web (Web
3.0) application. However, the challenge in ontology engineering is automatic
learning, i.e., the there is still a lack of fully automatic approach from a
text corpus or dataset of various topics to form ontology using machine
learning techniques. In this paper, two topic modeling algorithms are explored,
namely LSI & SVD and Mr.LDA for learning topic ontology. The objective is to
determine the statistical relationship between document and terms to build a
topic ontology and ontology graph with minimum human intervention. Experimental
analysis on building a topic ontology and semantic retrieving corresponding
topic ontology for the user's query demonstrating the effectiveness of the
proposed approach
Putting Question-Answering Systems into Practice: Transfer Learning for Efficient Domain Customization
Traditional information retrieval (such as that offered by web search
engines) impedes users with information overload from extensive result pages
and the need to manually locate the desired information therein. Conversely,
question-answering systems change how humans interact with information systems:
users can now ask specific questions and obtain a tailored answer - both
conveniently in natural language. Despite obvious benefits, their use is often
limited to an academic context, largely because of expensive domain
customizations, which means that the performance in domain-specific
applications often fails to meet expectations. This paper proposes
cost-efficient remedies: (i) we leverage metadata through a filtering
mechanism, which increases the precision of document retrieval, and (ii) we
develop a novel fuse-and-oversample approach for transfer learning in order to
improve the performance of answer extraction. Here knowledge is inductively
transferred from a related, yet different, tasks to the domain-specific
application, while accounting for potential differences in the sample sizes
across both tasks. The resulting performance is demonstrated with actual use
cases from a finance company and the film industry, where fewer than 400
question-answer pairs had to be annotated in order to yield significant
performance gains. As a direct implication to management, this presents a
promising path to better leveraging of knowledge stored in information systems.Comment: Accepted by ACM TMI
Enriching Ontologies with Encyclopedic Background Knowledge for Document Indexing
The rapidly increasing number of scientific documents available publicly on
the Internet creates the challenge of efficiently organizing and indexing these
documents. Due to the time consuming and tedious nature of manual
classification and indexing, there is a need for better methods to automate
this process. This thesis proposes an approach which leverages encyclopedic
background knowledge for enriching domain-specific ontologies with textual and
structural information about the semantic vicinity of the ontologies' concepts.
The proposed approach aims to exploit this information for improving both
ontology-based methods for classifying and indexing documents and methods based
on supervised machine learning
Medical Image Analysis using Convolutional Neural Networks: A Review
The science of solving clinical problems by analyzing images generated in
clinical practice is known as medical image analysis. The aim is to extract
information in an effective and efficient manner for improved clinical
diagnosis. The recent advances in the field of biomedical engineering has made
medical image analysis one of the top research and development area. One of the
reason for this advancement is the application of machine learning techniques
for the analysis of medical images. Deep learning is successfully used as a
tool for machine learning, where a neural network is capable of automatically
learning features. This is in contrast to those methods where traditionally
hand crafted features are used. The selection and calculation of these features
is a challenging task. Among deep learning techniques, deep convolutional
networks are actively used for the purpose of medical image analysis. This
include application areas such as segmentation, abnormality detection, disease
classification, computer aided diagnosis and retrieval. In this study, a
comprehensive review of the current state-of-the-art in medical image analysis
using deep convolutional networks is presented. The challenges and potential of
these techniques are also highlighted
AppTechMiner: Mining Applications and Techniques from Scientific Articles
This paper presents AppTechMiner, a rule-based information extraction
framework that automatically constructs a knowledge base of all application
areas and problem solving techniques. Techniques include tools, methods,
datasets or evaluation metrics. We also categorize individual research articles
based on their application areas and the techniques proposed/improved in the
article. Our system achieves high average precision (~82%) and recall (~84%) in
knowledge base creation. It also performs well in application and technique
assignment to an individual article (average accuracy ~66%). In the end, we
further present two use cases presenting a trivial information retrieval system
and an extensive temporal analysis of the usage of techniques and application
areas. At present, we demonstrate the framework for the domain of computational
linguistics but this can be easily generalized to any other field of research.Comment: JCDL 2017, 6th International Workshop on Mining Scientific
Publications. arXiv admin note: substantial text overlap with
arXiv:1608.0638
SECTOR: A Neural Model for Coherent Topic Segmentation and Classification
When searching for information, a human reader first glances over a document,
spots relevant sections and then focuses on a few sentences for resolving her
intention. However, the high variance of document structure complicates to
identify the salient topic of a given section at a glance. To tackle this
challenge, we present SECTOR, a model to support machine reading systems by
segmenting documents into coherent sections and assigning topic labels to each
section. Our deep neural network architecture learns a latent topic embedding
over the course of a document. This can be leveraged to classify local topics
from plain text and segment a document at topic shifts. In addition, we
contribute WikiSection, a publicly available dataset with 242k labeled sections
in English and German from two distinct domains: diseases and cities. From our
extensive evaluation of 20 architectures, we report a highest score of 71.6% F1
for the segmentation and classification of 30 topics from the English city
domain, scored by our SECTOR LSTM model with bloom filter embeddings and
bidirectional segmentation. This is a significant improvement of 29.5 points F1
compared to state-of-the-art CNN classifiers with baseline segmentation.Comment: Author's final version, accepted for publication at TACL, 201
Joint-ViVo: Selecting and Weighting Visual Words Jointly for Bag-of-Features based Tissue Classification in Medical Images
Automatically classifying the tissues types of Region of Interest (ROI) in
medical imaging has been an important application in Computer-Aided Diagnosis
(CAD), such as classification of breast parenchymal tissue in the mammogram,
classify lung disease patterns in High-Resolution Computed Tomography (HRCT)
etc. Recently, bag-of-features method has shown its power in this field,
treating each ROI as a set of local features. In this paper, we investigate
using the bag-of-features strategy to classify the tissue types in medical
imaging applications. Two important issues are considered here: the visual
vocabulary learning and weighting. Although there are already plenty of
algorithms to deal with them, all of them treat them independently, namely, the
vocabulary learned first and then the histogram weighted. Inspired by
Auto-Context who learns the features and classifier jointly, we try to develop
a novel algorithm that learns the vocabulary and weights jointly. The new
algorithm, called Joint-ViVo, works in an iterative way. In each iteration, we
first learn the weights for each visual word by maximizing the margin of ROI
triplets, and then select the most discriminate visual words based on the
learned weights for the next iteration. We test our algorithm on three tissue
classification tasks: identifying brain tissue type in magnetic resonance
imaging (MRI), classifying lung tissue in HRCT images, and classifying breast
tissue density in mammograms. The results show that Joint-ViVo can perform
effectively for classifying tissues.Comment: This paper has been withdrawn by the author due to the terrible
writin
Towards Deep Modeling of Music Semantics using EEG Regularizers
Modeling of music audio semantics has been previously tackled through
learning of mappings from audio data to high-level tags or latent unsupervised
spaces. The resulting semantic spaces are theoretically limited, either because
the chosen high-level tags do not cover all of music semantics or because audio
data itself is not enough to determine music semantics. In this paper, we
propose a generic framework for semantics modeling that focuses on the
perception of the listener, through EEG data, in addition to audio data. We
implement this framework using a novel end-to-end 2-view Neural Network (NN)
architecture and a Deep Canonical Correlation Analysis (DCCA) loss function
that forces the semantic embedding spaces of both views to be maximally
correlated. We also detail how the EEG dataset was collected and use it to
train our proposed model. We evaluate the learned semantic space in a transfer
learning context, by using it as an audio feature extractor in an independent
dataset and proxy task: music audio-lyrics cross-modal retrieval. We show that
our embedding model outperforms Spotify features and performs comparably to a
state-of-the-art embedding model that was trained on 700 times more data. We
further discuss improvements to the model that are likely to improve its
performance.Comment: 5 pages, 2 figure
- …