9,511 research outputs found
Conceptual Text Summarizer: A new model in continuous vector space
Traditional methods of summarization are not cost-effective and possible
today. Extractive summarization is a process that helps to extract the most
important sentences from a text automatically and generates a short informative
summary. In this work, we propose an unsupervised method to summarize Persian
texts. This method is a novel hybrid approach that clusters the concepts of the
text using deep learning and traditional statistical methods. First we produce
a word embedding based on Hamshahri2 corpus and a dictionary of word
frequencies. Then the proposed algorithm extracts the keywords of the document,
clusters its concepts, and finally ranks the sentences to produce the summary.
We evaluated the proposed method on Pasokh single-document corpus using the
ROUGE evaluation measure. Without using any hand-crafted features, our proposed
method achieves state-of-the-art results. We compared our unsupervised method
with the best supervised Persian methods and we achieved an overall improvement
of ROUGE-2 recall score of 7.5%.Comment: The experimental results complete
A Survey on Content-Aware Video Analysis for Sports
Sports data analysis is becoming increasingly large-scale, diversified, and
shared, but difficulty persists in rapidly accessing the most crucial
information. Previous surveys have focused on the methodologies of sports video
analysis from the spatiotemporal viewpoint instead of a content-based
viewpoint, and few of these studies have considered semantics. This study
develops a deeper interpretation of content-aware sports video analysis by
examining the insight offered by research into the structure of content under
different scenarios. On the basis of this insight, we provide an overview of
the themes particularly relevant to the research on content-aware systems for
broadcast sports. Specifically, we focus on the video content analysis
techniques applied in sportscasts over the past decade from the perspectives of
fundamentals and general review, a content hierarchical model, and trends and
challenges. Content-aware analysis methods are discussed with respect to
object-, event-, and context-oriented groups. In each group, the gap between
sensation and content excitement must be bridged using proper strategies. In
this regard, a content-aware approach is required to determine user demands.
Finally, the paper summarizes the future trends and challenges for sports video
analysis. We believe that our findings can advance the field of research on
content-aware video analysis for broadcast sports.Comment: Accepted for publication in IEEE Transactions on Circuits and Systems
for Video Technology (TCSVT
Text Synopsis Generation for Egocentric Videos
Mass utilization of body-worn cameras has led to a huge corpus of available
egocentric video. Existing video summarization algorithms can accelerate
browsing such videos by selecting (visually) interesting shots from them.
Nonetheless, since the system user still has to watch the summary videos,
browsing large video databases remain a challenge. Hence, in this work, we
propose to generate a textual synopsis, consisting of a few sentences
describing the most important events in a long egocentric videos. Users can
read the short text to gain insight about the video, and more importantly,
efficiently search through the content of a large video database using text
queries. Since egocentric videos are long and contain many activities and
events, using video-to-text algorithms results in thousands of descriptions,
many of which are incorrect. Therefore, we propose a multi-task learning scheme
to simultaneously generate descriptions for video segments and summarize the
resulting descriptions in an end-to-end fashion. We Input a set of video shots
and the network generates a text description for each shot. Next,
visual-language content matching unit that is trained with a weakly supervised
objective, identifies the correct descriptions. Finally, the last component of
our network, called purport network, evaluates the descriptions all together to
select the ones containing crucial information. Out of thousands of
descriptions generated for the video, a few informative sentences are returned
to the user. We validate our framework on the challenging UT Egocentric video
dataset, where each video is between 3 to 5 hours long, associated with over
3000 textual descriptions on average. The generated textual summaries,
including only 5 percent (or less) of the generated descriptions, are compared
to groundtruth summaries in text domain using well-established metrics in
natural language processing.Comment: ICPR 202
Learning to Extract Coherent Summary via Deep Reinforcement Learning
Coherence plays a critical role in producing a high-quality summary from a
document. In recent years, neural extractive summarization is becoming
increasingly attractive. However, most of them ignore the coherence of
summaries when extracting sentences. As an effort towards extracting coherent
summaries, we propose a neural coherence model to capture the cross-sentence
semantic and syntactic coherence patterns. The proposed neural coherence model
obviates the need for feature engineering and can be trained in an end-to-end
fashion using unlabeled data. Empirical results show that the proposed neural
coherence model can efficiently capture the cross-sentence coherence patterns.
Using the combined output of the neural coherence model and ROUGE package as
the reward, we design a reinforcement learning method to train a proposed
neural extractive summarizer which is named Reinforced Neural Extractive
Summarization (RNES) model. The RNES model learns to optimize coherence and
informative importance of the summary simultaneously. Experimental results show
that the proposed RNES outperforms existing baselines and achieves
state-of-the-art performance in term of ROUGE on CNN/Daily Mail dataset. The
qualitative evaluation indicates that summaries produced by RNES are more
coherent and readable.Comment: 8 pages, 1 figure, presented at AAAI-201
An Efficient Approach to Learning Chinese Judgment Document Similarity Based on Knowledge Summarization
A previous similar case in common law systems can be used as a reference with
respect to the current case such that identical situations can be treated
similarly in every case. However, current approaches for judgment document
similarity computation failed to capture the core semantics of judgment
documents and therefore suffer from lower accuracy and higher computation
complexity. In this paper, a knowledge block summarization based machine
learning approach is proposed to compute the semantic similarity of Chinese
judgment documents. By utilizing domain ontologies for judgment documents, the
core semantics of Chinese judgment documents is summarized based on knowledge
blocks. Then the WMD algorithm is used to calculate the similarity between
knowledge blocks. At last, the related experiments were made to illustrate that
our approach is very effective and efficient in achieving higher accuracy and
faster computation speed in comparison with the traditional approaches.Comment: 23 page
State of the Art, Evaluation and Recommendations regarding "Document Processing and Visualization Techniques"
Several Networks of Excellence have been set up in the framework of the
European FP5 research program. Among these Networks of Excellence, the NEMIS
project focuses on the field of Text Mining.
Within this field, document processing and visualization was identified as
one of the key topics and the WG1 working group was created in the NEMIS
project, to carry out a detailed survey of techniques associated with the text
mining process and to identify the relevant research topics in related research
areas.
In this document we present the results of this comprehensive survey. The
report includes a description of the current state-of-the-art and practice, a
roadmap for follow-up research in the identified areas, and recommendations for
anticipated technological development in the domain of text mining.Comment: 54 pages, Report of Working Group 1 for the European Network of
Excellence (NoE) in Text Mining and its Applications in Statistics (NEMIS
The automatic creation of concept maps from documents written using morphologically rich languages
Concept map is a graphical tool for representing knowledge. They have been
used in many different areas, including education, knowledge management,
business and intelligence. Constructing of concept maps manually can be a
complex task; an unskilled person may encounter difficulties in determining and
positioning concepts relevant to the problem area. An application that
recommends concept candidates and their position in a concept map can
significantly help the user in that situation. This paper gives an overview of
different approaches to automatic and semi-automatic creation of concept maps
from textual and non-textual sources. The concept map mining process is
defined, and one method suitable for the creation of concept maps from
unstructured textual sources in highly inflected languages such as the Croatian
language is described in detail. Proposed method uses statistical and data
mining techniques enriched with linguistic tools. With minor adjustments, that
method can also be used for concept map mining from textual sources in other
morphologically rich languages.Comment: ISSN 0957-417
Automatic Keyword Extraction for Text Summarization: A Survey
In recent times, data is growing rapidly in every domain such as news, social
media, banking, education, etc. Due to the excessiveness of data, there is a
need of automatic summarizer which will be capable to summarize the data
especially textual data in original document without losing any critical
purposes. Text summarization is emerged as an important research area in recent
past. In this regard, review of existing work on text summarization process is
useful for carrying out further research. In this paper, recent literature on
automatic keyword extraction and text summarization are presented since text
summarization process is highly depend on keyword extraction. This literature
includes the discussion about different methodology used for keyword extraction
and text summarization. It also discusses about different databases used for
text summarization in several domains along with evaluation matrices. Finally,
it discusses briefly about issues and research challenges faced by researchers
along with future direction.Comment: 12 pages, 4 figure
An Enhanced Latent Semantic Analysis Approach for Arabic Document Summarization
The fast-growing amount of information on the Internet makes the research in
automatic document summarization very urgent. It is an effective solution for
information overload. Many approaches have been proposed based on different
strategies, such as latent semantic analysis (LSA). However, LSA, when applied
to document summarization, has some limitations which diminish its performance.
In this work, we try to overcome these limitations by applying statistic and
linear algebraic approaches combined with syntactic and semantic processing of
text. First, the part of speech tagger is utilized to reduce the dimension of
LSA. Then, the weight of the term in four adjacent sentences is added to the
weighting schemes while calculating the input matrix to take into account the
word order and the syntactic relations. In addition, a new LSA-based sentence
selection algorithm is proposed, in which the term description is combined with
sentence description for each topic which in turn makes the generated summary
more informative and diverse. To ensure the effectiveness of the proposed
LSA-based sentence selection algorithm, extensive experiment on Arabic and
English are done. Four datasets are used to evaluate the new model, Linguistic
Data Consortium (LDC) Arabic Newswire-a corpus, Essex Arabic Summaries Corpus
(EASC), DUC2002, and Multilingual MSS 2015 dataset. Experimental results on the
four datasets show the effectiveness of the proposed model on Arabic and
English datasets. It performs comprehensively better compared to the
state-of-the-art methods.Comment: This is a pre-print of an article published in Arabian Journal for
Science and Engineering. The final authenticated version is available online
at: https://doi.org/10.1007/s13369-018-3286-
Toward Selectivity Based Keyword Extraction for Croatian News
Preliminary report on network based keyword extraction for Croatian is an
unsupervised method for keyword extraction from the complex network. We build
our approach with a new network measure the node selectivity, motivated by the
research of the graph based centrality approaches. The node selectivity is
defined as the average weight distribution on the links of the single node. We
extract nodes (keyword candidates) based on the selectivity value. Furthermore,
we expand extracted nodes to word-tuples ranked with the highest in/out
selectivity values. Selectivity based extraction does not require linguistic
knowledge while it is purely derived from statistical and structural
information en-compassed in the source text which is reflected into the
structure of the network. Obtained sets are evaluated on a manually annotated
keywords: for the set of extracted keyword candidates average F1 score is
24,63%, and average F2 score is 21,19%; for the exacted words-tuples candidates
average F1 score is 25,9% and average F2 score is 24,47%
- …