896 research outputs found
State of the art document clustering algorithms based on semantic similarity
The constant success of the Internet made the number of text documents in electronic forms increases hugely. The techniques to group these documents into meaningful clusters are becoming critical missions. The traditional clustering method was based on statistical features, and the clustering was done using a syntactic notion rather than semantically. However, these techniques resulted in un-similar data gathered in the same group due to polysemy and synonymy problems. The important solution to this issue is to document clustering based on semantic similarity, in which the documents are grouped according to the meaning and not keywords. In this research, eighty papers that use semantic similarity in different fields have been reviewed; forty of them that are using semantic similarity based on document clustering in seven recent years have been selected for a deep study, published between the years 2014 to 2020. A comprehensive literature review for all the selected papers is stated. Detailed research and comparison regarding their clustering algorithms, utilized tools, and methods of evaluation are given. This helps in the implementation and evaluation of the clustering of documents. The exposed research is used in the same direction when preparing the proposed research. Finally, an intensive discussion comparing the works is presented, and the result of our research is shown in figures
A Case Study and Qualitative Analysis of Simple Cross-Lingual Opinion Mining
User-generated content from social media is produced in many languages,
making it technically challenging to compare the discussed themes from one
domain across different cultures and regions. It is relevant for domains in a
globalized world, such as market research, where people from two nations and
markets might have different requirements for a product. We propose a simple,
modern, and effective method for building a single topic model with sentiment
analysis capable of covering multiple languages simultanteously, based on a
pre-trained state-of-the-art deep neural network for natural language
understanding. To demonstrate its feasibility, we apply the model to newspaper
articles and user comments of a specific domain, i.e., organic food products
and related consumption behavior. The themes match across languages.
Additionally, we obtain an high proportion of stable and domain-relevant
topics, a meaningful relation between topics and their respective textual
contents, and an interpretable representation for social media documents.
Marketing can potentially benefit from our method, since it provides an
easy-to-use means of addressing specific customer interests from different
market regions around the globe. For reproducibility, we provide the code,
data, and results of our study.Comment: 10 pages, 2 tables, 5 figures, full paper, peer-reviewed, published
at KDIR/IC3k 2021 conferenc
Recommended from our members
Linking Textual Resources to Support Information Discovery
A vast amount of information is today stored in the form of textual documents, many of which are available online. These documents come from different sources and are of different types. They include newspaper articles, books, corporate reports, encyclopedia entries and research papers. At a semantic level, these documents contain knowledge, which was created by explicitly connecting information and expressing it in the form of a natural language. However, a significant amount of knowledge is not explicitly stated in a single document, yet can be derived or discovered by researching, i.e. accessing, comparing, contrasting and analysing, information from multiple documents. Carrying out this work using traditional search interfaces is tedious due to information overload and the difficulty of formulating queries that would help us to discover information we are not aware of.
In order to support this exploratory process, we need to be able to effectively navigate between related pieces of information across documents. While information can be connected using manually curated cross-document links, this approach not only does not scale, but cannot systematically assist us in the discovery of sometimes non-obvious (hidden) relationships. Consequently, there is a need for automatic approaches to link discovery.
This work studies how people link content, investigates the properties of different link types, presents new methods for automatic link discovery and designs a system in which link discovery is applied on a collection of millions of documents to improve access to public knowledge
- …