2 research outputs found
No Pattern, No Recognition: a Survey about Reproducibility and Distortion Issues of Text Clustering and Topic Modeling
Extracting knowledge from unlabeled texts using machine learning algorithms
can be complex. Document categorization and information retrieval are two
applications that may benefit from unsupervised learning (e.g., text clustering
and topic modeling), including exploratory data analysis. However, the
unsupervised learning paradigm poses reproducibility issues. The initialization
can lead to variability depending on the machine learning algorithm.
Furthermore, the distortions can be misleading when regarding cluster geometry.
Amongst the causes, the presence of outliers and anomalies can be a determining
factor. Despite the relevance of initialization and outlier issues for text
clustering and topic modeling, the authors did not find an in-depth analysis of
them. This survey provides a systematic literature review (2011-2022) of these
subareas and proposes a common terminology since similar procedures have
different terms. The authors describe research opportunities, trends, and open
issues. The appendices summarize the theoretical background of the text
vectorization, the factorization, and the clustering algorithms that are
directly or indirectly related to the reviewed works
Unsupervised Separation of Transliterable and Native Words for Malayalam
Differentiating intrinsic language words from transliterable words is a key
step aiding text processing tasks involving different natural languages. We
consider the problem of unsupervised separation of transliterable words from
native words for text in Malayalam language. Outlining a key observation on the
diversity of characters beyond the word stem, we develop an optimization method
to score words based on their nativeness. Our method relies on the usage of
probability distributions over character n-grams that are refined in step with
the nativeness scorings in an iterative optimization formulation. Using an
empirical evaluation, we illustrate that our method, DTIM, provides significant
improvements in nativeness scoring for Malayalam, establishing DTIM as the
preferred method for the task.Comment: 10 pages, Proceedings of 14th International Conference on Natural
Language Processing, Kolkata, India. 18-21 December, 201