Classifying the same event reported by different countries is of significant
importance for public opinion control and intelligence gathering. Due to the
diverse types of news, relying solely on transla-tors would be costly and
inefficient, while depending solely on translation systems would incur
considerable performance overheads in invoking translation interfaces and
storing translated texts. To address this issue, we mainly focus on the
clustering problem of cross-lingual news. To be specific, we use a combination
of sentence vector representations of news headlines in a mixed semantic space
and the topic probability distributions of news content to represent a news
article. In the training of cross-lingual models, we employ knowledge
distillation techniques to fit two semantic spaces into a mixed semantic space.
We abandon traditional static clustering methods like K-Means and AGNES in
favor of the incremental clustering algorithm Single-Pass, which we further
modify to better suit cross-lingual news clustering scenarios. Our main
contributions are as follows: (1) We adopt the English standard BERT as the
teacher model and XLM-Roberta as the student model, training a cross-lingual
model through knowledge distillation that can represent sentence-level
bilingual texts in both Chinese and English. (2) We use the LDA topic model to
represent news as a combina-tion of cross-lingual vectors for headlines and
topic probability distributions for con-tent, introducing concepts such as
topic similarity to address the cross-lingual issue in news content
representation. (3) We adapt the Single-Pass clustering algorithm for the news
context to make it more applicable. Our optimizations of Single-Pass include
ad-justing the distance algorithm between samples and clusters, adding cluster
merging operations, and incorporating a news time parameter