5 research outputs found
Π£Π»ΠΎΠ³Π° ΠΈΡΡΡΠ°ΠΆΠΈΠ²Π°ΡΠ° Ρ ΠΊΡΠ΅ΠΈΡΠ°ΡΡ ΠΊΠΎΡΠΏΡΡΠ° ΠΊΠΎΠ½Π²Π΅ΡΠ·Π°ΡΠΈΠΎΠ½ΠΈΡ Π½Π°ΡΠ°ΡΠΈΠ²Π°
Π£ ΠΎΠ²ΠΎΠΌ ΠΏΡΠΈΠ»ΠΎΠ³Ρ ΡΠ΅ Π°Π½ΡΡΠΎΠ»ΠΈΠ½Π³Π²ΠΈΡΡΠΈΡΠΊΠΎΠΌ Π°Π½Π°Π»ΠΈΠ·ΠΎΠΌ ΠΏΡΠΈΠΌΠ΅ΡΠ° ΠΈΠ·Π΄Π²ΠΎΡΠ΅Π½ΠΈΡ
ΠΈΠ· ΡΠ°Π·Π³ΠΎΠ²ΠΎΡΠ° ΠΊΠΎΡΠΈ ΡΠΈΠ½Π΅ ΠΊΠΎΡΠΏΡΡ ΡΠΎΡΠΌΠΈΡΠ°Π½ Π·Π° ΠΏΠΎΡΡΠ΅Π±Π΅ ΡΡΡΠ΄ΠΈΡΠ΅ Π‘ΡΠ΅ΡΠ΅ΠΎΡΠΈΠΏ Π²ΡΠ΅ΠΌΠ΅Π½Π° Ρ Π΄ΠΈΡΠΊΡΡΡΡ ΡΠ°ΡΠ΅ΡΠ΅Π½ΠΈΡ
Π»ΠΈΡΠ° ΡΠ° ΠΠΎΡΠΎΠ²Π° ΠΈ ΠΠ΅ΡΠΎΡ
ΠΈΡΠ΅ ΡΠΊΠ°Π·ΡΡΠ΅ Π½Π° ΡΠ»ΠΎΠ³Ρ ΠΈΡΡΡΠ°ΠΆΠΈΠ²Π°ΡΠ° Ρ Π²ΠΎΡΠ΅ΡΡ ΡΠ΅ΡΠ΅Π½ΡΠΊΠΈΡ
ΡΠ°Π·Π³ΠΎΠ²ΠΎΡΠ° ΡΠ° ΡΠ°ΡΠ΅ΡΠ΅-
Π½ΠΈΠΌ Π»ΠΈΡΠΈΠΌΠ°. Π€ΠΎΠΊΡΡ Π°Π½Π°Π»ΠΈΠ·Π΅ ΡΠ΅ Π½Π° ΠΈΠ½ΡΠ΅ΡΠ²Π΅Π½ΡΠΈΡΠ°ΠΌΠ° ΠΈΡΡΡΠ°ΠΆΠΈΠ²Π°ΡΠ° Ρ ΡΠ°Π·Π³ΠΎΠ²ΠΎΡΠΈΠΌΠ° ΠΊΠΎΡΠ΅ ΡΡ, Ρ ΡΠ΅Π΄Π½Π΅ ΡΡΡΠ°Π½Π΅, ΠΈΠΌΠ°Π»Π΅ Π²Π°ΠΆΠ½Ρ ΡΠ»ΠΎΠ³Ρ Ρ ΡΠΎΡΠΌΠΈΡΠ°ΡΡ ΠΊΠΎΠΌΠ»Π΅ΡΠ½ΠΎΠ³ ΠΊΠΎΡΠΏΡΡΠ°, Π° Ρ Π΄ΡΡΠ³Π΅, ΡΠ΅ ΠΈΠ½ΡΠ΅ΡΠ²Π΅Π½ΡΠΈΡΠ΅ ΠΏΠΎΠΊΠ°Π·ΡΡΡ Π½Π΅ΠΊΠ΅ ΠΎΠ΄ ΡΠ°Π·Π»ΠΈΠΊΠ° Ρ ΠΊΠΎΠ½ΡΠ΅ΠΏΡΡΠ°Π»ΠΈΠ·Π°ΡΠΈΡΠΈ ΡΠ²Π΅ΡΠ° ΠΈΡΡΡΠ°ΠΆΠΈΠ²Π°ΡΠ° ΠΈ ΡΠ°Π³ΠΎΠ²ΠΎΡΠ½ΠΈΠΊΠ°, ΠΎ ΠΊΠΎΡΠΈΠΌΠ° ΡΠ΅ Π½ΠΈΡΠ΅ ΡΠ½Π°ΠΏΡΠ΅Π΄ ΠΌΠΎΠ³Π»ΠΎ ΡΠ°Π·ΠΌΠΈΡΡΠ°ΡΠΈ ΠΈ ΠΊΠΎΡΠ΅ ΡΡ ΡΠΎΡΠ΅Π½Π΅ ΡΠ΅ΠΊ Π½Π°ΠΊΠΎΠ½ Π°Π½Π°Π»ΠΈΠ·Π΅ ΡΡΠ°Π½ΡΠΊΡΠΈΠΏΠ°ΡΠ°
Extracting Multilingual Topics from Unaligned Comparable Corpora
Topic models have been studied extensively in the context of monolingual corpora. Though there are some attempts to mine topical structure from cross-lingual corpora, they require clues about document alignments. In this paper we present a generative model called JointLDA which uses a bilingual dictionary to mine multilingual topics from an unaligned corpus. Experiments conducted on different data sets confirm our conjecture that jointly modeling the cross-lingual corpora offers several advantages compared to individual monolingual models. Since the JointLDA model merges related topics in different languages into a single multilingual topic: a) it can fit the data with relatively fewer topics. b) it has the ability to predict related words from a language different than that of the given document. In fact it has better predictive power compared to the bag-of-word based translation model leaving the possibility for JointLDA to be preferred over bag-of-word model for cross-lingual IR applications. We also found that the monolingual models learnt while optimizing the cross-lingual copora are more effective than the corresponding LDA models
Classifying Bias in Large Multilingual Corpora via Crowdsourcing and Topic Modeling
Our project extends previous algorithmic approaches to finding bias in large text corpora. We used multilingual topic modeling to examine language-specific bias in the English, Spanish, and Russian versions of Wikipedia. In particular, we placed Spanish articles discussing the Cold War on a Russian-English viewpoint spectrum based on similarity in topic distribution. We then crowdsourced human annotations of Spanish Wikipedia articles for comparison to the topic model. Our hypothesis was that human annotators and topic modeling algorithms would provide correlated results for bias. However, that was not the case. Our annotators indicated that humans were more perceptive of sentiment in article text than topic distribution, which suggests that our classifier provides a different perspective on a textβs bias