Statements on social media can be analysed to identify individuals who are experiencing red flag medical symptoms, allowing early detection of the spread of disease such as influenza. Since disease does not respect cultural borders and may spread between populations speaking different languages, we would like to build multilingual models. However, the data required to train models for every language may be difficult, expensive and time-consuming to obtain, particularly for low-resource languages. Taking Japanese as our target language, we explore methods by which data in one language might be used to build models for a different language. We evaluate strategies of training on machine translated data and of zero-shot transfer through the use of multilingual models. We find that the choice of source language impacts the performance, with Chinese-Japanese being a better language pair than English-Japanese. Training on machine translated data shows promise, especially when used in conjunction with a small amount of target language data.PostprintPeer reviewe

Appelgren, Mattias

Falis, Matúš

Ikeda, Satoshi

O'Neil, Alison Q

Schrempf, Patrick

arXiv

Statements on social media can be analysed to identify individuals who are experiencing red flag medical symptoms, allowing early detection of the spread of disease such as influenza. Since disease does not respect cultural borders and may spread between populations speaking different languages, we would like to build multilingual models. However, the data required to train models for every language may be difficult, expensive and time-consuming to obtain, particularly for low-resource languages. Taking Japanese as our target language, we explore methods by which data in one language might be used to build models for a different language. We evaluate strategies of training on machine translated data and of zero-shot transfer through the use of multilingual models. We find that the choice of source language impacts the performance, with Chinese-Japanese being a better language pair than English-Japanese. Training on machine translated data shows promise, especially when used in conjunction with a small amount of target language data.Peer reviewe

St Andrews Research Repository

Language Transfer for Early Warning of Epidemicsfrom Social MediaMattias Appelgren1, Patrick Schrempf1,2, Matúš Falis1, Satoshi Ikeda1, Alison Q. O’Neil1,31Canon Medical Research Europe, 2University of St Andrews, 3University of Edinburgh{mattias.appelgren, patrick.schrempf, matus.falis}@eu.medical.canon{satoshi.ikeda, alison.oneil}@eu.medical.canonAbstractStatements on social media can be analysed to identify individuals who are experi-encing red flag medical symptoms, allowing early detection of the spread of diseasesuch as influenza. Since disease does not respect cultural borders and may spreadbetween populations speaking different languages, we would like to build multilin-gual models. However, the data required to train models for every language maybe difficult, expensive and time-consuming to obtain, particularly for low-resourcelanguages. Taking Japanese as our target language, we explore methods by whichdata in one language might be used to build models for a different language. Weevaluate strategies of training on machine translated data and of zero-shot transferthrough the use of multilingual models. We find that the choice of source languageimpacts the performance, with Chinese-Japanese being a better language pair thanEnglish-Japanese. Training on machine translated data shows promise, especiallywhen used in conjunction with a small amount of target language data.1 IntroductionThe spread of influenza is a major health concern. Without appropriate preventative measures, thiscan escalate to an epidemic, causing high levels of mortality. A potential route to early detection is toanalyse statements on social media platforms to identify individuals who have reported experiencingsymptoms of the illness. These numbers can be used as a proxy to monitor the spread of the virus.Since disease does not respect cultural borders and may spread between populations speaking differentlanguages, we would like to build models for several languages without going through the difficult,expensive and time-consuming process of generating task-specific labelled data for each language. Inthis paper we explore ways of taking data and models generated in one language and transferring toother languages for which there is little or no data.2 Related WorkPreviously, authors have created multilingual models which should allow transfer between languagesby aligning models [van der Plas and Tiedemann, 2006] or embedding spaces [Johnson et al., 2019,Alaux et al., 2019]. An alternative is translation of a high-resource language into the target low-resource language; for instance, [Chaudhary et al., 2019] combined translation with subsequentselective correction by active learning of uncertain words and phrases believed to describe entities, tocreate a labelled dataset for named entity recognition.33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.arXiv:1910.04519v1  [cs.CL]  10 Oct 20193 MedWeb DatasetWe use the MedWeb (“Medical Natural Language Processing for Web Document”) dataset [Wakamiyaet al., 2017] that was provided as part of a subtask at the NTCIR-13 Conference [Kato and Liu,2017]. The data is summarised in Table 1. There are a total of 2,560 pseudo-tweets in three differentlanguages: Japanese (ja), English (en) and Chinese (zh). These were created in Japanese and thenmanually translated into English and Chinese (see Figure 1). Each pseudo-tweet is labelled with asubset of the following 8 labels: influenza, diarrhoea/stomach ache, hay fever, cough/sore throat,headache, fever, runny nose, and cold. A positive label is assigned if the author (or someone they livewith) has the symptom in question. As such it is more than a named entity recognition task, as can beseen in pseudo-tweet #3 in Figure 1 where the term “flu” is mentioned but the label is negative.Table 1: MedWeb dataset overview statistics.Dataset #Pseudo-TweetsMean#labelsperexampleInfluenzaDiarrhoeaHayfeverCoughHeadacheFeverRunnynoseCold#ExampleswithnolabelsTraining 1,920 0.997 106 182 163 227 251 345 375 265 530Test 640 0.933 24 64 46 80 77 93 123 90 195Pseudo-tweet Labels(ja) 風邪を引くと全身がだるくなる。Cold(en) The cold makes my whole body weak.(zh) 一感冒就身酸无力。(ja) アトピーと花粉症が重なってつらい Hay fever(en) It’s really bad. My eczema and allergies are acting up at the same time. &(zh) 敏症加花粉症，受死了。 Runny nose(ja) 今日インフルの手術じゃないただの注射なのにビビるNo labels(en) I’m so scared of today’s flu shot, and it’s not even surgery or anything.(zh) 今天只打不做流感手，但是害怕。Figure 1: Example pseudo-tweet triplets.4 MethodsBidirectional Encoder Representations from Transformers (BERT): The BERT model [Devlinet al., 2018] base version is a 12-layer Transformer model trained on two self-supervised tasks usinga large corpus of text. In the first (denoising autoencoding) task, the model must map input sentenceswith some words replaced with a special “MASK” token back to the original unmasked sentences. Inthe second (binary classification) task, the model is given two sentences and must predict whetheror not the second sentence immediately follows the first in the corpus. The output of the finalTransformer layer is passed through a logistic output layer for classification. We have used theoriginal (English) BERT-base1, trained on Wikipedia and books corpus [Zhu et al., 2015], and aJapanese BERT (jBERT) [Kikuta, 2019] trained on Japanese Wikipedia. The original BERT modeland jBERT use a standard sentence piece tokeniser with roughly 30,000 tokens.1PyTorch code and pre-trained models for BERT: https://github.com/huggingface/transformers2Multilingual BERT: Multilingual BERT (mBERT)2 is a BERT model simultaneously trained onWikipedia in 100 different languages. It makes use of a shared sentence piece tokeniser with roughly100,000 tokens trained on the same data. This model provides state-of-the-art zero-shot transferresults on natural language inference and part-of-speech tagging tasks [Pires et al., 2019].Translation: We use two publicly available machine translation systems to provide two possibletranslations for each original sentence: Google’s neural translation system [Wu et al., 2016] viaGoogle Cloud3, and Amazon Translate4. We experiment using the translations singly and together.Training procedure: Models are trained for 20 epochs, using the Adam optimiser [Kingma andBa, 2014] and a cyclical learning rate [Smith, 2017] varied linearly between 5× 10−6 and 3× 10−5.5 ExperimentsUsing the multilingual BERT model, we run three experiments as described below. The “exact match”metric from the original MedWeb challenge is reported, which means that all labels must be predictedcorrectly for a given pseudo-tweet to be considered correct; macro-averaged F1 is also reported.Each experiment is run 5 times (with different random seeds) and the mean performance is shown inTable 2. Our experiments are focused around using Japanese as the low-resource target language,with English and Chinese as the more readily available source languages.Table 2: Overall results, given as mean (standard deviation) of 5 runs, for different training/test datapairs. The leading results on the original challenge are shown as baselines for benchmarking purposes.EN - English, JA - Japanese, ZH - Chinese, TJA - Translated Japanese.Model Source Train Test Exact Match Accuracy F1 macroBaselinesMajority class classifier - - - 0.305 -Random classifier - - - 0.130 (0.012) 0.118 (0.007)Iso et al. [2017] - EN EN 0.795 -Iso et al. [2017] - JA JA 0.825 -Iso et al. [2017] - ZH ZH 0.809 -BERT - EN EN 0.847 (0.003) 0.884 (0.004)jBERT - JA JA 0.843 (0.012) 0.880 (0.006)mBERT - ZH ZH 0.835 (0.004) 0.876 (0.006)Zero-shot transfermBERT - EN JA 0.305 (0.001) -mBERT - ZH JA 0.507 (0.007) 0.484 (0.032)Machine translationmBERT EN TJA JA 0.740 (0.011) 0.740 (0.012)mBERT ZH TJA JA 0.774 (0.008) 0.821 (0.010)mBERT EN TJA (x2) JA 0.754 (0.009) 0.758 (0.034)mBERT ZH TJA (x2) JA 0.804 (0.004) 0.849 (0.098)2Multilingual BERT Models: https://github.com/google-research/bert/blob/master/multilingual.md3Cloud Translation | Google Cloud: https://cloud.google.com/translate/4Amazon Translate: Neural Machine Translation: https://aws.amazon.com/translate/35.1 BaselinesTo establish a target for our transfer techniques we train and test models on a single language, i.e.English to English, Japanese to Japanese, and Chinese to Chinese. For English we use the uncasedbase-BERT, for Japanese we use jBERT, and for Chinese we use mBERT (since there is no Chinese-specific model available in the public domain). This last choice seems reasonable since mBERTperformed similarly to the single-language models when trained and tested on the same language.For comparison, we show the results of Iso et al. [2017] who created the most successful modelfor the MedWeb challenge. Their final system was an ensemble of 120 trained models, using twoarchitectures: a hierarchical attention network and a convolutional neural network. They exploitedthe fact that parallel data is available in three languages by ensuring consistency between outputs ofthe models in each language, giving a final exact match score of 0.880. However, for the purposeof demonstrating language transfer we report their highest single-model scores to show that oursingle-language models are competitive with the released results. We also show results for a majorityclass classifier (predicting all negative labels, see Table 1) and a random classifier that uses the labelfrequencies from the training set to randomly predict labels.5.2 Zero-shot transfer with multilingual pre-trainingOur first experiment investigates the zero-shot transfer ability of multilingual BERT. If mBERT haslearned a shared embedding space for all languages, we would expect that if the model is fine-tunedon the English training dataset, then it should be applicable also to the Japanese dataset. To test thiswe have run this with both the English and Chinese training data, results are shown in Table 2. Weran additional experiments where we froze layers within BERT, but observed no improvement.The results indicate poor transfer, especially between English and Japanese. To investigate why themodel does not perform well, we visualise the output vectors of mBERT using t-SNE [Maaten andHinton, 2008] in Figure 2. We can see that the language representations occupy separate parts ofthe representation space, with only small amounts of overlap. Further, no clear correlation can beobserved between sentence pairs.Zh-Ja PairsEn-Ja PairsEnglishJapaneseChineseFigure 2: Max-pooled output of mBERT final layer (before fine tuning), reduced using principalcomponent analysis (to reduce from 768 to 50 dimensions) followed by t-SNE (to project into 2dimensions). 20 sentence triplets are linked to give an idea of the mapping between languages.The better transfer between Chinese and Japanese likely reflects the fact that these languages sharetokens; one of the Japanese alphabets (the Kanji logographic alphabet) consists of Chinese characters.There is 21% vocabulary overlap for the training data and 19% for the test data, whereas there is notoken overlap between English and Japanese. Our finding is consistent with previous claims thattoken overlap impacts mBERT’s transfer capability [Pires et al., 2019].45.3 Training on machine translated dataOur second experiment investigates the use of machine translated data for training a model. We trainon the machine translated source data and test on the target test set. Results are shown in Table 2.Augmenting the data by using two sets of translations rather than one proves beneficial. In the end,the difference between training on real Japanese and training on translations from English is around9% while training on translations from Chinese is around 4%.5.4 Mixing translated data with original dataWhilst the results for translated data are promising, we would like to bridge the gap to the performanceof the original target data. Our premise is that we start with a fixed-size dataset in the source language,and we have a limited annotation budget to manually translate a proportion of this data into thetarget language. For this experiment we mix all the translated data with different portions of originalJapanese data, varying the amount between 1% and 100%. The results of these experiments areshown in Figure 3. Using the translated data with just 10% of the original Japanese data, we close thegap by half, with 50% we match the single-language model, and with 100% appear to even achieve asmall improvement (for English), likely through the data augmentation provided by the translations.0 10 20 30 40 50 60 70 80 90 100Original Japanese (%)0.30.40.50.60.70.8Exact Match ScoreOnly 100% Original JapaneseOnly 100% Chinese TranslationsOnly 100% English Translationsx% Original Japanesex% Original Japanese + 100% Original Englishx% Original Japanese + 100% Chinese Translationsx% Original Japanese + 100% English TranslationsFigure 3: Exact match accuracy when training on different proportions of the original Japanesetraining set, with or without either the original English data or the translated data. The pink andorange dashed lines show the accuracy of the full set of translated Japanese data (from English andChinese respectively) and the blue dashed line shows the accuracy of the full original Japanese data.6 Discussion and ConclusionsZero-shot transfer using multilingual BERT performs poorly when transferring to Japanese on theMedWeb data. However, training on machine translations gives promising performance, and thisperformance can be increased by adding small amounts of original target data. On inspection, thedrop in performance between translated and original Japanese was often a result of translations thatwere reasonable but not consistent with the labels. For example, when translating the first example inFigure 1, both machine translations map “風邪”, which means cold (the illness), into “寒さ”, whichmeans cold (low temperature). Another example is where the Japanese pseudo-tweet “花粉症の時期はすごい疲れる。” was provided alongside an English pseudo-tweet “Allergy season is soexhausting.”. Here, the Japanese word for hay fever “花粉症。” has been manually mapped to theless specific word “allergies” in English; the machine translation maps back to Japanese using theword for “allergies” i.e. “アレルギー” in the katakana alphabet (katakana is used to express wordsderived from foreign languages), since there is no kanji character for the concept of allergies. Infuture work, it would be interesting to understand how to detect such ambiguities in order to bestdeploy our annotation budget.5ReferencesJean Alaux, Edouard Grave, Marco Cuturi, and Armand Joulin. Unsupervised hyper-alignment formultilingual word embeddings. In International Conference on Learning Representations, 2019.Aditi Chaudhary, Jiateng Xie, Zaid Sheikh, Graham Neubig, and Jaime G Carbonell. A littleannotation does a lot of good: A study in bootstrapping low-resource named entity recognizers.arXiv preprint arXiv:1908.08983, 2019.Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deepbidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.Hayate Iso, Camille Ruiz, Taichi Murayama, Katsuya Taguchi, Ryo Takeuchi, Hideya Yamamoto,Shoko Wakamiya, and Eiji Aramaki. Ntcir13 medweb task: Multi-label classification of tweetsusing an ensemble of neural networks. In Proceedings of the NTCIR-13 Conference, 2017.Andrew Johnson, Penny Karanasou, Judith Gaspers, and Dietrich Klakow. Cross-lingual transferlearning for Japanese named entity recognition. In Proceedings of the 2019 Conference of theNorth American Chapter of the Association for Computational Linguistics: Human LanguageTechnologies, Volume 2 (Industry Papers), 2019.Makoto P Kato and Yiqun Liu. Overview of ntcir-13. In Proceedings of the NTCIR-13 Conference,2017.Yohei Kikuta. Bert pretrained model trained on japanese wikipedia articles. https://github.com/yoheikikuta/bert-japanese, 2019.Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprintarXiv:1412.6980, 2014.Laurens van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. Journal of machinelearning research, 9(Nov):2579–2605, 2008.Telmo Pires, Eva Schlinger, and Dan Garrette. How multilingual is multilingual bert? CoRR,abs/1906.01502, 2019. URL http://arxiv.org/abs/1906.01502.Leslie N Smith. Cyclical learning rates for training neural networks. In 2017 IEEE Winter Conferenceon Applications of Computer Vision (WACV), pages 464–472. IEEE, 2017.Lonneke van der Plas and Jörg Tiedemann. Finding synonyms using automatic word alignment andmeasures of distributional similarity. In Proceedings of the COLING/ACL on Main ConferencePoster Sessions, COLING-ACL ’06, Stroudsburg, PA, USA, 2006. Association for ComputationalLinguistics.Shoko Wakamiya, Mizuki Morita, Yoshinobu Kano, Tomoko Ohkuma, and Eiji Aramaki. Overviewof the ntcir-13: Medweb task. In Proceedings of the NTCIR-13 Conference, 2017.Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey,Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. Google’s neural machine translation sys-tem: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144,2016.Yukun Zhu, Ryan Kiros, Rich Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba, andSanja Fidler. Aligning books and movies: Towards story-like visual explanations by watchingmovies and reading books. In Proceedings of the IEEE international conference on computervision, pages 19–27, 2015.6

Language transfer for early warning of epidemics from social media

Statements on social media can be analysed to identify individuals who are experiencing red flag medical symptoms, allowing early detection of the spread of disease such as influenza. Since disease does not respect cultural borders and may spread between populations speaking different languages, we would like to build multilingual models. However, the data required to train models for every language may be difficult, expensive and time-consuming to obtain, particularly for low-resource languages. Taking Japanese as our target language, we explore methods by which data in one language might be used to build models for a different language. We evaluate strategies of training on machine translated data and of zero-shot transfer through the use of multilingual models. We find that the choice of source language impacts the performance, with Chinese-Japanese being a better language pair than English-Japanese. Training on machine translated data shows promise, especially when used in conjunction with a small amount of target language data

University of St. Andrews - Pure

https://research-repository.st-andrews.ac.uk/bitstream/handle/10023/19177/1910.04519v1.pdf?sequence=1&isAllowed=y

Language transfer for early warning of epidemics from social media

Abstract

Similar works

Full text

Available Versions

St Andrews Research Repository

University of St. Andrews - Pure

University of St. Andrews - Pure