This paper presents our approach and results for the 2017 PAN Author Profiling Shared Task. Language-specific corpora were provided for four langauges: Spanish, English, Portuguese, and Arabic. Each corpus consisted of tweets authored by a number of Twitter users labeled with their gender and the specific variant of their language which was used in the documents (e.g. Brazilian or European Portuguese). The task was to develop a system to infer the same attributes for unseen Twitter users. Our system employs an ensemble of two probabilistic classifiers: a Logistic regression classifier trained on TF-IDF transformed n-grams and a Gaussian Process classifier trained on word embedding clusters derived for an additional, external corpus of tweets

Poulston, A.

Stevenson, M.

Waseem, Z.

White Rose Research Online

Using TF-IDF n-gram and word embedding cluster ensembles for author profiling: Notebook for PAN at CLEF 2017

This is a repository copy of Using TF-IDF n-gram and word embedding cluster ensembles for author profiling: Notebook for PAN at CLEF 2017.White Rose Research Online URL for this paper:http://eprints.whiterose.ac.uk/128573/Version: Published VersionProceedings Paper:Poulston, A., Waseem, Z. and Stevenson, M. orcid.org/0000-0002-9483-6006 (2017) Using TF-IDF n-gram and word embedding cluster ensembles for author profiling: Notebook for PAN at CLEF 2017. In: Cappellato, L., Ferro, N., Goeuriot, L. and Mandl, T., (eds.) CEUR Workshop Proceedings. Conference and Labs of the Evaluation Forum (CLEF 2017), 11-14 Sep 2017, Dublin, Ireland. CEUR . A. Poulston et al: Using TF-IDF n-gram and word embedding cluster ensembles for author profiling: Notebook for PAN at CLEF 2017. Working Notes of CLEF 2017 - Conference andLabs of the Evaluation Forum. Dublin, Ireland, September 11-14, 2017. CEUR-WS.org, online http://ceur-ws.org/Vol-1866/paper_72.pdfeprints@whiterose.ac.ukhttps://eprints.whiterose.ac.uk/Reuse Unless indicated otherwise, fulltext items are protected by copyright with all rights reserved. The copyright exception in section 29 of the Copyright, Designs and Patents Act 1988 allows the making of a single copy solely for the purpose of non-commercial research or private study within the limits of fair dealing. The publisher or other rights-holder may allow further reproduction and re-use of this version - refer to the White Rose Research Online record for this item. Where records identify the publisher as the copyright holder, users can verify any specific terms of use on the publisher’s website. Takedown If you consider content in White Rose Research Online to be in breach of UK law, please notify us by emailing eprints@whiterose.ac.uk including the URL of the record and the reason for the withdrawal request. Using TF-IDF n-gram and Word Embedding ClusterEnsembles for Author ProfilingNotebook for PAN at CLEF 2017Adam Poulston, Zeerak Waseem, and Mark StevensonDepartment of Computer ScienceUniversity of Sheffield, UK{arspoulston1, z.w.butt, mark.stevenson}@sheffield.ac.ukAbstract This paper presents our approach and results for the 2017 PAN Au-thor Profiling Shared Task. Language-specific corpora were provided for fourlangauges: Spanish, English, Portuguese, and Arabic. Each corpus consisted oftweets authored by a number of Twitter users labeled with their gender and thespecific variant of their language which was used in the documents (e.g. Brazil-ian or European Portuguese). The task was to develop a system to infer the sameattributes for unseen Twitter users. Our system employs an ensemble of two prob-abilistic classifiers: a Logistic regression classifier trained on TF-IDF transformedn–grams and a Gaussian Process classifier trained on word embedding clustersderived for an additional, external corpus of tweets.1 IntroductionAuthor profiling is the task of determining the characteristics of the individual whowrote a document. Many different characteristics can be determined (e.g. personal char-acteristics such as gender, age, personality [19] and socioeconomic indicators [5,13,14,15])across a variety of media (e.g. written essays, books, blogs and other social media).Despite their potential ethical concerns, author profiling techniques can be a valuablecomponent in various applications, such as bias reduction in predictive models [2] andlanguage-variant adaption in part-of-speech taggers [1].In this paper, we present our approach to the 2017 edition of the PAN Author Pro-filing shared task [10,11,16]. A dataset was provided consisting of Twitter users acrossfour languages and their variants. Each user was labeled with a binary gender label(male/female) and the particular variant of their language (e.g. Brazilian vs EuropeanPortuguese). The dataset was balanced by both gender and language variant. Given anunseen user (and their native language), the task is to determine their gender and lan-guage variant being used.To predict gender and language variant, we applied an ensemble of probabilisticmachine learning classifiers (described in detail in Section 2). First, an external Twittercorpus was acquired and Tweets geo-located within the countries covered in the taskslanguages were extracted (except for the Arabic language variants). This corpus wasdivided into individual languages (Portuguese, English and Spanish) and used to de-rive Word2Vec word embeddings [7,8] for each language. Then, each set of languagespecific word embeddings were clustered using K-Means to derive a set of word tocluster mappings, which can be thought of as roughly analogous to topics in a topicmodel. The normalised frequency of each word cluster across a user’s tweets was usedto train a Gaussian Process classifier. Second, a Logistic Regression classifier was thentrained using TF-IDF transformed unigram and bigram frequencies. Both classifierswere employed in an ensemble approach by averaging the predicted probabilities foreach sample to determine the label.2 ApproachOur approach combines two probabilistic classifiers trained on distinct feature sets inan ensemble to predict gender and language variant. Two classifiers were applied: aLogistic Regression classifier trained on TF-IDF n–grams (Section 2.1) and a GaussianProcess classifier trained on word cluster frequencies (Section 2.2). For each unseendocument, probabilities from both classifiers are taken and averaged, and the highestaverage probability class is taken as the prediction. Models were trained using the im-plementations found in scikit-learn [9] unless stated otherwise.For Arabic data, only the Logistic Regression classifier is applied, as the volumeof geo-located Arabic tweets collected was too low to allow for training of robustWord2Vec models for use with the Gaussian Process classifier.2.1 Logistic regression classifier with TF-IDF n–gramsWord unigram and bigram features were extracted for each training document. Thetext was tokenised using a Twitter-aware tokeniser [4]; no additional steps were takento deal with the extra complexities of Arabic text. A list of stop words was not usedwhile deriving n–gram features, instead tokens that appeared in more than 90% of thedocuments were removed, as this allows for the removal of n–grams common across alanguage’s variants while also removing stop words.TF-IDF weighting was applied to down-weight n–grams common across the docu-ments and assign a higher weight to n–grams which are rare.A Logistic Regression classifier was trained for each language using the n–gramfeatures. Logistic Regression was chosen for use with the n–gram features because ithas been shown to perform well on similar high-dimensional classification tasks, andproduces probabilistic predictions [3].2.2 Gaussian process classifier with word embedding clustersWe obtained the data for our word embedding clusters from a Twitter Firehose1 samplecollected throughout 2015. We only used tweets that were geo-located in the specificlanguage regions determined by the shared task (see Table 1).Some language variants were less frequent in the resulting datasets than others, forinstance we collected very few tweets from Ireland compared to the U.S.A. Down-sampling was used to avoid over representation of the more prevent language variants.1 Twitter Firehose has since been discontinued and can no longer be accessed.Table 1. Countries scraped for each language.English (Fen) Spanish (Fsp) Portuguese (Fpt)Australia Argentina BrazilCanada Chile PortugalGreat Britain ColombiaIreland MexicoNew Zealand PeruUnited States SpainVenezuelaData for the language variant with the largest volume of documents was reduced so thatit contained no more than 10 times number of tweets of the smallest language variant.Word embeddings For each language dataset (Fen, Fes, and Fpt) were trained usingthe Word2Vec [7,8] implementation in gensim [18] with Continuous Bag of Words(CBOW), negative sampling, 200 dimensions, and a window size of 10.We applied K-Means clustering [6] to the word embeddings to derive a set of 100clusters for each language, in which each word is assigned a cluster based on its nearestcluster in the embedding space. We then computed the frequency distribution of theclusters for every training document, and used them as features to train a GaussianProcess classifier with an RBF kernel [17].Similar word embedding clusters have been applied with Gaussian Processes toperform other author profiling tasks such as socio-economic status detection [5]; fur-thermore, the derived clusters are similar to topics derived in a topic model, in that theyidentify semantically similar groups of words in documents, which we found to performwell in a similar task [12].3 ResultsTable 2 shows the accuracy scores achieved by a Support Vector Machine (SVM) classi-fier with a linear kernel, trained on the same TF-IDF n–grams described in Section 2.1.We chose this approach as our baseline, as it has been shown to perform well on similartasks and represent a strong baseline.Table 2. Baseline accuracy scores for gender and language variant prediction for each languagederived from a SVM classifier trained on TF-IDF n–grams.Target Spanish English Portuguese ArabicGender 0.7361 0.7896 0.8263 0.7450Language variant 0.9532 0.8617 0.9800 0.8150Joint 0.7007 0.6838 0.8113 0.6275Table 3. Accuracy scores for gender and language variant prediction for each language as sub-mitted for the PAN: Author Profiling task 2017.Target Spanish English Portuguese ArabicGender 0.7939 0.7829 0.8388 0.7738Language variant 0.9368 0.8038 0.9763 0.7975Joint 0.7471 0.6254 0.8188 0.6356Table 3 shows the results of our final submitted run for the PAN: Author Profilingtask 2017. For Spanish, English and Portuguese the results were attained by applyingthe ensemble of Logistic Regression and Gaussian Process classifiers described in Sec-tion 2; for Arabic only the Logistic regression classifier was applied (Section 2.1). Inthe rankings for the PAN Author Profiling shared task [16], our approach achieved 7thplace out of 22 entries for joint prediction and 6th for gender, exceeding reported base-lines. We achieved poorer results for language variant prediction at 9th place, and didnot exceed the baseline approach.3.1 DiscussionIn Table 3, we see that the our ensemble performs quite well for identifying languagevariant or gender individually. For joint prediction our ensemble performs less well,likely due to errors in either gender or language variant prediction propagating throughto incorrect joint predictions. Of the three languages the ensemble was applied to, thebest performance was observed for Portuguese and the worst for English. Broad topicsof interest appear to be effective for the gender prediction problem while individualterms that are unique to specific language variants are more discriminating for languagevariant prediction.Similar to our results in a previous PAN: Author Profiling Profiling shared task en-try [12], in which LDA topic models were able to improve predictive performance overword n–grams, word embedding clusters improved predictive accuracy for gender clas-sification. For the language variant differentiation task, introducing the word embeddingclusters in fact reduced accuracy scores over earlier runs.Under our current clustering scheme, each term was assumed to be equally as repre-sentative of its cluster as each other term; in practise though, certain terms were closerto the centroid in embedding space than others. Prior to submission we had begun ex-perimenting with weighting terms based on their proximity to their closest centroid,and our initial findings were promising. In future work we would like to investigate theeffect of weighting terms in more detail.4 ConclusionIn this notebook, we have shown that by employing an ensemble of classifiers andutilising clusters of word embeddings reasonable results can be achieved. We propose,that our approach can be improved by weighting the word embedding clusters by thedistance to the cluster centroid.References1. Blodgett, S.L., Green, L., O’Connor, B.: Demographic dialectal variation in social media: Acase study of african-american english pp. 1119–1130 (November 2016)2. Culotta, A.: Reducing sampling bias in social media data for county health inference. In:Joint Statistical Meetings Proceedings (2014)3. Freedman, D.A.: Statistical models: theory and practice. cambridge university press (2009)4. Gimpel, K., Schneider, N., O’Connor, B., Das, D., Mills, D., Eisenstein, J., Heilman, M.,Yogatama, D., Flanigan, J., Smith, N.a.: Part-of-speech tagging for Twitter: annotation,features, and experiments. Human Language Technologies 2(2), 42–47 (2011)5. Lampos, V., Aletras, N., Geyti, J.K., Zou, B., Cox, I.J.: Inferring the socioeconomic statusof social media users based on behaviour and language (2016)6. MacQueen, J., et al.: Some methods for classification and analysis of multivariateobservations. In: Proceedings of the fifth Berkeley symposium on mathematical statisticsand probability. vol. 1, pp. 281–297. Oakland, CA, USA. (1967)7. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations invector space. arXiv preprint arXiv:1301.3781 (2013)8. Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations ofwords and phrases and their compositionality. In: Advances in neural informationprocessing systems. pp. 3111–3119 (2013)9. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M.,Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D.,Brucher, M., Perrot, M., Duchesnay, E.: Scikit-learn: Machine learning in Python. Journalof Machine Learning Research 12, 2825–2830 (2011)10. Potthast, M., Gollub, T., Rangel, F., Rosso, P., Stamatatos, E., Stein, B.: Improving theReproducibility of PAN’s Shared Tasks: Plagiarism Detection, Author Identification, andAuthor Profiling. In: Kanoulas, E., Lupu, M., Clough, P., Sanderson, M., Hall, M., Hanbury,A., Toms, E. (eds.) Information Access Evaluation meets Multilinguality, Multimodality,and Visualization. 5th International Conference of the CLEF Initiative (CLEF 14). pp.268–299. Springer, Berlin Heidelberg New York (Sep 2014)11. Potthast, M., Rangel, F., Tschuggnall, M., Stamatatos, E., Rosso, P., Stein, B.: Overview ofPAN’17: Author Identification, Author Profiling, and Author Obfuscation. In: Jones, G.,Lawless, S., Gonzalo, J., Kelly, L., Goeuriot, L., Mandl, T., Cappellato, L., Ferro, N. (eds.)Experimental IR Meets Multilinguality, Multimodality, and Interaction. 8th InternationalConference of the CLEF Initiative (CLEF 17). Springer, Berlin Heidelberg New York (Sep2017)12. Poulston, A., Stevenson, M., Bontcheva, K.: Topic models and n-gram language models forauthor profiling-notebook for pan at clef 2015. (2015)13. Poulston, A., Stevenson, M., Bontcheva, K.: User profiling with geo-located posts anddemographic data pp. 43–48 (November 2016)14. Preoţiuc-Pietro, D., Lampos, V., Aletras, N.: An analysis of the user occupational classthrough Twitter content. Proceedings of the 53rd Annual Meeting of the Association forComputational Linguistics and the 7th International Joint Conference on Natural LanguageProcessing (Volume 1: Long Papers) pp. 1754–1764 (2015)15. Preoţiuc-Pietro, D., Volkova, S., Lampos, V., Bachrach, Y., Aletras, N.: Studying userincome through language, behaviour and affect in social media. PloS one 10(9), e0138717(2015)16. Rangel, F., Rosso, P., Potthast, M., Stein, B.: Overview of the 5th Author Profiling Task atPAN 2017: Gender and Language Variety Identification in Twitter. In: Cappellato, L., Ferro,N., Goeuriot, L., Mandl, T. (eds.) Working Notes Papers of the CLEF 2017 EvaluationLabs. CEUR Workshop Proceedings, CLEF and CEUR-WS.org (sep 2017)17. Rasmussen, C.E., Williams, C.K.: Gaussian processes for machine learning, vol. 1. MITpress Cambridge (2006)18. Řehůřek, R., Sojka, P.: Software Framework for Topic Modelling with Large Corpora. In:Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks. pp.45–50. ELRA, Valletta, Malta (May 2010)19. Schwartz, H.A., Eichstaedt, J.C., Kern, M.L., Dziurzynski, L., Ramones, S.M., Agrawal,M., Shah, A., Kosinski, M., Stillwell, D., Seligman, M.E.P., Ungar, L.H.: Personality,gender, and age in the language of social media: The open-vocabulary approach. PLoSONE 8(9), e73791 (09 2013)

https://eprints.whiterose.ac.uk/128573/1/paper_72.pdf

Using TF-IDF n-gram and word embedding cluster ensembles for author profiling: Notebook for PAN at CLEF 2017

Abstract

Similar works

Full text

Available Versions

White Rose Research Online

White Rose Research Online