355,490 research outputs found
Privacy Guarantees for De-identifying Text Transformations
Machine Learning approaches to Natural Language Processing tasks benefit from
a comprehensive collection of real-life user data. At the same time, there is a
clear need for protecting the privacy of the users whose data is collected and
processed. For text collections, such as, e.g., transcripts of voice
interactions or patient records, replacing sensitive parts with benign
alternatives can provide de-identification. However, how much privacy is
actually guaranteed by such text transformations, and are the resulting texts
still useful for machine learning? In this paper, we derive formal privacy
guarantees for general text transformation-based de-identification methods on
the basis of Differential Privacy. We also measure the effect that different
ways of masking private information in dialog transcripts have on a subsequent
machine learning task. To this end, we formulate different masking strategies
and compare their privacy-utility trade-offs. In particular, we compare a
simple redact approach with more sophisticated word-by-word replacement using
deep learning models on multiple natural language understanding tasks like
named entity recognition, intent detection, and dialog act classification. We
find that only word-by-word replacement is robust against performance drops in
various tasks.Comment: Proceedings of INTERSPEECH 202
Automatic detection of protected health information from clinic narratives
This paper presents a natural language processing (NLP) system that was designed to participate in the 2014 i2b2 de-identification challenge. The challenge task aims to identify and classify seven main Protected Health Information (PHI) categories and 25 associated sub categories. A hybrid model was proposed which combines machine learning techniques with keyword-based and rule based approaches to deal with the complexity inherent in PHI categories. Our proposed approaches exploit a rich set of linguistic features, both syntactic and word surface-oriented, which are further enriched by task specific features and regular expression template patterns to characterize the semantics of various PHI categories. Our system achieved promising accuracy on the challenge test data with an overall micro-averaged F measure of 93.6%, which was the winner of this de-identification challenge
Fine-Grained Analysis of Language Varieties and Demographics
[EN] The rise of social media empowers people to interact and communicate with anyone anywhere in the world. The possibility of being anonymous avoids censorship and enables freedom of expression. Nevertheless, this anonymity might lead to cybersecurity issues, such as opinion spam, sexual harassment, incitement to hatred or even terrorism propaganda. In such cases, there is a need to know more about the anonymous users and this could be useful in several domains beyond security and forensics such as marketing, for example. In this paper, we focus on a fine-grained analysis of language varieties while considering also the authorsÂż demographics. We present a Low-Dimensionality Statistical Embedding method to represent text documents. We compared the performance of this method with the best performing teams in the Author Profiling task at PAN 2017. We obtained an average accuracy of 92.08% versus 91.84% for the best performing team at PAN 2017. We also analyse the relationship of the language variety identification with the authorsÂż gender. Furthermore, we applied our proposed method to a more fine-grained annotated corpus of Arabic varieties covering 22 Arab countries and obtained an overall accuracy of 88.89%. We have also investigated the effect of the authorsÂż age and gender on the identification of the different Arabic varieties, as well as the effect of the corpus size on the performance of our method.This publication was made possible by NPRP grant 9-175-1-033 from the Qatar National Research Fund (a member of Qatar Foundation). The statements made herein are solely the responsibility of the authors.Rangel, F.; Rosso, P.; Zaghouani, W.; Charfi, A. (2020). Fine-Grained Analysis of Language Varieties and Demographics. Natural Language Engineering. 26(6):641-661. https://doi.org/10.1017/S1351324920000108S641661266Kestemont, M. , Tschuggnall, M. , Stamatatos, E. , Daelemans, W. , Specht, G. , Stein, B. and Potthast, M. (2018). Overview of the Author Identification Task at PAN-2018: Cross-domain Authorship Attribution and Style Change Detection. CLEF 2018 Labs and Workshops, Notebook Papers. CEUR Workshop Proceedings. CEUR-WS.org.McNemar, Q. (1947). Note on the sampling error of the difference between correlated proportions or percentages. Psychometrika, 12(2), 153-157. doi:10.1007/bf02295996Lui, M. and Cook, P. (2013). Classifying english documents by national dialect. In Proceedings of the Australasian Language Technology Association Workshop, Citeseer pp. 5–15.Basile, A. , Dwyer, G. , Medvedeva, M. , Rawee, J. , Haagsma, H. and Nissim, M. (2017). Is there life beyond n-grams? A simple SVM-based author profiling system. In Cappellato L., Ferro N., Goeuriot L. and Mandl T. (eds), CLEF 2017 Working Notes. CEUR Workshop Proceedings (CEUR-WS.org), ISSN 1613-0073, http://ceur-ws.org/Vol-/. CLEF and CEUR-WS.org.Elfardy, H. and Diab, M.T. (2013). Sentence level dialect identification in arabic. In Association for Computational Linguistics (ACL), pp. 456–461.Salton, G., & Buckley, C. (1988). Term-weighting approaches in automatic text retrieval. Information Processing & Management, 24(5), 513-523. doi:10.1016/0306-4573(88)90021-0Zaghouani, W. and Charfi, A. (2018a). ArapTweet: A large MultiDialect Twitter corpus for gender, age and language variety identification. In Proceedings of the 11th International Conference on Language Resources and Evaluation (LREC), Miyazaki, Japan.Zampieri, M. , Tan, L. , Ljubešić, N. , Tiedemann, J. and Nakov, P. (2015). Overview of the DSL shared task 2015. In Proceedings of the Joint Workshop on Language Technology for Closely Related Languages, Varieties and Dialects, pp. 1–9.Huang, C.-R. and Lee, L.-H. (2008). Contrastive approach towards text source classification based on top-bag-of-word similarity. In PACLIC, pp. 404–410.Zaidan, O. F., & Callison-Burch, C. (2014). Arabic Dialect Identification. Computational Linguistics, 40(1), 171-202. doi:10.1162/coli_a_00169Grouin, C. , Forest, D. , Paroubek, P. and Zweigenbaum, P. (2011). PrĂ©sentation et rĂ©sultats du dĂ©fi fouille de texte DEFT2011 Quand un article de presse a t-il Ă©tĂ© Ă©crit? Ă€ quel article scientifique correspond ce rĂ©sumĂ©? Actes du septième DĂ©fi Fouille de Textes, p. 3.Martinc, M. , Skrjanec, I. , Zupan, K. and Pollak, S. Pan (2017). Author profiling – gender and language variety prediction. In Cappellato L., Ferro N., Goeuriot L. and Mandl T. (eds), CLEF 2017 Working Notes. CEUR Workshop Proceedings (CEUR-WS.org), ISSN 1613-0073, http://ceur-ws.org/Vol-/. CLEF and CEUR-WS.org.Rangel, F. , Rosso, P. and Franco-Salvador, M. (2016b). A low dimensionality representation for language variety identification. In 17th International Conference on Intelligent Text Processing and Computational Linguistics, CICLing, LNCS. Springer-Verlag, arxiv:1705.10754.Hagen, M. , Potthast, M. and Stein, B. (2018). Overview of the Author Obfuscation Task at PAN 2018. CLEF 2018 Labs and Workshops, Notebook Papers. CEUR Workshop Proceedings. CEUR-WS.org.Zampieri, M. and Gebre, B.G. (2012). Automatic identification of language varieties: The case of portuguese. In The 11th Conference on Natural Language Processing (KONVENS), pp. 233–237 (2012)Rangel, F. , Rosso, P. , Montes-y-GĂłmez, M. , Potthast, M. and Stein, B. (2018). Overview of the 6th Author Profiling Task at PAN 2018: Multimodal Gender Identification in Twitter. In CLEF 2018 Labs and Workshops, Notebook Papers. CEUR Workshop Proceedings. CEUR-WS.org.Heitele, D. (1975). An epistemological view on fundamental stochastic ideas. Educational Studies in Mathematics, 6(2), 187-205. doi:10.1007/bf00302543Inches, G. and Crestani, F. (2012). Overview of the International Sexual Predator Identification Competition at PAN-2012. CLEF Online working notes/labs/workshop, vol. 30.Rosso, P. , Rangel Pardo, F.M. , Ghanem, B. and Charfi, A. (2018b). ARAP: Arabic Author Profiling Project for Cyber-Security. Sociedad Española para el Procesamiento del Lenguaje Natural (SEPLN).Agić, Ĺ˝. , Tiedemann, J. , Dobrovoljc, K. , Krek, S. , Merkler, D. , MoĹľe, S. , Nakov, P. , Osenova, P. and Vertan, C. (2014). Proceedings of the EMNLP 2014 Workshop on Language Technology for Closely Related Languages and Language Variants. Association for Computational Linguistics.Sadat, F., Kazemi, F., & Farzindar, A. (2014). Automatic Identification of Arabic Language Varieties and Dialects in Social Media. Proceedings of the Second Workshop on Natural Language Processing for Social Media (SocialNLP). doi:10.3115/v1/w14-5904Franco-Salvador, M., Rangel, F., Rosso, P., TaulĂ©, M., & Antònia MartĂt, M. (2015). Language Variety Identification Using Distributed Representations of Words and Documents. Experimental IR Meets Multilinguality, Multimodality, and Interaction, 28-40. doi:10.1007/978-3-319-24027-5_3Rosso, P., Rangel, F., FarĂas, I. H., Cagnina, L., Zaghouani, W., & Charfi, A. (2018). A survey on author profiling, deception, and irony detection for the Arabic language. Language and Linguistics Compass, 12(4), e12275. doi:10.1111/lnc3.12275Malmasi, S. , Zampieri, M. , Ljubešić, N. , Nakov, P. , Ali, A. and Tiedemann, J. (2016). Discriminating between similar languages and arabic dialect identification: A report on the third DSL shared task. In Proceedings of the Third Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial3), pp. 1–14.Rangel, F. , Rosso, P. , Potthast, M. and Stein, B. (2017). Overview of the 5th Author Profiling Task at PAN 2017: Gender and Language Variety Identification in Twitter. In Cappellato L., Ferro N., Goeuriot, L. and Mandl T. (eds), Working Notes Papers of the CLEF 2017 Evaluation Labs, p. 1613–0073, CLEF and CEUR-WS.org.Zampieri, M. , Malmasi, S. , Ljubešić, N. , Nakov, P. , Ali, A. , Tiedemann, J. , Scherrer, Y. , Aepli, N. (2017). Findings of the vardial evaluation campaign 2017. In Proceedings of the Fourth Workshop on NLP for Similar Languages, Varieties and Dialects, pp. 1–15.Bogdanova, D., Rosso, P., & Solorio, T. (2014). Exploring high-level features for detecting cyberpedophilia. Computer Speech & Language, 28(1), 108-120. doi:10.1016/j.csl.2013.04.007Maier, W. and GĂłmez-RodrĂguez, C. (2014). Language Variety Identification in Spanish Tweets. LT4CloseLang.Castro, D. , Souza, E. , de Oliveira, A.L.I. (2016). Discriminating between Brazilian and European Portuguese national varieties on Twitter texts. In 5th Brazilian Conference on Intelligent Systems (BRACIS), pp. 265–270.Zaghouani, W. and Charfi, A. (2018b). Guidelines and annotation framework for Arabic author profiling. In Proceedings of the 3rd Workshop on Open-Source Arabic Corpora and Processing Tools, 11th International Conference on Language Resources and Evaluation (LREC), Miyazaki, Japan.Hernández Fusilier, D., Montes-y-GĂłmez, M., Rosso, P., & Guzmán Cabrera, R. (2015). Detecting positive and negative deceptive opinions using PU-learning. Information Processing & Management, 51(4), 433-443. doi:10.1016/j.ipm.2014.11.001Tellez, E.S. , Miranda-JimĂ©nez, S. , Graff, M. and Moctezuma, D. (2017). Gender and language variety identification with microtc. In Cappellato L., Ferro N., Goeuriot L. and Mandl T. (eds). CLEF 2017 Working Notes. CEUR Workshop Proceedings (CEUR-WS.org), ISSN 1613-0073, http://ceur-ws.org/Vol-/. CLEF and CEUR-WS.org.Kandias, M., Stavrou, V., Bozovic, N., & Gritzalis, D. (2013). Proactive insider threat detection through social media. Proceedings of the 12th ACM workshop on Workshop on privacy in the electronic society. doi:10.1145/2517840.251786
ViTs are Everywhere: A Comprehensive Study Showcasing Vision Transformers in Different Domain
Transformer design is the de facto standard for natural language processing
tasks. The success of the transformer design in natural language processing has
lately piqued the interest of researchers in the domain of computer vision.
When compared to Convolutional Neural Networks (CNNs), Vision Transformers
(ViTs) are becoming more popular and dominant solutions for many vision
problems. Transformer-based models outperform other types of networks, such as
convolutional and recurrent neural networks, in a range of visual benchmarks.
We evaluate various vision transformer models in this work by dividing them
into distinct jobs and examining their benefits and drawbacks. ViTs can
overcome several possible difficulties with convolutional neural networks
(CNNs). The goal of this survey is to show the first use of ViTs in CV. In the
first phase, we categorize various CV applications where ViTs are appropriate.
Image classification, object identification, image segmentation, video
transformer, image denoising, and NAS are all CV applications. Our next step
will be to analyze the state-of-the-art in each area and identify the models
that are currently available. In addition, we outline numerous open research
difficulties as well as prospective research possibilities.Comment: ICCD-2023. arXiv admin note: substantial text overlap with
arXiv:2208.04309 by other author
Social Media Operationalized for GIS: The Prequel
With social media a de facto global communication channel used to disseminate news, entertainment, and one’s self-revelations, the latter contains double-talk, peculiar insight, and contextual observation about real-world events. The primary objective is to propose a novel pipeline to classify a tweet as either “useful” or “not useful” by using widely-accepted Natural Language Processing (NLP) techniques, and measure the effect of such method based on the change in performance of a Geographical Information System (GIS) artifact. A 1,000 tweet sample is manually tagged and compared to an innovative social media grammar applied by a rule-based social media NLP pipeline. Evaluation underpins answering, prior to content analysis of a tweet, does a method exist to support identifying a tweet as “useful” for subsequent processing? Indeed, “useful” tweet identification via NLP returned precision of 0.9256, recall of 0.6590, and F-measure of 0.7699; consequently GIS social media processing increased 0.2194 over baseline
- …