Search CORE

800 research outputs found

A Study of Metrics of Distance and Correlation Between Ranked Lists for Compositionality Detection

Author: Hansen Niels Dalum
Lioma Christina
Publication venue
Publication date: 10/03/2017
Field of study

Compositionality in language refers to how much the meaning of some phrase can be decomposed into the meaning of its constituents and the way these constituents are combined. Based on the premise that substitution by synonyms is meaning-preserving, compositionality can be approximated as the semantic similarity between a phrase and a version of that phrase where words have been replaced by their synonyms. Different ways of representing such phrases exist (e.g., vectors [1] or language models [2]), and the choice of representation affects the measurement of semantic similarity. We propose a new compositionality detection method that represents phrases as ranked lists of term weights. Our method approximates the semantic similarity between two ranked list representations using a range of well-known distance and correlation metrics. In contrast to most state-of-the-art approaches in compositionality detection, our method is completely unsupervised. Experiments with a publicly available dataset of 1048 human-annotated phrases shows that, compared to strong supervised baselines, our approach provides superior measurement of compositionality using any of the distance and correlation metrics considered

arXiv.org e-Print Archive

Copenhagen University Research Information System

Using distributional similarity to organise biomedical terminology

Author: Dowdall James
Keller Bill
Schneider Gerold
Weeds Julie
Weir David
Publication venue: 'John Benjamins Publishing Company'
Publication date: 01/01/2005
Field of study

We investigate an application of distributional similarity techniques to the problem of structural organisation of biomedical terminology. Our application domain is the relatively small GENIA corpus. Using terms that have been accurately marked-up by hand within the corpus, we consider the problem of automatically determining semantic proximity. Terminological units are dened for our purposes as normalised classes of individual terms. Syntactic analysis of the corpus data is carried out using the Pro3Gres parser and provides the data required to calculate distributional similarity using a variety of dierent measures. Evaluation is performed against a hand-crafted gold standard for this domain in the form of the GENIA ontology. We show that distributional similarity can be used to predict semantic type with a good degree of accuracy

ZORA

Sussex Research Online

Indirectly Named Entity Recognition

Author: Atanassova Iana
Cardey Sylviane
Gaudinat Arnaud
Greenfield Peter
Kauffmann Alexis
Madinier Hélène
Rey François-Claude
Publication venue: 'Universitat Politecnica de Valencia'
Publication date: 13/12/2021
Field of study

[EN] We define here indirectly named entities, as a term to denote multiword expressions referring to known named entities by means of periphrasis. While named entity recognition is a classical task in natural language processing, little attention has been paid to indirectly named entities and their treatment. In this paper, we try to address this gap, describing issues related to the detection and understanding of indirectly named entities in texts. We introduce a proof of concept for retrieving both lexicalised and non-lexicalised indirectly named entities in French texts. We also show example cases where this proof of concept is applied, and discuss future perspectives. We have initiated the creation of a first lexicon of 712 indirectly named entity entries that is available for future research.This research has been funded by the FEDER (Fonds européen de développement régional) and selected by the French-Swiss programme Interreg V. We would like to thank Claire Wuillemin for her preliminary work in the DecRIPT project about the State-of-the-Art in NER and SER in 2020. We would also like to thank for their advice Gilles Falquet, Luka Nerima, Eric Wehrli and Jean-Philippe Goldman at the University of Geneva.Kauffmann, A.; Rey, F.; Atanassova, I.; Gaudinat, A.; Greenfield, P.; Madinier, H.; Cardey, S. (2021). Indirectly Named Entity Recognition. Journal of Computer-Assisted Linguistic Research. 5(1):27-46. https://doi.org/10.4995/jclr.2021.15922OJS274651Abney, Steven. 1987. "The English Noun Phrase in its Sentential Aspect." PhD diss., Massachusetts Institute of Technology.Alsharaf, H., S. Cardey, P. Greenfield, D. Limame, and I. Skouratov. 2003. "Fixedness, the complexity and fragility of the phenomenon: some solutions for natural language processing." In Proceedings of ICL17. Prague, Czech Republic: Matfyzpress.Ananthanarayanan, Rema, Vijil Chenthamarakshan, Prasad M Deshpande, and Raghuram Krishnapuram. 2008. "Rule Based Synonyms for Entity Extraction from Noisy Text." In Proceedings of the Second Workshop on Analytics for Noisy Unstructured Text Data AND '08, 31-38. Singapore: Association for Computing Machinery. https://doi.org/10.1145/1390749.1390756Bachellier, Jean-Louis. 1972. "Sur-Nom." Le texte: de la théorie à la recherche, no. 19: 69-92. doi :10.3406/comm.1972.1283. https://doi.org/10.3406/comm.1972.1283Baldwin, Timothy, and Su Nam Kim. 2013. "Multiword Expressions." In Handbook of Natural Language Processing, Second Edition, edited by Nitin Indurkhya and Fred J. Damerau, 267-292. Boca Raton, USA: CRCPress.Bohn, C., and Kjeti Nørvag. 2010. "Extracting Named Entities and Synonyms from Wikipedia." In Proceedings of the 24th IEEE International Conference on Advanced Information Networking and Applications, 1300-1307. https://doi.org/10.1109/AINA.2010.50Cai, Desheng, and Gongqing Wu. 2019. "Content-aware attributed entity embedding for synonymous named entity discovery." Neurocomputing 329: 237-247. https://doi.org/10.1016/j.neucom.2018.10.055Chakrabarti, K., S. Chaudhuri, T. Cheng, and Dong Xin. 2012. "A framework for robust discovery of entity synonyms." In Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 1384-1392, Beijing, China: Association for Computing Machinery. https://doi.org/10.1145/2339530.2339743Charton, Eric, Michel Gagnon, and Benoit Ozell. 2011. "Génération automatique de motifs de détection d'entités nommées en utilisant des contenus encyclopédiques (Automatic generation of named entity detection patterns using encyclopedic contents)" [in French]. In Actes de la 18e conférence sur le Traitement Automatique des Langues Naturelles. Articles longs, 13-24. Montpellier, France: ATALA.Cho, Hyejin, Wonjun Choi, and Hyunju Lee. 2017. "A method for named entity normalization in biomedical articles: application to diseases and plants." BMC bioinformatics 18, no. 1 ( 1-12. https://doi.org/10.1186/s12859-017-1857-8Devlin, Jacob, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding." In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 4171-4186. Minneapolis, Minnesota: Association for Computational Linguistics.Friburger, Nathalie. 2006. "Linguistique et reconnaissance automatique des noms propres." Meta 51, no. 4: 637-650. doi:10.7202/014331ar. https://doi.org/10.7202/014331arGuenoune, Hani, Kevin Cousot, Mathieu Lafourcade, Melissa Mekaoui, and Cédric Lopez. 2020. "A Dataset for Anaphora Analysis in French Emails." In Proceedings of the Third Workshop on Computational Models of Reference, Anaphora and Coreference, 165-175. Barcelona, Spain (online): Association for Computational Linguistics.Honnibal, Matthew, and Ines Montani. 2017. "spaCy 2: Natural language understanding with Bloom embeddings, convolutional neural networks and incremental parsing."Kampeera, Wannachai, and Sylviane Cardey-Greenfield. 2012. "Building a Lexically and Semantically-Rich Resource for Paraphrase Processing." In Advances in Natural Language Processing, edited by Hitoshi Isahara and Kyoko Kanzaki, 138-143. Springer Berlin Heidelberg. https://doi.org/10.1007/978-3-642-33983-7_14Kauffmann, Alexis. 2013. "Structural Asymmetries in Machine Translation: The case of English-Japanese". PhD diss., Université de Genève. https://doi.org/10.13097/archive-ouverte/unige:34540.Lample, Guillaume, Miguel Ballesteros, Sandeep Subramanian, Kazuya Kawakami, and Chris Dyer. 2016. "Neural Architectures for Named Entity Recognition." In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 260-270. San Diego, California: Association for Computational Linguistics. https://doi.org/10.18653/v1/N16-1030Lin, Bill Yuchen, Dong-Ho Lee, M. Shen, Ryan Rene Moreno, X. Huang, Prashant Shiralkar, and X. Ren. 2020. "TriggerNER: Learning with Entity Triggers as Explanations for Named Entity Recognition." In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 8503-8511. Online: Association for Computational Linguistics. https://doi.org/10.18653/v1/2020.acl-main.752Lopez, C., Melissa Mekaoui, K. Aubry, Jean Bort, and Philippe Garnier. 2019. "Reconnaissance d'entités nommées itérative sur une structure en dépendances syntaxiques avec l'ontologie NERD." Revue des Nouvelles Technologies de l'Information, Extraction et Gestion des connaissances, RNTI-E-35, 81-92.Ma, Jie, Jun Liu, Y. Li, X. Hu, Yudai Pan, S. Sun, and Qika Lin. 2020. "Jointly Optimized Neural Coreference Resolution with Mutual Attention." In Proceedings of the 13th International Conference on Web Search and Data Mining. Houston, Texas, USA: Association for Computing Machinery. https://doi.org/10.1145/3336191.3371787Manning, Christopher D., Mihai Surdeanu, John Bauer, Jenny Finkel, Steven J. Bethard, and David McClosky. 2014. The Stanford CoreNLP Natural Language Processing Toolkit In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pp. 55-60. Baltimore, Maryland: Association for Computational Linguistics. https://doi.org/10.3115/v1/P14-5010Martin, Louis, Benjamin Muller, Pedro Javier Ortiz Suarez, Yoann Dupont, Laurent Romary, Eric Villemonte de la Clergerie, Benoıt Sagot, and Djamé Seddah. 2020. "Les modèles de langue contextuels CamemBERT pour le français: impact de la taille et de l'hétérogénéité des données d'entrainement (CamemBERT Contextual Language Models for French: Impact of Training Data Size and Heterogeneity)" [in French]. In Actes de la 6e conférence conjointe Journées d'Etudes sur la Parole (JEP, 33e édition), Traitement Automatique des Langues Naturelles (TALN, 27e édition), Rencontre des Etudiants Chercheurs en Informatique pour le' Traitement Automatique des Langues (RECITAL, 22e édition). Volume 2: Traitement Automatique des Langues Naturelles, 54-65. Nancy, France: ATALA et AFCP.Mitkov, Ruslan. 2014. Anaphora resolution. Routledge. https://doi.org/10.4324/9781315840086Mohamed, Muhidin A., and Mourad Chabane Oussalah. 2020. "A hybrid approach for paraphrase identification based on knowledge-enriched semantic heuristics." Language Resources and Evaluation 54 : 457-485. https://doi.org/10.1007/s10579-019-09466-4Nadeau, David, and Satoshi Sekine. 2007. "A survey of named entity recognition and classification." Lingvisticae Investigationes 30: 3-26. https://doi.org/10.1075/li.30.1.03nadNayel, Hamada A., H. L. Shashirekha, Hiroyuki Shindo, and Yuji Matsumoto. 2019. "Improving Multi-Word Entity Recognition for Biomedical Texts." CoRRabs/1908.05691. arXiv:1908.05691.Nebhi, Kamel. 2013. "Named Entity Disambiguation using Freebase and Syntactic Parsing." In [email protected], Damien, Maud Ehrmann, and Sophie Rosset. 2016. "Evaluating Named Entity Recognition." Chap. 6 in Named Entities for Computational Linguistics, 111-129. John Wiley & Sons, Ltd. https://doi.org/10.1002/9781119268567.ch6Ortiz Suarez, Pedro Javier, Yoann Dupont, Benjamin Muller, Laurent Romary, and Benoıt Sagot. 2020. "Establishing a New State-of-the-Art for French Named Entity Recognition" [in English]. In Proceedings of the 12th Language Resources and Evaluation Conference, 4631-4638. Marseille, France: European Language Resources Association.Petit, Gérard. 2006. "Le nom de marque déposée : nom propre, nom commun et terme." Meta 51, no. 4: 690-705. doi:10.7202/014335ar. https://doi.org/10.7202/014335arQu, Meng, Xiang Ren, and Jiawei Han. 2017. "Automatic Synonym Discovery with Knowledge Bases." In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 997-1005. KDD '17. Halifax, NS, Canada: Association for Computing Machinery. https://doi.org/10.1145/3097983.3098185Racicot, André. 2009. "Traduire le monde: Venise du Nord et autres surnoms." L'Actualité langagière, vol. 6, n° 2, 23. Travaux publics et Services gouvernementaux Canada.Rey, François-Claude, and Kauffmann Alexis. 2021. "French indirectly named entities (version 1.3) [Data set]." Zenodo. https://doi.org/10.5281/zenodo.5158253.Rosales-Méndez, Henry, Aidan Hogan, and Barbara Poblete. 2019. "Fine-Grained Evaluation for Entity Linking." In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 718-727. Hong Kong, China: Association for Computational Linguistics. https://doi.org/10.18653/v1/D19-1066Sales, Juliano Efson, André Freitas, Brian Davis, and Siegfried Handschuh. 2016. "A Compositional-Distributional Semantic Model for Searching Complex Entity Categories." In Proceedings of the Fifth Joint Conference on Lexical and Computational Semantics, 199-208. Berlin, Germany: Association for Computational Linguistics. https://doi.org/10.18653/v1/S16-2025Schmitt, X., S. Kubler, J. Robert, M. Papadakis, and Y. LeTraon. 2019. "A Replicable Comparison Study of NER Software: StanfordNLP, NLTK, OpenNLP, SpaCy, Gate." In Proceedings of the Sixth International Conference on Social Networks Analysis, Management and Security (SNAMS), 338-343. https://doi.org/10.1109/SNAMS.2019.8931850Shang, Jingbo, Liyuan Liu, Xiaotao Gu, Xiang Ren, Teng Ren, and Jiawei Han. 2018. "Learning Named Entity Tagger using Domain-Specific Dictionary." In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, 2054-2064. Brussels, Belgium: Association for Computational Linguistics. https://doi.org/10.18653/v1/D18-1230Shen, Jiaming, Ruiliang Lyu, Xiang Ren, Michelle Vanni, Brian Sadler, and Jiawei Han. 2019. "Mining entity synonyms with efficient neural set generation." In Proceedings of the AAAI Conference on Artificial Intelligence, 33:249-256. doi:10.1609/aaai.v33i01.3301249. https://doi.org/10.1609/aaai.v33i01.3301249Shinyama, Yusuke, Satoshi Sekine, and Kiyoshi Sudo. 2002. "Automatic Paraphrase Acquisition from News Articles." In Proceedings of the Second International Conference on Human Language Technology Research, 313-318. HLT '02. San Diego, California: Morgan Kaufmann Publishers Inc. https://doi.org/10.3115/1289189.1289218Sjöblom, Paula. 2016. "Commercial names." Chap. V.31 in The Oxford Handbook of Names and Naming, edited by Carole Hough, 453-464. Oxford, UK: Oxford University Press. https://doi.org/10.1093/oxfordhb/9780199656431.013.56Tenney, Ian, Dipanjan Das, and Ellie Pavlick. 2019. "BERT Rediscovers the Classical NLP Pipeline." In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 4593-4601. Florence, Italy: Association for Computational Linguistics. https://doi.org/10.18653/v1/P19-1452Treps, Marie. 2012. La rançon de la gloire - Les surnoms de nos politiques. Paris, France: Editions du Seuil.Watanabe, Taiki, Akihiro Tamura, Takashi Ninomiya, Takuya Makino, and Tomoya Iwakura. 2019. "Multi-Task Learning for Chemical Named Entity Recognition with Chemical Compound Paraphrasing." In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 6244-6249. Hong Kong, China: Association for Computational Linguistics. https://doi.org/10.18653/v1/D19-1648Wehrli, Eric, and Luka Nerima. 2018. "Anaphora resolution, collocations and translation." In Multiword units in machine translation and translation technology, edited by Johanna Monti, Violeta Seretan, Gloria Corpas Pastor, and Ruslan Mitkov, 244-256. John Benjamins. https://doi.org/10.1075/cilt.341.12wehWehrli, Eric, Violeta Seretan, and Luka Nerima. 2010. "Sentence Analysis and Collocation Identification." In Proceedings of the 2010 Workshop on Multiword Expressions: from Theory to Applications, 28-36. Beijing, China: Coling 2010 Organizing Committee.Weston, L., V. Tshitoyan, J. Dagdelen, O. Kononova, A. Trewartha, K. A. Persson, G. Ceder, and A. Jain. 2019. "Named Entity Recognition and Normalization Applied to Large-Scale Information Extraction from the Materials Science Literature." Journal of Chemical Information and Modeling 59, no. 9: 3692-3702. doi: 10.1021/acs.jcim.9b00470. https://doi.org/10.1021/acs.jcim.9b00470Wu, G., Y. He, and X. Hu. 2018. "Entity Linking: An Issue to Extract Corresponding Entity With Knowledge Base." IEEE Access 6: 6220-6231. doi:10.1109/ACCESS.2017.2787787. https://doi.org/10.1109/ACCESS.2017.2787787Yang, Yiying, Xi Yin, Haiqin Yang, Xingjian Fei, Hao Peng, Kaijie Zhou, Kunfeng Lai, and Jianping Shen. 2021. "KGSynNet: A Novel Entity Synonyms Discovery Framework with Knowledge Graph." In Database Systems for Advanced Applications, edited by Christian S. Jensen, Ee-Peng Lim, De-Nian Yang, Wang-Chien Lee, Vincent S. Tseng, Vana Kalogeraki, Jen-Wei Huang, and Chih-Ya Shen, 174-190. Cham: Springer International Publishing. https://doi.org/10.1007/978-3-030-73194-6_13Zhang, Ruoyu, Wenpeng Lu, Shoujin Wang, Xueping Peng, Rui Yu, and Yuan Gao. 2021. "Chinese clinical named entity recognition based on stacked neural network." Concurrency and Computation: Practice and Experience : 33:e5775. doi:10.1002/cpe.5775. https://doi.org/10.1002/cpe.577

HAL - Université de Franche-Comté

Hes-so: ArODES Open Archive (University of Applied Sciences and Arts Western Switzerland / Haute école spécialisée de Suisse occidentale / FH Westschweiz)

RiuNet

Information retrieval system using Multiwords Expressions (MWE) as descriptors

Author: Silva Edson Marchetti da
Souza Renato Rocha
Publication venue: Universidade de São Paulo. Faculdade de Economia, Administração e Contabilidade
Publication date: 01/08/2012
Field of study

This paper aims to propose an alternative method for retrieving documents using Multiwords Expressions (MWE) extracted from a document base to be used as descriptors in search of an Information Retrieval System (IRS). In this sense, unlike methods that consider the text as a set of words, bag of words, we propose a method that takes into account the characteristics of the physical structure of the document in the extraction process of MWE. From this set of terms comparing pre-processed using an exhaustive algorithmic technique proposed by the authors with the results obtained for thirteen different measures of association statistics generated by the software Ngram Statistics Package (NSP). To perform this experiment was set up with a corpus of documents in digital format

Cadernos Espinosanos (E-Journal)

Automatic Extraction Of Malay Compound Nouns Using A Hybrid Of Statistical And Machine Learning Methods

Author: A. S. Hazaa Muneer
Albared Mohammed
Ba-Alwi Fadl Mutaher
Omar Nazlia
Publication venue: 'Institute of Advanced Engineering and Science'
Publication date: 01/06/2016
Field of study

Identifying of compound nouns is important for a wide spectrum of applications in the field of natural language processing such as machine translation and information retrieval. Extraction of compound nouns requires deep or shallow syntactic preprocessing tools and large corpora. This paper investigates several methods for extracting Noun compounds from Malay text corpora. First, we present the empirical results of sixteen statistical association measures of Malay <N+N> compound nouns extraction. Second, we introduce the possibility of integrating multiple association measures. Third, this work also provides a standard dataset intended to provide a common platform for evaluating research on the identification compound Nouns in Malay language. The standard data set contains 7,235 unique N-N candidates, 2,970 of them are N-N compound nouns collocations. The extraction algorithms are evaluated against this reference data set. The experimental results demonstrate that a group of association measures (T-test , Piatersky-Shapiro (PS) , C_value, FGM and rank combination method) are the best association measure and outperforms the other association measures for <N+N> collocations in the Malay corpus. Finally, we describe several classification methods for combining association measures scores of the basic measures, followed by their evaluation. Evaluation results show that classification algorithms significantly outperform individual association measures. Experimental results obtained are quite satisfactory in terms of the Precision, Recall and F-score

IAES journal

Crossref

Institute of Advanced Engineering and Science

Using parallel text for the extraction of German multiword expressions

Author: Fritzinger Fabienne
Publication venue: 'OpenEdition'
Publication date: 05/04/2016
Field of study

A procedure for the identification of semantically opaque (i.e. idiomatic) German multiwords is presented. We focus on verb + PP combinations that are lexicographically relevant (extracted via dependency parsing [Schiehlen 2003]) of the kind ins Leben rufen – “to initiate”, lit.: “to call into life”. Starting from [Villada Moirón and Tiedemann 2006], the method exploits the fact that opaque combinations are translated as a whole, whereas compositional uses would show regular, individual translations of the words involved. The translations into other languages are obtained by applying GIZA++ [Och and Ney 2003] word alignment to the EUROPARL corpus [Koehn 2005]. Numerous experiments are performed to further optimise the original method: several parameters are analysed individually as well as in combination with each other. This leads to the following results: depending on the actual parameter settings, values between 0.800 and 0.936 (in terms of uninterpolated average precision) are reached amongst the highest scoring 200 multiword candidates, as opposed to a baseline of 0.584, using the 200 most frequent multiwords in decreasing order of their occurrence frequency

OpenEdition

Exploiting multilingual lexical resources to predict MWE compositionality

Author: Bahar Salehi
Paul Cook
Timothy Baldwin
Publication venue: Language Science Press
Publication date
Field of study

Semantic idiomaticity is the extent to which the meaning of a multiword expression (MWE) cannot be predicted from the meanings of its component words. Much work in natural language processing on semantic idiomaticity has focused on compositionality prediction, wherein a binary or continuous-valued compositionality score is predicted for an MWE as a whole, or its individual component words. One source of information for making compositionality predictions is the translation of an MWE into other languages. This chapter extends two previously-presented studies – Salehi & Cook (2013) and Salehi et al. (2014) – that propose methods for predicting compositionality that exploit translation information provided by multilingual lexical resources, and that are applicable to many kinds of MWEs in a wide range of languages. These methods make use of distributional similarity of an MWE and its component words under translation into many languages, as well as string similarity measures applied to definitions of translations of an MWE and its component words. We evaluate these methods over English noun compounds, English verb-particle constructions, and German noun compounds. We show that the estimation of compositionality is improved when using translations into multiple languages, as compared to simply using distributional similarity in the source language. We further find that string similarity complements distributional similarity

ZENODO