27 research outputs found

    Predicate Matrix: an interoperable lexical knowledge base for predicates

    Get PDF
    183 p.La Matriz de Predicados (Predicate Matrix en inglés) es un nuevo recurso léxico-semántico resultado de la integración de múltiples fuentes de conocimiento, entre las cuales se encuentran FrameNet, VerbNet, PropBank y WordNet. La Matriz de Predicados proporciona un léxico extenso y robusto que permite mejorar la interoperabilidad entre los recursos semánticos mencionados anteriormente. La creación de la Matriz de Predicados se basa en la integración de Semlink y nuevos mappings obtenidos utilizando métodos automáticos que enlazan el conocimiento semántico a nivel léxico y de roles. Asimismo, hemos ampliado la Predicate Matrix para cubrir los predicados nominales (inglés, español) y predicados en otros idiomas (castellano, catalán y vasco). Como resultado, la Matriz de predicados proporciona un léxico multilingüe que permite el análisis semántico interoperable en múltiples idiomas

    Automatic induction of framenet lexical units in Italian

    Get PDF
    In this paper we investigate the applicability of automatic methods for frame induction to improve the coverage of IFrameNet, a novel lexical resource based on Frame Semantics in Italian. The experimental evaluations show that the adopted methods based on neural word embeddings pave the way for the assisted development of a large scale lexical resource for our language

    An approach to the automatic transfer of lexical units from english FrameNet to spanish by using WordNet

    Get PDF
    [EN] In the field of Natural Language Processing, linguistic resources are structured and detailed descriptions of a certain language. They are considered as key elements for studying languages and developing applications. However, these repositories are slow and difficult to build, and most of them focuses on English. This work tries to improve the lack of linguistic resources in Spanish by transferring part of the information encoded in the FrameNet project into Spanish. For this purpose, we developed an automatic procedure able to align the different frame predicates with the WordNet synsets that best represent them. Our system reaches an 88% precision and makes it possible to reuse this semantic resource for linguistic studies in Spanish.[ES] Dentro del procesamiento del lenguaje, los recursos lingüísticos son descripciones estructuradas y detalladas de una determinada lengua, esenciales a la hora de estudiar el lenguaje y crear aplicaciones. Sin embargo, estos repositorios son bastantes lentos y difíciles de construir, y además la mayoría de ellos se centra en el inglés. Este trabajo trata de paliar, en cierta medida, el problema de escasez de recursos disponibles en castellano, mediante la traducción al español de las unidades léxicas de los marcos situacionales del proyecto FrameNet, un recurso on-line para el inglés basado en la semántica de marcos. Para ello desarrollamos un procedimiento capaz de asociar los diferentes predicados de cada marco con los synsets de WordNet, una base de datos léxica que organiza el vocabulario según conceptos y relaciones semánticas. Como tendremos oportunidad de comprobar, el sistema alcanza una precisión en torno al 88% y abre la puerta a su uso en estudios lingüísticos de diversa índole en español. Esta publicación ha sido financiada por el proyecto "Comunicación especializada y terminografía: usos terminológicos relacionados con los contenidos y perspectivas actuales de la semántica léxica" (Ref. FFI2014-54609-P) del Programa Estatal de Fomento de la Investigación Científica y Técnica de Excelencia. Subprograma Estatal de Generación del Conocimiento (convocatoria 2014 del Ministerio de Economía y Competitividad) y se inscribe en el proyecto "Bases metodológicas y recursos digitales para la creación de un léxico relacional de usos terminológicos de la semántica léxica (TerLexNet)", solicitado en la Convocatoria 2020 de Proyectos de I+D+i del Ministerio de Ciencia e Innovación. Igualmente cuenta con el apoyo del proyecto "Lingüística y nuevas tecnologías de la información: la creación de un repositorio electrónico de documentación lingüística" (Ref. FEDER-UCA18-107788), perteneciente a los Proyectos de I+D+i del Programa Operativo FEDER Andalucía 2014-2020, y "Lingüística y Humanidades Digitales: base de datos relacional de documentación lingüística" (Ref. PY18-FR-2511) de la Convocatoria 2018 de Ayudas a proyectos I+D+i (Modalidad “Frontera Consolidado”) en el ámbito del Plan Andaluz de Investigación, Desarrollo e Innovación (Junta de Andalucía, PAIDI 2020).Crespo Miguel, M. (2021). Aproximación al trasvase automático de predicados de FrameNet al español mediante WordNet. Revista de Lingüística y Lenguas Aplicadas. 16(1):49-62. https://doi.org/10.4995/rlyla.2021.14408OJS4962161Arano, S. (2005). "Thesauruses and ontologies". Hipertext.net, 3. Disponible en https://www.upf.edu/hipertextnet/en/numero-3/tesauros.htmlBaker, C., Fillmore, C. J. and Lowe, J. B. (1998). "The Berkeley FrameNet project", en C. Boitet and P. Whitelock (eds.), Proceedings of the Thirty-Sixth Annual Meeting of the Association for Computational Linguistics and Seventeenth International Conference on Computational Linguistics (86-90). San Francisco, California: Morgan Kaufmann Publishers. https://doi.org/10.3115/980845.980860Bel, N., Bel, S., Espeja, S., Marimon, M., Villegas, M. (2008). "El proyecto CLARIN: una infraestructura de investigación científica para las humanidades y las ciencias sociales". Digithum (10). Artículo en línea]. https://doi.org/10.7238/d.v0i10.501Benfeng, C., y Fung, P. (2004). "Automatic Construction of an English-Chinese Bilingual FrameNet". Proceedings of HLT-NAACL 2004: Short Papers. Boston, Massachusetts: ACL, 29-32.Burchardt, A., Erk, K. y Frank, A. (2005). "A WordNet detour to FrameNet". Sprachtechnologie, mobile Kommunikation und linguistische Resourcen, 8, 408-421.Burchardt, A., Erk, K., Frank, A., Kowalski, A., Padó, S. and Pinkal, M. (2006). "The SALSA Corpus: a German Corpus Resourcefor Lexical Semantics". Proceedings of Language Resources and Evauation Conference, 2006 (969-974). Genova: LREC. URL: http://www.lrec-conf.org/proceedings/lrec2006/pdf/339_pdf.pdfCandito, M., Amsili, P., Barque, L., Benamara, F., de Chalendar, G., Djemaa, M., Haas, P., Huyghe, R., Yannick Mathieu, Y., Muller, P., Sagot, B., Vieu, L. (2014). "Developing a French FrameNet: Methodology and First results". Proceedings of the The 9th edition of the Language Resources and Evaluation Conference. Reykjavik: ELRA, 1-9.Casas Gómez, Miguel (2014). "A Typology of Relationships in Semantics". Quaderni di semantica: Rivista Internazionale di Semantica Teorica e Applicata, Vol. 35 (2), 45-74.Casas Gómez, M. (2020). "Conceptual relationships and their methodological representation in a dictionary of the terminological uses of lexical semantics". Fachsprache: Internationale Zeitschrift für Fachsprachenforschung-didaktik und Terminologie, 42/1-2, 2-26. https://doi.org/10.24989/fs.v42i1-2.1789Civit Torruella, M., Aldezabal Roteta, I., Pociello Irigoyen, E., Taulé Delor, M., Aparicio Mera, J.J., Màrquez Villodre, L., Navarro Colorado, B., Castellví Vives, J. y Martí Antonín, M.A. (2005). "3LB-LEX: léxico verbal con frames sintáctico-semánticos". Procesamiento del Lenguaje Natural 35, 367-373.Crespo, M. (2021). Automatic Corpus-based translation of a Spanish FrameNet medical Glossary. Colección Lingüística. Sevilla: Universidad de SevillaCristea, D., y Pistol, I.C. (2012). "Multilingual linguistic workflows". Multilingual Processing in Eastern and Southern EU Languages. Low-resourced Technologies and Translation, Cambridge Scholars Publishing, UK, 228-246.Ferrández, Ó., Ellsworth, M., Muñoz, R., y Baker, C. F. (2010). "Aligning FrameNet and WordNet based on Semantic Neighborhoods". Proceedings of the International Conference on Language Resources and Evaluation, LREC 2010. Malta: ELRA, 310-314.Fillmore, C. J. (1977). "Scenes and Frames Semantics", en A. Zampolli (Ed.), Linguistic Structures Processing (55-82). Amsterdam: North Holland.Friberg Heppin, K., y Toporowska Gronostaj, M. (2012). "The Rocky Road towards a Swedish FrameNet - Creating SweFN". Proceedings of the Eighth conference on International Language Resources and Evaluation (LREC-2012). Estambul: ELRA, -261.Gilchrist, A. (2003). "Thesauri, taxonomies and ontologies-an etymological note". Journal of documentation, 59(1), 7-18. https://doi.org/10.1108/00220410310457984Hayoun, A. y Elhadad, M. (2016). "The Hebrew FrameNet Project". Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016) (4341-4347). Portorož, Slovenia: European Language Resources Association (ELRA).Hilera, J. R., Pagés, C., Martínez, J.J., Gutiérrez, J.A., y De-Marcos, L. (2010). "An evolutive process to convert glossaries into ontologies". Information technology and libraries, 29(4), 195-204. https://doi.org/10.6017/ital.v29i4.3130Johansson, R., y Nugues, P. (2007). "Using WordNet to Extend FrameNet Coverage", en P. Nugues, y R. Johansson (Eds.), LU-CS-TR: 2007-240. Lund: Department of Computer Science, Lund University, 27-30.Kim, J., Hahm, Y., y Choi, K. (2016). "Korean FrameNet Expansion Based on Projection of Japanese FrameNet". Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: System Demonstrations. Osaka: ACL, 175-179.Kipper, K., Trang Dang, H., Schuler, W., y Palmer, M. (2000). "Building a class-based verb lexicon using tags". Proceedings of the Fifth International Workshop on Tree Adjoining Grammar and Related Frameworks (TAG+5) (147-155). Paris: ACL.Kurdi, M.Z. (2017). Natural language processing and computational linguistics 2: semantics, discourse and applications (Vol. 2). Hoboken, Nueva Jersey: John Wiley & Sons. https://doi.org/10.1002/9781119419686Laparra, E., Rigau, G. Cuadros, M. (2010). "Exploring the integration of WordNet and FrameNet". Proceedings of the 5th Global WordNet Conference. Mumbai: Global WordNet Association, 1-6.Liping, Y., y Kaiying, L. (2005). "Building Chinese FrameNet database". Proceedigs of the 2005 International Conference on Natural Language Processing and Knowledge Engineering. Wuhan: IEEE, 301-306. https://doi.org/10.1109/NLPKE.2005.1598752López de Lacalle, M., Laparra, E., y Rigau, G. (2014). "Predicate Matrix: extending SemLink through WordNet mappings". Proceedings of the Ninth International Conference on Language Resources and EvaluationMartí Antonín, M.A., y Taulé Delor, M. (2014). Computational Hispanic Linguistics. The Routledge Handbook of Hispanic Applied Linguistics. London: Taylor and Francis, (350-370).McCrae, J.P., y Cimiano, P. (2015). "Linghub: a Linked Data based portal supporting the discovery of language resources". Proceedings of the 11th International Conference on Semantic Systems, Semantics, 1481. New York: Association for Computing Machinery, 88-91.Miller, G. A., Beckwith, R., Fellbaum, C., Gross, D., y Miller, K. (ed.) (1993). Five Papers on WordNet, cls report 43. Tecnical report. New Jersey: Cognitive Science Laboratory. Princeton University.Miller, J. E., y Brown, K. (2013). The Cambridge dictionary of linguistics. Cambridge: Cambridge University Press.Minsky, M. (1975). "A framework for representing knowledge". Psychology of Computer Vision. New York: McGrawHill, 211-277.Nespore-Berzkalne G., Saulite, B., y Gruzitis, N. (2018). "Latvian FrameNet: Cross-Lingual Issue". Human Language Technologies - The Baltic Perspective, 307. Amsterdam: IOS Press, 96-103.Ohara, K., Fujii, S., Ohori, T., Suzuki, R., Saito, H., y Ishizaki, S. (2004). "The Japanese FrameNet Project: An Introduction". LREC 2004: The Fourth International Conference on Language Resources and Evaluation (249-254). Lisbon: LREC.Palmer, M., Gildea, D., y Kingsbury, P. (2005). "The Proposition Bank: An Annotated Corpus of Semantic Roles". Journal Computational Linguistics, 31, issue 1. MA: MIT Press Cambridge, 71-106. https://doi.org/10.1162/0891201053630264Pennacchiotti, M., De Cao, D., Basili, R., Croce, D., Roth, M. (2008). "Automatic induction of FrameNet lexical units". Proceedings of the 2008 conference on empirical methods in natural language. Honolulu: ACL, 457-465. https://doi.org/10.3115/1613715.1613773Pieterse, V., y Kourie, D. G. (2014). "Lists, taxonomies, lattices, thesauri and ontologies: paving a pathway through a terminological jungle". KO Knowledge Organization, 41(3), 217-229. https://doi.org/10.5771/0943-7444-2014-3-217Powers, D. M. (2011). "Evaluation: from precision, recall and F-measure to ROC, informedness, markedness and correlation". Journal of Machine Learning Technologies, 2, No. 1. (2011), 37-63.Ruppenhofer, J., Ellsworth, M., Petruck, M. R. L., Johnson, C. y Scheffczyk, J. (2006. FrameNet II: Extended Theory and Practice. URL: https://framenet2.icsi.berkeley.edu/docs/r1.7/book.pdfSalomão, M. (2009). "FrameNet Brasil: um trabalho em progresso". Calidoscópio 7(3), 171-182. https://doi.org/10.4013/cld.2009.73.01Subirats, C., y Petruck, M. R. L. (2003). "Surprise: Spanish FrameNet!". Proceedings of Proceedings of the Workshop on Frame Semantics at the XVII. International Congress of Linguists (CD-ROM). Prague: Matfyzpress.Subirats, C. (2013). "La integración de la semántica de marcos y la semántica de simulación: aplicaciones al procesamiento semántico automático del español", en Mª Luisa Calero and Mª Ángeles Hermosilla (eds.). Lingüística, Poética y Cognición. Córdoba: Servicio de Publicaciones de la Universidad de Córdoba, 307-337.Tonelli, S., y Pianta, E. (2009). "A novel approach to mapping FrameNet lexical units to WordNet synsets (short paper)". Proceedings of the Eight International Conference on Computational Semantics. Tilburg: ACL, 342-345. https://doi.org/10.3115/1693756.1693800Torrent, T.T., Ellsworth, M., Baker, C.F. and Matos, E. E. (2018). "The Multilingual FrameNet Shared Annotation Task: A Preliminary Report". Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018) (62-68). Miyazaki: ELRA.Van Uytvanck, D., Zinn, C., Broeder, D., Wittenburg, P., Gardelleni, M. (2010). "Virtual language observatory: The portal to the language resources and technology universe". Proceedings of the Seventh conference on International Language Resources and Evaluation [LREC 2010]. Malta: European Language Resources Association (ELRA), pp. 900-903.Vossen, P. (ed.) (1998): EuroWordNet: A Multilingual Database with Lexical Semantic Networks. Dordrecht: Kluwer Academic Publishers. https://doi.org/10.1007/978-94-017-1491-4Vossen, P. (ed.) (2002): EuroWordNet: general document. URL: http://vossen.info/docs/2002/EWNGeneral.pdfVilches-Blázquez, L.M., García Silva, A., y Villazón Terrazas, B. (2009). Construcción de ontologías a partir de tesauros. Semántica Espacial y descubrimiento de conocimientos para desarrollo sostenible. La Habana: CUJAE, 59-78

    A Frame-Based Approach for Integrating Heterogeneous Knowledge Sources

    Get PDF

    Cold-start universal information extraction

    Get PDF
    Who? What? When? Where? Why? are fundamental questions asked when gathering knowledge about and understanding a concept, topic, or event. The answers to these questions underpin the key information conveyed in the overwhelming majority, if not all, of language-based communication. At the core of my research in Information Extraction (IE) is the desire to endow machines with the ability to automatically extract, assess, and understand text in order to answer these fundamental questions. IE has been serving as one of the most important components for many downstream natural language processing (NLP) tasks, such as knowledge base completion, machine reading comprehension, machine translation and so on. The proliferation of the Web also intensifies the need of dealing with enormous amount of unstructured data from various sources, such as languages, genres and domains. When building an IE system, the conventional pipeline is to (1) ask expert linguists to rigorously define a target set of knowledge types we wish to extract by examining a large data set, (2) collect resources and human annotations for each type, and (3) design features and train machine learning models to extract knowledge elements. In practice, this process is very expensive as each step involves extensive human effort which is not always available, for example, to specify the knowledge types for a particular scenario, both consumers and expert linguists need to examine a lot of data from that domain and write detailed annotation guidelines for each type. Hand-crafted schemas, which define the types and complex templates of the expected knowledge elements, often provide low coverage and fail to generalize to new domains. For example, none of the traditional event extraction programs, such as ACE (Automatic Content Extraction) and TAC-KBP, include "donation'' and "evacuation'' in their schemas in spite of their potential relevance to natural disaster management users. Additionally, these approaches are highly dependent on linguistic resources and human labeled data tuned to pre-defined types, so they suffer from poor scalability and portability when moving to a new language, domain, or genre. The focus of this thesis is to develop effective theories and algorithms for IE which not only yield satisfactory quality by incorporating prior linguistic and semantic knowledge, but also greater portability and scalability by moving away from the high cost and narrow focus of large-scale manual annotation. This thesis opens up a new research direction called Cold-Start Universal Information Extraction, where the full extraction and analysis starts from scratch and requires little or no prior manual annotation or pre-defined type schema. In addition to this new research paradigm, we also contribute effective algorithms and models towards resolving the following three challenges: How can machines extract knowledge without any pre-defined types or any human annotated data? We develop an effective bottom-up and unsupervised Liberal Information Extraction framework based on the hypothesis that the meaning and underlying knowledge conveyed by linguistic expressions is usually embodied by their usages in language, which makes it possible to automatically induces a type schema based on rich contextual representations of all knowledge elements by combining their symbolic and distributional semantics using unsupervised hierarchical clustering. How can machines benefit from available resources, e.g., large-scale ontologies or existing human annotations? My research has shown that pre-defined types can also be encoded by rich contextual or structured representations, through which knowledge elements can be mapped to their appropriate types. Therefore, we design a weakly supervised Zero-shot Learning and a Semi-Supervised Vector Quantized Variational Auto-Encoder approach that frames IE as a grounding problem instead of classification, where knowledge elements are grounded into any types from an extensible and large-scale target ontology or induced from the corpora, with available annotations for a few types. How can IE approaches be extent to low-resource languages without any extra human effort? There are more than 6000 living languages in the real world while public gold-standard annotations are only available for a few dominant languages. To facilitate the adaptation of these IE frameworks to other languages, especially low resource languages, a Multilingual Common Semantic Space is further proposed to serve as a bridge for transferring existing resources and annotated data from dominant languages to more than 300 low resource languages. Moreover, a Multi-Level Adversarial Transfer framework is also designed to learn language-agnostic features across various languages

    Understanding Word Embedding Stability Across Languages and Applications

    Full text link
    Despite the recent popularity of word embedding methods, there is only a small body of work exploring the limitations of these representations. In this thesis, we consider several aspects of embedding spaces, including their stability. First, we propose a definition of stability, and show that common English word embeddings are surprisingly unstable. We explore how properties of data, words, and algorithms relate to instability. We extend this work to approximately 100 world languages, considering how linguistic typology relates to stability. Additionally, we consider contextualized output embedding spaces. Using paraphrases, we explore properties and assumptions of BERT, a popular embedding algorithm. Second, we consider how stability and other word embedding properties affect tasks where embeddings are commonly used. We consider both word embeddings used as features in downstream applications and corpus-centered applications, where embeddings are used to study characteristics of language and individual writers. In addition to stability, we also consider other word embedding properties, specifically batching and curriculum learning, and how methodological choices made for these properties affect downstream tasks. Finally, we consider how knowledge of stability affects how we use word embeddings. Throughout this thesis, we discuss strategies to mitigate instability and provide analyses highlighting the strengths and weaknesses of word embeddings in different scenarios and languages. We show areas where more work is needed to improve embeddings, and we show where embeddings are already a strong tool.PHDComputer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/162917/1/lburdick_1.pd

    D7.1. Criteria for evaluation of resources, technology and integration.

    Get PDF
    This deliverable defines how evaluation is carried out at each integration cycle in the PANACEA project. As PANACEA aims at producing large scale resources, evaluation becomes a critical and challenging issue. Critical because it is important to assess the quality of the results that should be delivered to users. Challenging because we prospect rather new areas, and through a technical platform: some new methodologies will have to be explored or old ones to be adapted

    Reflexive Space. A Constructionist Model of the Russian Reflexive Marker

    Get PDF
    This study examines the structure of the Russian Reflexive Marker ( ся/-сь) and offers a usage-based model building on Construction Grammar and a probabilistic view of linguistic structure. Traditionally, reflexive verbs are accounted for relative to non-reflexive verbs. These accounts assume that linguistic structures emerge as pairs. Furthermore, these accounts assume directionality where the semantics and structure of a reflexive verb can be derived from the non-reflexive verb. However, this directionality does not necessarily hold diachronically. Additionally, the semantics and the patterns associated with a particular reflexive verb are not always shared with the non-reflexive verb. Thus, a model is proposed that can accommodate the traditional pairs as well as for the possible deviations without postulating different systems. A random sample of 2000 instances marked with the Reflexive Marker was extracted from the Russian National Corpus and the sample used in this study contains 819 unique reflexive verbs. This study moves away from the traditional pair account and introduces the concept of Neighbor Verb. A neighbor verb exists for a reflexive verb if they share the same phonological form excluding the Reflexive Marker. It is claimed here that the Reflexive Marker constitutes a system in Russian and the relation between the reflexive and neighbor verbs constitutes a cross-paradigmatic relation. Furthermore, the relation between the reflexive and the neighbor verb is argued to be of symbolic connectivity rather than directionality. Effectively, the relation holding between particular instantiations can vary. The theoretical basis of the present study builds on this assumption. Several new variables are examined in order to systematically model variability of this symbolic connectivity, specifically the degree and strength of connectivity between items. In usage-based models, the lexicon does not constitute an unstructured list of items. Instead, items are assumed to be interconnected in a network. This interconnectedness is defined as Neighborhood in this study. Additionally, each verb carves its own niche within the Neighborhood and this interconnectedness is modeled through rhyme verbs constituting the degree of connectivity of a particular verb in the lexicon. The second component of the degree of connectivity concerns the status of a particular verb relative to its rhyme verbs. The connectivity within the neighborhood of a particular verb varies and this variability is quantified by using the Levenshtein distance. The second property of the lexical network is the strength of connectivity between items. Frequency of use has been one of the primary variables in functional linguistics used to probe this. In addition, a new variable called Constructional Entropy is introduced in this study building on information theory. It is a quantification of the amount of information carried by a particular reflexive verb in one or more argument constructions. The results of the lexical connectivity indicate that the reflexive verbs have statistically greater neighborhood distances than the neighbor verbs. This distributional property can be used to motivate the traditional observation that the reflexive verbs tend to have idiosyncratic properties. A set of argument constructions, generalizations over usage patterns, are proposed for the reflexive verbs in this study. In addition to the variables associated with the lexical connectivity, a number of variables proposed in the literature are explored and used as predictors in the model. The second part of this study introduces the use of a machine learning algorithm called Random Forests. The performance of the model indicates that it is capable, up to a degree, of disambiguating the proposed argument construction types of the Russian Reflexive Marker. Additionally, a global ranking of the predictors used in the model is offered. Finally, most construction grammars assume that argument construction form a network structure. A new method is proposed that establishes generalization over the argument constructions referred to as Linking Construction. In sum, this study explores the structural properties of the Russian Reflexive Marker and a new model is set forth that can accommodate both the traditional pairs and potential deviations from it in a principled manner.Siirretty Doriast
    corecore