66 research outputs found

    Bridging the gap within text-data analytics: a computer environment for data analysis in linguistic research

    Get PDF
    Since computer technology became widespread available at universities during the last quarter of the twentieth century, language researchers have been successfully employing software to analyse usage patterns in corpora. However, although there has been a proliferation of software for different disciplines within text-data analytics, e.g. corpus linguistics, statistics, natural language processing and text mining, this article demonstrates that any computer environment intended to support advanced linguistic research more effectively should be grounded on a user-centred approach to holistically integrate cross-disciplinary methods and techniques in a linguist-friendly manner. To this end, I examine not only the tasks that are derived from linguists' needs and goals but also the technologies that appropriately deal with the properties of linguistic data. This research results in the implementation of DAMIEN, an online workbench designed to conduct linguistic experiments on corpora

    Measuring associational thinking through word embeddings

    Full text link
    [EN] The development of a model to quantify semantic similarity and relatedness between words has been the major focus of many studies in various fields, e.g. psychology, linguistics, and natural language processing. Unlike the measures proposed by most previous research, this article is aimed at estimating automatically the strength of associative words that can be semantically related or not. We demonstrate that the performance of the model depends not only on the combination of independently constructed word embeddings (namely, corpus- and network-based embeddings) but also on the way these word vectors interact. The research concludes that the weighted average of the cosine-similarity coefficients derived from independent word embeddings in a double vector space tends to yield high correlations with human judgements. Moreover, we demonstrate that evaluating word associations through a measure that relies on not only the rank ordering of word pairs but also the strength of associations can reveal some findings that go unnoticed by traditional measures such as Spearman's and Pearson's correlation coefficients.s Financial support for this research has been provided by the Spanish Ministry of Science, Innovation and Universities [grant number RTC 2017-6389-5], the Spanish ¿Agencia Estatal de Investigación¿ [grant number PID2020-112827GB-I00 / AEI / 10.13039/501100011033], and the European Union¿s Horizon 2020 research and innovation program [grant number 101017861: project SMARTLAGOON]. Open Access funding provided thanks to the CRUE-CSIC agreement with Springer Nature.Periñán-Pascual, C. (2022). Measuring associational thinking through word embeddings. Artificial Intelligence Review. 55(3):2065-2102. https://doi.org/10.1007/s10462-021-10056-62065210255

    DEXTER: A workbench for automatic term extraction with specialized corpora

    Full text link
    [EN] Automatic term extraction has become a priority area of research within corpus processing. Despite the extensive literature in this field, there are still some outstanding issues that should be dealt with during the construction of term extractors, particularly those oriented to support research in terminology and terminography. In this regard, this article describes the design and development of DEXTER, an online workbench for the extraction of simple and complex terms from domain-specific corpora in English, French, Italian and Spanish. In this framework, three issues contribute to placing the most important terms in the foreground. First, unlike the elaborate morphosyntactic patterns proposed by most previous research, shallow lexical filters have been constructed to discard term candidates. Second, a large number of common stopwords are automatically detected by means of a method that relies on the IATE database together with the frequency distribution of the domain-specific corpus and a general corpus. Third, the term-ranking metric, which is grounded on the notions of salience, relevance and cohesion, is guided by the IATE database to display an adequate distribution of terms.Financial support for this research has been provided by the DGI, Spanish Ministry of Education and Science, grant FFI2014-53788-C3-1-P.Periñán-Pascual, C. (2018). DEXTER: A workbench for automatic term extraction with specialized corpora. Natural Language Engineering. 24(2):163-198. https://doi.org/10.1017/S1351324917000365S16319824

    Bridging the gap within text-data analytics: A computer environment for data analysis in linguistic research

    Full text link
    [EN] Since computer technology became widespread available at universities during the last quarter of the twentieth century, language researchers have been successfully employing software to analyse usage patterns in corpora. However, although there has been a proliferation of software for different disciplines within text-data analytics, e.g. corpus linguistics, statistics, natural language processing and text mining, this article demonstrates that any computer environment intended to support advanced linguistic research more effectively should be grounded on a user-centred approach to holistically integrate cross-disciplinary methods and techniques in a linguist-friendly manner. To this end, I examine not only the tasks that are derived from linguists' needs and goals but also the technologies that appropriately deal with the properties of linguistic data. This research results in the implementation of DAMIEN, an online workbench designed to conduct linguistic experiments on corpora.Financial support for this research has been provided by the DGI, Spanish Ministry of Education and Science, grant FFI2014-53788-C3-1-P.Periñán Pascual, C. (2017). Bridging the gap within text-data analytics: A computer environment for data analysis in linguistic research. LFE. Revista de Lenguas para Fines Específicos. 23(2):111-132. https://doi.org/10.20420/rlfe.2017.175S11113223

    The situated common-sense knowledge in FunGramKB

    Full text link
    It has been widely demonstrated that expectation-based schemata, along the lines of Lakoff's propositional Idealized Cognitive Models, play a crucial role in text comprehension. Discourse inferences are grounded on the shared generalized knowledge which is activated from the situational model underlying the text surface dimension. From a cognitive-plausible and linguistic-aware approach to knowledge representation, FunGramKB stands out for being a dynamic repository of lexical, constructional and conceptual knowledge which contributes to simulate human-level reasoning. The objective of this paper is to present a script model as a carrier of the situated common-sense knowledge required to help knowledge engineers construct more "intelligent" natural language processing systems.Periñán Pascual, JC. (2012). The situated common-sense knowledge in FunGramKB. Review of Cognitive Linguistics. 10(1):184-214. doi:10.1075/rcl.10.1.06perS18421410

    The underpinnings of a composite measure for automatic term extraction: The case of SRC

    Full text link
    The corpus-based identification of those lexical units which serve to describe a given specialized domain usually becomes a complex task, where an analysis oriented to the frequency of words and the likelihood of lexical associations is often ineffective. The goal of this article is to demonstrate that a user-adjustable composite metric such as SRC can accommodate to the diversity of domain-specific glossaries to be constructed from small-and medium-sized specialized corpora of non-structured texts. Unlike for most of the research in automatic term extraction, where single metrics are usually combined indiscriminately to produce the best results, SRC is grounded on the theoretical principles of salience, relevance and cohesion, which have been rationally implemented in the three components of this metric.Financial support for this research has been provided by the DGI, Spanish Ministry of Education and Science, grants FFI2011-29798-C02-01 and FFI2014-53788-C3-1-P.Periñán Pascual, JC. (2015). The underpinnings of a composite measure for automatic term extraction: The case of SRC. Terminology. 21(2):151-179. doi:10.1075/term.21.2.02perS15117921

    Multilingualism and conceptual modelling

    Full text link
    [EN] One of the leading motivations behind the multilingual semantic web is to make resources accessible digitally in an online global multilingual context. Consequently, it is fundamental for knowledge bases to find a way to manage multilingualism and thus be equipped with those procedures for its conceptual modelling. In this context, the goal of this paper is to discuss how common-sense knowledge and cultural knowledge are modelled in a multilingual framework. More particularly, multilingualism and conceptual modelling are dealt with from the perspective of FunGramKB, a lexico-conceptual knowledge base for natural language understanding. This project argues for a clear division between the lexical and the conceptual dimensions of knowledge. Moreover, the conceptual layer is organized into three modules, which result from a strong commitment towards capturing semantic knowledge (Ontology), procedural knowledge (Cognicon) and episodic knowledge (Onomasticon). Cultural mismatches are discussed and formally represented at the three conceptual levels of FunGramKB.We would like to thank Guadalupe Aguado-de-Cea, Christopher Butler, Lachlan Mackenzie, Elena Montiel-Ponsoda and Brian Nolan for detailed comments on the first draft of this paper. Any error is ours. Financial support for this research has been provided by the Spanish Ministry of Education and Science, grants FFI2011-29798-C02-01 and FFI2014-53788-C3-1-P.Mairal-Usón, R.; Periñán-Pascual, C. (2016). Multilingualism and conceptual modelling. Circulo de Linguistica Aplicada a la Comunicacion. 66:244-277. https://doi.org/10.5209/CLAC.52774S2442776

    A framework of analysis for the evaluation of automatic term extractors

    Full text link
    [EN] Following previous research on automatic term extraction, the primary aim of this paper is to propose a more robust and consistent framework of analysis for the comparative evaluation of term extractors. Within the different views for software quality outlined in ISO standards, our proposal focuses on the criterion of external quality and in particular on the characteristics of functionality, usability and efficiency together with the subcharacteristics of suitability, precision, operability and time behavior. The evaluation phase is completed by comparing four online open-access automatic term extractors: TermoStat, GaleXtract, BioTex and DEXTER. This latter resource forms part of the virtual functional laboratory for natural language processing (FUNK Lab) developed by our research group. Furthermore, the results obtained from the comparative analysis are discussed.Financial support for this research has been provided by the Spanish Ministry of Economy, Competitiveness and Science, grant FFI2014-53788-C3-1-P.Periñán-Pascual, C.; Mairal-Usón, R. (2018). A framework of analysis for the evaluation of automatic term extractors. VIAL. Vigo International Journal of Applied Linguistics. 15:105-125. https://doi.org/10.35869/vial.v0i15.88S1051251

    Cognitive modules of an NLP knowledge base for language understanding

    Get PDF
    Algunas aplicaciones del procesamiento del lenguaje natural, p.ej. la traducción automática, requieren una base de conocimiento provista de representaciones conceptuales que puedan reflejar la estructura del sistema cognitivo del ser humano. En cambio, tareas como la indización automática o la extracción de información pueden ser realizadas con una semántica superficial. De todos modos, la construcción de una base de conocimiento robusta garantiza su reutilización en la mayoría de las tareas del procesamiento del lenguaje natural. El propósito de este artículo es describir los principales módulos cognitivos de FunGramKB, una base de conocimiento léxico-conceptual multipropósito para su implementación en sistemas del procesamiento del lenguaje natural.Some natural language processing systems, e.g. machine translation, require a knowledge base with conceptual representations reflecting the structure of human beings’ cognitive system. In some other systems, e.g. automatic indexing or information extraction, surface semantics could be sufficient, but the construction of a robust knowledge base guarantees its use in most natural language processing tasks, consolidating thus the concept of resource reuse. The objective of this paper is to describe FunGramKB, a multipurpose lexicoconceptual knowledge base for natural language processing systems. Particular attention will be paid to the two main cognitive modules, i.e. the ontology and the cognicon
    • …
    corecore