224 research outputs found

    Etiquetado de clusters de relaciones verbales motivados semánticamente

    Get PDF
    El clustering de documentos es un campo de investigación popular en los ámbitos del Procesamiento del Lenguaje Natural, la Minería de Datos y la Recuperación de información (RI). El problema de agrupar unidades léxicas mediante clustering ha sido menos estudiado y menos aún, el problema de etiquetar los clusters. Sin embargo, en nuestra aplicación que trata sobre la extracción de tuplas de relaciones para ser usadas como entrada a programas para dibujar diagramas de bloques o mapas conceptuales, este problema es fundamental. La valoración de varias estrategias de etiquetado de clusters de documentos nos revela que algunas de estas técnicas pueden ser también aplicadas para etiquetar nuestros clusters, compuestos por verbos semánticamente similares. Para confirmar esta suposición, llevamos a cabo una serie de experimentos y evaluamos su rendimiento contra baselines y un goldstandard de clusters etiquetados.Document clustering is a popular research field in Natural Language Processing, Data Mining and Information Retrieval. The problem of lexical unit (LU) clustering has been less addressed, and even less so the problem of labeling LU clusters. However, in our application that deals with the distillation of relational tuples from patent claims as input to block diagram or a concept map drawing programs, this problem is central. The assessment of various document cluster labeling techniques lets us assume that despite some significant differences that need to be taken into account some of these techniques may also be applied to verbal relation cluster labeling we are concerned with. To confirm this assumption, we carry out a number of experiments and evaluate their outcome against baselines and gold standard labeled clusters

    Lexical co-occurrence and lexical inheritance. emotion lexemes in German: a lexicographic case study

    Get PDF
    In the present paper, we tackle the problem of the compact and efficient representation of restricted lexical co-occurrence information in the lexicon along semantic lines. The theoretical framework for this study is the Meaning Text Theory (MTT) and, more specifically, the lexicographic part of MIT --- the Explanatory Combinatorial Dictionary (ECD), which contains for each lexeme (i) its semantic definition, (ii) a systematic description of its restricted lexical co-occurrence in terms of Lexical Functions(LF), and (iii) its Government Pattern. The data domain is the semantic field of emotion lexemes in German. In order to represent the restricted lexical co-occurrence (or collocations) of the lexemes in this field, we suggest the following procedure:1.    Construct approximate descriptions of their meaning, i.e. what we call the abridged lexicographic definitions. Formulated in terms of semantic features, these definitions are supposed to provide as much semantic information as necessary for establishing correlations between the semantic features of a lexeme and its collocates.2.    Specify their syntactic Government Patterns, which are needed for a clearer picture of their co-occurrence --- syntactic as well as lexical.3.    Specify their restricted lexical co-occurrence with the verbs chosen.4.    Establish correlations between the values of LFs and the semantic features in the abridged definitions of the emotion lexemes.5.    Based on these correlations, extract recurrent values of LFs (and recurrent Government Patterns) from individual lexical entries and list them under what we call the generic lexeme of the semantic field under study --- in this case, GEFÜHL 'emotion'. This leads on the one hand, to "compressed" lexical entries for emotion lexemes, and on the other hand, to the creation of a lexical entry of a new type: the "public" entry of a generic lexeme.Keywords: lexicography, lexicon, german emotion lexemes, lexical co-occurrence, collocations, meaning text theory, lexical functions, semantic features, semantico-lexical correlations, information extraction, inheritance, individual lexical subentry, public lexical subentr

    Clasificación de errores gramaticales colocacionales en textos de estudiantes de español

    Get PDF
    Arbitrary recurrent word combinations (collocations) are a key in language learning. However, even advanced students have difficulties when using them. Efficient collocation aiding tools would be of great help. Still, existing “collocation checkers” still struggle to offer corrections to miscollocations. They attempt to correct without making any distinction between the different types of errors, providing, as a consequence, heterogeneous lists of collocations as suggestions. Besides, they focus solely on lexical errors, leaving aside grammatical ones. The former attract more attention, but the latter cannot be ignored either if the goal is to develop a comprehensive collocation aiding tool, able to correct all kinds of miscollocations. We propose an approach to automatically classify grammatical collocation errors made by US learners of Spanish as a starting point for the design of specific correction strategies targeted for each type of error.Las combinaciones recurrentes y arbitrarias de palabras (colocaciones) son clave para el aprendizaje de lenguas pero presentan dificultades incluso a los estudiantes m as avanzados. El uso de herramientas eficientes destinadas al aprendizaje de colocaciones supondría una gran ayuda, sin embargo, las que existen actualmente intentan corregir colocaciones erróneas sin diferenciar entre los distintos tipos de errores ofreciendo, como consecuencia, largas listas de colocaciones de muy diversa naturaleza. Además, sólo se consideran los errores léxicos, dejando de lado los gramaticales que, aunque menos frecuentes, no pueden ignorarse si el objetivo es desarrollar una herramienta capaz de corregir cualquier colocación errónea. En el presente trabajo se propone un método de clasificación automática de errores colocacionales gramaticales cometidos por estudiantes de español estadounidenses, como punto de partida para el diseño de estrategias de corrección específicas para cada tipo de error.This work has been funded by the Spanish Ministry of Science and Competitiveness (MINECO), through a predoctoral grant with reference BES-2012-057036, in the framework of the project HARenES, under the contract number FFI2011-30219-C02-02

    Discourse structuring of dynamic content

    Get PDF
    Uno de los desafíos de la Generación de Lenguaje Natural es la adaptación de la estructura y las palabras de la salida lingüística a la habilidad del usuario, el contenido, el género apropiado, el estilo, etc. Nos centramos en la determinación de la estructura del discurso. En general, se supone que entre dos unidades de contenido ocurre siempre la misma relación de discurso. Propuestas que varían el tipo de relación discursiva y el orden de las proposiciones según la interpretación del contenido siguen siendo escasas. Sin embargo, tal interpretación es extremadamente importante especialmente si el contenido es altamente dinámico como por ejemplo, cuando los datos son series temporales. Presentamos un planificador de textos que considera las restricciones que imponen los datos dinámicos para tomar decisiones a cada etapa de la planificación, en particular para la selección de las relaciones discursivas y la ordenación de las proposiciones.One of Natural Language Generation’s continuing challenges is to determine the structure and words of the generated linguistic output in accordance with the expertise of the user, the content, the appropriate genre, style, etc. We focus on the determination of the discourse structure. Most often, it is assumed that between two content units always the same discourse relation holds. Approaches in which the choice of discourse relations and the ordering of propositions depends on the interpretation of the content are still scarce. However, such an interpretation is extremely important especially if the content is highly dynamic as, e.g., in the case of data parameter time series. We present a text planner that takes into account the constraints imposed by dynamic data to make decisions at every stage of the text planning, and in particular, for the selection of discourse relations and the ordering of propositions.The work reported on in this paper has been carried out in the framework of the MARQUIS-project funded by the European Commission in the framework of the eContent programme under the contract number EDC-11258; duration: 2005-2007

    Dataset annotation in abusive language detection

    Get PDF
    The last decade saw the rise of research in the area of hate speech and abusive language detection. A lot of research has been conducted, with further datasets being introduced and new models put forward. However, contrastive studies of the annotation of different datasets also revealed that some problematic issues remain. Theoretically ambiguous and misleading definitions between different studies make it more difficult to evaluate model reproducibility and generalizability and require additional steps for dataset standardization. To overcome these challenges, the field needs a common understanding of concepts and problems such that standard datasets and different compatible approaches can be developed, avoiding inefficient and redundant research. This article attempts to identify persistent challenges and develop guidelines to help future annotation tasks. Some of the challenges and guidelines identified and discussed in the article relate to concept subjectivity, focus on overt hate speech, dataset integrity and lack of ethical considerations

    Multilingual Surface Realization Using Universal Dependency Trees

    Get PDF
    We propose a shared task on multilingual SurfaceRealization, i.e., on mapping unorderedand uninflected universal dependency trees tocorrectly ordered and inflected sentences in anumber of languages. A second deeper inputwill be available in which, in addition,functional words, fine-grained PoS and morphologicalinformation will be removed fromthe input trees. The first shared task on SurfaceRealization was carried out in 2011 witha similar setup, with a focus on English. Wethink that it is time for relaunching such ashared task effort in view of the arrival of UniversalDependencies annotated treebanks fora large number of languages on the one hand,and the increasing dominance of Deep Learning,which proved to be a game changer forNLP, on the other hand

    Towards Weakly-Supervised Hate Speech Classification Across Datasets

    Full text link
    As pointed out by several scholars, current research on hate speech (HS) recognition is characterized by unsystematic data creation strategies and diverging annotation schemata. Subsequently, supervised-learning models tend to generalize poorly to datasets they were not trained on, and the performance of the models trained on datasets labeled using different HS taxonomies cannot be compared. To ease this problem, we propose applying extremely weak supervision that only relies on the class name rather than on class samples from the annotated data. We demonstrate the effectiveness of a state-of-the-art weakly-supervised text classification model in various in-dataset and cross-dataset settings. Furthermore, we conduct an in-depth quantitative and qualitative analysis of the source of poor generalizability of HS classification models.Comment: Accepted to WOAH 7@ACL 202

    Descubrimiento de Colocaciones Utilizando Semántica

    Get PDF
    Collocations are combinations of two lexically dependent elements, of which one (the base) is freely chosen because of its meaning, and the choice of the other (the collocate) depends on the base. Collocations are difficult to master by language learners. This difficulty becomes evident in that even when learners know the meaning they want to express, they often struggle to choose the right collocate. Collocation dictionaries, in which collocates are grouped into semantic categories, are useful tools. However, they are scarce since they are the result of cost-intensive manual elaboration. In this paper, we present for Spanish an algorithm that automatically retrieves for a given base and a given semantic category the corresponding collocates.Las colocaciones, entendidas como combinaciones de dos elementos entre los cuales existe una dependencia léxica, es decir, donde uno de los elementos (la base) se escoge libremente por su significado, pero el otro (colocativo) depende de la base, suelen ser difíciles de utilizar por los hablantes no nativos de una lengua. Esta dificultad se hace visible en que estos, a menudo, aún sabiendo el significado que quieren expresar, tienen problemas a la hora de elegir el colocativo adecuado. Los diccionarios de colocaciones, donde los colocativos son agrupados en categorías semánticas son una herramienta muy útil, pero son recursos escasos y de costosa elaboración. En este artículo se presenta, para el español, un algoritmo que proporciona, dada una base y una categoría semántica, colocativos pertinentes a dicha categoría.The present work has been funded by the Spanish Ministry of Economy and Competitiveness (MINECO), through a predoctoral grant (BES-2012-057036) in the framework of the project HARenES (FFI2011-30219-C02-02) and the Maria de Maeztu Excellence Program (MDM-2015-0502)

    Combining Dictionary-and Corpus-Based Concept Extraction

    Get PDF
    Abstract. Concept extraction is an increasingly popular topic in deep text analysis. Concepts are individual content elements. Their extraction offers thus an overview of the content of the material from which they were extracted. In the case of domain-specific material, concept extraction boils down to term identification. The most straightforward strategy for term identification is a look up in existing terminological resources. In recent research, this strategy has a poor reputation because it is prone to scaling limitations due to neologisms, lexical variation, synonymy, etc., which make the terminology to be submitted to a constant change. For this reason, many works developed statistical techniques to extract concepts. But the existence of a crowdsourced resource such as Wikipedia is changing the landscape. We present a hybrid approach that combines state-of-the-art statistical techniques with the use of the large scale term acquisition tool BabelFy to perform concept extraction. The combination of both allows us to boost the performance, compared to approaches that use these techniques separately

    FootbOWL: Using a generic ontology of football competition for planning match summaries

    Full text link
    The final publication is available at Springer via http://dx.doi.org/10.1007/978-3-642-21034-1_16Proceedings of 8th Extended Semantic Web Conference, ESWC 2011, Heraklion, Crete, Greece, May 29-June 2, 2011We present a two-layer OWL ontology-based Knowledge Base (KB) that allows for flexible content selection and discourse structuring in Natural Language text Generation (NLG) and discuss its use for these two tasks. The first layer of the ontology contains an application-independent base ontology. It models the domain and was not designed with NLG in mind. The second layer, which is added on top of the base ontology, models entities and events that can be inferred from the base ontology, including inferable logico-semantic relations between individuals. The nodes in the KB are weighted according to learnt models of content selection, such that a subset of them can be extracted. The extraction is done using templates that also consider semantic relations between the nodes and a simple user profile. The discourse structuring submodule maps the semantic relations to discourse relations and forms discourse units to then arrange them into a coherent discourse graph. The approach is illustrated and evaluated on a KB that models the First Spanish Football League
    corecore