15 research outputs found

    OpenTagger: A flexible and user-friendly linguistic tagger

    Get PDF
    Linguistic annotation adds valuable information to a corpus. Annotated corpora are highly useful for linguists since they increase the range of linguistic phenomena that may be registered, categorised and retrieved. In addition, they are also significant for machines, as Natural Language Processing applications involve working with well-annotated data (e.g. Imran, Mitra and Castillo 2016) and some machine learning classifiers employ annotated data to test or train new language annotation tools, among other uses. In this regard, Pustejovsky and Stubbs (2012) report on stages for building annotated corpora to train machine learning algorithms. This paper describes OpenTagger, a new linguistic tagger that allows users to include any type of information to the different paragraphs, sentences, or words that compose a text. OpenTagger is characterised by its high usability and flexibility. It is a web application that allows users to manually annotate texts using their own predefined tag set or creating a new one. Thus, it offers an answer to any need for a tailor-made annotation system. This tagset may include nested categories. In addition, multiple layers of annotation are possible. The annotation process is very easy and provides two options: i) Selecting text and tagging; ii) Selecting a tag and annotating as much text as precissed. OpenTagger also includes a search box to query the text and retrieve relevant sections for tagging. In sum, the open character of this tool and its user-friendliness allows extending the benefits of annotation to a wider variety of research questions. OpenTagger differs from others well-known taggers such as Nooj (Silberztein, 2005) because of its simplicity and web access, as it is not specialised for grammar construction or other complex processes. Potential users range from novel linguist researchers to experts. Last, it should be mentioned that a further integration within the corpus analysis software ACTRES Corpus Manager (Sanjurjo-González, 2017) is planned for the future. OpenTagger will make the process of building and querying custom annotated corpora more straightforward using ACM.ACTRES, TRALIMA/ITZULIK, GIU19/067, Gobierno Vasco IT1209/1

    Creación de un Framework para el tratamiento de corpus lingüísticos = Development of a Framework for corpus linguistic analysis

    Get PDF
    436 p.A pesar de los indudables avances en el software para el tratamiento de corpus lingüísticos en los últimos tiempos, ya sea por medio de procesamiento de corpus cada vez más grandes o inclusión de estadísticas más complejas, sigue sin tenerse en cuenta la usabilidad y el perfil no técnico del usuario final. La situación resulta más evidente cuando se trabaja con lenguas distintas del inglés y con combinaciones de lenguas, ya que la tipología y especificidad de las mismas incide en los requisitos del software, y por este motivo la disponibilidad de recursos es menor y de peor calidad. El estado de la cuestión revela que la creación de corpus lingüísticos bi-/multilingües paralelos o comparables, así como la incorporación de etiquetados lingüísticos en los frameworks para el tratamiento de corpus lingüísticos ya existentes, obliga al usuario a disponer de ciertos conocimientos de programación, o al menos a saber ejecutar programas con usabilidad reducida y/o scripts informáticos propios, para ajustar el corpus a los requisitos establecidos por el framework utilizado. Si no se dan estas condiciones, es indispensable contar con especialistas técnicos con habilidades en programación y NLP (por sus siglas en inglés Natural Language Processing). El objetivo de la tesis doctoral es, por tanto, el desarrollo de un software, denominado ACTRES Corpus Manager, que permita a los usuarios lingüistas construir sus propios corpus lingüísticos (monolingües, paralelos bi-/multilingües o comparables) con distintas capas de anotación (gramatical, semántica o retórica) y obtener datos lingüísticos y estadísticos sin necesidad sin necesidad de asistencia técnica en ningún punto del proceso e independientemente de las habilidades técnicas del usuario. La estrategia seleccionada para el desarrollo de ACTRES Corpus Manager es la creación de un framework accesible vía web formado por distintos componentes interconectados entre sí. Cada actividad necesaria para la creación de un corpus es asignada a cada uno de estos componentes, posibilitando su fácil modificación y reutilización. ACTRES Corpus Manager combina la utilización de recursos software de terceros, cuya eficiencia y validez haya sido demostrada (ej. The IMS Corpus Workbench, Treetagger, hunalign, etc.), junto con soluciones software propias en aquellos procesos que el estado de la cuestión ha relevado más inmaduros y/o complejos de integrar (etiquetador retórico, etiquetador semántico, etc.). Por último, señalar que la interfaz de consulta de ACTRES Corpus Manager se inspira en P-ACTRES 2.0 y permite la realización de consultas complejas asistidas, basadas en expresiones regulares, así como la extracción de las estadísticas habituales, sin necesidad de que el usuario disponga de conocimientos específicos de la sintaxis del lenguaje de consulta utilizad

    Statistics and visualisations of theatre corpora using corpus analysis software

    Get PDF
    38º Congreso Internacional AESLA (2021)Corpus linguistics is a powerful quantitative methodology that relies on frequency data and statistical procedures (Han 2019). According to Gries (2013), scientific quantitative research has three main goals: description, explanation and prediction of data. Within this frame, statistics makes sense of quantitative data by means of analysis and useful visualisations (Brezina 2018). There are many techniques that have been designed for monolingual corpora such as statistical identification of collocations or keywords. While most of these can also be applied to different types of corpora, such as parallel and comparable ones, it seems that a dedicated set of statistics related to structural singularities of text types such as theatre plays is missing. In this study, we propose a range of different adaptations of statistics and visualisations that apply and interrelate theatre-specific filters. Dramatic texts division in structural units is a specific feature of this genre (Andaluz-Pinedo and Sanjurjo-González in press). Utterances, speakers, stage directions and dialogues are an intrinsic part of these texts that must be taken into account when developing useful and descriptive statistical procedures. It is thus necessary to offer statistics and visualizations that apply and interrelate theatre-specific filters. Some examples of this adaptation may be quantitative analysis based on the units of characters, utterances, stage directions and dialogues instead of using all the texts data as a whole. As Anthony states (2013), “the functionality offered by software tools largely dictates what corpus linguistics research methods are available to a researcher”. In order to improve this functionality when theatre corpora are analysed, further work includes the integration of this approach into an existing corpus analysis software that processes theatre play-texts such as ACTRES Corpus Manager (Sanjurjo-González, 2017)

    OpenTagger: A flexible and user-friendly linguistic tagger

    Get PDF
    Linguistic annotation adds valuable information to a corpus. Annotated corpora are highly useful for linguists since they increase the range of linguistic phenomena that may be registered, categorised and retrieved. In addition, they are also significant for machines, as Natural Language Processing applications involve working with well-annotated data (e.g. Imran, Mitra and Castillo 2016) and some machine learning classifiers employ annotated data to test or train new language annotation tools, among other uses. In this regard, Pustejovsky and Stubbs (2012) report on stages for building annotated corpora to train machine learning algorithms. This paper describes OpenTagger, a new linguistic tagger that allows users to include any type of information to the different paragraphs, sentences, or words that compose a text. OpenTagger is characterised by its high usability and flexibility. It is a web application that allows users to manually annotate texts using their own predefined tag set or creating a new one. Thus, it offers an answer to any need for a tailor-made annotation system. This tagset may include nested categories. In addition, multiple layers of annotation are possible. The annotation process is very easy and provides two options: i) Selecting text and tagging; ii) Selecting a tag and annotating as much text as precissed. OpenTagger also includes a search box to query the text and retrieve relevant sections for tagging. In sum, the open character of this tool and its user-friendliness allows extending the benefits of annotation to a wider variety of research questions. OpenTagger differs from others well-known taggers such as Nooj (Silberztein, 2005) because of its simplicity and web access, as it is not specialised for grammar construction or other complex processes. Potential users range from novel linguist researchers to experts. Last, it should be mentioned that a further integration within the corpus analysis software ACTRES Corpus Manager (Sanjurjo-González, 2017) is planned for the future. OpenTagger will make the process of building and querying custom annotated corpora more straightforward using ACM.ACTRES, TRALIMA/ITZULIK, GIU19/067, Gobierno Vasco IT1209/1

    Pragmatic annotation of a domain-restricted English-Spanish comparable corpus

    Get PDF
    [EN] This paper explores the multi-layer annotation of a written domain-restricted English-Spanish comparable corpus (CLANES – Controlled LANguage English Spanish), focusing on pragmatic annotation. The annotation scheme draws on part of speech tagging and a semantic annotation scheme, i.e. the UCREL Semantic Analysis System, with some added categories to fit the food-and-drink domain represented in CLANES. These are used to build significant (pragmatic) metapatterns. Seven different pragmatic functions have been identified in our corpus, namely , , , , , and . Computer scripts translate this linguistic information into regular expressions to be used in unsupervised annotation. Partial results indicate that applying lexical restrictors boosts the success rate considerably. However, metadata is preferred because of increased replicability and generality. Replicability issues and limitations encountered during testing are also addressed.SIThis research has been funded by grant FFI2016-75672-R awarded by the Spanish Ministry of Science and Innovation and ERDF (European Regional Development Fund

    Building a Spanish lexicon for corpus analysis

    Get PDF
    This paper seeks to describe the creation of a Spanish lexicon with semantic annotation in order to analyse more extensive corpora in the Spanish language. The semantic resources most employed nowadays are WordNet, FrameNet, PDEV and USAS, but they have been used mainly for English language research. The creation of a large Spanish lexicon will permit a greater amount of studies of corpora in Spanish can be undertaken. In the description of the steps followed for the construction of the lexicon, the difficulties encountered in its creation, and the solutions used to overcome them will be described. Finally, the construction of the lexicon will allow specific research tasks to be carried out, such as metaphor analysis, ACD studies and even PLN studies

    Building a Spanish lexicon for corpus analysis

    Get PDF
    This paper seeks to describe the creation of a Spanish lexicon with semantic annotation in order to analyse more extensive corpora in the Spanish language. The semantic resources most employed nowadays are WordNet, FrameNet, PDEV and USAS, but they have been used mainly for English language research. The creation of a large Spanish lexicon will permit a greater amount of studies of corpora in Spanish can be undertaken. In the description of the steps followed for the construction of the lexicon, the difficulties encountered in its creation, and the solutions used to overcome them will be described. Finally, the construction of the lexicon will allow specific research tasks to be carried out, such as metaphor analysis, ACD studies and even PLN studies

    Recursos de Humanidades Digitales para el estudio del teatro (traducido): bases de datos, corpus y herramientas desarrollados en TRALIMA/ITZULIK

    Get PDF
    [EN] Theatre research framed within Digital Humanities depends to a great extent on the availability of objects of study in digital format, as well as software that allows the application of different analysis techniques. Within TRALIMA/ITZULIK research group, databases, corpora and corpus processing tools have been developed. The creation of these materials and tools seeks to provide access to relevant contextual and textual data for research on translated theatre in Spain, an area that has usually received less attention than original theatre, despite its wide presence on stages. However, far from being restricted to the study of foreign authors and plays, the databases and corpora include references to several agents of the Spanish theatre system (translators, adaptors, directors) that may be of use in different disciplines. In addition, the tools are adjusted to the specific structure of play-texts to build and analyse monolingual, comparable and parallel corpora. Thus, the digital materials and tools may shed light on a variety of research questions. The purpose of this article is to present the databases, corpora and software developed by members of TRALIMA/ITZULIK for the study of (translated) theatre. By doing so, we aim to contribute to the spread of Digital Humanities resources available to researchers interested in the study of theatre from the variety of perspectives that characterises this multidisciplinary field.[ES] La investigación sobre teatro enmarcada en las Humanidades Digitales depende en gran medida de la disponibilidad de objetos de estudio en formato digital, así como de software que permita aplicar diversas técnicas de análisis. En el grupo de investigación TRALIMA/ITZULIK, se han desarrollado bases de datos y corpus, así como herramientas para su explotación. La creación de estos materiales y herramientas tiene como objetivo favorecer el acceso a datos contextuales y textuales de relevancia para la investigación del teatro traducido en España, un ámbito que a menudo queda relegado frente al teatro original a pesar de su amplia presencia en los escenarios. Cabe destacar que, lejos de limitarse al estudio de autores y obras extranjeros, las bases de datos y los corpus incluyen referencias a varios agentes del sistema teatral español (traductores, adaptadores, directores) de utilidad en diferentes disciplinas, y las herramientas se ajustan a la estructura específica de los textos teatrales para la construcción y el análisis de corpus monolingües, comparables y paralelos. De este modo, los materiales y las herramientas digitales permiten arrojar luz sobre una amplia variedad de preguntas de investigación. El objetivode este artículo es presentar de manera conjunta las bases de datos, los corpus y las aplicaciones desarrollados por miembros de TRALIMA/ITZULIK para el estudio del teatro (traducido) y, así, contribuir a la difusión de diversos recursos de Humanidades Digitales disponibles para los investigadores interesados en el estudio del teatro desde la variedad de perspectivas que caracteriza este campo multidisciplinar.Red de Excelencia CorpusNet (MINECO, FF12016-81934-RED/AEI) Grupo de investigación consolidado TRALIMA/ITZULIK Consolidated Research Group (Gobierno Vasco /Basque Government, IT1209/19) University of the Basque Country, UPV/EHU Research group GIU19/067 & pre-doctoral research grant PIF17/46

    Rhetorical structure and persuasive language in the subgenre of online advertisements

    Get PDF
    p. 38-47This paper aims to reveal the rhetorical structure and the linguistic features of persuasive language in online advertisements of electronic products. Nowadays, the bulk of e-commerce is carried out in English, and it is often the case that non-native speakers are required to write different text types for various professional purposes, including promotional texts. This need has prompted the present study and the results have been used to build software to help native speakers of Spanish when writing promotional texts in English. The analysis reveals that these texts typically have two main rhetorical moves: one for identifying the product and another one for describing it. The latter move is further divided into two steps: one including objective features (size, weight, etc.) and the other focusing on persuading the potential customer. This is mainly achieved with the use of a relatively informal style (imperatives, contractions, clipping, subject/auxiliary omissions, etc.) and lexico-grammatical elements conveying positive evaluation (multiple modification, multal quantifying expressions, etc.). The findings show that online advertisements of electronic products may be regarded as a specific subgenre with particular macro- and microlinguistic characteristics, which have been identified in this paper for technical writing assistance.S

    Using an Ontology-based Approach to Build Open Assisting Tools in Foreign Language Writing

    Get PDF
    [EN] In today’s globalised world where there is a growing need for international communication, non-native speakers (NNS) from a wide range of professional fields are increasingly called upon to write specialised texts in English. More often than not, however, the linguistic competence required to do so is well beyond that of the majority of NNS. While software applications can serve to assist NNS in their English writing tasks, most of the applications available are designed for users of English for general purposes as opposed to English for professional purposes. Therefore, these applications lack the specific vocabulary, style guidelines and common structures required in more specialised documents. Necessary modifications to meet the needs of English for professional purposes tend to be viewed as representing an overly complex and expensive task. To overcome these challenges, we present a software called O-WEAA (Ontology-Writing English Assistant Architecture) which makes use of an ontology that represents the knowledge which, according to our formalisation, is required to write most types of specialised professional documents in the English language. Our formalisation of the required knowledge is based on an exhaustive linguistic analysis of several written genres. The proposed software is composed of two parts: i) a web application named Acquisition Interface Module, which allows experts to populate the ontology with new data and ii) a userfriendly, general web interface named Writing Assistant Interface Module which guides the user throughout the writing process of the English document in the specific domain described in the ontology.S
    corecore