17 research outputs found
Creación de un Framework para el tratamiento de corpus lingüísticos = Development of a Framework for corpus linguistic analysis
436 p.A pesar de los indudables avances en el software para el tratamiento de corpus lingüísticos en los últimos tiempos, ya sea por medio de procesamiento de corpus cada vez más grandes o inclusión de estadísticas más complejas, sigue sin tenerse en cuenta la usabilidad y el perfil no técnico del usuario final. La situación resulta más evidente cuando se trabaja con lenguas distintas del inglés y con combinaciones de lenguas, ya que la tipología y especificidad de las mismas incide en los requisitos del software, y por este motivo la disponibilidad de recursos es menor y de peor calidad.
El estado de la cuestión revela que la creación de corpus lingüísticos bi-/multilingües paralelos o comparables, así como la incorporación de etiquetados lingüísticos en los frameworks para el tratamiento de corpus lingüísticos ya existentes, obliga al usuario a disponer de ciertos conocimientos de programación, o al menos a saber ejecutar programas con usabilidad reducida y/o scripts informáticos propios, para ajustar el corpus a los requisitos establecidos por el framework utilizado. Si no se dan estas condiciones, es indispensable contar con especialistas técnicos con habilidades en programación y NLP (por sus siglas en inglés Natural Language Processing).
El objetivo de la tesis doctoral es, por tanto, el desarrollo de un software, denominado ACTRES Corpus Manager, que permita a los usuarios lingüistas construir sus propios corpus lingüísticos (monolingües, paralelos bi-/multilingües o comparables) con distintas capas de anotación (gramatical, semántica o retórica) y obtener datos lingüísticos y estadísticos sin necesidad sin necesidad de asistencia técnica en ningún punto del proceso e independientemente de las habilidades técnicas del usuario.
La estrategia seleccionada para el desarrollo de ACTRES Corpus Manager es la creación de un framework accesible vía web formado por distintos componentes interconectados entre sí. Cada actividad necesaria para la creación de un corpus es asignada a cada uno de estos componentes, posibilitando su fácil modificación y reutilización. ACTRES Corpus Manager combina la utilización de recursos software de terceros, cuya eficiencia y validez haya sido demostrada (ej. The IMS Corpus Workbench, Treetagger, hunalign, etc.), junto con soluciones software propias en aquellos procesos que el estado de la cuestión ha relevado más inmaduros y/o complejos de integrar (etiquetador retórico, etiquetador semántico, etc.).
Por último, señalar que la interfaz de consulta de ACTRES Corpus Manager se inspira en P-ACTRES 2.0 y permite la realización de consultas complejas asistidas, basadas en expresiones regulares, así como la extracción de las estadísticas habituales, sin necesidad de que el usuario disponga de conocimientos específicos de la sintaxis del lenguaje de consulta utilizad
Corpus tools for parallel corpora of theatre plays: an introduction to TAligner and ACM-theatre
Software tools are of vital importance in corpus-based research, but they can also lead to restrictions on the type of supported corpora and the range of analyses that can be performed. For example, corpus analysis tools, as general purpose software, do not include specific features to process corpora of theatre plays. This situation is even worse for parallel corpora of theatrical texts, in that there is currently a lack of software that allows for both the alignment and analysis of parallel corpora here. In this contribution, we will first outline the peculiarities of theatre texts and suggest three software features to address them: annotation of the
structural units of plays, alignment at the utterance level, and concordances and statistics using the annotated units. Second, we will present the specific functionalities of TAligner and ACM to build and analyse parallel corpora of play texts, showing how new avenues of research are opening up with the development of these
tools.Part of this study was funded by the Spanish Agency for Research, Development and Innovation (Ministry of Economy and Competitiveness) [FFI2016-75672-R]. At the time of writing, the co-author Olaia Andaluz-Pinedo is a doctoral student funded by the University of the Basque Country UPV/EHU, Spain
OpenTagger: A flexible and user-friendly linguistic tagger
Linguistic annotation adds valuable information to a corpus. Annotated corpora are highly useful for linguists since they increase the range of linguistic phenomena that may be registered, categorised and retrieved. In addition, they are also significant for machines, as Natural Language Processing applications involve working with well-annotated data (e.g. Imran, Mitra and Castillo 2016) and some machine learning classifiers employ annotated data to test or train new language annotation tools, among other uses. In this regard, Pustejovsky and Stubbs (2012) report on stages for building annotated corpora to train machine learning algorithms.
This paper describes OpenTagger, a new linguistic tagger that allows users to include any type of information to the different paragraphs, sentences, or words that compose a text. OpenTagger is characterised by its high usability and flexibility. It is a web application that allows users to manually annotate texts using their own predefined tag set or creating a new one. Thus, it offers an answer to any need for a tailor-made annotation system. This tagset may include nested categories. In addition, multiple layers of annotation are possible. The annotation process is very easy and provides two options: i) Selecting text and tagging; ii) Selecting a tag and annotating as much text as precissed. OpenTagger also includes a search box to query the text and retrieve relevant sections for tagging. In sum, the open character of this tool and its user-friendliness allows extending the benefits of annotation to a wider variety of research questions.
OpenTagger differs from others well-known taggers such as Nooj (Silberztein, 2005) because of its simplicity and web access, as it is not specialised for grammar construction or other complex processes. Potential users range from novel linguist researchers to experts. Last, it should be mentioned that a further integration within the corpus analysis software ACTRES Corpus Manager (Sanjurjo-González, 2017) is planned for the future. OpenTagger will make the process of building and querying custom annotated corpora more straightforward using ACM.ACTRES, TRALIMA/ITZULIK, GIU19/067, Gobierno Vasco IT1209/1
Statistics and visualisations of theatre corpora using corpus analysis software
38º Congreso Internacional AESLA (2021)Corpus linguistics is a powerful quantitative methodology that relies on frequency data and statistical procedures (Han 2019). According to Gries (2013), scientific quantitative research has three main goals: description, explanation and prediction of data. Within this frame, statistics makes sense of quantitative data by means of analysis and useful visualisations (Brezina 2018).
There are many techniques that have been designed for monolingual corpora such as statistical identification of collocations or keywords. While most of these can also be applied to different types of corpora, such as parallel and comparable ones, it seems that a dedicated set of statistics related to structural singularities of text types such as theatre plays is missing.
In this study, we propose a range of different adaptations of statistics and visualisations that apply and interrelate theatre-specific filters. Dramatic texts division in structural units is a specific feature of this genre (Andaluz-Pinedo and Sanjurjo-González in press). Utterances, speakers, stage directions and dialogues are an intrinsic part of these texts that must be taken into account when developing useful and descriptive statistical procedures. It is thus necessary to offer statistics and visualizations that apply and interrelate theatre-specific filters. Some examples of this adaptation may be quantitative analysis based on the units of characters, utterances, stage directions and dialogues instead of using all the texts data as a whole.
As Anthony states (2013), “the functionality offered by software tools largely dictates what corpus linguistics research methods are available to a researcher”. In order to improve this functionality when theatre corpora are analysed, further work includes the integration of this approach into an existing corpus analysis software that processes theatre play-texts such as ACTRES Corpus Manager (Sanjurjo-González, 2017)
OpenTagger: A flexible and user-friendly linguistic tagger
Linguistic annotation adds valuable information to a corpus. Annotated corpora are highly useful for linguists since they increase the range of linguistic phenomena that may be registered, categorised and retrieved. In addition, they are also significant for machines, as Natural Language Processing applications involve working with well-annotated data (e.g. Imran, Mitra and Castillo 2016) and some machine learning classifiers employ annotated data to test or train new language annotation tools, among other uses. In this regard, Pustejovsky and Stubbs (2012) report on stages for building annotated corpora to train machine learning algorithms.
This paper describes OpenTagger, a new linguistic tagger that allows users to include any type of information to the different paragraphs, sentences, or words that compose a text. OpenTagger is characterised by its high usability and flexibility. It is a web application that allows users to manually annotate texts using their own predefined tag set or creating a new one. Thus, it offers an answer to any need for a tailor-made annotation system. This tagset may include nested categories. In addition, multiple layers of annotation are possible. The annotation process is very easy and provides two options: i) Selecting text and tagging; ii) Selecting a tag and annotating as much text as precissed. OpenTagger also includes a search box to query the text and retrieve relevant sections for tagging. In sum, the open character of this tool and its user-friendliness allows extending the benefits of annotation to a wider variety of research questions.
OpenTagger differs from others well-known taggers such as Nooj (Silberztein, 2005) because of its simplicity and web access, as it is not specialised for grammar construction or other complex processes. Potential users range from novel linguist researchers to experts. Last, it should be mentioned that a further integration within the corpus analysis software ACTRES Corpus Manager (Sanjurjo-González, 2017) is planned for the future. OpenTagger will make the process of building and querying custom annotated corpora more straightforward using ACM.ACTRES, TRALIMA/ITZULIK, GIU19/067, Gobierno Vasco IT1209/1
Pragmatic annotation of a domain-restricted English-Spanish comparable corpus
[EN] This paper explores the multi-layer annotation of a written domain-restricted English-Spanish comparable corpus (CLANES – Controlled LANguage English Spanish), focusing on pragmatic annotation. The annotation scheme draws on part of speech tagging and a semantic annotation scheme, i.e. the UCREL Semantic Analysis System, with some added categories to fit the food-and-drink domain represented in CLANES. These are used to build significant (pragmatic) metapatterns. Seven different pragmatic functions have been identified in our corpus, namely , , , , , and . Computer scripts translate this linguistic information into regular expressions to be used in unsupervised annotation. Partial results indicate that applying lexical restrictors boosts the success rate considerably. However, metadata is preferred because of increased replicability and generality. Replicability issues and limitations encountered during testing are also addressed.SIThis research has been funded by grant FFI2016-75672-R awarded by the Spanish Ministry of Science and Innovation and ERDF (European Regional Development Fund
Building a Spanish lexicon for corpus analysis
This paper seeks to describe the creation of a Spanish lexicon with semantic annotation in order to analyse more extensive corpora in the Spanish language. The semantic resources most employed nowadays are WordNet, FrameNet, PDEV and USAS, but they have been used mainly for English language research. The creation of a large Spanish lexicon will permit a greater amount of studies of corpora in Spanish can be undertaken. In the description of the steps followed for the construction of the lexicon, the difficulties encountered in its creation, and the solutions used to overcome them will be described. Finally, the construction of the lexicon will allow specific research tasks to be carried out, such as metaphor analysis, ACD studies and even PLN studies
Building a Spanish lexicon for corpus analysis
This paper seeks to describe the creation of a Spanish lexicon with semantic annotation in order to analyse more extensive corpora in the Spanish language. The semantic resources most employed nowadays are WordNet, FrameNet, PDEV and USAS, but they have been used mainly for English language research. The creation of a large Spanish lexicon will permit a greater amount of studies of corpora in Spanish can be undertaken. In the description of the steps followed for the construction of the lexicon, the difficulties encountered in its creation, and the solutions used to overcome them will be described. Finally, the construction of the lexicon will allow specific research tasks to be carried out, such as metaphor analysis, ACD studies and even PLN studies
Recursos de Humanidades Digitales para el estudio del teatro (traducido): bases de datos, corpus y herramientas desarrollados en TRALIMA/ITZULIK
[EN] Theatre research framed within Digital Humanities depends to a great extent on the availability of objects of study in digital format, as well as software that allows the application of different analysis techniques. Within TRALIMA/ITZULIK research group, databases, corpora and corpus processing tools have been developed. The creation of these materials and tools seeks to provide access to relevant contextual and textual data for research on translated theatre in Spain, an area that has usually received less attention than original theatre, despite its wide presence on stages. However, far from being restricted to the study of foreign authors and plays, the databases and corpora include references to several agents of the Spanish theatre system (translators, adaptors, directors) that may be of use in different disciplines. In addition, the tools are adjusted to the specific structure of play-texts to build and analyse monolingual, comparable and parallel corpora. Thus, the digital materials and tools may shed light on a variety of research questions. The purpose of this article is to present the databases, corpora and software developed by members of TRALIMA/ITZULIK for the study of (translated) theatre. By doing so, we aim to contribute to the spread of Digital Humanities resources available to researchers interested in the study of theatre from the variety of perspectives that characterises this multidisciplinary field.[ES] La investigación sobre teatro enmarcada en las Humanidades Digitales depende en gran medida de la disponibilidad de objetos de estudio en formato digital, así como de software que permita aplicar diversas técnicas de análisis. En el grupo de investigación TRALIMA/ITZULIK, se han desarrollado bases de datos y corpus, así como herramientas para su explotación. La creación de estos materiales y herramientas tiene como objetivo favorecer el acceso a datos contextuales y textuales de relevancia para la investigación del teatro traducido en España, un ámbito que a menudo queda relegado frente al teatro original a pesar de su amplia presencia en los escenarios. Cabe destacar que, lejos de limitarse al estudio de autores y obras extranjeros, las bases de datos y los corpus incluyen referencias a varios agentes del sistema teatral español (traductores, adaptadores, directores) de utilidad en diferentes disciplinas, y las herramientas se ajustan a la estructura específica de los textos teatrales para la construcción y el análisis de corpus monolingües, comparables y paralelos. De este modo, los materiales y las herramientas digitales permiten arrojar luz sobre una amplia variedad de preguntas de investigación. El objetivode este artículo es presentar de manera conjunta las bases de datos, los corpus y las aplicaciones desarrollados por miembros de TRALIMA/ITZULIK para el estudio del teatro (traducido) y, así, contribuir a la difusión de diversos recursos de Humanidades Digitales disponibles para los investigadores interesados en el estudio del teatro desde la variedad de perspectivas que caracteriza este campo multidisciplinar.Red de Excelencia CorpusNet (MINECO, FF12016-81934-RED/AEI) Grupo de investigación consolidado TRALIMA/ITZULIK Consolidated Research Group (Gobierno Vasco /Basque Government, IT1209/19) University of the Basque Country, UPV/EHU Research group GIU19/067 & pre-doctoral research grant PIF17/46
Rhetorical structure and persuasive language in the subgenre of online advertisements
p. 38-47This paper aims to reveal the rhetorical structure and the linguistic features of persuasive language in online advertisements of electronic products. Nowadays, the bulk of e-commerce is carried out in English, and it is often the case that non-native speakers are required to write different text types for various professional purposes, including promotional texts. This need has prompted the present study and the results have been used to build software to help native speakers of Spanish when writing promotional texts in English. The analysis reveals that these texts typically have two main rhetorical moves: one for identifying the product and another one for describing it. The latter move is further divided into two steps: one including objective features (size, weight, etc.) and the other focusing on persuading the potential customer. This is mainly achieved with the use of a relatively informal style (imperatives, contractions, clipping, subject/auxiliary omissions, etc.) and lexico-grammatical elements conveying positive evaluation (multiple modification, multal quantifying expressions, etc.). The findings show that online advertisements of electronic products may be regarded as a specific subgenre with particular macro- and microlinguistic characteristics, which have been identified in this paper for technical writing assistance.S