257 research outputs found
Current trends
Deep parsing is the fundamental process aiming at the representation of the syntactic
structure of phrases and sentences. In the traditional methodology this process is
based on lexicons and grammars representing roughly properties of words and interactions
of words and structures in sentences. Several linguistic frameworks, such as Headdriven
Phrase Structure Grammar (HPSG), Lexical Functional Grammar (LFG), Tree Adjoining
Grammar (TAG), Combinatory Categorial Grammar (CCG), etc., offer different
structures and combining operations for building grammar rules. These already contain
mechanisms for expressing properties of Multiword Expressions (MWE), which, however,
need improvement in how they account for idiosyncrasies of MWEs on the one
hand and their similarities to regular structures on the other hand. This collaborative
book constitutes a survey on various attempts at representing and parsing MWEs in the
context of linguistic theories and applications
Representation and parsing of multiword expressions
This book consists of contributions related to the definition, representation and parsing of MWEs. These reflect current trends in the representation and processing of MWEs. They cover various categories of MWEs such as verbal, adverbial and nominal MWEs, various linguistic frameworks (e.g. tree-based and unification-based grammars), various languages including English, French, Modern Greek, Hebrew, Norwegian), and various applications (namely MWE detection, parsing, automatic translation) using both symbolic and statistical approaches
PARSEME Survey on MWE Resources
International audienceThis paper summarizes the first results of an ongoing survey on multiword resources carried out within the IC1207 Cost ActionPARSEME (PARSing and Multi-word Expressions). Despite the availability of language resource catalogues and the inventory ofmultiword data-sets available at the SIGLEX-MWE website, multiword resources are scattered and prove to be difficult to be found.In many cases, language resources such as corpora, treebanks or lexical databases include multiwords as part of their data or take theminto consideration in their annotations. However, it is needed to centralize these resources so that other researches may subsequentlyuse them. The final aim of this survey is thus to create a portal where researchers may find multiword resources or multiword-awarelanguage resources for their research. We report on how the survey was designed and analyze the data gathered so far. We also discussthe problems we have detected upon examination of the data and possible ways of enhancing the survey
Promocijas darbs
Elektroniskā versija nesatur pielikumusPromocijas darba “Nacionālās identitātes veidošana un atspoguļojums Baltijas valstu prezidentu runās – korpusā balstīta kritiskā diskursa analīze” mērķis ir izpētīt, kā Baltijas valstu prezidentu runās atspoguļojas nacionālās identitātes diskursīvās konstrukcijas, proti, kādi valodas un diskursa makro- un mikrostruk-tūru elementi ir lietoti prezidentu retorikā, kādas ir to funkcijas un potenciālā ietekme uz runas mērķauditoriju. Izmantojot kvalitatīvo un kvantitatīvo metožu sinerģiju jeb korpusu pieeju un kritiskās diskursa analīzes vēsturisko pieeju, pētījumā veikta ne vien detalizēta runu satura, tematisko lauku, diskursīvo stratēģiju un lingvistisko paņēmienu analīze, bet arī analizēti korpusos balstītie statistiskie dati, kas palīdz detalizētāk izprast katra prezidenta lingvistisko pro-filu un lingvistisko paņēmienu izvēli. Veiktā komponentu analīze apliecina katra prezidenta multiplo identitāšu lingvistiskās iezīmes. Papildus veikta teorētisko avotu izpēte par pētījumā iekļautajiem aktuālajiem tematiem, kas veido prezi-dentu runu sociālpolitisko un vēsturisko kontekstu; pētījumā ir veiktas intervijas ar prezidentiem un prezidentu padomniekiem; savukārt, lai noskaidrotu pre-zidentu runu eksplicītos un implicītos mērķus un arī to potenciālo ietekmi uz klausītāju nacionālās identitātes veidošanu, darbā veiktas un apkopotas Latvijas iedzīvotāju viedokļu aptaujas.Atslēgvārdi: prezidentu runas, Baltijas valstis, nacionālā identitāte, kritiskās diskursa studijas, korpuslingvistikaThe goal of the dissertation ‘Construction and Representation of National Identity in the Speeches of the Presidents of the Baltic States: Corpus-Assisted Critical Discourse Analysis’ is to investigate the discursive construction of national identities in the presidential speeches of the Baltic States as well as their functions and potential impact on the target audience. By applying the synergy of qualitative and quantitative methods – corpus approach and the Discourse-Historical Approach to Critical Discourse Analysis, the study not only analyses the content of the speeches, including their thematic areas, discursive strategies, and linguistic means of realisation of these strategies but also provides statistical data and presents the analysis of the corpus data offering a detailed and objective insight into the individual linguistic profiles of the Presidents, their lexical choices, which point to the linguistic features of multiple identities constructed in the speeches. Additionally, the theoretical sources that pertain to understanding the socio-political and historical context influencing the content of the selected speeches have also been analysed, and interviews with the Presidents and their advisors, as well as opinion surveys with the target audience have been conducted to investigate the explicit and implicit goals of the speeches as well as their potential effect.Key words: presidential speeches, Baltic States, national identity, Critical Discourse Studies, Corpus Linguistic
Formulaic language
The notion of formulaicity has received increasing attention in disciplines and areas as diverse as linguistics, literary studies, art theory and art history. In recent years, linguistic studies of formulaicity have been flourishing and the very notion of formulaicity has been approached from various methodological and theoretical perspectives and with various purposes in mind. The linguistic approach to formulaicity is still in a state of rapid development and the objective of the current volume is to present the current explorations in the field. Papers collected in the volume make numerous suggestions for further development of the field and they are arranged into three complementary parts. The first part, with three chapters, presents new theoretical and methodological insights as well as their practical application in the development of custom-designed software tools for identification and exploration of formulaic language in texts. Two papers in the second part explore formulaic language in the context of language learning. Finally, the third part, with three chapters, showcases descriptive research on formulaic language conducted primarily from the perspectives of corpus linguistics and translation studies. The volume will be of interest to anyone involved in the study of formulaic language either from a theoretical or a practical perspective
Theories and methods
The notion of formulaicity has received increasing attention in disciplines and areas as diverse as linguistics, literary studies, art theory and art history. In recent years, linguistic studies of formulaicity have been flourishing and the very notion of formulaicity has been approached from various methodological and theoretical perspectives and with various purposes in mind. The linguistic approach to formulaicity is still in a state of rapid development and the objective of the current volume is to present the current explorations in the field. Papers collected in the volume make numerous suggestions for further development of the field and they are arranged into three complementary parts. The first part, with three chapters, presents new theoretical and methodological insights as well as their practical application in the development of custom-designed software tools for identification and exploration of formulaic language in texts. Two papers in the second part explore formulaic language in the context of language learning. Finally, the third part, with three chapters, showcases descriptive research on formulaic language conducted primarily from the perspectives of corpus linguistics and translation studies. The volume will be of interest to anyone involved in the study of formulaic language either from a theoretical or a practical perspective
Automatic text summarization with Maximal Frequent Sequences
En las últimas dos décadas un aumento exponencial de la información electrónica
ha provocado una gran necesidad de entender rápidamente grandes
volúmenes de información. En este libro se desarrollan los métodos automáticos
para producir un resumen. Un resumen es un texto corto que transmite la información
más importante de un documento o de una colección de documentos. Los
resúmenes utilizados en este libro son extractivos: una selección de las oraciones
más importantes del texto. Otros retos consisten en generar resúmenes de manera
independiente de lenguaje y dominio.
Se describe la identificación de cuatro etapas para generación de resúmenes
extractivos. La primera etapa es la selección de términos, en la que uno tiene
que decidir qué unidades contarían como términos individuales. El proceso de
estimación de la utilidad de los términos individuales se llama etapa de pesado
de términos. El siguiente paso se denota como pesado de oraciones, donde todas
las secuencias reciben alguna medida numérica de acuerdo con la utilidad de
términos. Finalmente, el proceso de selección de las oraciones más importantes
se llama selección de oraciones. Los diferentes métodos para generación de resúmenes
extractivos pueden ser caracterizados como representan estas etapas.
En este libro se describe la etapa de selección de términos, en la que la detección
de descripciones multipalabra se realiza considerando Secuencias Frecuentes
Maximales (sfms), las cuales adquieren un significado importante, mientras
Secuencias Frecuentes (sf) no maximales, que son partes de otros sf, no deben
de ser consideradas. En la motivación se consideró costo vs. beneficio: existen
muchas sf no maximales, mientras que la probabilidad de adquirir un significado
importante es baja. De todos modos, las sfms representan todas las sfs en el
modo compacto: todas las sfs podrían ser obtenidas a partir de todas las sfms
explotando cada sfm al conjunto de todas sus subsecuencias. Se presentan los nuevos métodos basados en grafos, algoritmos de agrupamiento
y algoritmos genéticos, los cuales facilitan la tarea de generación de
resúmenes de textos. Se ha experimentado diferentes combinaciones de las opciones
de selección de términos, pesado de términos, pesado de oraciones y
selección de oraciones para generar los resúmenes extractivos de textos independientes
de lenguaje y dominio para una colección de noticias. Se ha analizado
algunas opciones basadas en descripciones multipalabra considerándolas en los
métodos de grafos, algoritmos de agrupamiento y algoritmos genéticos. Se han
obtenido los resultados superiores al de estado de arte.
Este libro está dirigido a los estudiantes y científicos del área de Lingüística
Computacional, y también a quienes quieren saber sobre los recientes avances en
las investigaciones de generación automática de resúmenes de textos.In the last two decades, an exponential increase in the available electronic information
causes a big necessity to quickly understand large volumes of information.
It raises the importance of the development of automatic methods for
detecting the most relevant content of a document in order to produce a shorter
text. Automatic Text Summarization (ats) is an active research area dedicated to
generate abstractive and extractive summaries not only for a single document, but
also for a collection of documents. Other necessity consists in finding method for
ats in a language and domain independent way.
In this book we consider extractive text summarization for single document
task. We have identified that a typical extractive summarization method consists
in four steps. First step is a term selection where one should decide what units
will count as individual terms. The process of estimating the usefulness of the
individual terms is called term weighting step. The next step denotes as sentence
weighting where all the sentences receive some numerical measure according to
the usefulness of its terms. Finally, the process of selecting the most relevant sentences
calls sentence selection. Different extractive summarization methods can
be characterized how they perform these steps.
In this book, in the term selection step, we describe how to detect multiword
descriptions considering Maximal Frequent Sequences (mfss), which bearing important
meaning, while non-maximal frequent sequences (fss), those that are
parts of another fs, should not be considered. Our additional motivation was
cost vs. benefit considerations: there are too many non-maximal fss while their
probability to bear important meaning is lower. In any case, mfss represent all fss
in a compact way: all fss can be obtained from all mfss by bursting each mfs into
a set of all its subsequences.New methods based on graph algorithms, genetic algorithms, and clustering
algorithms which facilitate the text summarization task are presented. We
have tested different combinations of term selection, term weighting, sentence
weighting and sentence selection options for language-and domain-independent
extractive single-document text summarization on a news report collection. We
analyzed several options based on mfss, considering them with graph, genetic,
and clustering algorithms. We obtained results superior to the existing state-ofthe-
art methods.
This book is addressed for students and scientists of the area of Computational
Linguistics, and also who wants to know recent developments in the area of Automatic
Text Generation of Summaries
Proximity and impact of university-industry collaborations. A topic detection analysis of impact reports
The probability to initiate university-industry collaborations (UICs), their intensity and quality, are influenced by the proximity between the collaboration partners. However, little is known about the relationship between collaborators' proximity and impact of UICs. Building on an original database of 415 UICs in the United Kingdom, we analyse the association between collaborators' proximity and the extent to which UICs generate economic, social and knowledge impact. We find that geographical and institutional proximity are substitutes in relation to economic impact, cognitive and institutional proximity are substitutes in relation to knowledge impact, and social impact is associated with cognitive and institutional distance
- …