61 research outputs found
TuLiPA : a syntax-semantics parsing environment for mildly context-sensitive formalisms
In this paper we present a parsing architecture that allows processing of different mildly context-sensitive formalisms, in particular Tree-Adjoining Grammar (TAG), Multi-Component Tree-Adjoining Grammar with Tree Tuples (TT-MCTAG) and simple Range Concatenation Grammar (RCG). Furthermore, for tree-based grammars, the parser computes not only syntactic analyses but also the corresponding semantic representations
Fluent APIs in Functional Languages (full version)
Fluent API is an object-oriented pattern for smart and elegant embedded DSLs.
As fluent API designs typically rely on function overloading, they are hard to
realize in functional programming languages. We show how to write functional
fluent APIs using parametric polymorphism and unification instead of
overloading. Our designs support all regular and deterministic context-free
DSLs and beyond
Interpretación tabular de autómatas para lenguajes de adjunción de árboles
[Resumen] Las gramáticas de adjunción de árboles son una extensión de las gramáticas independientes del
contexto que utilizan árboles en vez de producciones como estructuras elementales y que resultan
adecuadas para la descripción de la mayor parte de las construcciones sintácticas presentes en el
lenguaje natural. Los lenguajes generados por esta clase de gramáticas se denominan lenguajes
de adjunción de árboles y son equivalentes a los lenguajes generados por las gramáticas lineales
de Ãndices y otros formalismos suavemente dependientes del contexto.
En la primera parte de esta memoria se presenta el problema del análisis sintáctico de los
lenguajes de adjunción de árboles. Para ello, se establece un camino evolutivo continuo en el
que se sitúan los algoritmos de análisis sintáctico que incorporan las estrategias de análisis más
importantes, tanto para el caso de las gramáticas de adjunción de árboles como para el caso de
las gramáticas lineales de Ãndices.
En la segunda parte se definen diferentes modelos de autómata que aceptan exactamente los
lenguajes de adjunción de árboles y se proponen técnicas que permiten su ejecución eficiente.
La utilización de autómatas para realizar el análisis sintáctico es interesante porque permite
separar el problema de la definición de un algoritmo de análisis sintáctico del problema de la
ejecución del mismo, al tiempo que simplifica las pruebas de corrección. Concretamente, hemos
estudiado los siguientes modelos de autómata:
• Los autómatas a pila embebidos descendentes y ascendentes, dos extensiones de ^ los
autómatas a pila que utilizan como estructura de almacenamiento una pila de pilas. Hemos
definido nuevas versiones de estos autómatas en las cuales se simplifica la forma de
las transiciones y se elimina el control de estado finito, manteniendo la potencia expresiva.
• La restricción de los autómatas lógicos a pila para adaptarlos al reconocimiento de las
gramáticas lineales de Ãndices, obteniéndose diferentes tipos de autómatas especializados
en diversas estrategias de análisis según el conjunto de transiciones permitido.
• Los autómatas lineales de Ãndices, tanto los orientados a la derecha, adecuados para estrategias
en las cuales las adjunciones se reconocen de manera ascendente, los orientados a la
izquierda, aptos para estrategias de análisis en las que las adjunciones se tratan de forma
descendente, como los fuertemente dirigidos, capaces de incorporar estrategias de análisis
en las cuales las adjunciones se tratan de manera ascendente y/o descendente.
• Los autómatas con dos pilas, una extensión de los autómatas a pila que trabaja con una
pila maestra encargada de dirigir el proceso de análisis y una pila auxiliar que restringe
las transiciones aplicables en un momento dado. Hemos descrito dos versiones diferentes
de este tipo de autómatas, los autómatas con dos pilas fuertemente dirigidos, aptos para
describir estrategias de análisis arbitrarias, y los autómatas con dos pilas ascendentes,
adecuados para describir estrategias de análisis en las cuales las adjunciones se procesan ascendentemente.
Hemos definido esquemas de compilación para todos estos modelos de autómata. Estos
esquemas permiten obtener el conjunto de transiciones correspondiente a la implantación de
una determinada estrategia de análisis sintáctico para una gramática dada.
Todos los modelos de autómata pueden ser ejecutados en tiempo polinomial con respecto a
la longitud de la cadena de entrada mediante la aplicación de técnicas de interpretación tabular.
Estas técnicas se basan en la manipulación de representaciones colapsadas de las configuraciones
del autómata, denominadas Ãtems, que se almacenan en una tabla para su posterior reutilización.
Con ello se evita la realización de cálculos redundantes.
Finalmente, hemos analizado conjuntamente los diferentes modelos de autómata, los cuales
se pueden dividir en tres grandes grupos: la familia de los autómatas generales, de la que
forman parte los autómatas lineales de Ãndices fuertemente dirigidos y los autómatas con dos
pilas fuertemente dirigidos; la familia de los autómatas descendentes, en la que se encuadran
los autómatas a pila embebidos y los autómatas lineales de Ãndices orientados a la izquierda; y
la familia de los autómatas ascendentes, en la que se enmarcan los autómatas a pila embebidos
ascendentes, los autómatas lineales de Ãndices orientados a la derecha y los autómatas con dos
pilas ascendentes.[Abstract] Tree adjoining grammars are an extension of context-free grammars that use trees instead of
productions as the primary representing structure and that are considered to be adequate to
describe most of syntactic phenomena occurring in natural languages. These grammars generate
the class of tree adjoining languages, which is equivalent to the class of languages generated by
linear indexed grammars and other mildly context-sensitive formalisms.
In the first part of this dissertation, we introduce the problem of parsing tree adjoining
grammars and linear indexed grammars, creating, for both formalisms, a continuum from simple
pure bottom-up algorithms to complex predictive algorithms and showing what transformations
must be applied to each one in order to obtain the next one in the continuum.
In the second part, we define several models of automata that accept the class of tree adjoining
languages, proposing techniques for their efficient execution. The use of automata for
parsing is interesting because they allow us to separate the problem of the definition of parsing
algorithms from the problem of their execution. We have considered the following types of
automata:
• Top-down and bottom-up embedded push-down automata, two extensions of push-down
automata working on nested stacks. A new definition is provided in which the finite-state
control has been eliminated and several kinds of normalized transition have been defined,
preserving the equivalence with tree adjoining languages.
• Logical push-down automata restricted to the case of tree adjoining languages. Depending
on the set of allowed transitions, we obtain three different types of automata.
• Linear indexed automata, left-oriented and right-oriented to describe parsing strategies
in which adjuntions are recognized top-down and bottom-up, respectively, and stronglydriven
to define parsing strategies recognizing adjunctions top-down and/or bottom-up.
• 2-stack automata, an extension of push-down automata working on a pair of stacks, a
master stack driving the parsing process and an auxiliary stack restricting the set of
transitions that can be applied at a given moment. Strongly-driven 2-stack automata can
be used to describe bottom-up, top-down or mixed parsing strategies for tree adjoining
languages with respect to the recognition of the adjunctions. Bottom-up 2-stack automata
are specifically designed for parsing strategies recognizing adjunctions bottom-up.
Compilation schemata for these models of automata have been defined. A compilation
schema allow us to obtain the set of transitions corresponding to the implementation of a^ parsing
strategy for a given grammar.
All the presented automata can be executed in polynomial time with respect to the length
of the input string by applying tabulation techniques. A tabular technique makes possible to
interpret an automaton by means of the manipulation of collapsed representation of configurations
(called items) instead of actual configurations. Items are stored into a table in order to be
reused, avoiding redundant computations.
Finally, we have studied the relations among the diÃferent classes of automata, the main
dif%rence being the storage structure used: embedded stacks, indices lists or coupled stacks.
According to the strategies that can be implemented, we can distinguish three kinds of automata:
bottom-up automata, including bottom-up embedded push-down automata, bottomup
restricted logic push-down automata, right-oriented linear indexed automata and bottom-up
2-stack automata; top-down automata, including (top-down) embedded push-down automata,
top-down restricted logic push-down automata and left-oriented linear indexed automata; and
general automata, including strongly-driven linear indexed automata and strongly-driven 2-
stack automata
Design of a Controlled Language for Critical Infrastructures Protection
We describe a project for the construction of controlled language for critical infrastructures protection (CIP). This project originates
from the need to coordinate and categorize the communications on CIP at the European level. These communications can be physically
represented by official documents, reports on incidents, informal communications and plain e-mail. We explore the application of
traditional library science tools for the construction of controlled languages in order to achieve our goal. Our starting point is an
analogous work done during the sixties in the field of nuclear science known as the Euratom Thesaurus.JRC.G.6-Security technology assessmen
Exploring formal models of linguistic data structuring. Enhanced solutions for knowledge management systems based on NLP applications
2010 - 2011The principal aim of this research is describing to which extent formal models for linguistic data structuring are crucial in Natural Language Processing (NLP) applications. In this sense, we will pay particular attention to those Knowledge Management Systems (KMS) which are designed for the Internet, and also to the enhanced solutions they may require. In order to appropriately deal with this topics, we will describe how to achieve computational linguistics applications helpful to humans in establishing and maintaining an advantageous relationship with technologies, especially with those technologies which are based on or produce man-machine interactions in natural language.
We will explore the positive relationship which may exist between well-structured Linguistic Resources (LR) and KMS, in order to state that if the information architecture of a KMS is based on the formalization of linguistic data, then the system works better and is more consistent.
As for the topics we want to deal with, frist of all it is indispensable to state that in order to structure efficient and effective Information Retrieval (IR) tools, understanding and formalizing natural language combinatory mechanisms seems to be the first operation to achieve, also because any piece of information produced by humans on the Internet is necessarily a linguistic act. Therefore, in this research work we will also discuss the NLP structuring of a linguistic formalization Hybrid Model, which we hope will prove to be a useful tool to support, improve and refine KMSs.
More specifically, in section 1 we will describe how to structure language resources implementable inside KMSs, to what extent they can improve the performance of these systems and how the problem of linguistic data structuring is dealt with by natural language formalization methods.
In section 2 we will proceed with a brief review of computational linguistics, paying particular attention to specific software packages such Intex, Unitex, NooJ, and Cataloga, which are developed according to Lexicon-Grammar (LG) method, a linguistic theory established during the 60’s by Maurice Gross.
In section 3 we will describe some specific works useful to monitor the state of the art in Linguistic Data Structuring Models, Enhanced Solutions for KMSs, and NLP Applications for KMSs.
In section 4 we will cope with problems related to natural language formalization methods, describing mainly Transformational-Generative Grammar (TGG) and LG, plus other methods based on statistical approaches and ontologies.
In section 5 we will propose a Hybrid Model usable in NLP applications in order to create effective enhanced solutions for KMSs. Specific features and elements of our hybrid model will be shown through some results on experimental research work. The case study we will present is a very complex NLP problem yet little explored in recent years, i.e. Multi Word Units (MWUs) treatment.
In section 6 we will close our research evaluating its results and presenting possible future work perspectives. [edited by author]X n.s
Formal Linguistic Models and Knowledge Processing. A Structuralist Approach to Rule-Based Ontology Learning and Population
2013 - 2014The main aim of this research is to propose a structuralist approach for knowledge processing by means of ontology learning and population, achieved starting from unstructured and structured texts. The method suggested includes distributional semantic approaches and NL formalization theories, in order to develop a framework, which relies upon deep linguistic analysis... [edited by author]XIII n.s
Algorithms for XML stream processing : massive data, external memory and scalable performance
Many modern applications require processing of massive streams of XML data, creating difficult technical challenges. Among these, there is the design and implementation of applications to optimize the processing of XPath queries and to provide an accurate cost estimation for these queries processed on a massive steam of XML data. In this thesis, we propose a novel performance prediction model which a priori estimates the cost (in terms of space used and time spent) for any structural query belonging to Forward XPath. In doing so, we perform an experimental study to confirm the linear relationship between stream-processing and data-access resources. Therefore, we introduce a mathematical model (linear regression functions) to predict the cost for a given XPath query. Moreover, we introduce a new selectivity estimation technique. It consists of two elements. The first one is the path tree structure synopsis: a concise, accurate, and convenient summary of the structure of an XML document. The second one is the selectivity estimation algorithm: an efficient streamquerying algorithm to traverse the path tree synopsis for estimating the values of cost-parameters. Those parameters are used by the mathematical model to determine the cost of a given XPath query. We compare the performance of our model with existing approaches. Furthermore, we present a use case for an online stream-querying system. The system uses our performance predicate model to estimate the cost for a given XPath query in terms of time/memory. Moreover, it provides an accurate answer for the query's sender. This use case illustrates the practical advantages of performance management with our techniques.Plusieurs applications modernes nécessitent un traitement de flux massifs de données XML, cela crée de défis techniques. Parmi ces derniers, il y a la conception et la mise en ouvre d'outils pour optimiser le traitement des requêtes XPath et fournir une estimation précise des coûts de ces requêtes traitées sur un flux massif de données XML. Dans cette thèse, nous proposons un nouveau modèle de prédiction de performance qui estime a priori le coût (en termes d'espace utilisé et de temps écoulé) pour les requêtes structurelles de Forward XPath. Ce faisant, nous réalisons une étude expérimentale pour confirmer la relation linéaire entre le traitement de flux, et les ressources d'accès aux données. Par conséquent, nous présentons un modèle mathématique (fonctions de régression linéaire) pour prévoir le coût d'une requête XPath donnée. En outre, nous présentons une technique nouvelle d'estimation de sélectivité. Elle se compose de deux éléments. Le premier est le résumé path tree: une présentation concise et précise de la structure d'un document XML. Le second est l'algorithme d'estimation de sélectivité: un algorithme efficace de flux pour traverser le synopsis path tree pour estimer les valeurs des paramètres de coût. Ces paramètres sont utilisés par le modèle mathématique pour déterminer le coût d'une requête XPath donnée. Nous comparons les performances de notre modèle avec les approches existantes. De plus, nous présentons un cas d'utilisation d'un système en ligne appelé "online stream-querying system". Le système utilise notre modèle de prédiction de performance pour estimer le coût (en termes de temps / mémoire) d'une requête XPath donnée. En outre, il fournit une réponse précise à l'auteur de la requête. Ce cas d'utilisation illustre les avantages pratiques de gestion de performance avec nos technique
- …