8 research outputs found

    Towards flexible parsing of structured textual model representations

    Get PDF
    Existing parsers for textual model representation formats such as XMI and HUTN are unforgiving and fail upon even the smallest inconsistency between the structure and naming of metamodel elements and the contents of serialised models. In this paper, we demonstrate how a fuzzy parsing approach can transparently and automatically resolve a number of these inconsistencies, and how it can eventually turn XML into a human-readable and editable textual model representation format for particular classes of models

    Approximating the DTD of a set of XML documents

    Get PDF
    The WWW contains a huge amount of documents. Some of them share the subject, but are generated by different people or even organizations. To guarantee the interchange of such documents, we can use XML. This allows to share documents that do not have the same structure. However, it makes difficult to understand the core of such heterogeneous documents (in general, schema is not available). In this paper, we offer a characterization and algorithm to obtain the midpoint (in terms of a resemblance function) of a set of semi-structured, heterogeneous documents without optional elements. The trivial case of midpoint would be the common elements to all documents. Nevertheless, in cases with several heterogeneous documents this may result in an empty set. Thus, we consider that those elements present in a given amount of documents belong to the midpoint. Once we have such midpoint, the algorithm is generalized for the obtaining of repetitions and optional elements. Thus, a exact schema can always be found generating optional elements. However, the exact schema of the whole set may result in overspecialization (lots of optional elements), which would make it useless.Postprint (published version

    Bounded repairability for regular tree languages

    Get PDF
    We study the problem of bounded repairability of a given restriction tree language R into a target tree language T. More precisely, we say that R is bounded repairable w.r.t. T if there exists a bound on the number of standard tree editing operations necessary to apply to any tree in R in order to obtain a tree in T. We consider a number of possible specifications for tree languages: bottom-up tree automata (on curry encoding of unranked trees) that capture the class of XML Schemas and DTDs. We also consider a special case when the restriction language R is universal, i.e., contains all trees over a given alphabet. We give an effective characterization of bounded repairability between pairs of tree languages represented with automata. This characterization introduces two tools, synopsis trees and a coverage relation between them, allowing one to reason about tree languages that undergo a bounded number of editing operations. We then employ this characterization to provide upper bounds to the complexity of deciding bounded repairability and we show that these bounds are tight. In particular, when the input tree languages are specified with arbitrary bottom-up automata, the problem is coNEXPTIME-complete. The problem remains coNEXPTIME-complete even if we use deterministic non-recursive DTDs to specify the input languages. The complexity of the problem can be reduced if we assume that the alphabet, the set of node labels, is fixed: the problem becomes PSPACE-complete for non-recursive DTDs and coNP-complete for deterministic non-recursive DTDs. Finally, when the restriction tree language R is universal, we show that the bounded repairability problem becomes EXPTIME-complete if the target language is specified by an arbitrary bottom-up tree automaton and becomes tractable (PTIME-complete, in fact) when a deterministic bottom-up automaton is used

    Impact de la structure des documents XML sur le processus d'appariement dans le contexte de la recherche d'information semi-structurée

    Get PDF
    Nos travaux s'inscrivent dans le cadre de la recherche d'information sur documents semi-structurésde type XML. La recherche d'information structurée (RIS) a pour objectif de retourner des granules documentaires précis répondant aux besoins d'information exprimés par l'utilisateur au travers de requêtes. Ces requêtes permettent de spécifier, en plus des conditions de contenu, des contraintes structurelles sur la localisation de l'information recherchée. L'objectif de nos travaux est d'étudier l'apport de la structure des documents dans le processus d'appariement documents-requêtes. Puisque les contraintes structurelles des requêtes peuvent être représentées sous la forme d'un arbre et que, parallèlement, la structure du document, de nature hiérarchique, peut elle-même utiliser le même type de représentation, nous avons proposé plusieurs modèles de mesure de la similarité entre ces deux structures. La mesure de la similarité entre deux structures arborescentes ayant été étudiée par le domaine de la théorie des graphes, nous avons tout d'abord cherché à adapter les algorithmes de ce domaine à notre problématique. Suite à une étude approfondie de ces algorithmes au regard de la RIS, notre choix s'est porté sur la distance d'édition entre arbres (Tree Edit Distance - TED). Cet algorithme permet, au travers de l'application récursive de séquences de suppression et de substitution, de mesurer le degré d'isomorphisme (le degré de similarité) entre deux arbres. Constatant que ces algorithmes sont coûteux en mémoire et en calcul, nous avons cherché à en réduire la complexité et le temps d'exécution au travers d'approches de résumé et de la mise en place d'un algorithme de TED au coût de complexité plus bas. Etant donné que la TED est normalement utilisée avec des coûts d'opération fixes peut adaptés à notre problématique, nous en avons également proposé de nouveaux basés sur la distance dans le graphe formé par la grammaire des documents : la DTD. Notre deuxième proposition se base sur les Modèles de Langue. En recherche d'information, ces derniers sont utilisés afin de mesurer la pertinence au travers de la probabilité qu'un terme de la requête soit généré par un document. Nous avons utilisés les Modèles de Langue pour mesurer, non pas la probabilité de pertinence du contenu, mais celle de la structure. Afin de former un vocabulaire document et requête à même d'être utilisé par notre modèle de langue structurel nous avons utilisé une technique de relaxation pondérée (la relaxation est le relâchement des contraintes). Nous avons également proposé une méthode pour apparier le contenu des documents et celui des requêtes. L'appariement seul des structures étant insuffisant dans une problématique de recherche d'information : la pertinence d'un granule documentaire est jugée en priorité sur la pertinence de l'information textuelle qu'il contient. De ce fait, nous avons proposé une approche de mesure de la pertinence de ce contenu. Notre méthode utilise la structure de l'arbre afin d'opérer une propagation de la pertinence du texte en prenant en compte l'environnement des éléments traversés ainsi que le contexte global du document. Nos différents modèles ont été expérimentés sur deux tâches de la campagne d'évaluation de référence de notre domaine : Initiative for XML Retrieval. Cette campagne a pour but de permettre l'évaluation de systèmes de recherche d'information XML dans un cadre normalisée et comporte plusieurs tâches fournissant des corpus, des mesures d'évaluation, des requêtes, et des jugements de pertinence. Nous avons à ce propos participé à cette campagne en 2011.Pour nos expérimentations, les tâches que nous avons choisi d'utiliser sont : * La tâche SSCAS d'INEX 2005 qui utilise une collection d'articles scientifiques d'IEEE. Cette collection est orientée texte dans la mesure où la structure exprimée dans les documents qu'elle contient est similaire à celle d'un livre (paragraphe, sections). * La tâche Datacentric d'INEX 2010 dont la collection est extraite d'IMDB. Cette collection est orientée données dans la mesure où les termes des documents sont très spécifiques et peu redondants et que la structure est porteuse de sens. Nos différentes expérimentations nous ont permis de montrer que le choix de la méthode d'appariement dépend de la collection considérée. Dans le cadre d'une collection orienté texte, la structure peut être prise en compte de manière non stricte et plusieurs sous-arbres extraits du document peuvent être utilisés simultanément pour évaluer la similarité structurelle. Inversement, dans le cadre d'une collection orientée donnée, la prise en compte stricte de la structure est nécessaire. Etant donné que les éléments recherchés portent une sémantique, il est alors important de détecter quelle partie du document est à priori pertinente. La structure à apparier doit être la plus précise et minimale possible. Enfin, nos approches de mesures de la similarité structurelle se sont montrées performantes et ont amélioré la pertinence des résultats retournés par rapport à l'état de l'art, à partir du moment où la nature de la collection a été prise en compte dans la sélection des arbres structurels en entrée.The work presented in this PhD thesis concerns structured information retrieval and focuses on XML documents. Structured information retrieval (SIR) aims at returning to users document parts (instead of whole documents) relevant to their needs. Those needs are expressed by queries that can contain content conditions as well as structural constraints which are used to specify the location of the needed information. In this work, we are interested in the use of document structure in the retrieval process. We propose some approaches to evaluate the document-query structural similarity. Both query structural constraints and document structures can be represented as trees. Based on this observation we propose two models which aim at matching these tree structures. As tree matching is historically linked with graph theory, our first proposition is based on an adaptation of a solution from the graph theory. After conducting an in depth study of the existing graph theory algorithms, we choose to use Tree Edit Distance (TED), which measures isomorphism (tree similarity) as the minimal set of remove and replace operations to turn one tree to another. As the main drawback of TED algorithms is their time and space complexity, which impacts the overall matching runtime, we propose two ways to overcome these issues. First we propose a TED algorithm having a minimal space complexity overall. Secondly, as runtime is dependent on the input tree cardinality (size) we propose several summarization techniques. Finally, since TED is usually used to assess relatively similar trees and as TED efficiency strongly relies on its costs, we propose a novel way, based on the DTD of documents, to compute these costs. Our second proposition is based on language models which are considered as very effective IR models. Traditionally, they are use to assess the content similarity through the probability of a document model (build upon document terms) to generate the query. We take a different approach based purely on structure and consider the document and query vocabulary as a set of transitions between document structure labels. To build these vocabularies, we propose to extract and weight all the structural relationships through a relaxation process. Finally, as relevance of the returned search results is first assessed based on the content, we propose a content evaluation process which uses the document tree structure to propagate relevance: the relevance of a node is evaluated thanks to its leaves as well as with the document context and neighbour nodes content relevance. In order to validate our models we conduct some experiments on two data-sets from the reference evaluation campaign of our domain: Initiative for XML retrieval (INEX). INEX tracks provide documents collections, metrics and relevance judgments which can be used to assess and compare SIR models. The tracks we use are: * The INEX 2005 SSCAS track whose associated documents are scientific papers extracted from IEEE. We consider this collection to be text-oriented as the structure used is similar to the one we can find in a book. * The INEX 2010 Datacentric track which uses a set of documents extracted from the Internet Movie Database (IMDB) website. This collection is data-oriented as document terms are very specific while the structure carries semantic meaning. Our various experiments show that the matching strategy strongly relies on the document structure type. In text-oriented collections, the structure can be considered as non-strict and several subtrees can be simultaneously used to assess the relevance. On the opposite, structure from documents regarded as data-centered should be used as strictly as possible. The reason is that as elements labels carry semantic, documents structures contain relevant and useful information that the content does not necessarily provide. Finally, our structural similarity approaches improve relevance of the returned results compared to state-of-the-art approaches, as long as the collection nature is considered when extracting the input trees for the structural matching process

    Representation and Processing of Composition, Variation and Approximation in Language Resources and Tools

    Get PDF
    In my habilitation dissertation, meant to validate my capacity of and maturity for directingresearch activities, I present a panorama of several topics in computational linguistics, linguisticsand computer science.Over the past decade, I was notably concerned with the phenomena of compositionalityand variability of linguistic objects. I illustrate the advantages of a compositional approachto the language in the domain of emotion detection and I explain how some linguistic objects,most prominently multi-word expressions, defy the compositionality principles. I demonstratethat the complex properties of MWEs, notably variability, are partially regular and partiallyidiosyncratic. This fact places the MWEs on the frontiers between different levels of linguisticprocessing, such as lexicon and syntax.I show the highly heterogeneous nature of MWEs by citing their two existing taxonomies.After an extensive state-of-the art study of MWE description and processing, I summarizeMultiflex, a formalism and a tool for lexical high-quality morphosyntactic description of MWUs.It uses a graph-based approach in which the inflection of a MWU is expressed in function ofthe morphology of its components, and of morphosyntactic transformation patterns. Due tounification the inflection paradigms are represented compactly. Orthographic, inflectional andsyntactic variants are treated within the same framework. The proposal is multilingual: it hasbeen tested on six European languages of three different origins (Germanic, Romance and Slavic),I believe that many others can also be successfully covered. Multiflex proves interoperable. Itadapts to different morphological language models, token boundary definitions, and underlyingmodules for the morphology of single words. It has been applied to the creation and enrichmentof linguistic resources, as well as to morphosyntactic analysis and generation. It can be integratedinto other NLP applications requiring the conflation of different surface realizations of the sameconcept.Another chapter of my activity concerns named entities, most of which are particular types ofMWEs. Their rich semantic load turned them into a hot topic in the NLP community, which isdocumented in my state-of-the art survey. I present the main assumptions, processes and resultsissued from large annotation tasks at two levels (for named entities and for coreference), parts ofthe National Corpus of Polish construction. I have also contributed to the development of bothrule-based and probabilistic named entity recognition tools, and to an automated enrichment ofProlexbase, a large multilingual database of proper names, from open sources.With respect to multi-word expressions, named entities and coreference mentions, I pay aspecial attention to nested structures. This problem sheds new light on the treatment of complexlinguistic units in NLP. When these units start being modeled as trees (or, more generally, asacyclic graphs) rather than as flat sequences of tokens, long-distance dependencies, discontinu-ities, overlapping and other frequent linguistic properties become easier to represent. This callsfor more complex processing methods which control larger contexts than what usually happensin sequential processing. Thus, both named entity recognition and coreference resolution comesvery close to parsing, and named entities or mentions with their nested structures are analogous3to multi-word expressions with embedded complements.My parallel activity concerns finite-state methods for natural language and XML processing.My main contribution in this field, co-authored with 2 colleagues, is the first full-fledged methodfor tree-to-language correction, and more precisely for correcting XML documents with respectto a DTD. We have also produced interesting results in incremental finite-state algorithmics,particularly relevant to data evolution contexts such as dynamic vocabularies or user updates.Multilingualism is the leitmotif of my research. I have applied my methods to several naturallanguages, most importantly to Polish, Serbian, English and French. I have been among theinitiators of a highly multilingual European scientific network dedicated to parsing and multi-word expressions. I have used multilingual linguistic data in experimental studies. I believethat it is particularly worthwhile to design NLP solutions taking declension-rich (e.g. Slavic)languages into account, since this leads to more universal solutions, at least as far as nominalconstructions (MWUs, NEs, mentions) are concerned. For instance, when Multiflex had beendeveloped with Polish in mind it could be applied as such to French, English, Serbian and Greek.Also, a French-Serbian collaboration led to substantial modifications in morphological modelingin Prolexbase in its early development stages. This allowed for its later application to Polishwith very few adaptations of the existing model. Other researchers also stress the advantages ofNLP studies on highly inflected languages since their morphology encodes much more syntacticinformation than is the case e.g. in English.In this dissertation I am also supposed to demonstrate my ability of playing an active rolein shaping the scientific landscape, on a local, national and international scale. I describemy: (i) various scientific collaborations and supervision activities, (ii) roles in over 10 regional,national and international projects, (iii) responsibilities in collective bodies such as program andorganizing committees of conferences and workshops, PhD juries, and the National UniversityCouncil (CNU), (iv) activity as an evaluator and a reviewer of European collaborative projects.The issues addressed in this dissertation open interesting scientific perspectives, in whicha special impact is put on links among various domains and communities. These perspectivesinclude: (i) integrating fine-grained language data into the linked open data, (ii) deep parsingof multi-word expressions, (iii) modeling multi-word expression identification in a treebank as atree-to-language correction problem, and (iv) a taxonomy and an experimental benchmark fortree-to-language correction approaches

    An Integrated Environment For Automated Benchmarking And Validation Of XML-Based Applications

    Get PDF
    Testing is the dominant software verification technique used in industry; it is a critical and most expensive process during software development. Along with the increase in software complexity, the costs of testing are increasing rapidly. Faced with this problem, many researchers are working on automated testing, attempting to find methods that execute the processes of testing automatically and cut down the cost of testing. Today, software systems are becoming complicated. Some of them are composed of several different components. Some projects even required different systems to work together and support each other. The XML have been developed to facilitate data exchange and enhance interoperability among software systems. Along with the development of XML technologies, XML-based systems are used widely in many domains. In this thesis we will present a methodology for testing XML-based applications automatically. In this thesis we present a methodology called XPT (XML-based Partition Testing) which is defined as deriving XML Instances from XML Schema automatically and systematically. XPT methodology is inspired from the Category-partition method, which is a well-known approach to Black-box Test generation. We follow a similar idea of applying partitioning to an XML Schema in order to generate a suite of conforming instances; in addition, since the number of generated instances soon becomes unmanageable, we also introduce a set of heuristics for reducing the suite; while optimizing the XML Schema coverage. The aim of our research is not only to invent a technical method, but also to attempt to apply XPT methodology in real applications. We have created a proof-of-concept tool, TAXI, which is the implementation of XPT. This tool has a graphic user interface that can guide and help testers to use it easily. TAXI can also be customized for specific applications to build the test environment and automate the whole processes of testing. The details of TAXI design and the case studies using TAXI in different domains are presented in this thesis. The case studies cover three test purposes. The first one is for functional correctness, specifically we apply the methodology to do the XSLT Testing, which uses TAXI to build an automatic environment for testing the XSLT transformation; the second is for robustness testing, we did the XML database mapping test which tests the data transformation tool for mapping and populate the data from XML Document to XML database; and the third one is for the performance testing, we show XML benchmark that uses TAXI to do the benchmarking of the XML-based applications

    Comparaison et évolution de schémas XML

    Get PDF
    XML has become the de facto format for data exchange. We aim at establishing a multi-system environment where some local original systems work in harmony with a global integrated system, which is a conservative evolution of local ones. Data exchange is possible in both directions, allowing activities on both levels. For this purpose, we need schema mapping whose is to ensure schema evolution, and to guide the construction of a document translator, allowing automatic data adaptation wrt type evolution. We propose a set of tools to help dealing with XML database evolution. These tools are used : (i) to compute a mapping capable of obtaining a global schema which is a conservative extension of original local schemas, and to adapt XML documents ; (ii) to compute the set of integrity constraints for the global system on the basis of the local ones ; (iii) to compare XML types of two systems in order to replace a system by another one ; (iv) to correct a new document with respect to an XML schema. Experimental results are discussed, showing the efficiency of our methods in many situations.XML est devenu le format standard d’échange de données. Nous souhaitons construire un environnement multi-système où des systèmes locaux travaillent en harmonie avec un système global, qui est une évolution conservatrice des systèmes locaux. Dans cet environnement, l’échange de données se fait dans les deux sens. Pour y parvenir nous avons besoin d’un mapping entre les schémas des systèmes. Le but du mapping est d’assurer l’évolution des schémas et de guider l’adaptation des documents entre les schémas concernés. Nous proposons des outils pour faciliter l’évolution de base de données XML. Ces outils permettent de : (i) calculer un mapping entre le schéma global et les schémas locaux, et d’adapter les documents ; (ii) calculer les contraintes d’intégrité du système global à partir de celles des systèmes locaux ; (iii) comparer les schémas de deux systèmes pour pouvoir remplacer un système par celui qui le contient ; (iv) corriger un nouveau document qui est invalide par rapport au schéma d’un système, afin de l’ajouter au système. Des expériences ont été menées sur des données synthétiques et réelles pour montrer l’efficacité de nos méthodes
    corecore