5 research outputs found

    Noisy-parallel and comparable corpora filtering methodology for the extraction of bi-lingual equivalent data at sentence level

    Get PDF
    Text alignment and text quality are critical to the accuracy of Machine Translation (MT) systems, some NLP tools, and any other text processing tasks requiring bilingual data. This research proposes a language independent bi-sentence filtering approach based on Polish (not a position-sensitive language) to English experiments. This cleaning approach was developed on the TED Talks corpus and also initially tested on the Wikipedia comparable corpus, but it can be used for any text domain or language pair. The proposed approach implements various heuristics for sentence comparison. Some of them leverage synonyms and semantic and structural analysis of text as additional information. Minimization of data loss was ensured. An improvement in MT system score with text processed using the tool is discussed.Comment: arXiv admin note: text overlap with arXiv:1509.09093, arXiv:1509.0888

    Configuration interactive et contraintes : connaissances, filtrage et extensions

    Get PDF
    The value of our research work is rooted in the following observations :-1- the life cycle of products, systems, services and processes is tending to get shorter ; -2- new designs and updates of products on the market are becoming more and more frequent, leading to increasingly short design cycles ; -3 technologies are constantly changing, requiring permanent, ongoing acquisition of knowledge ; -4-the diversity of products offered on the market is growing all the time, ranging from customizable or configurable to made-to-measure or designed to order.These trends, and the mass of information and knowledge that requires treating as a result of them, are placing heavy demands on designers, requiring ever more attentiveness and increasingly intense cognitive effort. The result is an increased risk that the product does not fully meet the customer鈥檚 needs, that it is difficult to implement or manufacture, or that it will be prohibitively expensive. The aim of our work is thus to help the design process to reduce these risks and errors by delivering software tools and methodological environments that serve to capitalize and exploit general, contextual, academic, expert or business knowledge.Our work on various complex industrial cases has led us to take into consideration two kinds of knowledge, involving on the one hand the "product domain" and on the other the "product diversity element". Each kind of knowledge leads to differing industrial cases. The first kind of knowledge encompasses the scientific and technical aspects, but also the specific rules governing the business in question. This knowledge is required in order to define the product itself, and involves issues that can be resolved by aiding the product /system/service design. The second kind of knowledge relates to the diverse nature of the products, and involves issues of customization or configuration of the product/system/service.Our aim is to help in what might be called "routine" design, where different kinds and various types of knowledge exist, due to the recurrent nature of the activity. We consider that aid in design or configuration can be formalized, either completely or partially, in the form of a constraint satisfaction problem (CSP). In this context, we focus more specifically on interactive decision-support, by introducing the principles of filtering or constraint propagation. The diversity of knowledge formalized as a CSP and the interaction with the user allow us to assemble and adapt filtering algorithms in a generic constraint propagation engine, integrated in our CoFiADe software solution.In addition, this formalism based on CSP constraints is complemented by : - ontologies to structure knowledge and facilitate its reuse throughout the development cycle, - analogy-based approaches taking advantage of contextual knowledge encapsulated in the case under study, so as to make recommendations to the user on the choice of values, - evolutionary approaches to optimize the search for multi-criteria solutions.Les travaux de recherche pre虂sente虂s dans ce me虂moire trouvent leurs fondements dans les constats suivants :-1- la dure虂e de vie des produits et syste虁mes tend a虁 se re虂duire,-2- les conceptions et les actualisations des produits mis sur le marche虂 sont de plus en plus fre虂quentes alors que les cycles de conception sont toujours plus brefs,-3- les technologies employe虂es en constante e虂volution ne虂cessitent une acquisition de connaissance permanente,-4- la diversite虂 des produits offerte sur les marche虂s ne cesse de croi虃tre allant des produits personnali- sables ou configure虂s jusqu鈥檃ux produits sur-mesure et conc抬us a虁 la commande.Ces tendances et la masse d鈥檌nformations et de connaissances a虁 traiter en de虂coulant exigent des concepteurs toujours plus d鈥檃ttention et un travail cognitif toujours plus intense. Il en re虂sulte une augmentation des risques, que le produit re虂ponde imparfaitement aux besoins du demandeur, qu鈥檌l soit difficilement re虂alisable et fabricable, ou encore qu鈥檌l le soit a虁 un cou虃t prohibitif. L鈥檕bjectif de nos travaux est donc de limiter ces risques et erreurs en proposant des outils logiciels et des environnements me虂thodologiques destine虂s a虁 capitaliser et exploiter des connaissances ge虂ne虂rales, contextuelles, acade虂miques, expertes ou me虂tier pour aider la conception.Les travaux effectue虂s sur diffe虂rentes proble虂matiques industrielles ont conduit a虁 prendre en conside虂ration deux natures de connaissances relevant du 芦 domaine produit 禄 et de la 芦 diversite虂 produit 禄 conduisant a虁 des proble虂matiques industrielles diffe虂rentes : la premie虁re nature de connaissance recouvre aussi bien des aspects scientifiques et techniques que des re虁gles me虂tier, elle est ne虂cessaire pour la de虂finition du produit et de虂bouche sur des proble虂matiques d鈥檃ide a虁 la conception de produit ; la seconde nature est une connaissance lie虂e a虁 la diversite虂 des produits, qui de虂bouche sur les proble虂matiques d鈥檃ide a虁 la personnalisation ou configuration de produit.Nous visons a虁 aider un type de conception pluto虃t 芦 routinier 禄 ou虁 de la connaissance de diffe虂rentes natures et de divers types existe du fait de la re虂currence de l鈥檃ctivite虂. Nous conside虂rons de plus dans nos travaux que l鈥檃ide a虁 la conception ou configuration peut se formaliser, comple虁tement ou partiellement, comme un proble虁me de satisfaction de contraintes (CSP). Dans ce cadre, nous nous inte虂ressons plus spe虂cifiquement a虁 l鈥檃ide a虁 la de虂cision interactive exploitant les principes de filtrage ou de propagation de contraintes. Notre objectif se de虂cline alors en l鈥檃ccompagnement des concepteurs dans la construction des solutions re虂pondant au mieux a虁 leurs proble虁mes, en retirant progressivement de l鈥檈space des solutions, celles qui ne sont plus cohe虂rentes avec les de虂cisions prises, en estimant celles-ci au fil de leur construction et/ou en les optimisant.en comple虂ment, nous associons a虁 ce formalisme a虁 base de contraintes CSP :- des ontologies pour structurer les connaissances et faciliter leur re虂utilisateion sur l鈥檈nsemble du cycle de de虂veloppement,- des approches par analogie exploitant de la connaissance contextuelle encapsule虂e dans des cas afin de proposer a虁 l鈥檜tilisateur des recommandations quant aux choix de valeurs,- des approches e虂volutionnaires pour optimiser la recherche des solutions de manie虁re multicrite虁re

    Bayesian Network Approximation from Local Structures

    Get PDF
    This work is focused on the problem of Bayesian network structure learning. There are two main areas in this field which are here discussed.The first area is a theoretical one. We consider some aspects of the Bayesian network structure learning hardness. In particular we prove that the problem of finding a Bayesian network structure with a minimal number of edges encoding the joint probability distribution of a given dataset is NP-hard. This result can be considered as a significantly different than the standard one view on the NP-hardness of the Bayesian network structure learning. The most notable so far results in this area are focused mainly on the specific characterization of the problem, where the aim is to find a Bayesian network structure maximizing some given probabilistic criterion. These criteria arise from quite advanced considerations in the area of statistics, and in particular their interpretation might be not intuitive---especially for the people not familiar with the Bayesian networks domain. In contrary the proposed here criterion, for which the NP-hardness is proved, does not require any advanced knowledge and it can be easily understandable.The second area is related to concrete algorithms. We focus on one of the most interesting branch in history of Bayesian network structure learning methods, leading to a very significant solutions. Namely we consider the branch of local Bayesian network structure learning methods, where the main aim is to gather first of all some information describing local properties of constructed networks, and then use this information appropriately in order to construct the whole network structure. The algorithm which is the root of this branch is focused on the important local characterization of Bayesian networks---so called Markov blankets. The Markov blanket of a given attribute consists of such other attributes which in the probabilistic sense correspond to the maximal in strength and minimal in size set of its causes. The aforementioned first algorithm in the considered here branch is based on one important observation. Subject to appropriate assumptions it is possible to determine the optimal Bayesian network structure by examining relations between attributes only within the Markov blankets. In the case of datasets derived from appropriately sparse distributions, where Markov blanket of each attribute has a limited by some common constant size, such procedure leads to a well time scalable Bayesian network structure learning approach.The Bayesian network local learning branch has mainly evolved in direction of reducing the gathered local information into even smaller and more reliably learned patterns. This reduction has raised from the parallel progress in the Markov blankets approximation field.The main result of this dissertation is the proposal of Bayesian network structure learning procedure which can be placed into the branch of local learning methods and which leads to the fork in its root in fact. The fundamental idea is to appropriately aggregate learned over the Markov blankets local knowledge not in the form of derived dependencies within these blankets---as it happens in the root method, but in the form of local Bayesian networks. The user can thanks to this have much influence on the character of this local knowledge---by choosing appropriate to his needs Bayesian network structure learning method used in order to learn the local structures. The merging approach of local structures into a global one is justified theoretically and evaluated empirically, showing its ability to enhance even very advanced Bayesian network structure learning algorithms, when applying them locally in the proposed scheme.Praca ta skupia si臋 na problemie uczenia struktury sieci bayesowskiej. S膮 dwa g艂贸wne pola w tym temacie, kt贸re s膮 tutaj om贸wione.Pierwsze pole ma charakter teoretyczny. Rozpatrujemy pewne aspekty trudno艣ci uczenia struktury sieci bayesowskiej. W szczeg贸lno艣ci pokozujemy, 偶e problem wyznaczenia struktury sieci bayesowskiej o minimalnej liczbie kraw臋dzi koduj膮cej w sobie 艂膮czny rozk艂ad prawdopodobie艅stwa atrybut贸w danej tabeli danych jest NP-trudny. Rezultat ten mo偶e by膰 postrzegany jako istotnie inne od standardowego spojrzenie na NP-trudno艣膰 uczenia struktury sieci bayesowskiej. Najbardziej znacz膮ce jak dot膮d rezultaty w tym zakresie skupiaj膮 si臋 g艂贸wnie na specyficznej charakterystyce problemu, gdzie celem jest wyznaczenie struktury sieci bayesowskiej maksymalizuj膮cej pewne zadane probabilistyczne kryterium. Te kryteria wywodz膮 si臋 z do艣膰 zaawansowanych rozwa偶a艅 w zakresie statystyki i w szczeg贸lno艣ci mog膮 nie by膰 intuicyjne---szczeg贸lnie dla ludzi niezaznajomionych z dziedzin膮 sieci bayesowskich. W przeciwie艅stwie do tego zaproponowane tutaj kryterium, dla kt贸rego zosta艂a wykazana NP-trudno艣膰, nie wymaga 偶adnej zaawansowanej wiedzy i mo偶e by膰 艂atwo zrozumiane.Drugie pole wi膮偶e si臋 z konkretnymi algorytmami. Skupiamy si臋 na jednej z najbardziej interesuj膮cych ga艂臋zi w historii metod uczenia struktur sieci bayesowskich, prowadz膮cej do bardzo znacz膮cych rozwi膮za艅. Konkretnie rozpatrujemy ga艂膮藕 metod lokalnego uczenia struktur sieci bayesowskich, gdzie g艂贸wnym celem jest zebranie w pierwszej kolejno艣ci pewnych informacji opisuj膮cych lokalne w艂asno艣ci konstruowanych sieci, a nast臋pnie u偶ycie tych informacji w odpowiedni spos贸b celem konstrukcji pe艂nej struktury sieci. Algorytm b臋d膮cy korzeniem tej ga艂臋zi skupia si臋 na wa偶nej lokalnej charakteryzacji sieci bayesowskich---tak zwanych kocach Markowa. Koc Markowa dla zadanego atrybutu sk艂ada si臋 z tych pozosta艂ych atrybut贸w, kt贸re w sensie probabilistycznym odpowiadaj膮 maksymalnymu w sile i minimalnemu w rozmiarze zbiorowi jego przyczyn. Wspomniany pierwszy algorytm w rozpatrywanej tu ga艂臋zi opiera si臋 na jednej istotnej obserwacji. Przy odpowiednich za艂o偶eniach mo偶liwe jest wyznaczenie optymalnej struktury sieci bayesowskiej poprzez badanie relacji mi臋dzy atrybutami jedynie w obr臋bie koc贸w Markowa. W przypadku zbior贸w danych wywodz膮cych si臋 z odpowiednio rzadkiego rozk艂adu, gdzie koc Markowa ka偶dego atrybutu ma ograniczony przez pewn膮 wsp贸ln膮 sta艂膮 rozmiar, taka procedura prowadzi do dobrze skalowalnego czasowo podej艣cia uczenia struktury sieci bayesowskiej.Ga艂膮藕 lokalnego uczenia sieci bayesowskich rozwin臋艂a si臋 g艂贸wnie w kierunku redukcji zbieranych lokalnych informacji do jeszcze mniejszych i bardziej niezawodnie wyuczanych wzorc贸w. Redukcja ta wyros艂a na bazie r贸wnoleg艂ego rozwoju w dziedzinie aproksymacji koc贸w Markowa.G艂贸wnym rezultatem tej rozprawy jest zaproponowanie procedury uczenia struktury sieci bayesowskiej, kt贸ra mo偶e by膰 umiejscowiona w ga艂臋zi metod lokalnego uczenia i kt贸ra faktycznie wyznacza rozga艂臋zienie w jego korzeniu. Fundamentalny pomys艂 polega tu na tym, 偶eby odpowiednio agregowa膰 wyuczon膮 w obr臋bie koc贸w Markowa lokaln膮 wiedz臋 nie w formie wyprowadzonych zale偶no艣ci w obr臋bie tych koc贸w---tak jak to si臋 dzieje w przypadku metody - korzenia, ale w formie lokalnych sieci bayesowskich. U偶ytkownik mo偶e mie膰 dzi臋ki temu du偶y wp艂yw na charakter tej lokalnej wiedzy---poprzez wyb贸r odpowiedniej dla jego potrzeb metody uczenia struktury sieci bayesowskiej u偶ytej w celu wyznaczenia lokalnych struktur. Procedura scalenia lokalnych modeli celem utworzenia globalnego jest uzasadniona teoretycznie oraz zbadana eksperymentalnie, pokazuj膮c jej zdolno艣膰 do poprawienia nawet bardzo zaawansowanych algorytm贸w uczenia struktury sieci bayesowskiej, gdy zastosuje si臋 je lokalnie w ramach zaproponowanego schematu
    corecore