32 research outputs found

    Correct your Text with Google

    No full text
    to appear in the Proceedings of the International Conference on Web Intelligence, IEEE 2007.International audienceWith the increasing amount of text files that are produced nowadays, spell checkers have become essential tools for everyday tasks of millions of end users. Among the years, several tools have been designed that show decent performances. Of course, grammatical checkers may improve corrections of texts, nevertheless, this requires large resources. We think that basic spell checking may be improved (a step towards) using the Web as a corpus and taking into account the context of words that are identified as potential misspellings. We propose to use the Google search engine and some machine learning techniques, in order to design a flexible and dynamic spell checker that may evolve among the time with new linguistic features

    Discovering Patterns in Flows: a Privacy Preserving Approach with the ACSM Prototype

    No full text
    International audienceIn this demonstration, we aim to present the ACSM prototype that deals with the discovery of frequent patterns in the context of flow management problems. One important issue while working on such problems is to ensure the preservation of private data collected from the users. The approach presented here is based on the representation of flows in the form of probabilistic automata. Resorting to efficient algebraic techniques, the ACSM prototype is able to discover from those automata sequential patterns under constraints. Contrary to standard sequential pattern techniques that may be applied in such contexts, our prototype makes no use of individuals data

    Contributions de l'inférence grammaticale à la fouille de données séquentielles

    No full text
    Within the framework of this thesis, we established links between the models obtained by grammatical inference algorithms and the knowledge inferred by sequential data mining algorithms. Based on the observation that the common point between these two different contexts is the manipulation of data structured in the form of sequences of symbols, we tried to exploit the properties of probabilistic automaton inferred from these sequences for the benefit of a more efficient sequential data mining. In this context, we showed that the raw exploitation, not only of original sequences but also of probabilistic automaton inferred from those, does not necessarily guarantee an extraction of relevant knowledge. We bring in this thesis several contributions, under the shape of minimal borders and statistical constraints, so allowing to insure a fruitful exploitation of sequences and probabilistic automaton. Furthermore, thanks to our model we bring an effective solution for applications which bringing problems of preservation of private individuals life into play.Dans le cadre de cette thèse, nous avons établi des liens entre les modèles obtenus par des algorithmes d'inférence grammaticale et la connaissance induite par des techniques de fouille de données séquentielles. Partant du constat que le point commun entre ces deux contextes différents de travail est la manipulation de données structurées sous forme de séquences de symboles, nous avons tenté d'exploiter les propriétés des automates probabilistes inférés à partir de ces séquences au profit d'une fouille de données séquentielles plus efficace. Dans ce contexte, nous avons montré que l'exploitation brute, non seulement des séquences d'origine mais aussi des automates probabilistes inférés à partir de celles-ci, ne garantit pas forcément une extraction de connaissance pertinente. Nous avons apporté dans cette thèse plusieurs contributions, sous la forme de bornes minimales et de contraintes statistiques, permettant ainsi d'assurer une exploitation fructueuse des séquences et des automates probabilistes. De plus, grâce à notre modèle nous apportons une solution efficace à certaines applications mettant en jeux des problèmes de préservation de vie privée des individus

    Sequence Mining Without Sequences: a New Way for Privacy Preserving

    No full text
    International audienceDuring the last decade, sequential pattern mining has been the core of numerous researches. It is now possible to efficiently discover users' behavior in various domains such as purchases in supermarkets, Web site visits, etc. Nevertheless, classical algorithms do not respect individual's privacy, exploiting personal information (name, IP address, etc.). We provide an original solution to privacy preserving by using a probabilistic automaton instead of the original data. An application in car flow modelization is presented, showing the ability of our algorithm to discover frequent routes without any individual information. A comparison with SPAM is done showing that even if we sample from the automaton, our approach is more efficient

    A Lower Bound on the Sample Size needed to perform a Significant Frequent Pattern Mining Task

    No full text
    International audienceDuring the past few years, the problem of assessing the statistical significance of frequent patterns extracted from a given set S of data has received much attention. Considering that S always consists of a sample drawn from an unknown underlying distribution, two types of risks can arise during a frequent pattern mining process: accepting a false frequent pattern or rejecting a true one. In this context, many approaches presented in the literature assume that the dataset size is an application-dependent parameter. In this case, there is a trade-off between both errors leading to solutions that only control one risk to the detriment of the other one. On the other hand, many sampling-based methods have attempted to determine the optimal size of S ensuring a good approximation of the original (potentially infinite) database from which S is drawn. However, these approaches often resort to Chernoff bounds that do not allow the independent control of the two risks. In this paper, we overcome the mentioned drawbacks by providing a lower bound on the sample size required to control both risks and achieve a significant frequent pattern mining task

    Mining Probabilistic Automata: A Statistical View of Sequential Pattern Mining

    No full text
    44 pagesInternational audienceDuring the past decade, sequential pattern mining has been the core of numerous research efforts. It is now possible to efficiently extract knowledge of users' behavior from a huge set of sequences collected over time. This has applications in various domains such as purchases in supermarkets, Web site visits, etc. However, sequence mining algorithms do little to control the risks of extracting false discoveries or overlooking true knowledge. In this paper, the theoretical conditions to achieve a relevant sequence mining process are examined. Then, the article offers a statistical view of sequence mining which has the following advantages: First, it uses a compact and generalized representation of the original sequences in the form of a probabilistic automaton. Second, it integrates statistical constraints to guarantee the extraction of significant patterns. Finally, it provides an interesting solution in a privacy preserving context in order to respect individuals' information. An application in car flow modeling is presented, showing the ability of our algorithm (ACSM) to discover frequent routes without any private information. Comparisons with a classical sequence mining algorithm (SPAM) are made, showing the effectiveness of our approach

    Contributions de l'inférence grammaticale à la fouille de données séquentielles

    No full text
    Dans le cadre de cette thèse, nous avons établi des liens entre les modèles obtenus par des algorithmes d'inférence grammaticale et la connaissance induite par des techniques de fouille de données séquentielles. Partant du constat que le point commun entre ces deux contextes différents de travail est la manipulation de données structurées sous forme de séquences de symboles, nous avons tenté d'exploiter les propriétés des automates probabilistes inférés à partir de ces séquences au profit d'une fouille de données séquentielles plus efficace. Dans ce contexte, nous avons montré que l'exploitation brute, non seulement des séquences d'origine mais aussi des automates probabilistes inférés à partir de celles-ci, ne garantit pas forcément une extraction de connaissance pertinente. Nous avons apporté dans cette thèse plusieurs contributions, sous la forme de bornes minimales et de contraintes statistiques, permettant ainsi d'assurer une exploitation fructueuse des séquences et des automates probabilistes. De plus, grâce à notre modèle nous apportons une solution efficace à certaines applications mettant en jeu des problèmes de préservation de vie privée des individusWithin the framework of this thesis, we established links between the models obtained by algorithms of grammatical inference and the knowledge inferred by techniques of sequential data mining. Based on the observation that the common point between these two different contexts of work is the manipulation of data structured in the form of sequences of symbols, we tried to exploit the properties of probabilistic automaton inferred from these sequences for the benefit of a more effective sequential data mining. In this context, we showed that the raw exploitation, not only of original sequences but also of a probabilistic automaton inferred from these, does not necessarily guarantee an extraction of relevant knowledge. We bring in this thesis several contributions, under the shape of minimal borders and statistical constraints, so allowing to insure a fruitful exploitation of sequences and probabilistic automaton. Furthermore, thanks to our model we bring an effective solution of certain applications putting in games problems of conservation of private life of the individualsST ETIENNE-BU Sciences (422182103) / SudocSudocFranceF
    corecore