432 research outputs found

    Data Mining Algorithms for Internet Data: from Transport to Application Layer

    Get PDF
    Nowadays we live in a data-driven world. Advances in data generation, collection and storage technology have enabled organizations to gather data sets of massive size. Data mining is a discipline that blends traditional data analysis methods with sophisticated algorithms to handle the challenges posed by these new types of data sets. The Internet is a complex and dynamic system with new protocols and applications that arise at a constant pace. All these characteristics designate the Internet a valuable and challenging data source and application domain for a research activity, both looking at Transport layer, analyzing network tra c flows, and going up to Application layer, focusing on the ever-growing next generation web services: blogs, micro-blogs, on-line social networks, photo sharing services and many other applications (e.g., Twitter, Facebook, Flickr, etc.). In this thesis work we focus on the study, design and development of novel algorithms and frameworks to support large scale data mining activities over huge and heterogeneous data volumes, with a particular focus on Internet data as data source and targeting network tra c classification, on-line social network analysis, recommendation systems and cloud services and Big data

    Information-theoretic causal inference of lexical flow

    Get PDF
    This volume seeks to infer large phylogenetic networks from phonetically encoded lexical data and contribute in this way to the historical study of language varieties. The technical step that enables progress in this case is the use of causal inference algorithms. Sample sets of words from language varieties are preprocessed into automatically inferred cognate sets, and then modeled as information-theoretic variables based on an intuitive measure of cognate overlap. Causal inference is then applied to these variables in order to determine the existence and direction of influence among the varieties. The directed arcs in the resulting graph structures can be interpreted as reflecting the existence and directionality of lexical flow, a unified model which subsumes inheritance and borrowing as the two main ways of transmission that shape the basic lexicon of languages. A flow-based separation criterion and domain-specific directionality detection criteria are developed to make existing causal inference algorithms more robust against imperfect cognacy data, giving rise to two new algorithms. The Phylogenetic Lexical Flow Inference (PLFI) algorithm requires lexical features of proto-languages to be reconstructed in advance, but yields fully general phylogenetic networks, whereas the more complex Contact Lexical Flow Inference (CLFI) algorithm treats proto-languages as hidden common causes, and only returns hypotheses of historical contact situations between attested languages. The algorithms are evaluated both against a large lexical database of Northern Eurasia spanning many language families, and against simulated data generated by a new model of language contact that builds on the opening and closing of directional contact channels as primary evolutionary events. The algorithms are found to infer the existence of contacts very reliably, whereas the inference of directionality remains difficult. This currently limits the new algorithms to a role as exploratory tools for quickly detecting salient patterns in large lexical datasets, but it should soon be possible for the framework to be enhanced e.g. by confidence values for each directionality decision

    Statistical methods for fine-grained retail product recognition

    Get PDF
    In recent years, computer vision has become a major instrument in automating retail processes with emerging smart applications such as shopper assistance, visual product search (e.g., Google Lens), no-checkout stores (e.g., Amazon Go), real-time inventory tracking, out-of-stock detection, and shelf execution. At the core of these applications lies the problem of product recognition, which poses a variety of new challenges in contrast to generic object recognition. Product recognition is a special instance of fine-grained classification. Considering the sheer diversity of packaged goods in a typical hypermarket, we are confronted with up to tens of thousands of classes, which, particularly if under the same product brand, tend to have only minute visual differences in shape, packaging texture, metric size, etc., making them very difficult to discriminate from one another. Another challenge is the limited number of available datasets, which either have only a few training examples per class that are taken under ideal studio conditions, hence requiring cross-dataset generalization, or are captured from the shelf in an actual retail environment and thus suffer from issues like blur, low resolution, occlusions, unexpected backgrounds, etc. Thus, an effective product classification system requires substantially more information in addition to the knowledge obtained from product images alone. In this thesis, we propose statistical methods for a fine-grained retail product recognition. In our first framework, we propose a novel context-aware hybrid classification system for the fine-grained retail product recognition problem. In the second framework, state-of-the-art convolutional neural networks are explored and adapted to fine-grained recognition of products. The third framework, which is the most significant contribution of this thesis, presents a new approach for fine-grained classification of retail products that learns and exploits statistical context information about likely product arrangements on shelves, incorporates visual hierarchies across brands, and returns recognition results as "confidence sets" that are guaranteed to contain the true class at a given confidence leve

    L'extraction d'information des sources de données non structurées et semi-structurées

    Get PDF
    L'objectif de la thèse: Dans le contexte des dépôts de connaissances de grandes dimensions récemment apparues, on exige l'investigation de nouvelles méthodes innovatrices pour résoudre certains problèmes dans le domaine de l'Extraction de l'Information (EI), tout comme dans d'autres sous-domaines apparentés. La thèse débute par un tour d'ensemble dans le domaine de l'Extraction de l'Information, tout en se concentrant sur le problème de l'identification des entités dans des textes en langage naturel. Cela constitue une démarche nécessaire pour tout système EI. L'apparition des dépôts de connaissances de grandes dimensions permet le traitement des sous-problèmes de désambigüisation au Niveau du Sens (WSD) et La Reconnaissance des Entités dénommées (NER) d'une manière unifiée. Le premier système implémenté dans cette thèse identifie les entités (les noms communs et les noms propres) dans un texte libre et les associe à des entités dans une ontologie, pratiquement, tout en les désambigüisant. Un deuxième système implémenté, inspiré par l'information sémantique contenue dans les ontologies, essaie, également, l'utilisation d'une nouvelle méthode pour la solution du problème classique de classement de texte, obtenant de bons résultats.Thesis objective: In the context of recently developed large scale knowledge sources (general ontologies), investigate possible new approaches to major areas of Information Extraction (IE) and related fields. The thesis overviews the field of Information Extraction and focuses on the task of entity recognition in natural language texts, a required step for any IE system. Given the availability of large knowledge resources in the form of semantic graphs, an approach that treats the sub-tasks of Word Sense Disambiguation and Named Entity Recognition in a unified manner is possible. The first implemented system using this approach recognizes entities (words, both common and proper nouns) from free text and assigns them ontological classes, effectively disambiguating them. A second implemented system, inspired by the semantic information contained in the ontologies, also attempts a new approach to the classic problem of text classification, showing good results

    Visualizing multidimensional data similarities:Improvements and applications

    Get PDF
    Multidimensional data is increasingly more prominent and important in many application domains. Such data typically consist of a large set of elements, each of which described by several measurements (dimensions). During the design of techniques and tools to process this data, a key component is to gather insights into their structure and patterns, which can be described by the notion of similarity between elements. Among these techniques, multidimensional projections and similarity trees can effectively capture similarity patterns and handle a large number of data elements and dimensions. However, understanding and interpreting these patterns in terms of the original data dimensions is still hard. This thesis addresses the development of visual explanatory techniques for the easy interpretation of similarity patterns present in multidimensional projections and similarity trees, by several contributions. First, we propose methods that make the computation of similarity trees efficient for large datasets, and also enhance its visual representation to allow the exploration of more data in a limited screen. Secondly, we propose methods for the visual explanation of multidimensional projections in terms of groups of similar elements. These are automatically annotated to describe which dimensions are more important to define their notion of group similarity. We show next how these explanatory mechanisms can be adapted to handle both static and time-dependent data. Our proposed techniques are designed to be easy to use, work nearly automatically, and are demonstrated on a variety of real-world large data obtained from image collections, text archives, scientific measurements, and software engineering

    The Role of Mutations in Protein Structural Dynamics and Function: A Multi-scale Computational Approach

    Get PDF
    abstract: Proteins are a fundamental unit in biology. Although proteins have been extensively studied, there is still much to investigate. The mechanism by which proteins fold into their native state, how evolution shapes structural dynamics, and the dynamic mechanisms of many diseases are not well understood. In this thesis, protein folding is explored using a multi-scale modeling method including (i) geometric constraint based simulations that efficiently search for native like topologies and (ii) reservoir replica exchange molecular dynamics, which identify the low free energy structures and refines these structures toward the native conformation. A test set of eight proteins and three ancestral steroid receptor proteins are folded to 2.7Ă… all-atom RMSD from their experimental crystal structures. Protein evolution and disease associated mutations (DAMs) are most commonly studied by in silico multiple sequence alignment methods. Here, however, the structural dynamics are incorporated to give insight into the evolution of three ancestral proteins and the mechanism of several diseases in human ferritin protein. The differences in conformational dynamics of these evolutionary related, functionally diverged ancestral steroid receptor proteins are investigated by obtaining the most collective motion through essential dynamics. Strikingly, this analysis shows that evolutionary diverged proteins of the same family do not share the same dynamic subspace. Rather, those sharing the same function are simultaneously clustered together and distant from those functionally diverged homologs. This dynamics analysis also identifies 77% of mutations (functional and permissive) necessary to evolve new function. In silico methods for prediction of DAMs rely on differences in evolution rate due to purifying selection and therefore the accuracy of DAM prediction decreases at fast and slow evolvable sites. Here, we investigate structural dynamics through computing the contribution of each residue to the biologically relevant fluctuations and from this define a metric: the dynamic stability index (DSI). Using DSI we study the mechanism for three diseases observed in the human ferritin protein. The T30I and R40G DAMs show a loss of dynamic stability at the C-terminus helix and nearby regulatory loop, agreeing with experimental results implicating the same regulatory loop as a cause in cataracts syndrome.Dissertation/ThesisPh.D. Physics 201

    Run-time thread management for large-scale distributed-memory multiprocessors

    Get PDF
    Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 1993.Includes bibliographical references (p. 214-216).by Daniel Nussbaum.Ph.D

    Information-theoretic causal inference of lexical flow

    Get PDF
    This volume seeks to infer large phylogenetic networks from phonetically encoded lexical data and contribute in this way to the historical study of language varieties. The technical step that enables progress in this case is the use of causal inference algorithms. Sample sets of words from language varieties are preprocessed into automatically inferred cognate sets, and then modeled as information-theoretic variables based on an intuitive measure of cognate overlap. Causal inference is then applied to these variables in order to determine the existence and direction of influence among the varieties. The directed arcs in the resulting graph structures can be interpreted as reflecting the existence and directionality of lexical flow, a unified model which subsumes inheritance and borrowing as the two main ways of transmission that shape the basic lexicon of languages
    • …
    corecore