4,009 research outputs found

    Tracking repeats using significance and transitivty.

    Get PDF
    transitivity; extreme value distribution Motivation: Internal repeats in coding sequences correspond to structural and functional units of proteins. Moreover, duplication of fragments of coding sequences is known to be a mechanism to facilitate evolution. Identification of repeats is crucial to shed light on the function and structure of proteins, and explain their evolutionary past. The task is difficult because during the course of evolution many repeats diverged beyond recognition. Results: We introduce a new method TRUST, for ab-initio determination of internal repeats in proteins. It provides an improvement in prediction quality as compared to alternative state-of-the-art methods. The increased sensitivity and accuracy of the method is achieved by exploiting the concept of transitivity of alignments. Starting from significant local suboptimal alignments, the application of transitivity allows us to: 1) identify distant repeat homologues for which no alignments were found; 2) gain confidence about consistently well-aligned regions; and 3) recognize and reduce the contribution of nonhomologous repeats. This reassessment step enables us to derive a virtually noise-free profile representing a generalized repeat with high fidelity. We also obtained superior specificity by employing rigid statistical testing for self-sequence and profile-sequence alignments. Assessment was done using a database of repeat annotations based on structural superpositioning. The results show that TRUST is a useful and reliable tool for mining tandem and non-tandem repeats in protein sequence databases, able to predict multiple repeat types with varying intervening segments within a single sequence

    Automated smoother for the numerical decoupling of dynamics models

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Structure identification of dynamic models for complex biological systems is the cornerstone of their reverse engineering. Biochemical Systems Theory (BST) offers a particularly convenient solution because its parameters are kinetic-order coefficients which directly identify the topology of the underlying network of processes. We have previously proposed a numerical decoupling procedure that allows the identification of multivariate dynamic models of complex biological processes. While described here within the context of BST, this procedure has a general applicability to signal extraction. Our original implementation relied on artificial neural networks (ANN), which caused slight, undesirable bias during the smoothing of the time courses. As an alternative, we propose here an adaptation of the Whittaker's smoother and demonstrate its role within a robust, fully automated structure identification procedure.</p> <p>Results</p> <p>In this report we propose a robust, fully automated solution for signal extraction from time series, which is the prerequisite for the efficient reverse engineering of biological systems models. The Whittaker's smoother is reformulated within the context of information theory and extended by the development of adaptive signal segmentation to account for heterogeneous noise structures. The resulting procedure can be used on arbitrary time series with a nonstationary noise process; it is illustrated here with metabolic profiles obtained from <it>in-vivo </it>NMR experiments. The smoothed solution that is free of parametric bias permits differentiation, which is crucial for the numerical decoupling of systems of differential equations.</p> <p>Conclusion</p> <p>The method is applicable in signal extraction from time series with nonstationary noise structure and can be applied in the numerical decoupling of system of differential equations into algebraic equations, and thus constitutes a rather general tool for the reverse engineering of mechanistic model descriptions from multivariate experimental time series.</p

    Analyzing User Behavior in Collaborative Environments

    Get PDF
    Discrete sequences are the building blocks for many real-world problems in domains including genomics, e-commerce, and social sciences. While there are machine learning methods to classify and cluster sequences, they fail to explain what makes groups of sequences distinguishable. Although in some cases having a black box model is sufficient, there is a need for increased explainability in research areas focused on human behaviors. For example, psychologists are less interested in having a model that predicts human behavior with high accuracy and more concerned with identifying differences between actions that lead to divergent human behavior. This dissertation presents techniques for understanding differences between classes of discrete sequences. We leveraged our developed approaches to study two online collaborative environments: GitHub, a software development platform, and Minecraft, a multiplayer online game. The first approach measures the differences between groups of sequences by comparing k-gram representations of sequences using the silhouette score and characterizing the differences by analyzing the distance matrix of subsequences. The second approach discovers subsequences that are significantly more similar to one set of sequences vs. other sets. This approach, which is called contrast motif discovery, first finds a set of motifs for each group of sequences and then refines them to include the motifs that distinguish that group from other groups of sequences. Compared to existing methods, our technique is scalable and capable of handling long event sequences. Our first case study is GitHub. GitHub is a social coding platform that facilitates distributed, asynchronous collaborations in open source software development. It has an open API to collect metadata about users, repositories, and the activities of users on repositories. To study the dynamics of teams on GitHub, we focused on discrete event sequences that are generated when GitHub users perform actions on this platform. Specifically, we studied the differences that automated accounts (aka bots) make on software development processes and outcomes. We trained black box supervised learning methods to classify sequences of GitHub teams and then utilized our sequence analysis techniques to measure and characterize differences between event sequences of teams with bots and teams without bots. Teams with bots have relatively distinct event sequences from teams without bots in terms of the existence and frequency of short subsequences. Moreover, teams with bots have more novel and less repetitive sequences compared to teams with no bots. In addition, we discovered contrast motifs for human-bot and human-only teams. Our analysis of contrast motifs shows that in human-bot teams, discussions are scattered throughout other activities while in human-only teams discussions tend to cluster together. For our second case study, we applied our sequence mining approaches to analyze player behavior in Minecraft, a multiplayer online game that supports many forms of player collaboration. As a sandbox game, it provides players with a large amount of flexibility in deciding how to complete tasks; this lack of goal-orientation makes the problem of analyzing Minecraft event sequences more challenging than event sequences from more structured games. Using our approaches, we were able to measure and characterize differences between low-level sequences of high-level actions and despite variability in how different players accomplished the same tasks, we discovered contrast motifs for many player actions. Finally, we explored how the level of player collaboration affects the contrast motifs

    Applications of high-frequency telematics for driving behavior analysis

    Get PDF
    A thesis submitted in partial fulfillment of the requirements for the degree of Doctor in Information Management, specialization in Statistics and EconometricsProcessing driving data and investigating driving behavior has been receiving an increasing interest in the last decades, with applications ranging from car insurance pricing to policy-making. A popular way of analyzing driving behavior is to move the focus to the maneuvers as they give useful information about the driver who is performing them. Previous research on maneuver detection can be divided into two strategies, namely, 1) using fixed thresholds in inertial measurements to define the start and end of specific maneuvers or 2) using features extracted from rolling windows of sensor data in a supervised learning model to detect maneuvers. While the first strategy is not adaptable and requires fine-tuning, the second needs a dataset with labels (which is time-consuming) and cannot identify maneuvers with different lengths in time. To tackle these shortcomings, we investigate a new way of identifying maneuvers from vehicle telematics data, through motif detection in time-series. Using a publicly available naturalistic driving dataset (the UAH-DriveSet), we conclude that motif detection algorithms are not only capable of extracting simple maneuvers such as accelerations, brakes, and turns, but also more complex maneuvers, such as lane changes and overtaking maneuvers, thus validating motif discovery as a worthwhile line for future research in driving behavior. We also propose TripMD, a system that extracts the most relevant driving patterns from sensor recordings (such as acceleration) and provides a visualization that allows for an easy investigation. We test TripMD in the same UAH-DriveSet dataset and show that (1) our system can extract a rich number of driving patterns from a single driver that are meaningful to understand driving behaviors and (2) our system can be used to identify the driving behavior of an unknown driver from a set of drivers whose behavior we know.Nas Ășltimas dĂ©cadas, o processamento e anĂĄlise de dados de condução tem recebido um interesse cada vez maior, com aplicaçÔes que abrangem a ĂĄrea de seguros de automĂłveis atĂ© a atea de regulação. Tipicamente, a anĂĄlise de condução compreende a extração e estudo de manobras uma vez que estas contĂȘm informação relevante sobre a performance do condutor. A investigação prĂ©via sobre este tema pode ser dividida em dois tipos de estratĂ©gias, a saber, 1) o uso de valores fixos de aceleração para definir o inĂ­cio e fim de cada manobra ou 2) a utilização de modelos de aprendizagem supervisionada em janelas temporais. Enquanto o primeiro tipo de estratĂ©gias Ă© inflexĂ­vel e requer afinação dos parĂąmetros, o segundo precisa de dados de condução anotados (o que Ă© moroso) e nĂŁo Ă© capaz de identificar manobras de diferentes duraçÔes. De forma a mitigar estas lacunas, neste trabalho, aplicamos mĂ©todos desenvolvidos na ĂĄrea de investigação de sĂ©ries temporais de forma a resolver o problema de deteção de manobras. Em particular, exploramos ĂĄrea de deteção de motifs em sĂ©ries temporais e testamos se estes mĂ©todos genĂ©ricos sĂŁo bem-sucedidos na deteção de manobras. TambĂ©m propomos o TripMD, um sistema que extrai os padrĂ”es de condução mais relevantes de um conjuntos de viagens e fornece uma simples visualização. TripMD Ă© testado num conjunto de dados pĂșblicos (o UAH-DriveSet), do qual concluĂ­mos que (1) o nosso sistema Ă© capaz de extrair padrĂ”es de condução/manobras de um Ășnico condutor que estĂŁo relacionados com o perfil de condução do condutor em questĂŁo e (2) o nosso sistema pode ser usado para identificar o perfil de condução de um condutor desconhecido de um conjunto de condutores cujo comportamento nos Ă© conhecido

    Combating User Misbehavior on Social Media

    Get PDF
    Social media encourages user participation and facilitates user’s self-expression like never before. While enriching user behavior in a spectrum of means, many social media platforms have become breeding grounds for user misbehavior. In this dissertation we focus on understanding and combating three specific threads of user misbehaviors that widely exist on social media — spamming, manipulation, and distortion. First, we address the challenge of detecting spam links. Rather than rely on traditional blacklist-based or content-based methods, we examine the behavioral factors of both who is posting the link and who is clicking on the link. The core intuition is that these behavioral signals may be more difficult to manipulate than traditional signals. We find that this purely behavioral approach can achieve good performance for robust behavior-based spam link detection. Next, we deal with uncovering manipulated behavior of link sharing. We propose a four-phase approach to model, identify, characterize, and classify organic and organized groups who engage in link sharing. The key motivating insight is that group-level behavioral signals can distinguish manipulated user groups. We find that levels of organized behavior vary by link type and that the proposed approach achieves good performance measured by commonly-used metrics. Finally, we investigate a particular distortion behavior: making bullshit (BS) statements on social media. We explore the factors impacting the perception of BS and what leads users to ultimately perceive and call a post BS. We begin by preparing a crowdsourced collection of real social media posts that have been called BS. We then build a classification model that can determine what posts are more likely to be called BS. Our experiments suggest our classifier has the potential of leveraging linguistic cues for detecting social media posts that are likely to be called BS. We complement these three studies with a cross-cutting investigation of learning user topical profiles, which can shed light into what subjects each user is associated with, which can benefit the understanding of the connection between user and misbehavior. Concretely, we propose a unified model for learning user topical profiles that simultaneously considers multiple footprints and we show how these footprints can be embedded in a generalized optimization framework. Through extensive experiments on millions of real social media posts, we find our proposed models can effectively combat user misbehavior on social media

    Abundance of intrinsic disorder in SV-IV, a multifunctional androgen-dependent protein secreted from rat seminal vesicle

    Get PDF
    The potent immunomodulatory, anti-inflammatory and procoagulant properties of the&#xd;&#xa;protein no. 4 secreted from the rat seminal vesicle epithelium (SV-IV) have been&#xd;&#xa;previously found to be modulated by a supramolecular monomer-trimer equilibrium.&#xd;&#xa;More structural details that integrate experimental data into a predictive framework&#xd;&#xa;have recently been reported. Unfortunately, homology modelling and fold-recognition&#xd;&#xa;strategies were not successful in creating a theoretical model of the structural&#xd;&#xa;organization of SV-IV. It was inferred that the global structure of SV-IV is not similar&#xd;&#xa;to any protein of known three-dimensional structure. Reversing the classical approach&#xd;&#xa;to the sequence-structure-function paradigm, in this paper we report on novel&#xd;&#xa;information obtained by comparing physicochemical parameters of SV-IV with two&#xd;&#xa;datasets made of intrinsically unfolded and ideally globular proteins. In addition, we&#xd;&#xa;have analysed the SV-IV sequence by several publicly available disorder-oriented&#xd;&#xa;predictors. Overall, disorder predictions and a re-examination of existing experimental&#xd;&#xa;data strongly suggest that SV-IV needs large plasticity to efficiently interact with the&#xd;&#xa;different targets that characterize its multifaceted biological function and should be&#xd;&#xa;therefore better classified as an intrinsically disordered protein

    Input variable selection in time-critical knowledge integration applications: A review, analysis, and recommendation paper

    Get PDF
    This is the post-print version of the final paper published in Advanced Engineering Informatics. The published article is available from the link below. Changes resulting from the publishing process, such as peer review, editing, corrections, structural formatting, and other quality control mechanisms may not be reflected in this document. Changes may have been made to this work since it was submitted for publication. Copyright @ 2013 Elsevier B.V.The purpose of this research is twofold: first, to undertake a thorough appraisal of existing Input Variable Selection (IVS) methods within the context of time-critical and computation resource-limited dimensionality reduction problems; second, to demonstrate improvements to, and the application of, a recently proposed time-critical sensitivity analysis method called EventTracker to an environment science industrial use-case, i.e., sub-surface drilling. Producing time-critical accurate knowledge about the state of a system (effect) under computational and data acquisition (cause) constraints is a major challenge, especially if the knowledge required is critical to the system operation where the safety of operators or integrity of costly equipment is at stake. Understanding and interpreting, a chain of interrelated events, predicted or unpredicted, that may or may not result in a specific state of the system, is the core challenge of this research. The main objective is then to identify which set of input data signals has a significant impact on the set of system state information (i.e. output). Through a cause-effect analysis technique, the proposed technique supports the filtering of unsolicited data that can otherwise clog up the communication and computational capabilities of a standard supervisory control and data acquisition system. The paper analyzes the performance of input variable selection techniques from a series of perspectives. It then expands the categorization and assessment of sensitivity analysis methods in a structured framework that takes into account the relationship between inputs and outputs, the nature of their time series, and the computational effort required. The outcome of this analysis is that established methods have a limited suitability for use by time-critical variable selection applications. By way of a geological drilling monitoring scenario, the suitability of the proposed EventTracker Sensitivity Analysis method for use in high volume and time critical input variable selection problems is demonstrated.E

    Record Linkage and Matching Problems in Forensics

    Get PDF
    In forensics, evidence such as DNA, fingerprints, bullets and cartridge cases, shoeprints or digital evidence is often compared, to infer if they come from the same or different sources. This helps to generate leads through database searches, where information from different investigations can be combined, if pieces of evidence are judged to have come from the same source. For specific pairs of comparisons, such as whether a particular cartridge case comes from a suspect\u27s gun, an inference of a match can also be used as testimony in courts. We demonstrate how such matching problems fit into the record linkage framework commonly used in statistics and computer science, illustrating this using examples from DNA and firearms identification. We propose some ways that record linkage can inform forensic matching. Finally, we develop methodology to match accounts on anonymous marketplaces. In forensic matching, the stakes are high and the consequences of false arrests or wrongful convictions are severe. The field would benefit from a more principled way of developing matching methods
    • 

    corecore