7,405 research outputs found
The Thermodynamics of Network Coding, and an Algorithmic Refinement of the Principle of Maximum Entropy
The principle of maximum entropy (Maxent) is often used to obtain prior
probability distributions as a method to obtain a Gibbs measure under some
restriction giving the probability that a system will be in a certain state
compared to the rest of the elements in the distribution. Because classical
entropy-based Maxent collapses cases confounding all distinct degrees of
randomness and pseudo-randomness, here we take into consideration the
generative mechanism of the systems considered in the ensemble to separate
objects that may comply with the principle under some restriction and whose
entropy is maximal but may be generated recursively from those that are
actually algorithmically random offering a refinement to classical Maxent. We
take advantage of a causal algorithmic calculus to derive a thermodynamic-like
result based on how difficult it is to reprogram a computer code. Using the
distinction between computable and algorithmic randomness we quantify the cost
in information loss associated with reprogramming. To illustrate this we apply
the algorithmic refinement to Maxent on graphs and introduce a Maximal
Algorithmic Randomness Preferential Attachment (MARPA) Algorithm, a
generalisation over previous approaches. We discuss practical implications of
evaluation of network randomness. Our analysis provides insight in that the
reprogrammability asymmetry appears to originate from a non-monotonic
relationship to algorithmic probability. Our analysis motivates further
analysis of the origin and consequences of the aforementioned asymmetries,
reprogrammability, and computation.Comment: 30 page
People on Drugs: Credibility of User Statements in Health Communities
Online health communities are a valuable source of information for patients
and physicians. However, such user-generated resources are often plagued by
inaccuracies and misinformation. In this work we propose a method for
automatically establishing the credibility of user-generated medical statements
and the trustworthiness of their authors by exploiting linguistic cues and
distant supervision from expert sources. To this end we introduce a
probabilistic graphical model that jointly learns user trustworthiness,
statement credibility, and language objectivity. We apply this methodology to
the task of extracting rare or unknown side-effects of medical drugs --- this
being one of the problems where large scale non-expert data has the potential
to complement expert medical knowledge. We show that our method can reliably
extract side-effects and filter out false statements, while identifying
trustworthy users that are likely to contribute valuable medical information
Metrics for Graph Comparison: A Practitioner's Guide
Comparison of graph structure is a ubiquitous task in data analysis and
machine learning, with diverse applications in fields such as neuroscience,
cyber security, social network analysis, and bioinformatics, among others.
Discovery and comparison of structures such as modular communities, rich clubs,
hubs, and trees in data in these fields yields insight into the generative
mechanisms and functional properties of the graph.
Often, two graphs are compared via a pairwise distance measure, with a small
distance indicating structural similarity and vice versa. Common choices
include spectral distances (also known as distances) and distances
based on node affinities. However, there has of yet been no comparative study
of the efficacy of these distance measures in discerning between common graph
topologies and different structural scales.
In this work, we compare commonly used graph metrics and distance measures,
and demonstrate their ability to discern between common topological features
found in both random graph models and empirical datasets. We put forward a
multi-scale picture of graph structure, in which the effect of global and local
structure upon the distance measures is considered. We make recommendations on
the applicability of different distance measures to empirical graph data
problem based on this multi-scale view. Finally, we introduce the Python
library NetComp which implements the graph distances used in this work
Canine Genomics and Genetics: Running with the Pack
The domestication of the dog from its wolf ancestors is perhaps the most complex genetic experiment in history, and certainly the most extensive. Beginning with the wolf, man has created dog breeds that are hunters or herders, big or small, lean or squat, and independent or loyal. Most breeds were established in the 1800s by dog fanciers, using a small number of founders that featured traits of particular interest. Popular sire effects, population bottlenecks, and strict breeding programs designed to expand populations with desirable traits led to the development of what are now closed breeding populations, with limited phenotypic and genetic heterogeneity, but which are ideal for genetic dissection of complex traits. In this review, we first discuss the advances in mapping and sequencing that accelerated the field in recent years. We then highlight findings of interest related to disease gene mapping and population structure. Finally, we summarize novel results on the genetics of morphologic variation
Batch Testing, Adaptive Algorithms, and Heuristic Applications for Stable Marriage Problems
In this dissertation we focus on different variations of the stable matching (marriage) problem, initially posed by Gale and Shapley in 1962. In this problem, preference lists are used to match n men with n women in such a way that no (man, woman) pair exists that would both prefer each other over their current partners. These two would be considered a blocking pair, preventing a matching from being considered stable. In our research, we study three different versions of this problem. First, we consider batch testing of stable marriage solutions. Gusfield and Irving presented an open problem in their 1989 book The Stable Marriage Problem: Structure and Algorithms\u3c\italic\u3e on whether, given a reasonable amount of preprocessing time, stable matching solutions could be verified in less than O(n^2) time. We answer this question affirmatively, showing an algorithm that will verify k different matchings in O((m + kn) log^2 n) time. Second, we show how the concept of an adaptive algorithm can be used to speed up running time in certain cases of the stable marriage problem where the disorder present in preference lists is limited. While a problem with identical lists can be solved in a trivial O(n) running time, we present an O(n+k) time algorithm where the women have identical preference lists, and the men have preference lists that differ in k positions from a set of identical lists. We also show a visualization program for better understanding the effects of changes in preference lists. Finally, we look at preference list based matching as a heuristic for cost based matching problems. In theory, this method can lead to arbitrarily bad solutions, but through empirical testing on different types of random sources of data, we show how to obtain reasonable results in practice using methods for generating preference lists āasymmetricallyā that account for long-term ramifications of short-term decisions. We also discuss several ways to measure the stability of a solution and how this might be used for bicriteria optimization approaches based on both cost and stability
Optimal Adaptation Principles In Neural Systems
Animal brains are remarkably efficient in handling complex computational tasks, which are intractable even for state-of-the-art computers. For instance, our ability to detect visual objects in the presence of substantial variability and clutter sur- passes any algorithm. This ability seems even more surprising given the noisiness and biophysical constraints of neural circuits. This thesis focuses on understanding the theoretical principles governing how neural systems, at various scales, are adapted to the structure of their environment in order to interact with it and perform informa- tion processing tasks efficiently. Here, we study this question in three very different and challenging scenarios: i) how a sensory neural circuit the olfactory pathway is organised to efficiently process odour stimuli in a very high-dimensional space with complex structure; ii) how individual neurons in the sensory periphery exploit the structure in a fast-changing environment to utilise their dynamic range efficiently; iii) how the auditory system of whole organisms is able to efficiently exploit temporal structure in a noisy, fast-changing environment to optimise perception of ambiguous sounds. We also study the theoretical issues in developing principled measures of model complexity and extending classical complexity notions to explicitly account for the scale/resolution at which we observe a system
Rule Mining and Sequential Pattern Based Predictive Modeling with EMR Data
Electronic medical record (EMR) data is collected on a daily basis at hospitals and other healthcare facilities to track patientsā health situations including conditions, treatments (medications, procedures), diagnostics (labs) and associated healthcare operations. Besides being useful for individual patient care and hospital operations (e.g., billing, triaging), EMRs can also be exploited for secondary data analyses to glean discriminative patterns that hold across patient cohorts for different phenotypes. These patterns in turn can yield high level insights into disease progression with interventional potential. In this dissertation, using a large scale realistic EMR dataset of over one million patients visiting University of Kentucky healthcare facilities, we explore data mining and machine learning methods for association rule (AR) mining and predictive modeling with mood and anxiety disorders as use-cases. Our first work involves analysis of existing quantitative measures of rule interestingness to assess how they align with a practicing psychiatristās sense of novelty/surprise corresponding to ARs identified from EMRs. Our second effort involves mining causal ARs with depression and anxiety disorders as target conditions through matching methods accounting for computationally identified confounding attributes. Our final effort involves efficient implementation (via GPUs) and application of contrast pattern mining to predictive modeling for mental conditions using various representational methods and recurrent neural networks. Overall, we demonstrate the effectiveness of rule mining methods in secondary analyses of EMR data for identifying causal associations and building predictive models for diseases
Improving Structural Features Prediction in Protein Structure Modeling
Proteins play a vital role in the biological activities of all living species. In nature, a protein folds into a specific and energetically favorable three-dimensional structure which is critical to its biological function. Hence, there has been a great effort by researchers in both experimentally determining and computationally predicting the structures of proteins.
The current experimental methods of protein structure determination are complicated, time-consuming, and expensive. On the other hand, the sequencing of proteins is fast, simple, and relatively less expensive. Thus, the gap between the number of known sequences and the determined structures is growing, and is expected to keep expanding. In contrast, computational approaches that can generate three-dimensional protein models with high resolution are attractive, due to their broad economic and scientific impacts. Accurately predicting protein structural features, such as secondary structures, disulfide bonds, and solvent accessibility is a critical intermediate step stone to obtain correct three-dimensional models ultimately.
In this dissertation, we report a set of approaches for improving the accuracy of structural features prediction in protein structure modeling. First of all, we derive a statistical model to generate context-based scores characterizing the favorability of segments of residues in adopting certain structural features. Then, together with other information such as evolutionary and sequence information, we incorporate the context-based scores in machine learning approaches to predict secondary structures, disulfide bonds, and solvent accessibility. Furthermore, we take advantage of the emerging high performance computing architectures in GPU to accelerate the calculation of pairwise and high-order interactions in context-based scores. Finally, we make these prediction methods available to the public via web services and software packages
- ā¦