422 research outputs found
Incorporating prior information into association studies.
UnlabelledRecent technological developments in measuring genetic variation have ushered in an era of genome-wide association studies which have discovered many genes involved in human disease. Current methods to perform association studies collect genetic information and compare the frequency of variants in individuals with and without the disease. Standard approaches do not take into account any information on whether or not a given variant is likely to have an effect on the disease. We propose a novel method for computing an association statistic which takes into account prior information. Our method improves both power and resolution by 8% and 27%, respectively, over traditional methods for performing association studies when applied to simulations using the HapMap data. Advantages of our method are that it is as simple to apply to association studies as standard methods, the results of the method are interpretable as the method reports p-values, and the method is optimal in its use of prior information in regards to statistical power.AvailabilityThe method presented herein is available at http://masa.cs.ucla.edu
Behavior-Based Outlier Detection for Network Access Control Systems
Network Access Control (NAC) systems manage the access of new devices into enterprise networks to prevent unauthorised devices from attacking network services. The main difficulty with this approach is that NAC cannot detect abnormal behaviour of devices connected to an enterprise network. These abnormal devices can be detected using outlier detection techniques. Existing outlier detection techniques focus on specific application domains such as fraud, event or system health monitoring. In this paper, we review attacks on Bring Your Own Device (BYOD) enterprise networks as well as existing clustering-based outlier detection algorithms along with their limitations. Importantly, existing techniques can detect outliers, but cannot detect where or which device is causing the abnormal behaviour. We develop a novel behaviour-based outlier detection technique which detects abnormal behaviour according to a device type profile. Based on data analysis with K-means clustering, we build device type profiles using Clustering-based Multivariate Gaussian Outlier Score (CMGOS) and filter out abnormal devices from the device type profile. The experimental results show the applicability of our approach as we can obtain a device type profile for five dell-netbooks, three iPads, two iPhone 3G, two iPhones 4G and Nokia Phones and detect outlying devices within the device type profile
Combining Strategies for Extracting Relations from Text Collections
Text documents often contain valuable structured data that is hidden in regular English sentences. This data is best exploited if available as a relational table that we could use for answering precise queries or for running data mining tasks. Our Snowball system extracts these relations from document collections starting with only a handful of user-provided example tuples. Based on these tuples, Snowball generates patterns that are used, in turn, to find more tuples. In this paper we introduce a new pattern and tuple generation scheme for Snowball, with different strengths and weaknesses than those of our original system. We also show preliminary results on how we can combine the two versions of Snowball to extract tuples more accurately
Multiple testing correction in linear mixed models.
BackgroundMultiple hypothesis testing is a major issue in genome-wide association studies (GWAS), which often analyze millions of markers. The permutation test is considered to be the gold standard in multiple testing correction as it accurately takes into account the correlation structure of the genome. Recently, the linear mixed model (LMM) has become the standard practice in GWAS, addressing issues of population structure and insufficient power. However, none of the current multiple testing approaches are applicable to LMM.ResultsWe were able to estimate per-marker thresholds as accurately as the gold standard approach in real and simulated datasets, while reducing the time required from months to hours. We applied our approach to mouse, yeast, and human datasets to demonstrate the accuracy and efficiency of our approach.ConclusionsWe provide an efficient and accurate multiple testing correction approach for linear mixed models. We further provide an intuition about the relationships between per-marker threshold, genetic relatedness, and heritability, based on our observations in real data
Recommended from our members
MET: An Experimental System for Malicious Email Tracking
Despite the use of state of the art methods to protect against malicious programs, they continue to threaten and damage computer systems around the world. In this paper we present MET, the Malicious Email Tracking system, designed to automatically report statistics on the flow behavior of malicious software delivered via email attachments both at a local and global level. MET can help reduce the spread of malicious software worldwide, especially self-replicating viruses, as well as provide further insight toward minimizing damage caused by malicious programs in the future. In addition, the system can help system administrators detect all of the points of entry of a malicious email into a network. The core of MET's operation is a database of statistics about the trajectory of email attachments in and out of a network system, and the culling together of these statistics across networks to present a global view of the spread of the malicious software. From a statistical perspective sampling only a small amount of traffic (for example, .1 %) of a very large email stream is sufficient to detect suspicious or otherwise new email viruses that may be undetected by standard signature-based scanners. Therefore, relatively few MET installations would be necessary to gather sufficient data in order to provide broad protection services. Small scale simulations are presented to demonstrate MET in operation and suggests how detection of new virus propagations via flow statistics can be automated
Assembly of non-unique insertion content using next-generation sequencing
Recent studies in genomics have highlighted the significance of sequence insertions in determining individual variation. Efforts to discover the content of these sequence insertions have been limited to short insertions and long unique insertions. Much of the inserted sequence in the typical human genome, however, is a mixture of repeated and unique sequence. Current methods are designed to assemble only unique sequence insertions, using reads that do not map to the reference. These methods are not able to assemble repeated sequence insertions, as the reads will map to the reference in a different locus
Identification of causal genes for complex traits.
MotivationAlthough genome-wide association studies (GWAS) have identified thousands of variants associated with common diseases and complex traits, only a handful of these variants are validated to be causal. We consider 'causal variants' as variants which are responsible for the association signal at a locus. As opposed to association studies that benefit from linkage disequilibrium (LD), the main challenge in identifying causal variants at associated loci lies in distinguishing among the many closely correlated variants due to LD. This is particularly important for model organisms such as inbred mice, where LD extends much further than in human populations, resulting in large stretches of the genome with significantly associated variants. Furthermore, these model organisms are highly structured and require correction for population structure to remove potential spurious associations.ResultsIn this work, we propose CAVIAR-Gene (CAusal Variants Identification in Associated Regions), a novel method that is able to operate across large LD regions of the genome while also correcting for population structure. A key feature of our approach is that it provides as output a minimally sized set of genes that captures the genes which harbor causal variants with probability ρ. Through extensive simulations, we demonstrate that our method not only speeds up computation, but also have an average of 10% higher recall rate compared with the existing approaches. We validate our method using a real mouse high-density lipoprotein data (HDL) and show that CAVIAR-Gene is able to identify Apoa2 (a gene known to harbor causal variants for HDL), while reducing the number of genes that need to be tested for functionality by a factor of 2.Availability and implementationSoftware is freely available for download at genetics.cs.ucla.edu/caviar
Recommended from our members
A linear mixed model approach to gene expression-tumor aneuploidy association studies.
Aneuploidy, defined as abnormal chromosome number or somatic DNA copy number, is a characteristic of many aggressive tumors and is thought to drive tumorigenesis. Gene expression-aneuploidy association studies have previously been conducted to explore cellular mechanisms associated with aneuploidy. However, in an observational setting, gene expression is influenced by many factors that can act as confounders between gene expression and aneuploidy, leading to spurious correlations between the two variables. These factors include known confounders such as sample purity or batch effect, as well as gene co-regulation which induces correlations between the expression of causal genes and non-causal genes. We use a linear mixed-effects model (LMM) to account for confounding effects of tumor purity and gene co-regulation on gene expression-aneuploidy associations. When applied to patient tumor data across diverse tumor types, we observe that the LMM both accounts for the impact of purity on aneuploidy measurements and identifies a new association between histone gene expression and aneuploidy
- …