18 research outputs found
MalGrid: Visualization Of Binary Features In Large Malware Corpora
The number of malware is constantly on the rise. Though most new malware are
modifications of existing ones, their sheer number is quite overwhelming. In
this paper, we present a novel system to visualize and map millions of malware
to points in a 2-dimensional (2D) spatial grid. This enables visualizing
relationships within large malware datasets that can be used to develop triage
solutions to screen different malware rapidly and provide situational
awareness. Our approach links two visualizations within an interactive display.
Our first view is a spatial point-based visualization of similarity among the
samples based on a reduced dimensional projection of binary feature
representations of malware. Our second spatial grid-based view provides a
better insight into similarities and differences between selected malware
samples in terms of the binary-based visual representations they share. We also
provide a case study where the effect of packing on the malware data is
correlated with the complexity of the packing algorithm.Comment: Submitted version - MILCOM 2022 IEEE Military Communications
Conference. The high-quality images in this paper can be found on Github
(https://github.com/Mayachitra-Inc/MalGrid
PLAST-ncRNA: Partition function Local Alignment Search Tool for non-coding RNA sequences
Alignment-based programs are valuable tools for finding potential homologs in genome sequences. Previously, it has been shown that partition function posterior probabilities attuned to local alignment achieve a high accuracy in identifying distantly similar non-coding RNA sequences that are hidden in a large genome. Here, we present an online implementation of that alignment algorithm based on such probabilities. Our server takes as input a query RNA sequence and a large genome sequence, and outputs a list of hits that are above a mean posterior probability threshold. The output is presented in a format suited to local alignment. It can also be viewed within the PLAST alignment viewer applet that provides a list of all hits found and highlights regions of high posterior probability within each local alignment. The server is freely available at http://plastrna.njit.edu
GENIE: a software package for gene-gene interaction analysis in genetic association studies using multiple GPU or CPU cores
<p>Abstract</p> <p>Background</p> <p>Gene-gene interaction in genetic association studies is computationally intensive when a large number of SNPs are involved. Most of the latest Central Processing Units (CPUs) have multiple cores, whereas Graphics Processing Units (GPUs) also have hundreds of cores and have been recently used to implement faster scientific software. However, currently there are no genetic analysis software packages that allow users to fully utilize the computing power of these multi-core devices for genetic interaction analysis for binary traits.</p> <p>Findings</p> <p>Here we present a novel software package GENIE, which utilizes the power of multiple GPU or CPU processor cores to parallelize the interaction analysis. GENIE reads an entire genetic association study dataset into memory and partitions the dataset into fragments with non-overlapping sets of SNPs. For each fragment, GENIE analyzes: 1) the interaction of SNPs within it in parallel, and 2) the interaction between the SNPs of the current fragment and other fragments in parallel. We tested GENIE on a large-scale candidate gene study on high-density lipoprotein cholesterol. Using an NVIDIA Tesla C1060 graphics card, the GPU mode of GENIE achieves a speedup of 27 times over its single-core CPU mode run.</p> <p>Conclusions</p> <p>GENIE is open-source, economical, user-friendly, and scalable. Since the computing power and memory capacity of graphics cards are increasing rapidly while their cost is going down, we anticipate that GENIE will achieve greater speedups with faster GPU cards. Documentation, source code, and precompiled binaries can be downloaded from <url>http://www.cceb.upenn.edu/~mli/software/GENIE/</url>.</p
Hydrogen bond networks determine emergent mechanical and thermodynamic properties across a protein family
<p>Abstract</p> <p>Background</p> <p>Gram-negative bacteria use periplasmic-binding proteins (bPBP) to transport nutrients through the periplasm. Despite immense diversity within the recognized substrates, all members of the family share a common fold that includes two domains that are separated by a conserved hinge. The hinge allows the protein to cycle between open (apo) and closed (ligated) conformations. Conformational changes within the proteins depend on a complex interplay of mechanical and thermodynamic response, which is manifested as an increase in thermal stability and decrease of flexibility upon ligand binding.</p> <p>Results</p> <p>We use a distance constraint model (DCM) to quantify the give and take between thermodynamic stability and mechanical flexibility across the bPBP family. Quantitative stability/flexibility relationships (QSFR) are readily evaluated because the DCM links mechanical and thermodynamic properties. We have previously demonstrated that QSFR is moderately conserved across a mesophilic/thermophilic RNase H pair, whereas the observed variance indicated that different enthalpy-entropy mechanisms allow similar mechanical response at their respective melting temperatures. Our predictions of heat capacity and free energy show marked diversity across the bPBP family. While backbone flexibility metrics are mostly conserved, cooperativity correlation (long-range couplings) also demonstrate considerable amount of variation. Upon ligand removal, heat capacity, melting point, and mechanical rigidity are, as expected, lowered. Nevertheless, significant differences are found in molecular cooperativity correlations that can be explained by the detailed nature of the hydrogen bond network.</p> <p>Conclusion</p> <p>Non-trivial mechanical and thermodynamic variation across the family is explained by differences within the underlying H-bond networks. The mechanism is simple; variation within the H-bond networks result in altered mechanical linkage properties that directly affect intrinsic flexibility. Moreover, varying numbers of H-bonds and their strengths control the likelihood for energetic fluctuations as H-bonds break and reform, thus directly affecting thermodynamic properties. Consequently, these results demonstrate how unexpected large differences, especially within cooperativity correlation, emerge from subtle differences within the underlying H-bond network. This inference is consistent with well-known results that show allosteric response within a family generally varies significantly. Identifying the hydrogen bond network as a critical determining factor for these large variances may lead to new methods that can predict such effects.</p
MSACompro: protein multiple sequence alignment using predicted secondary structure, solvent accessibility, and residue-residue contacts
<p>Abstract</p> <p>Background</p> <p>Multiple Sequence Alignment (MSA) is a basic tool for bioinformatics research and analysis. It has been used essentially in almost all bioinformatics tasks such as protein structure modeling, gene and protein function prediction, DNA motif recognition, and phylogenetic analysis. Therefore, improving the accuracy of multiple sequence alignment is important for advancing many bioinformatics fields.</p> <p>Results</p> <p>We designed and developed a new method, MSACompro, to synergistically incorporate predicted secondary structure, relative solvent accessibility, and residue-residue contact information into the currently most accurate posterior probability-based MSA methods to improve the accuracy of multiple sequence alignments. The method is different from the multiple sequence alignment methods (e.g. 3D-Coffee) that use the tertiary structure information of some sequences since the structural information of our method is fully predicted from sequences. To the best of our knowledge, applying predicted relative solvent accessibility and contact map to multiple sequence alignment is novel. The rigorous benchmarking of our method to the standard benchmarks (i.e. BAliBASE, SABmark and OXBENCH) clearly demonstrated that incorporating predicted protein structural information improves the multiple sequence alignment accuracy over the leading multiple protein sequence alignment tools without using this information, such as MSAProbs, ProbCons, Probalign, T-coffee, MAFFT and MUSCLE. And the performance of the method is comparable to the state-of-the-art method PROMALS of using structural features and additional homologous sequences by slightly lower scores.</p> <p>Conclusion</p> <p>MSACompro is an efficient and reliable multiple protein sequence alignment tool that can effectively incorporate predicted protein structural information into multiple sequence alignment. The software is available at <url>http://sysbio.rnet.missouri.edu/multicom_toolbox/</url>.</p
Locally most powerful rank tests for comparison of two failure rates based on multiple type-II censored data
This article deals with the locally most powerful rank tests for testing the hypothesis that two failure rates are equal against the alternative that one failure rate is greater than the other, when the combined ordered sample is multiple Type-II censored. A modified version of the Dupač and Hájek (1969) theorem is used to establish their asymptotic normality under fixed alternative since the scores generating functions associated with these rank test statistics have a finite number of jump discontinuities. The modified version that leads to a simpler centering constant, is proved by Dupač (1970) using the results of Hájek (1968). The Pitman AREs of these rank tests based on censored data relative to the corresponding tests based on complete data are obtained under some Lehmann-type alternative distributions such that their failure rates dominate the failure rates of the respective null distributions. The AREs are computed numerically for single (left or right) and double censored data, and the extent of loss due to these censoring schemes is discussed. The rank tests considered here include among them the Mann-Whiney-Wilcoxon (MWW) test, the Savage test, and the linear combination of these two tests. In the case of all the tests, except the MWW test, it is found that the loss of efficiency due to left censoring is considerably less than that due to right censoring. In the case of finite samples, Monte Carlo simulation results showing the empirical levels and empirical powers against some Lehmann alternatives are presented
Rank Tests for Two-Sample Problems Based on Multiple Type-II Censored Data
In this article, we study the effect of censoring on the asymptotic efficiency of the two-sample rank tests based on multiple Type-II censored data. Since the scores generating functions associated with these test statistics have a finite number of jump discontinuities, we use a slightly modified version of a theorem of Dupac and Hajek (1969) to obtain their asymptotic distributions under fixed alternatives. This modified version, which leads to a simpler centering constant, is proved by Dupac (1970) in the light of results of Hoeffding (1968), an earlier version of Hoeffding (1973). Hence, we obtain the Pitman ARE's of these rank tests relative to the corresponding tests based on the complete samples. The ARE's are computed for some well known rank tests for two-sample location and scale problems, when the combined ordered samples from different underlying distributions are censored using triple and lower order Type-II censoring schemes. The effect of all these censoring schemes on the ARE's of the different tests is examined numerically. It is found that there is a gain in efficiency due to censoring in many of the cases considered here. This suggests that in such cases it is possible to improve the efficiency of rank tests by discarding suitable portions of the data