974 research outputs found
GRASE: Granulometry Analysis with Semi Eager Classifier to Detect Malware
Technological advancement in communication leading to 5G, motivates everyone to get connected to the internet including âDevicesâ, a technology named Web of Things (WoT). The community benefits from this large-scale network which allows monitoring and controlling of physical devices. But many times, it costs the security as MALicious softWARE (MalWare) developers try to invade the network, as for them, these devices are like a âbackdoorâ providing them easy âentryâ. To stop invaders from entering the network, identifying malware and its variants is of great significance for cyberspace. Traditional methods of malware detection like static and dynamic ones, detect the malware but lack against new techniques used by malware developers like obfuscation, polymorphism and encryption. A machine learning approach to detect malware, where the classifier is trained with handcrafted features, is not potent against these techniques and asks for efforts to put in for the feature engineering. The paper proposes a malware classification using a visualization methodology wherein the disassembled malware code is transformed into grey images. It presents the efficacy of Granulometry texture analysis technique for improving malware classification. Furthermore, a Semi Eager (SemiE) classifier, which is a combination of eager learning and lazy learning technique, is used to get robust classification of malware families. The outcome of the experiment is promising since the proposed technique requires less training time to learn the semantics of higher-level malicious behaviours. Identifying the malware (testing phase) is also done faster. A benchmark database like malimg and Microsoft Malware Classification challenge (BIG-2015) has been utilized to analyse the performance of the system. An overall average classification accuracy of 99.03 and 99.11% is achieved, respectively
Backpropagation Beyond the Gradient
Automatic differentiation is a key enabler of deep learning: previously, practitioners were limited to models
for which they could manually compute derivatives. Now, they can create sophisticated models with almost
no restrictions and train them using first-order, i. e. gradient, information. Popular libraries like PyTorch
and TensorFlow compute this gradient efficiently, automatically, and conveniently with a single line of
code. Under the hood, reverse-mode automatic differentiation, or gradient backpropagation, powers the
gradient computation in these libraries. Their entire design centers around gradient backpropagation.
These frameworks are specialized around one specific taskâcomputing the average gradient in a mini-batch.
This specialization often complicates the extraction of other information like higher-order statistical moments
of the gradient, or higher-order derivatives like the Hessian. It limits practitioners and researchers to methods
that rely on the gradient. Arguably, this hampers the field from exploring the potential of higher-order
information and there is evidence that focusing solely on the gradient has not lead to significant recent
advances in deep learning optimization.
To advance algorithmic research and inspire novel ideas, information beyond the batch-averaged gradient
must be made available at the same level of computational efficiency, automation, and convenience.
This thesis presents approaches to simplify experimentation with rich information beyond the gradient
by making it more readily accessible. We present an implementation of these ideas as an extension to the
backpropagation procedure in PyTorch. Using this newly accessible information, we demonstrate possible use
cases by (i) showing how it can inform our understanding of neural network training by building a diagnostic
tool, and (ii) enabling novel methods to efficiently compute and approximate curvature information.
First, we extend gradient backpropagation for sequential feedforward models to Hessian backpropagation
which enables computing approximate per-layer curvature. This perspective unifies recently proposed block-
diagonal curvature approximations. Like gradient backpropagation, the computation of these second-order
derivatives is modular, and therefore simple to automate and extend to new operations.
Based on the insight that rich information beyond the gradient can be computed efficiently and at the
same time, we extend the backpropagation in PyTorch with the BackPACK library. It provides efficient and
convenient access to statistical moments of the gradient and approximate curvature information, often at a
small overhead compared to computing just the gradient.
Next, we showcase the utility of such information to better understand neural network training. We build
the Cockpit library that visualizes what is happening inside the model during training through various
instruments that rely on BackPACKâs statistics. We show how Cockpit provides a meaningful statistical
summary report to the deep learning engineer to identify bugs in their machine learning pipeline, guide
hyperparameter tuning, and study deep learning phenomena.
Finally, we use BackPACKâs extended automatic differentiation functionality to develop ViViT, an approach
to efficiently compute curvature information, in particular curvature noise. It uses the low-rank structure
of the generalized Gauss-Newton approximation to the Hessian and addresses shortcomings in existing
curvature approximations. Through monitoring curvature noise, we demonstrate how ViViTâs information
helps in understanding challenges to make second-order optimization methods work in practice.
This work develops new tools to experiment more easily with higher-order information in complex deep
learning models. These tools have impacted works on Bayesian applications with Laplace approximations,
out-of-distribution generalization, differential privacy, and the design of automatic differentia-
tion systems. They constitute one important step towards developing and establishing more efficient deep
learning algorithms
Automated Mapping of Adaptive App GUIs from Phones to TVs
With the increasing interconnection of smart devices, users often desire to
adopt the same app on quite different devices for identical tasks, such as
watching the same movies on both their smartphones and TV.
However, the significant differences in screen size, aspect ratio, and
interaction styles make it challenging to adapt Graphical User Interfaces
(GUIs) across these devices.
Although there are millions of apps available on Google Play, only a few
thousand are designed to support smart TV displays.
Existing techniques to map a mobile app GUI to a TV either adopt a responsive
design, which struggles to bridge the substantial gap between phone and TV or
use mirror apps for improved video display, which requires hardware support and
extra engineering efforts.
Instead of developing another app for supporting TVs, we propose a
semi-automated approach to generate corresponding adaptive TV GUIs, given the
phone GUIs as the input.
Based on our empirical study of GUI pairs for TV and phone in existing apps,
we synthesize a list of rules for grouping and classifying phone GUIs,
converting them to TV GUIs, and generating dynamic TV layouts and source code
for the TV display.
Our tool is not only beneficial to developers but also to GUI designers, who
can further customize the generated GUIs for their TV app development.
An evaluation and user study demonstrate the accuracy of our generated GUIs
and the usefulness of our tool.Comment: 30 pages, 15 figure
CLIM4OMICS: a geospatially comprehensive climate and multi-OMICS database for maize phenotype predictability in the United States and Canada
The performance of numerical, statistical, and data-driven diagnostic and predictive crop production modeling relies heavily on data quality for input and calibration or validation processes. This
study presents a comprehensive database and the analytics used to
consolidate it as a homogeneous, consistent, multidimensional genotype, phenotypic, and environmental database for maize phenotype modeling, diagnostics, and prediction. The data used are obtained from the Genomes to Fields (G2F) initiative, which provides multiyear genomic (G), environmental (E), and phenotypic (P) datasets that can be used to train and test crop growth models to understand the genotype by environment (GxE)
interaction phenomenon. A particular advantage of the G2F database is its
diverse set of maize genotype DNA sequences (G2F-G), phenotypic measurements (G2F-P), station-based environmental time series (mainly climatic data) observations collected during the maize-growing season (G2F-E), and metadata for each field trial (G2F-M) across the United States (US), the province of Ontario in Canada, and the state of Lower Saxony in Germany. The construction
of this comprehensive climate and genomic database incorporates the
analytics for data quality control (QC) and consistency control (CC) to
consolidate the digital representation of geospatially distributed
environmental and genomic data required for phenotype predictive analytics
and modeling of the GxE interaction. The two-phase QCâCC preprocessing
algorithm also includes a module to estimate environmental uncertainties.
Generally, this data pipeline collects raw files, checks their formats,
corrects data structures, and identifies and cures or imputes missing data.
This pipeline uses machine-learning techniques to fill the environmental
time series gaps, quantifies the uncertainty introduced by using other
data sources for gap imputation in G2F-E, discards the missing values in
G2F-P, and removes rare variants in G2F-G. Finally, an integrated and
enhanced multidimensional database was generated. The analytics for
improving the G2F database and the improved database called Climate for OMICS (CLIM4OMICS) follow findability, accessibility, interoperability, and reusability (FAIR) principles, and all data and codes are available at
https://doi.org/10.5281/zenodo.8002909 (Aslam et al., 2023a) and https://doi.org/10.5281/zenodo.8161662 (Aslam et al., 2023b), respectively.</p
Recommended from our members
Facilitating software evolution through natural language comments and dialogue
Software projects are continually evolving, as developers incorporate changes to refactor code, support new functionality, and fix bugs. To uphold software quality amidst constant changes and also facilitate prompt implementation of critical changes, it is desirable to have automated tools for supporting and driving software evolution. In this thesis, we explore tasks and data and design machine learning approaches which leverage natural language to serve this purpose.
When developers make code changes, they sometimes fail to update the accompanying natural language comments documenting various aspects of the code, which can lead to confusion and vulnerability to bugs. We present our work on alerting developers of inconsistent comments upon code changes and suggesting updates by learning to correlate comments and code.
When a bug is reported, developers engage in a dialogue to collaboratively understand it and ultimately resolve it. While the solution is likely formulated within the discussion, it is often buried in a large amount of text, making it difficult to comprehend, which delays its implementation through the necessary repository changes. To guide developers in more easily absorbing information relevant towards making these changes and consequently expedite bug resolution, we investigate generating a concise natural language description of the solution by synthesizing relevant content as it emerges in the discussion. We benchmark models for generating solution descriptions and design a classifier for determining when sufficient context for generating an informative description becomes available. We investigate approaches for real-time generation, entailing separately trained and jointly trained classification and generation models. Furthermore, we also study techniques for deriving natural language context from bug report discussions and generated solution descriptions to guide models in generating suggested bug-resolving code changes.Computer Science
Name Your Colour For the Task: Artificially Discover Colour Naming via Colour Quantisation Transformer
The long-standing theory that a colour-naming system evolves under dual
pressure of efficient communication and perceptual mechanism is supported by
more and more linguistic studies, including analysing four decades of
diachronic data from the Nafaanra language. This inspires us to explore whether
machine learning could evolve and discover a similar colour-naming system via
optimising the communication efficiency represented by high-level recognition
performance. Here, we propose a novel colour quantisation transformer,
CQFormer, that quantises colour space while maintaining the accuracy of machine
recognition on the quantised images. Given an RGB image, Annotation Branch maps
it into an index map before generating the quantised image with a colour
palette; meanwhile the Palette Branch utilises a key-point detection way to
find proper colours in the palette among the whole colour space. By interacting
with colour annotation, CQFormer is able to balance both the machine vision
accuracy and colour perceptual structure such as distinct and stable colour
distribution for discovered colour system. Very interestingly, we even observe
the consistent evolution pattern between our artificial colour system and basic
colour terms across human languages. Besides, our colour quantisation method
also offers an efficient quantisation method that effectively compresses the
image storage while maintaining high performance in high-level recognition
tasks such as classification and detection. Extensive experiments demonstrate
the superior performance of our method with extremely low bit-rate colours,
showing potential to integrate into quantisation network to quantities from
image to network activation. The source code is available at
https://github.com/ryeocthiv/CQForme
Advanced statistical methods for prognostic biomarkers and disease incidence models
Due to their prognostic value, biomarkers can support physicians in making the appropriate choice of therapy for a patient. In this thesis, several advanced statistical methods and machine learning algorithms were considered and applied to projects in collaboration with departments of the University Hospital Augsburg. A machine learning algorithm capturing hidden structures in binary
immunohistologically stained images of colon cancer was developed to identify patients with a high risk of occurrence of distant metastases. Further, generalized linear models were used to estimate the probability of the need for a permanent shunt in patients after an aneurysmatic subarachnoid hemorrhage. Patients with oligometastatic colon cancer were stratified by a score developed using approaches from survival analysis to investigate which groups might benefit from surgical removal of metastases with prolonged overall survival.
Another important point is the selection of suitable statistical models dependent on the structure of the data. We found that a linear regression may only be suited with a transformation of the response variable in the context of association of a COVID-19 infection with lymphocyte subsets. In addition, modeling the course of daily reported new COVID-19 cases is a relevant task and requires suitable statistical models. We compared non-seasonal and seasonal ARIMA models and examined the performance of different log-linear autoregressive Poisson models. To add more structure and enable theoretical prognosis for the further course depending on nonpharmaceutical interventions, we fitted a Bayesian SEIR model with several change points and set the determined change points in context with the distribution of variants of the virus.Biomarker können Ărzte durch ihren prognostischen Wert bei der Auswahl geeigneter Therapieoptionen unterstĂŒtzen. In dieser Arbeit wurden mehrere fortgeschrittene statistische Methoden sowie Algorithmen des maschinellen Lernens eingefĂŒhrt und in Zusammenarbeit mit verschiedenen Abteilungen des UniversitĂ€tsklinikums Augsburg angewendet. Mit Hilfe eines Algorithmus des maschinellen Lernens, der versteckte Strukturen in binĂ€ren, immunhistologisch gefĂ€rbten Bildern von Darmkrebstumoren feststellen kann, wurden Patienten mit einem hohen Risiko fĂŒr auftretende Fernmetastasen identifiziert. Ebenso wurden Generalisierte Lineare Modelle verwendet, um eine Vorhersage der Wahrscheinlichkeit fĂŒr eine dauerhafte Shunt-Anlegung nach einer aneurysmatischen Subarachnoidalblutung zu treffen. Patienten mit oligometastastischen Darmkrebs wurden mittels eines Scores, der anhand von Methoden der Survival Analysis entwickelt wurde, stratifiziert, um eine Gruppe zu identifizieren, die von einer operativen Entfernung der Metastasen durch ein langes
GesamtĂŒberleben profitieren kann.
Ein weiterer wichtiger Punkt bei der Datenanalyse ist die geeignete Auswahl der statistischen Methode abhÀngig von der Datenstruktur. Es konnten am Beispiel der Assoziation einer Coronainfektion mit der Anzahl von Lymphozytensubpopulationen
gezeigt werden, dass eine Transformation der Zielvariable notwendig sein kann, um die Voraussetzungen der linearen Regression zu erfĂŒllen. Die Modellierung der Anzahl an tĂ€glichen Neuinfektionen stellt eine relevante Aufgabe dar und benötigt passende statistische Modelle. Ein non-seasonal und ein seasonal ARIMA-Model wurden ebenso wie mehrere log-linearen autoregressiven Poisson-Modellen verglichen. ZusĂ€tzlich wurde ein weiterer Modellierungsansatz untersucht, der die biologischen Mechanismen stĂ€rker einbezieht und eine theoretische Prognose fĂŒr den weiteren Verlauf unter verschiedenen Szenarien ermöglicht. Der Verlauf wurde mittels eines bayesschen SEIR Modell mit mehreren Wendepunkten an die Daten angepasst. Die gefundenen Wendepunkte wurden in Kontext der Verteilung der Virusvarianten analysiert
Inferring ecological interactions from dynamics in phage-bacteria communities
Characterizing how viruses interact with microbial hosts is critical to understanding microbial community structure and function. However, existing methods for quantifying bacteria-phage interactions are not widely applicable to natural communities. First, many bacteria are not culturable, preventing direct experimental testing. Second, â-omicsâ based methods, while high in accuracy and specificity, have been shown to be extremely low in power. Third, inference methods based on time-series or co-occurrence data, while promising, have for the most part not been rigorously tested. This thesis work focuses on this final category of quantification strategies: inference methods.
In this thesis, we further our understanding of both the potential and limitations of several inference methods, focusing primarily on time-series data with high time resolution. We emphasize the quantification of efficacy by using time-series data from multi-strain bacteria-phage communities with known infection networks. We employ both in silico simulated bacteria-phage communities as well as an in vitro community experiment. We review existing correlation-based inference methods, extend theory and characterize tradeoffs for model-based inference which uses convex optimization, characterize pairwise interactions in a 5x5 virus-microbe community experiment using Markov chain Monte Carlo, and present analytic tools for microbiome time-series analysis when a dynamical model is unknown. Together, these chapters bridge gaps in existing literature in inference of ecological interactions from time-series data.Ph.D
Data pre-processing to identify environmental risk factors associated with diabetes
Genetics, diet, obesity, and lack of exercise play a major role in the development of type II diabetes. Additionally, environmental conditions are also linked to type II diabetes. The aim of this research is to identify the environmental conditions associated with diabetes. To achieve this, the research study utilises hospital-admitted patient data in NSW integrated with weather, pollution, and demographic data. The environmental variables (air pollution and weather) change over time and space, necessitating spatiotemporal data analysis to identify associations. Moreover, the environmental variables are measured using sensors, and they often contain large gaps of missing values due to sensor failures. Therefore, enhanced methodologies in data cleaning and imputation are needed to facilitate research using this data. Hence, the objectives of this study are twofold: first, to develop a data cleaning and imputation framework with improved methodologies to clean and pre-process the environmental data, and second, to identify environmental conditions associated with diabetes. This study develops a novel data-cleaning framework that streamlines the practice of data analysis and visualisation, specifically for studying environmental factors such as climate change monitoring and the effects of weather and pollution. The framework is designed to efficiently handle data collected by remote sensors, enabling more accurate and comprehensive analyses of environmental phenomena that would otherwise not be possible. The study initially focuses on the Sydney Region, identifies missing data patterns, and utilises established imputation methods. It assesses the performance of existing techniques and finds that Kalman smoothing on structural time series models outperforms other methods. However, when dealing with larger gaps in missing data, none of the existing methods yield satisfactory results. To address this, the study proposes enhanced methodologies for filling substantial gaps in environmental datasets. The first proposed algorithm employs regularized regression models to fill large gaps in air quality data using a univariate approach. It is then extended to incorporate seasonal patterns and expand its applicability to weather data with similar patterns. Furthermore, the algorithm is enhanced by incorporating other correlated variables to accurately fill substantial gaps in environmental variables. Consistently, the algorithm presented in this thesis outperforms other methods in imputing large gaps. This algorithm is applicable for filling large gaps in air pollution and weather data, facilitating downstream analysis
- âŠ