974 research outputs found

    GRASE: Granulometry Analysis with Semi Eager Classifier to Detect Malware

    Get PDF
    Technological advancement in communication leading to 5G, motivates everyone to get connected to the internet including ‘Devices’, a technology named Web of Things (WoT). The community benefits from this large-scale network which allows monitoring and controlling of physical devices. But many times, it costs the security as MALicious softWARE (MalWare) developers try to invade the network, as for them, these devices are like a ‘backdoor’ providing them easy ‘entry’. To stop invaders from entering the network, identifying malware and its variants is of great significance for cyberspace. Traditional methods of malware detection like static and dynamic ones, detect the malware but lack against new techniques used by malware developers like obfuscation, polymorphism and encryption. A machine learning approach to detect malware, where the classifier is trained with handcrafted features, is not potent against these techniques and asks for efforts to put in for the feature engineering. The paper proposes a malware classification using a visualization methodology wherein the disassembled malware code is transformed into grey images. It presents the efficacy of Granulometry texture analysis technique for improving malware classification. Furthermore, a Semi Eager (SemiE) classifier, which is a combination of eager learning and lazy learning technique, is used to get robust classification of malware families. The outcome of the experiment is promising since the proposed technique requires less training time to learn the semantics of higher-level malicious behaviours. Identifying the malware (testing phase) is also done faster. A benchmark database like malimg and Microsoft Malware Classification challenge (BIG-2015) has been utilized to analyse the performance of the system. An overall average classification accuracy of 99.03 and 99.11% is achieved, respectively

    Backpropagation Beyond the Gradient

    Get PDF
    Automatic differentiation is a key enabler of deep learning: previously, practitioners were limited to models for which they could manually compute derivatives. Now, they can create sophisticated models with almost no restrictions and train them using first-order, i. e. gradient, information. Popular libraries like PyTorch and TensorFlow compute this gradient efficiently, automatically, and conveniently with a single line of code. Under the hood, reverse-mode automatic differentiation, or gradient backpropagation, powers the gradient computation in these libraries. Their entire design centers around gradient backpropagation. These frameworks are specialized around one specific task—computing the average gradient in a mini-batch. This specialization often complicates the extraction of other information like higher-order statistical moments of the gradient, or higher-order derivatives like the Hessian. It limits practitioners and researchers to methods that rely on the gradient. Arguably, this hampers the field from exploring the potential of higher-order information and there is evidence that focusing solely on the gradient has not lead to significant recent advances in deep learning optimization. To advance algorithmic research and inspire novel ideas, information beyond the batch-averaged gradient must be made available at the same level of computational efficiency, automation, and convenience. This thesis presents approaches to simplify experimentation with rich information beyond the gradient by making it more readily accessible. We present an implementation of these ideas as an extension to the backpropagation procedure in PyTorch. Using this newly accessible information, we demonstrate possible use cases by (i) showing how it can inform our understanding of neural network training by building a diagnostic tool, and (ii) enabling novel methods to efficiently compute and approximate curvature information. First, we extend gradient backpropagation for sequential feedforward models to Hessian backpropagation which enables computing approximate per-layer curvature. This perspective unifies recently proposed block- diagonal curvature approximations. Like gradient backpropagation, the computation of these second-order derivatives is modular, and therefore simple to automate and extend to new operations. Based on the insight that rich information beyond the gradient can be computed efficiently and at the same time, we extend the backpropagation in PyTorch with the BackPACK library. It provides efficient and convenient access to statistical moments of the gradient and approximate curvature information, often at a small overhead compared to computing just the gradient. Next, we showcase the utility of such information to better understand neural network training. We build the Cockpit library that visualizes what is happening inside the model during training through various instruments that rely on BackPACK’s statistics. We show how Cockpit provides a meaningful statistical summary report to the deep learning engineer to identify bugs in their machine learning pipeline, guide hyperparameter tuning, and study deep learning phenomena. Finally, we use BackPACK’s extended automatic differentiation functionality to develop ViViT, an approach to efficiently compute curvature information, in particular curvature noise. It uses the low-rank structure of the generalized Gauss-Newton approximation to the Hessian and addresses shortcomings in existing curvature approximations. Through monitoring curvature noise, we demonstrate how ViViT’s information helps in understanding challenges to make second-order optimization methods work in practice. This work develops new tools to experiment more easily with higher-order information in complex deep learning models. These tools have impacted works on Bayesian applications with Laplace approximations, out-of-distribution generalization, differential privacy, and the design of automatic differentia- tion systems. They constitute one important step towards developing and establishing more efficient deep learning algorithms

    Automated Mapping of Adaptive App GUIs from Phones to TVs

    Full text link
    With the increasing interconnection of smart devices, users often desire to adopt the same app on quite different devices for identical tasks, such as watching the same movies on both their smartphones and TV. However, the significant differences in screen size, aspect ratio, and interaction styles make it challenging to adapt Graphical User Interfaces (GUIs) across these devices. Although there are millions of apps available on Google Play, only a few thousand are designed to support smart TV displays. Existing techniques to map a mobile app GUI to a TV either adopt a responsive design, which struggles to bridge the substantial gap between phone and TV or use mirror apps for improved video display, which requires hardware support and extra engineering efforts. Instead of developing another app for supporting TVs, we propose a semi-automated approach to generate corresponding adaptive TV GUIs, given the phone GUIs as the input. Based on our empirical study of GUI pairs for TV and phone in existing apps, we synthesize a list of rules for grouping and classifying phone GUIs, converting them to TV GUIs, and generating dynamic TV layouts and source code for the TV display. Our tool is not only beneficial to developers but also to GUI designers, who can further customize the generated GUIs for their TV app development. An evaluation and user study demonstrate the accuracy of our generated GUIs and the usefulness of our tool.Comment: 30 pages, 15 figure

    CLIM4OMICS: a geospatially comprehensive climate and multi-OMICS database for maize phenotype predictability in the United States and Canada

    Get PDF
    The performance of numerical, statistical, and data-driven diagnostic and predictive crop production modeling relies heavily on data quality for input and calibration or validation processes. This study presents a comprehensive database and the analytics used to consolidate it as a homogeneous, consistent, multidimensional genotype, phenotypic, and environmental database for maize phenotype modeling, diagnostics, and prediction. The data used are obtained from the Genomes to Fields (G2F) initiative, which provides multiyear genomic (G), environmental (E), and phenotypic (P) datasets that can be used to train and test crop growth models to understand the genotype by environment (GxE) interaction phenomenon. A particular advantage of the G2F database is its diverse set of maize genotype DNA sequences (G2F-G), phenotypic measurements (G2F-P), station-based environmental time series (mainly climatic data) observations collected during the maize-growing season (G2F-E), and metadata for each field trial (G2F-M) across the United States (US), the province of Ontario in Canada, and the state of Lower Saxony in Germany. The construction of this comprehensive climate and genomic database incorporates the analytics for data quality control (QC) and consistency control (CC) to consolidate the digital representation of geospatially distributed environmental and genomic data required for phenotype predictive analytics and modeling of the GxE interaction. The two-phase QC–CC preprocessing algorithm also includes a module to estimate environmental uncertainties. Generally, this data pipeline collects raw files, checks their formats, corrects data structures, and identifies and cures or imputes missing data. This pipeline uses machine-learning techniques to fill the environmental time series gaps, quantifies the uncertainty introduced by using other data sources for gap imputation in G2F-E, discards the missing values in G2F-P, and removes rare variants in G2F-G. Finally, an integrated and enhanced multidimensional database was generated. The analytics for improving the G2F database and the improved database called Climate for OMICS (CLIM4OMICS) follow findability, accessibility, interoperability, and reusability (FAIR) principles, and all data and codes are available at https://doi.org/10.5281/zenodo.8002909 (Aslam et al., 2023a) and https://doi.org/10.5281/zenodo.8161662 (Aslam et al., 2023b), respectively.</p

    Name Your Colour For the Task: Artificially Discover Colour Naming via Colour Quantisation Transformer

    Full text link
    The long-standing theory that a colour-naming system evolves under dual pressure of efficient communication and perceptual mechanism is supported by more and more linguistic studies, including analysing four decades of diachronic data from the Nafaanra language. This inspires us to explore whether machine learning could evolve and discover a similar colour-naming system via optimising the communication efficiency represented by high-level recognition performance. Here, we propose a novel colour quantisation transformer, CQFormer, that quantises colour space while maintaining the accuracy of machine recognition on the quantised images. Given an RGB image, Annotation Branch maps it into an index map before generating the quantised image with a colour palette; meanwhile the Palette Branch utilises a key-point detection way to find proper colours in the palette among the whole colour space. By interacting with colour annotation, CQFormer is able to balance both the machine vision accuracy and colour perceptual structure such as distinct and stable colour distribution for discovered colour system. Very interestingly, we even observe the consistent evolution pattern between our artificial colour system and basic colour terms across human languages. Besides, our colour quantisation method also offers an efficient quantisation method that effectively compresses the image storage while maintaining high performance in high-level recognition tasks such as classification and detection. Extensive experiments demonstrate the superior performance of our method with extremely low bit-rate colours, showing potential to integrate into quantisation network to quantities from image to network activation. The source code is available at https://github.com/ryeocthiv/CQForme

    Advanced statistical methods for prognostic biomarkers and disease incidence models

    Get PDF
    Due to their prognostic value, biomarkers can support physicians in making the appropriate choice of therapy for a patient. In this thesis, several advanced statistical methods and machine learning algorithms were considered and applied to projects in collaboration with departments of the University Hospital Augsburg. A machine learning algorithm capturing hidden structures in binary immunohistologically stained images of colon cancer was developed to identify patients with a high risk of occurrence of distant metastases. Further, generalized linear models were used to estimate the probability of the need for a permanent shunt in patients after an aneurysmatic subarachnoid hemorrhage. Patients with oligometastatic colon cancer were stratified by a score developed using approaches from survival analysis to investigate which groups might benefit from surgical removal of metastases with prolonged overall survival. Another important point is the selection of suitable statistical models dependent on the structure of the data. We found that a linear regression may only be suited with a transformation of the response variable in the context of association of a COVID-19 infection with lymphocyte subsets. In addition, modeling the course of daily reported new COVID-19 cases is a relevant task and requires suitable statistical models. We compared non-seasonal and seasonal ARIMA models and examined the performance of different log-linear autoregressive Poisson models. To add more structure and enable theoretical prognosis for the further course depending on nonpharmaceutical interventions, we fitted a Bayesian SEIR model with several change points and set the determined change points in context with the distribution of variants of the virus.Biomarker können Ärzte durch ihren prognostischen Wert bei der Auswahl geeigneter Therapieoptionen unterstĂŒtzen. In dieser Arbeit wurden mehrere fortgeschrittene statistische Methoden sowie Algorithmen des maschinellen Lernens eingefĂŒhrt und in Zusammenarbeit mit verschiedenen Abteilungen des UniversitĂ€tsklinikums Augsburg angewendet. Mit Hilfe eines Algorithmus des maschinellen Lernens, der versteckte Strukturen in binĂ€ren, immunhistologisch gefĂ€rbten Bildern von Darmkrebstumoren feststellen kann, wurden Patienten mit einem hohen Risiko fĂŒr auftretende Fernmetastasen identifiziert. Ebenso wurden Generalisierte Lineare Modelle verwendet, um eine Vorhersage der Wahrscheinlichkeit fĂŒr eine dauerhafte Shunt-Anlegung nach einer aneurysmatischen Subarachnoidalblutung zu treffen. Patienten mit oligometastastischen Darmkrebs wurden mittels eines Scores, der anhand von Methoden der Survival Analysis entwickelt wurde, stratifiziert, um eine Gruppe zu identifizieren, die von einer operativen Entfernung der Metastasen durch ein langes GesamtĂŒberleben profitieren kann. Ein weiterer wichtiger Punkt bei der Datenanalyse ist die geeignete Auswahl der statistischen Methode abhĂ€ngig von der Datenstruktur. Es konnten am Beispiel der Assoziation einer Coronainfektion mit der Anzahl von Lymphozytensubpopulationen gezeigt werden, dass eine Transformation der Zielvariable notwendig sein kann, um die Voraussetzungen der linearen Regression zu erfĂŒllen. Die Modellierung der Anzahl an tĂ€glichen Neuinfektionen stellt eine relevante Aufgabe dar und benötigt passende statistische Modelle. Ein non-seasonal und ein seasonal ARIMA-Model wurden ebenso wie mehrere log-linearen autoregressiven Poisson-Modellen verglichen. ZusĂ€tzlich wurde ein weiterer Modellierungsansatz untersucht, der die biologischen Mechanismen stĂ€rker einbezieht und eine theoretische Prognose fĂŒr den weiteren Verlauf unter verschiedenen Szenarien ermöglicht. Der Verlauf wurde mittels eines bayesschen SEIR Modell mit mehreren Wendepunkten an die Daten angepasst. Die gefundenen Wendepunkte wurden in Kontext der Verteilung der Virusvarianten analysiert

    Inferring ecological interactions from dynamics in phage-bacteria communities

    Get PDF
    Characterizing how viruses interact with microbial hosts is critical to understanding microbial community structure and function. However, existing methods for quantifying bacteria-phage interactions are not widely applicable to natural communities. First, many bacteria are not culturable, preventing direct experimental testing. Second, “-omics” based methods, while high in accuracy and specificity, have been shown to be extremely low in power. Third, inference methods based on time-series or co-occurrence data, while promising, have for the most part not been rigorously tested. This thesis work focuses on this final category of quantification strategies: inference methods. In this thesis, we further our understanding of both the potential and limitations of several inference methods, focusing primarily on time-series data with high time resolution. We emphasize the quantification of efficacy by using time-series data from multi-strain bacteria-phage communities with known infection networks. We employ both in silico simulated bacteria-phage communities as well as an in vitro community experiment. We review existing correlation-based inference methods, extend theory and characterize tradeoffs for model-based inference which uses convex optimization, characterize pairwise interactions in a 5x5 virus-microbe community experiment using Markov chain Monte Carlo, and present analytic tools for microbiome time-series analysis when a dynamical model is unknown. Together, these chapters bridge gaps in existing literature in inference of ecological interactions from time-series data.Ph.D

    Data pre-processing to identify environmental risk factors associated with diabetes

    Get PDF
    Genetics, diet, obesity, and lack of exercise play a major role in the development of type II diabetes. Additionally, environmental conditions are also linked to type II diabetes. The aim of this research is to identify the environmental conditions associated with diabetes. To achieve this, the research study utilises hospital-admitted patient data in NSW integrated with weather, pollution, and demographic data. The environmental variables (air pollution and weather) change over time and space, necessitating spatiotemporal data analysis to identify associations. Moreover, the environmental variables are measured using sensors, and they often contain large gaps of missing values due to sensor failures. Therefore, enhanced methodologies in data cleaning and imputation are needed to facilitate research using this data. Hence, the objectives of this study are twofold: first, to develop a data cleaning and imputation framework with improved methodologies to clean and pre-process the environmental data, and second, to identify environmental conditions associated with diabetes. This study develops a novel data-cleaning framework that streamlines the practice of data analysis and visualisation, specifically for studying environmental factors such as climate change monitoring and the effects of weather and pollution. The framework is designed to efficiently handle data collected by remote sensors, enabling more accurate and comprehensive analyses of environmental phenomena that would otherwise not be possible. The study initially focuses on the Sydney Region, identifies missing data patterns, and utilises established imputation methods. It assesses the performance of existing techniques and finds that Kalman smoothing on structural time series models outperforms other methods. However, when dealing with larger gaps in missing data, none of the existing methods yield satisfactory results. To address this, the study proposes enhanced methodologies for filling substantial gaps in environmental datasets. The first proposed algorithm employs regularized regression models to fill large gaps in air quality data using a univariate approach. It is then extended to incorporate seasonal patterns and expand its applicability to weather data with similar patterns. Furthermore, the algorithm is enhanced by incorporating other correlated variables to accurately fill substantial gaps in environmental variables. Consistently, the algorithm presented in this thesis outperforms other methods in imputing large gaps. This algorithm is applicable for filling large gaps in air pollution and weather data, facilitating downstream analysis
    • 

    corecore