82,932 research outputs found

    A Novel Machine Learning Approach For File Fragments Classification

    Get PDF
    Identifying types of manipulated or corrupted file fragments in isolation from their context is an essential task in digital forensics. In traditional file type identification, metadata, such as file extensions and header and footer signatures, is used. Traditional metadata-based approaches do not work where metadata is missing or altered, therefore some alternative strategies and approaches need to be applied or developed to solve the problem. One approach is to apply some statistical techniques to extract features from the binary contents of file fragments and then use them as inputs for classification algorithms. This results in high dimensionality, causing learning and classification to be time-consuming. Another approach is deep learning neural networks, which extract features automatically. File fragment classification is further complicated by the high number of possible file classes. Also, some container file types, such as Powerpoint (PPT) include data belonging to other file types, such as JPEG, which can confuse the classification algorithms. In this thesis, we developed a hybrid method to address high feature dimensionality. We use filters and wrappers to reduce the number of features. We explored the possible hierarchical relationships between file classes and we represent them with a hierarchy tree to help narrow the uncertainties for challenging file types. We proposed a novel hybrid approach that combines hierarchical models with feature selection to improve the accuracy of file fragment classification. We also explored the use of deep learning techniques for this task. We test our methods using a benchmark dataset - GovDocs. The results from hybrid feature selection show a reduction in the number of features from 66,313 to 11–32, and provide improved accuracy compared to methods using all features. The accuracy increased from 69% using random forest to 75% using the DAG tree. We incorporate the hybrid feature selection into hierarchical modelling to generate trees that use only the most discriminative features. We find that these models outperformed classical machine-learning approaches. Finally, using deep learning for file fragment classification provided the highest accuracy of all techniques explored, obtaining accuracies of 86%

    Structured Review of Code Clone Literature

    Get PDF
    This report presents the results of a structured review of code clone literature. The aim of the review is to assemble a conceptual model of clone-related concepts which helps us to reason about clones. This conceptual model unifies clone concepts from a wide range of literature, so that findings about clones can be compared with each other

    Algorithmic Programming Language Identification

    Full text link
    Motivated by the amount of code that goes unidentified on the web, we introduce a practical method for algorithmically identifying the programming language of source code. Our work is based on supervised learning and intelligent statistical features. We also explored, but abandoned, a grammatical approach. In testing, our implementation greatly outperforms that of an existing tool that relies on a Bayesian classifier. Code is written in Python and available under an MIT license.Comment: 11 pages. Code: https://github.com/simon-weber/Programming-Language-Identificatio

    Gerstmann-Sträussler-Scheinker disease revisited: accumulation of covalently-linked multimers of internal prion protein fragments

    Get PDF
    Despite their phenotypic heterogeneity, most human prion diseases belong to two broadly defined groups: Creutzfeldt-Jakob disease (CJD) and Gerstmann-Sträussler-Scheinker disease (GSS). While the structural characteristics of the disease-related proteinase K-resistant prion protein (resPrPD) associated with the CJD group are fairly well established, many features of GSS-associated resPrPD are unclear. Electrophoretic profiles of resPrPD associated with GSS variants typically show 6-8 kDa bands corresponding to the internal PrP fragments as well as a variable number of higher molecular weight bands, the molecular nature of which has not been investigated. Here we have performed systematic studies of purified resPrPD species extracted from GSS cases with the A117V (GSSA117V) and F198S (GSSF198S) PrP gene mutations. The combined analysis based on epitope mapping, deglycosylation treatment and direct amino acid sequencing by mass spectrometry provided a conclusive evidence that high molecular weight resPrPD species seen in electrophoretic profiles represent covalently-linked multimers of the internal ~ 7 and ~ 8 kDa fragments. This finding reveals a mechanism of resPrPD aggregate formation that has not been previously established in prion diseases

    Distribution, functional impact, and origin mechanisms of copy number variation in the barley genome

    Get PDF
    BACKGROUND There is growing evidence for the prevalence of copy number variation (CNV) and its role in phenotypic variation in many eukaryotic species. Here we use array comparative genomic hybridization to explore the extent of this type of structural variation in domesticated barley cultivars and wild barleys. RESULTS A collection of 14 barley genotypes including eight cultivars and six wild barleys were used for comparative genomic hybridization. CNV affects 14.9% of all the sequences that were assessed. Higher levels of CNV diversity are present in the wild accessions relative to cultivated barley. CNVs are enriched near the ends of all chromosomes except 4H, which exhibits the lowest frequency of CNVs. CNV affects 9.5% of the coding sequences represented on the array and the genes affected by CNV are enriched for sequences annotated as disease-resistance proteins and protein kinases. Sequence-based comparisons of CNV between cultivars Barke and Morex provided evidence that DNA repair mechanisms of double-strand breaks via single-stranded annealing and synthesis-dependent strand annealing play an important role in the origin of CNV in barley. CONCLUSIONS We present the first catalog of CNVs in a diploid Triticeae species, which opens the door for future genome diversity research in a tribe that comprises the economically important cereal species wheat, barley, and rye. Our findings constitute a valuable resource for the identification of CNV affecting genes of agronomic importance. We also identify potential mechanisms that can generate variation in copy number in plant genomes.This work was financially supported by the following grants: project GABI-BARLEX, German Federal Ministry of Education and Research (BMBF), #0314000 to MP, US, KFXM and NS; Triticeae Coordinated Agricultural Project, USDA-NIFA #2011-68002-30029 to GJM; and Agriculture and Food Research Initiative Plant Genome, Genetics and Breeding Program of USDA’s Cooperative State Research and Extension Service, #2009-65300- 05645 to GJM
    • …
    corecore