Search CORE

82,932 research outputs found

A Novel Machine Learning Approach For File Fragments Classification

Author: Algurashi Alia
Publication venue: University of East Anglia. School of Computing Sciences
Publication date: 01/11/2022
Field of study

Identifying types of manipulated or corrupted file fragments in isolation from their context is an essential task in digital forensics. In traditional file type identification, metadata, such as file extensions and header and footer signatures, is used. Traditional metadata-based approaches do not work where metadata is missing or altered, therefore some alternative strategies and approaches need to be applied or developed to solve the problem. One approach is to apply some statistical techniques to extract features from the binary contents of file fragments and then use them as inputs for classification algorithms. This results in high dimensionality, causing learning and classification to be time-consuming. Another approach is deep learning neural networks, which extract features automatically. File fragment classification is further complicated by the high number of possible file classes. Also, some container file types, such as Powerpoint (PPT) include data belonging to other file types, such as JPEG, which can confuse the classification algorithms. In this thesis, we developed a hybrid method to address high feature dimensionality. We use filters and wrappers to reduce the number of features. We explored the possible hierarchical relationships between file classes and we represent them with a hierarchy tree to help narrow the uncertainties for challenging file types. We proposed a novel hybrid approach that combines hierarchical models with feature selection to improve the accuracy of file fragment classification. We also explored the use of deep learning techniques for this task. We test our methods using a benchmark dataset - GovDocs. The results from hybrid feature selection show a reduction in the number of features from 66,313 to 11–32, and provide improved accuracy compared to methods using all features. The accuracy increased from 69% using random forest to 75% using the DAG tree. We incorporate the hybrid feature selection into hierarchical modelling to generate trees that use only the most discriminative features. We find that these models outperformed classical machine-learning approaches. Finally, using deep learning for file fragment classification provided the highest accuracy of all techniques explored, obtaining accuracies of 86%

University of East Anglia digital repository

Structured Review of Code Clone Literature

Author: Hordijk Wiebe
Ponisio María Laura
Wieringa Roel
Publication venue: Centre for Telematics and Information Technology, University of Twente
Publication date: 01/01/2008
Field of study

This report presents the results of a structured review of code clone literature. The aim of the review is to assemble a conceptual model of clone-related concepts which helps us to reason about clones. This conceptual model unifies clone concepts from a wide range of literature, so that findings about clones can be compared with each other

University of Twente Research Information

Algorithmic Programming Language Identification

Author: Klein David
Murray Kyle
Weber Simon
Publication venue
Publication date: 01/01/2011
Field of study

Motivated by the amount of code that goes unidentified on the web, we introduce a practical method for algorithmically identifying the programming language of source code. Our work is based on supervised learning and intelligent statistical features. We also explored, but abandoned, a grammatical approach. In testing, our implementation greatly outperforms that of an existing tool that relies on a Bayesian classifier. Code is written in Python and available under an MIT license.Comment: 11 pages. Code: https://github.com/simon-weber/Programming-Language-Identificatio

arXiv.org e-Print Archive

CiteSeerX

Recommended from our members

Miniature transposable sequences are frequently mobilized in the bacterial plant pathogen Pseudomonas syringae pv. phaseolicola

Author: Añorga Maite
Bardaji Leire
Jackson Robert W.
Martínez-Bilbao Alejandro
Murillo Jesús
Yanguas-Casás Natalia
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 01/01/2011
Field of study

Mobile genetic elements are widespread in Pseudomonas syringae, and often associate with virulence genes. Genome reannotation of the model bean pathogen P. syringae pv. phaseolicola 1448A identified seventeen types of insertion sequences and two miniature inverted-repeat transposable elements (MITEs) with a biased distribution, representing 2.8% of the chromosome, 25.8% of the 132-kb virulence plasmid and 2.7% of the 52-kb plasmid. Employing an entrapment vector containing sacB, we estimated that transposition frequency oscillated between 2.661025 and 1.161026, depending on the clone, although it was stable for each clone after consecutive transfers in culture media. Transposition frequency was similar for bacteria grown in rich or minimal media, and from cells recovered from compatible and incompatible plant hosts, indicating that growth conditions do not influence transposition in strain 1448A. Most of the entrapped insertions contained a full-length IS801 element, with the remaining insertions corresponding to sequences smaller than any transposable element identified in strain 1448A, and collectively identified as miniature sequences. From these, fragments of 229, 360 and 679-nt of the right end of IS801 ended in a consensus tetranucleotide and likely resulted from one-ended transposition of IS801. An average 0.7% of the insertions analyzed consisted of IS801 carrying a fragment of variable size from gene PSPPH_0008/PSPPH_0017, showing that IS801 can mobilize DNA in vivo. Retrospective analysis of complete plasmids and genomes of P. syringae suggests, however, that most fragments of IS801 are likely the result of reorganizations rather than one-ended transpositions, and that this element might preferentially contribute to genome flexibility by generating homologous regions of recombination. A further miniature sequence previously found to affect host range specificity and virulence, designated MITEPsy1 (100-nt), represented an average 2.4% of the total number of insertions entrapped in sacB, demonstrating for the first time the mobilization of a MITE in bacteria

Central Archive at the University of Reading

Public Library of Science (PLOS)

Crossref

Directory of Open Access Journals

PubMed Central

Academica-e

Charcoal identification in species-rich biomes: a protocol for Central Africa optimised for the Mayumbe forest

Author: Beeckman Hans
Hubau Wannes
Kitin Peter
Mees Florias
Van Acker Joris
Van den Bulcke Jan
Publication venue: 'Elsevier BV'
Publication date: 01/01/2012
Field of study

Ghent University Academic Bibliography

Gerstmann-Sträussler-Scheinker disease revisited: accumulation of covalently-linked multimers of internal prion protein fragments

Author: Cali Ignazio
Cracco Laura
Gambetti Pierluigi
Ghetti Bernardino
Lavrich Jody
Nemani Satish K.
Notari Silvio
Surewicz Witold K.
Xiao Xiangzhu
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 29/05/2019
Field of study

Despite their phenotypic heterogeneity, most human prion diseases belong to two broadly defined groups: Creutzfeldt-Jakob disease (CJD) and Gerstmann-Sträussler-Scheinker disease (GSS). While the structural characteristics of the disease-related proteinase K-resistant prion protein (resPrPD) associated with the CJD group are fairly well established, many features of GSS-associated resPrPD are unclear. Electrophoretic profiles of resPrPD associated with GSS variants typically show 6-8 kDa bands corresponding to the internal PrP fragments as well as a variable number of higher molecular weight bands, the molecular nature of which has not been investigated. Here we have performed systematic studies of purified resPrPD species extracted from GSS cases with the A117V (GSSA117V) and F198S (GSSF198S) PrP gene mutations. The combined analysis based on epitope mapping, deglycosylation treatment and direct amino acid sequencing by mass spectrometry provided a conclusive evidence that high molecular weight resPrPD species seen in electrophoretic profiles represent covalently-linked multimers of the internal ~ 7 and ~ 8 kDa fragments. This finding reveals a mechanism of resPrPD aggregate formation that has not been previously established in prion diseases

IUPUIScholarWorks

Distribution, functional impact, and origin mechanisms of copy number variation in the barley genome

Author: Ariyadasa Ruvini
Eichten Steven R
Jeddeloh Jeffrey A
Mascher Martin
Mayer Klaus FX
Muehlbauer Gary J
Muñoz-Amatriaín María
Nussbaumer Thomas
Platzer Matthias
Richmond Todd A
Scholz Uwe
Spannagl Manuel
Springer Nathan M
Stein Nils
Steuernagel Burkhard
Taudien Stefan
Wicker Thomas
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 10/12/2015
Field of study

BACKGROUND There is growing evidence for the prevalence of copy number variation (CNV) and its role in phenotypic variation in many eukaryotic species. Here we use array comparative genomic hybridization to explore the extent of this type of structural variation in domesticated barley cultivars and wild barleys. RESULTS A collection of 14 barley genotypes including eight cultivars and six wild barleys were used for comparative genomic hybridization. CNV affects 14.9% of all the sequences that were assessed. Higher levels of CNV diversity are present in the wild accessions relative to cultivated barley. CNVs are enriched near the ends of all chromosomes except 4H, which exhibits the lowest frequency of CNVs. CNV affects 9.5% of the coding sequences represented on the array and the genes affected by CNV are enriched for sequences annotated as disease-resistance proteins and protein kinases. Sequence-based comparisons of CNV between cultivars Barke and Morex provided evidence that DNA repair mechanisms of double-strand breaks via single-stranded annealing and synthesis-dependent strand annealing play an important role in the origin of CNV in barley. CONCLUSIONS We present the first catalog of CNVs in a diploid Triticeae species, which opens the door for future genome diversity research in a tribe that comprises the economically important cereal species wheat, barley, and rye. Our findings constitute a valuable resource for the identification of CNV affecting genes of agronomic importance. We also identify potential mechanisms that can generate variation in copy number in plant genomes.This work was financially supported by the following grants: project GABI-BARLEX, German Federal Ministry of Education and Research (BMBF), #0314000 to MP, US, KFXM and NS; Triticeae Coordinated Agricultural Project, USDA-NIFA #2011-68002-30029 to GJM; and Agriculture and Food Research Initiative Plant Genome, Genetics and Breeding Program of USDA’s Cooperative State Research and Extension Service, #2009-65300- 05645 to GJM

The Australian National University