251,484 research outputs found
A Novel Machine Learning Approach For File Fragments Classification
Identifying types of manipulated or corrupted file fragments in isolation from their context is an essential task in digital forensics. In traditional file type identification, metadata, such as file extensions and header and footer signatures, is used. Traditional metadata-based approaches do not work where metadata is missing or altered, therefore some alternative strategies and approaches need to be applied or developed to solve the problem.
One approach is to apply some statistical techniques to extract features from the binary contents of file fragments and then use them as inputs for classification algorithms. This results in high dimensionality, causing learning and classification to be time-consuming. Another approach is deep learning neural networks, which extract features automatically. File fragment classification is further complicated by the high number of possible file classes. Also, some container file types, such as Powerpoint (PPT) include data belonging to other file types, such as JPEG, which can confuse the classification algorithms.
In this thesis, we developed a hybrid method to address high feature dimensionality. We use filters and wrappers to reduce the number of features. We explored the possible hierarchical relationships between file classes and we represent them with a hierarchy tree to help narrow the uncertainties for challenging file types. We proposed a novel hybrid approach that combines hierarchical models with feature selection to improve the accuracy of file fragment classification. We also explored the use of deep learning techniques for this task.
We test our methods using a benchmark dataset - GovDocs. The results from hybrid feature selection show a reduction in the number of features from 66,313 to 11–32, and provide improved accuracy compared to methods using all features. The accuracy increased from 69% using random forest to 75% using the DAG tree. We incorporate the hybrid feature selection into hierarchical modelling to generate trees that use only the most discriminative features. We find that these models outperformed classical machine-learning approaches. Finally, using deep learning for file fragment classification provided the highest accuracy of all techniques explored, obtaining accuracies of 86%
CEAI: CCM based Email Authorship Identification Model
In this paper we present a model for email authorship identification (EAI) by
employing a Cluster-based Classification (CCM) technique. Traditionally,
stylometric features have been successfully employed in various authorship
analysis tasks; we extend the traditional feature-set to include some more
interesting and effective features for email authorship identification (e.g.
the last punctuation mark used in an email, the tendency of an author to use
capitalization at the start of an email, or the punctuation after a greeting or
farewell). We also included Info Gain feature selection based content features.
It is observed that the use of such features in the authorship identification
process has a positive impact on the accuracy of the authorship identification
task. We performed experiments to justify our arguments and compared the
results with other base line models. Experimental results reveal that the
proposed CCM-based email authorship identification model, along with the
proposed feature set, outperforms the state-of-the-art support vector machine
(SVM)-based models, as well as the models proposed by Iqbal et al. [1, 2]. The
proposed model attains an accuracy rate of 94% for 10 authors, 89% for 25
authors, and 81% for 50 authors, respectively on Enron dataset, while 89.5%
accuracy has been achieved on authors' constructed real email dataset. The
results on Enron dataset have been achieved on quite a large number of authors
as compared to the models proposed by Iqbal et al. [1, 2]
A traffic classification method using machine learning algorithm
Applying concepts of attack investigation in IT industry, this idea has been developed to design
a Traffic Classification Method using Data Mining techniques at the intersection of Machine
Learning Algorithm, Which will classify the normal and malicious traffic. This classification will
help to learn about the unknown attacks faced by IT industry. The notion of traffic classification
is not a new concept; plenty of work has been done to classify the network traffic for
heterogeneous application nowadays. Existing techniques such as (payload based, port based
and statistical based) have their own pros and cons which will be discussed in this
literature later, but classification using Machine Learning techniques is still an open field to explore and has provided very promising results up till now
Does training with amplitude modulated tones affect tone-vocoded speech perception?
Temporal-envelope cues are essential for successful speech perception. We asked here whether training on stimuli containing temporal-envelope cues without speech content can improve the perception of spectrally-degraded (vocoded) speech in which the temporal-envelope (but not the temporal fine structure) is mainly preserved. Two groups of listeners were trained on different amplitude-modulation (AM) based tasks, either AM detection or AM-rate discrimination (21 blocks of 60 trials during two days, 1260 trials; frequency range: 4Hz, 8Hz, and 16Hz), while an additional control group did not undertake any training. Consonant identification in vocoded vowel-consonant-vowel stimuli was tested before and after training on the AM tasks (or at an equivalent time interval for the control group). Following training, only the trained groups showed a significant improvement in the perception of vocoded speech, but the improvement did not significantly differ from that observed for controls. Thus, we do not find convincing evidence that this amount of training with temporal-envelope cues without speech content provide significant benefit for vocoded speech intelligibility. Alternative training regimens using vocoded speech along the linguistic hierarchy should be explored
Towards Automated Performance Bug Identification in Python
Context: Software performance is a critical non-functional requirement,
appearing in many fields such as mission critical applications, financial, and
real time systems. In this work we focused on early detection of performance
bugs; our software under study was a real time system used in the
advertisement/marketing domain.
Goal: Find a simple and easy to implement solution, predicting performance
bugs.
Method: We built several models using four machine learning methods, commonly
used for defect prediction: C4.5 Decision Trees, Na\"{\i}ve Bayes, Bayesian
Networks, and Logistic Regression.
Results: Our empirical results show that a C4.5 model, using lines of code
changed, file's age and size as explanatory variables, can be used to predict
performance bugs (recall=0.73, accuracy=0.85, and precision=0.96). We show that
reducing the number of changes delivered on a commit, can decrease the chance
of performance bug injection.
Conclusions: We believe that our approach can help practitioners to eliminate
performance bugs early in the development cycle. Our results are also of
interest to theoreticians, establishing a link between functional bugs and
(non-functional) performance bugs, and explicitly showing that attributes used
for prediction of functional bugs can be used for prediction of performance
bugs
Algorithmic Programming Language Identification
Motivated by the amount of code that goes unidentified on the web, we
introduce a practical method for algorithmically identifying the programming
language of source code. Our work is based on supervised learning and
intelligent statistical features. We also explored, but abandoned, a
grammatical approach. In testing, our implementation greatly outperforms that
of an existing tool that relies on a Bayesian classifier. Code is written in
Python and available under an MIT license.Comment: 11 pages. Code:
https://github.com/simon-weber/Programming-Language-Identificatio
- …