3,468 research outputs found

    Machine learning approach for detection of nonTor traffic

    Get PDF
    Intrusion detection has attracted a considerable interest from researchers and industry. After many years of research the community still faces the problem of building reliable and efficient intrusion detection systems (IDS) capable of handling large quantities of data with changing patterns in real time situations. The Tor network is popular in providing privacy and security to end user by anonymizing the identity of internet users connecting through a series of tunnels and nodes. This work identifies two problems; classification of Tor traffic and nonTor traffic to expose the activities within Tor traffic that minimizes the protection of users in using the UNB-CIC Tor Network Traffic dataset and classification of the Tor traffic flow in the network. This paper proposes a hybrid classifier; Artificial Neural Network in conjunction with Correlation feature selection algorithm for dimensionality reduction and improved classification performance. The reliability and efficiency of the propose hybrid classifier is compared with Support Vector Machine and naĂŻve Bayes classifiers in detecting nonTor traffic in UNB-CIC Tor Network Traffic dataset. Experimental results show the hybrid classifier, ANN-CFS proved a better classifier in detecting nonTor traffic and classifying the Tor traffic flow in UNB-CIC Tor Network Traffic dataset

    Data-Adaptive Kernel Support Vector Machine

    Get PDF
    In this thesis, we propose the data-adaptive kernel Support Vector Machine (SVM), a new method with a data-driven scaling kernel function based on real data sets. This two-stage approach of kernel function scaling can enhance the accuracy of a support vector machine, especially when the data are imbalanced. Followed by the standard SVM procedure in the first stage, the proposed method locally adapts the kernel function to data locations based on the skewness of the class outcomes. In the second stage, the decision rule is constructed with the data-adaptive kernel function and is used as the classifier. This process enlarges the magnification effect directly on the Riemannian manifold within the feature space rather than the input space. The proposed data-adaptive kernel SVM technique is applied in the binary classification, and is extended to the multi-class situations when imbalance is a main concern. We conduct extensive simulation studies to assess the performance of the proposed methods, and the prostate cancer image study is employed as an illustration. The data-adaptive kernel is further applied in feature selection process. We propose the data-adaptive kernel-penalized SVM, a new method of simultaneous feature selection and classification by penalizing data-adaptive kernels in SVMs. Instead of penalizing the standard cost function of SVMs in the usual way, the penalty will be directly added to the dual objective function that contains the data-adaptive kernel. Classification results with sparse features selected can be obtained simultaneously. Different penalty terms in the data-adaptive kernel-penalized SVM will be compared. The oracle property of the estimator is examined. We conduct extensive simulation studies to assess the performance of all the proposed methods, and employ the method on a breast cancer data set as an illustration. The data-adaptive kernel is further applied in feature selection process. We propose the data-adaptive kernel-penalized SVM, a new method of simultaneous feature selection and classification by penalizing data-adaptive kernels in SVMs. Instead of penalizing the standard cost function of SVMs in the usual way, the penalty will be directly added to the dual objective function that contains the data-adaptive kernel. Classification results with sparse features selected can be obtained simultaneously. Different penalty terms in the data-adaptive kernel-penalized SVM will be compared. The oracle property of the estimator is examined. We conduct extensive simulation studies to assess the performance of all the proposed methods, and employ the method on a breast cancer data set as an illustration

    A Review of Codebook Models in Patch-Based Visual Object Recognition

    No full text
    The codebook model-based approach, while ignoring any structural aspect in vision, nonetheless provides state-of-the-art performances on current datasets. The key role of a visual codebook is to provide a way to map the low-level features into a fixed-length vector in histogram space to which standard classifiers can be directly applied. The discriminative power of such a visual codebook determines the quality of the codebook model, whereas the size of the codebook controls the complexity of the model. Thus, the construction of a codebook is an important step which is usually done by cluster analysis. However, clustering is a process that retains regions of high density in a distribution and it follows that the resulting codebook need not have discriminant properties. This is also recognised as a computational bottleneck of such systems. In our recent work, we proposed a resource-allocating codebook, to constructing a discriminant codebook in a one-pass design procedure that slightly outperforms more traditional approaches at drastically reduced computing times. In this review we survey several approaches that have been proposed over the last decade with their use of feature detectors, descriptors, codebook construction schemes, choice of classifiers in recognising objects, and datasets that were used in evaluating the proposed methods

    Tackling Uncertainties and Errors in the Satellite Monitoring of Forest Cover Change

    Get PDF
    This study aims at improving the reliability of automatic forest change detection. Forest change detection is of vital importance for understanding global land cover as well as the carbon cycle. Remote sensing and machine learning have been widely adopted for such studies with increasing degrees of success. However, contemporary global studies still suffer from lower-than-satisfactory accuracies and robustness problems whose causes were largely unknown. Global geographical observations are complex, as a result of the hidden interweaving geographical processes. Is it possible that some geographical complexities were not expected in contemporary machine learning? Could they cause uncertainties and errors when contemporary machine learning theories are applied for remote sensing? This dissertation adopts the philosophy of error elimination. We start by explaining the mathematical origins of possible geographic uncertainties and errors in chapter two. Uncertainties are unavoidable but might be mitigated. Errors are hidden but might be found and corrected. Then in chapter three, experiments are specifically designed to assess whether or not the contemporary machine learning theories can handle these geographic uncertainties and errors. In chapter four, we identify an unreported systemic error source: the proportion distribution of classes in the training set. A subsequent Bayesian Optimal solution is designed to combine Support Vector Machine and Maximum Likelihood. Finally, in chapter five, we demonstrate how this type of error is widespread not just in classification algorithms, but also embedded in the conceptual definition of geographic classes before the classification. In chapter six, the sources of errors and uncertainties and their solutions are summarized, with theoretical implications for future studies. The most important finding is that, how we design a classification largely pre-determines what we eventually get out of it. This applies for many contemporary popular classifiers including various types of neural nets, decision tree, and support vector machine. This is a cause of the so-called overfitting problem in contemporary machine learning. Therefore, we propose that the emphasis of classification work be shifted to the planning stage before the actual classification. Geography should not just be the analysis of collected observations, but also about the planning of observation collection. This is where geography, machine learning, and survey statistics meet

    Online Condition Monitoring of Electric Powertrains using Machine Learning and Data Fusion

    Get PDF
    Safe and reliable operations of industrial machines are highly prioritized in industry. Typical industrial machines are complex systems, including electric motors, gearboxes and loads. A fault in critical industrial machines may lead to catastrophic failures, service interruptions and productivity losses, thus condition monitoring systems are necessary in such machines. The conventional condition monitoring or fault diagnosis systems using signal processing, time and frequency domain analysis of vibration or current signals are widely used in industry, requiring expensive and professional fault analysis team. Further, the traditional diagnosis methods mainly focus on single components in steady-state operations. Under dynamic operating conditions, the measured quantities are non-stationary, thus those methods cannot provide reliable diagnosis results for complex gearbox based powertrains, especially in multiple fault contexts. In this dissertation, four main research topics or problems in condition monitoring of gearboxes and powertrains have been identified, and novel solutions are provided based on data-driven approach. The first research problem focuses on bearing fault diagnosis at early stages and dynamic working conditions. The second problem is to increase the robustness of gearbox mixed fault diagnosis under noise conditions. Mixed fault diagnosis in variable speeds and loads has been considered as third problem. Finally, the limitation of labelled training or historical failure data in industry is identified as the main challenge for implementing data-driven algorithms. To address mentioned problems, this study aims to propose data-driven fault diagnosis schemes based on order tracking, unsupervised and supervised machine learning, and data fusion. All the proposed fault diagnosis schemes are tested with experimental data, and key features of the proposed solutions are highlighted with comparative studies.publishedVersio

    Predicting credit rating change using machine learning and natural language processing

    Get PDF
    Corporate credit ratings provide standardized third-party information for market participants. They offer many benefits for issuers, intermediaries and investors and generally increase trust and efficiency in the market. Credit ratings are provided by credit rating agencies. In addition to quantitative information of companies (e.g. financial statements), the qualitative information in company-related textual documents is known to be a determinant in the credit rating process. However, the way in which the credit rating agencies interpret this data is not public information. The purpose of this thesis is to develop a supervised machine learning model that predicts credit rating changes as a binary classification problem, based on form 10-k annual reports of public U.S. companies. Before using in the classification task, the form 10-k reports are pre-processed using natural language processing methods. More generally, this thesis aims to answer, to what extent a change in a company’s credit rating can be predicted based on the form 10-k reports, and whether the use of topic modeling can improve the results. A total of five different machine learning algorithms are used for the binary classification of this thesis and their performances are compared. These algorithms are support vector machine, logistic regression, decision tree, random forest and naïve Bayes classifier. Topic modeling is implemented using latent semantic analysis. The studies of Hajek et al. (2016) and Chen et al. (2017) are the main sources of inspiration for this thesis. The methods used in this thesis are for the most part similar as in these studies. This thesis adds value to the findings of these studies by finding out how credit rating prediction methods in Hajek et al. (2016), binary classification methods in Chen et al. (2017) and utilization of form 10-k annual reports (used in both Hajek et al. (2016) and Chen et al. (2017) can be combined as a binary credit rating classifier. The results of the study show that credit rating change can be predicted using 10-k data, but the predictions are not very accurate. The best classification results were obtained using a support vector machine, with an accuracy of 69.4% and an AUC of 0.6744. No significant improvement on classification performance was obtained using topic modeling.Yritysten luottoluokitukset antavat standardoitua kolmannen osapuolen tietoa markkinaosapuolille. Ne tarjoavat monia etuja liikkeellelaskijoille, välittäjille ja sijoittajille ja lisäävät yleistä luottamusta ja tehokkuutta markkinoilla. Luottoluokituksia myöntävät luottoluokituslaitokset. Kvantitatiivisten yritystä koskevien tietojen (esim. Tilinpäätöstietojen) lisäksi yrityksen julkaiseman tekstimuotoisen datan sisältävien laadullisten tietojen tiedetään vaikuttavan luottoluokitusprosessiin. Tapa, jolla luottoluokituslaitokset tulkitsevat tätä tietoa, ei kuitenkaan ole julkisesti tiedossa. Tämän tutkielman tarkoituksena on kehittää ohjattu koneoppimismalli, joka ennustaa luottoluokitusmuutoksia binäärisenä luokitteluongelmana Yhdysvalloissa toimivien pörssiyhtiöiden 10-k -muotoisten vuosikertomuksien perusteella. 10-k vuosikertomukset esikäsitellään luonnollisen kielen käsittelyn menetelmillä, ennen kuin niitä käytetään luokittelutehtävässä. Yleisemmin tämän tutkielman tavoitteena on selvittää, missä määrin yrityksen luottoluokituksen muutosta voidaan ennustaa 10-k vuosikertomuksen perusteella ja voidaanko aihemallinnuksen avulla parantaa tuloksia. Tutkielmassa käytetään binääriseen luokitteluun yhteensä viittä erilaista koneoppimisalgoritmia ja verrataan niiden suorituskykyjä. Nämä algoritmit ovat tukivektorikone, logistinen regressio, päätöspuu, satunnainen metsä ja naïve Bayes-luokitin. Aihemallinnus toteutetaan latentin semanttisen analyysin avulla. Hajek ym. (2016) ja Chen ym. (2017) tutkimukset ovat toimineet pääasiallisena inspiraation lähteenä tälle tutkielmalle. Tässä tutkielmassa käytetyt metodit ovat pitkälti samoja kuin näissä tutkimuksissa. Tämä tutkielma tuo lisäarvoa näiden tutkimusten tuloksiin selvittämällä, kuinka Hajek ym. (2016) käyttämiä luottoluokituksen ennustusmetodeja, Chen ym. (2017) käyttämiä binäärisen luokittelun metodeja ja 10-k vuosikertomusten hyödyntämistä (käytetty sekä Hajek ym. (2016) että Chen ym. (2017)) voidaan yhdistää binääriseksi luottoluokitusennustimeksi. Tutkielman tulokset osoittavat, että luottoluokituksen muutosta voidaan ennustaa käyttämällä 10-k vuosikertomuksia, mutta ennusteet eivät ole kovin tarkkoja. Paras luokittelutulos saatiin tukivektorikoneella, tarkkuudella 69,4% ja AUC-arvolla 0,6744. Aihemallinnuksella ei saavutettu merkittävää parannusta luokittelutuloksiin
    • …
    corecore