251 research outputs found

    Web Page Multiclass Classification

    Get PDF
    As the internet age evolves, the volume of content hosted on the Web is rapidly expanding. With this ever-expanding content, the capability to accurately categorize web pages is a current challenge to serve many use cases. This paper proposes a variation in the approach to text preprocessing pipeline whereby noun phrase extraction is performed first followed by lemmatization, contraction expansion, removing special characters, removing extra white space, lower casing, and removal of stop words. The first step of noun phrase extraction is aimed at reducing the set of terms to those that best describe what the web pages are about to improve the categorization capabilities of the model. Separately, a text preprocessing using keyword extraction is evaluated. In addition to the text preprocessing techniques mentioned, feature reduction techniques are applied to optimize model performance. Several modeling techniques are examined using these two approaches and are compared to a baseline model. The baseline model is a Support Vector Machine with linear kernel and is based on text preprocessing and feature reduction techniques that do not include noun phrase extraction or keyword extraction and uses stemming rather than lemmatization. The recommended SVM One-Versus-One model based on noun phrase extraction and lemmatization during text preprocessing shows an accuracy improvement over the baseline model of nearly 1% and a 5-fold reduction in misclassification of web pages as undesirable categories

    XSS attack detection based on machine learning

    Get PDF
    As the popularity of web-based applications grows, so does the number of individuals who use them. The vulnerabilities of those programs, however, remain a concern. Cross-site scripting is a very prevalent assault that is simple to launch but difficult to defend against. That is why it is being studied. The current study focuses on artificial systems, such as machine learning, which can function without human interaction. As technology advances, the need for maintenance is increasing. Those maintenance systems, on the other hand, are becoming more complex. This is why machine learning technologies are becoming increasingly important in our daily lives. This study use supervised machine learning to protect against cross-site scripting, which allows the computer to find an algorithm that can identify vulnerabilities. A large collection of datasets serves as the foundation for this technique. The model will be equipped with functions extracted from datasets that will allow it to learn the model of such an attack by filtering it using common Javascript symbols or possible Document Object Model (DOM) syntax. As long as the research continues, the best conjugate algorithms will be discovered that can successfully fight against cross-site scripting. It will do multiple comparisons between different classification methods on their own or in combination to determine which one performs the best.À medida que a popularidade dos aplicativos da internet cresce, aumenta também o número de indivíduos que os utilizam. No entanto, as vulnerabilidades desses programas continuam a ser uma preocupação para o uso da internet no dia-a-dia. O cross-site scripting é um ataque muito comum que é simples de lançar, mas difícil de-se defender. Por isso, é importante que este ataque possa ser estudado. A tese atual concentra-se em sistemas baseados na utilização de inteligência artificial e Aprendizagem Automática (ML), que podem funcionar sem interação humana. À medida que a tecnologia avança, a necessidade de manutenção também vai aumentando. Por outro lado, estes sistemas vão tornando-se cada vez mais complexos. É, por isso, que as técnicas de machine learning torna-se cada vez mais importantes nas nossas vidas diárias. Este trabalho baseia-se na utilização de Aprendizagem Automática para proteger contra o ataque cross-site scripting, o que permite ao computador encontrar um algoritmo que tem a possibilidade de identificar as vulnerabilidades. Uma grande coleção de conjuntos de dados serve como a base para a abordagem proposta. A máquina virá ser equipada com o processamento de linguagem natural, o que lhe permite a aprendizagem do padrão de tal ataque e filtrando-o com o uso da mesma linguagem, javascript, que é possível usar para controlar os objectos DOM (Document Object Model). Enquanto a pesquisa continua, os melhores algoritmos conjugados serão descobertos para que possam prever com sucesso contra estes ataques. O estudo fará várias comparações entre diferentes métodos de classificação por si só ou em combinação para determinar o que tiver melhor desempenho

    Imbalanced data classification and its application in cyber security

    Get PDF
    Cyber security, also known as information technology security or simply as information security, aims to protect government organizations, companies and individuals by defending their computers, servers, electronic systems, networks, and data from malicious attacks. With the advancement of client-side on the fly web content generation techniques, it becomes easier for attackers to modify the content of a website dynamically and gain access to valuable information. The impact of cybercrime to the global economy is now more than ever, and it is growing day by day. Among various types of cybercrimes, financial attacks are widely spread and the financial sector is among most targeted. Both corporations and individuals are losing a huge amount of money each year. The majority portion of financial attacks is carried out by banking malware and web-based attacks. The end users are not always skilled enough to differentiate between injected content and actual contents of a webpage. Designing a real-time security system for ensuring a safe browsing experience is a challenging task. Some of the existing solutions are designed for client side and all the users have to install it in their system, which is very difficult to implement. In addition, various platforms and tools are used by organizations and individuals, therefore, different solutions are needed to be designed. The existing server-side solution often focuses on sanitizing and filtering the inputs. It will fail to detect obfuscated and hidden scripts. This is a realtime security system and any significant delay will hamper user experience. Therefore, finding the most optimized and efficient solution is very important. To ensure an easy installation and integration capabilities of any solution with the existing system is also a critical factor to consider. If the solution is efficient but difficult to integrate, then it may not be a feasible solution for practical use. Unsupervised and supervised data classification techniques have been widely applied to design algorithms for solving cyber security problems. The performance of these algorithms varies depending on types of cyber security problems and size of datasets. To date, existing algorithms do not achieve high accuracy in detecting malware activities. Datasets in cyber security and, especially those from financial sectors, are predominantly imbalanced datasets as the number of malware activities is significantly less than the number of normal activities. This means that classifiers for imbalanced datasets can be used to develop supervised data classification algorithms to detect malware activities. Development of classifiers for imbalanced data sets has been subject of research over the last decade. Most of these classifiers are based on oversampling and undersampling techniques and are not efficient in many situations as such techniques are applied globally. In this thesis, we develop two new algorithms for solving supervised data classification problems in imbalanced datasets and then apply them to solve malware detection problems. The first algorithm is designed using the piecewise linear classifiers by formulating this problem as an optimization problem and by applying the penalty function method. More specifically, we add more penalty to the objective function for misclassified points from minority classes. The second method is based on the combination of the supervised and unsupervised (clustering) algorithms. Such an approach allows one to identify areas in the input space where minority classes are located and to apply local oversampling or undersampling. This approach leads to the design of more efficient and accurate classifiers. The proposed algorithms are tested using real-world datasets. Results clearly demonstrate superiority of newly introduced algorithms. Then we apply these algorithms to design classifiers to detect malwares.Doctor of Philosoph

    Cyber Security

    Get PDF
    This open access book constitutes the refereed proceedings of the 18th China Annual Conference on Cyber Security, CNCERT 2022, held in Beijing, China, in August 2022. The 17 papers presented were carefully reviewed and selected from 64 submissions. The papers are organized according to the following topical sections: ​​data security; anomaly detection; cryptocurrency; information security; vulnerabilities; mobile internet; threat intelligence; text recognition

    Cyber Security

    Get PDF
    This open access book constitutes the refereed proceedings of the 18th China Annual Conference on Cyber Security, CNCERT 2022, held in Beijing, China, in August 2022. The 17 papers presented were carefully reviewed and selected from 64 submissions. The papers are organized according to the following topical sections: ​​data security; anomaly detection; cryptocurrency; information security; vulnerabilities; mobile internet; threat intelligence; text recognition

    A Deep-dive into Cryptojacking Malware: From an Empirical Analysis to a Detection Method for Computationally Weak Devices

    Get PDF
    Cryptojacking is an act of using a victim\u27s computation power without his/her consent. Unauthorized mining costs extra electricity consumption and decreases the victim host\u27s computational efficiency dramatically. In this thesis, we perform an extensive research on cryptojacking malware from every aspects. First, we present a systematic overview of cryptojacking malware based on the information obtained from the combination of academic research papers, two large cryptojacking datasets of samples, and numerous major attack instances. Second, we created a dataset of 6269 websites containing cryptomining scripts in their source codes to characterize the in-browser cryptomining ecosystem by differentiating permissioned and permissionless cryptomining samples. Third, we introduce an accurate and efficient IoT cryptojacking detection mechanism based on network traffic features that achieves an accuracy of 99%. Finally, we believe this thesis will greatly expand the scope of research and facilitate other novel solutions in the cryptojacking domain

    Click fraud : how to spot it, how to stop it?

    Get PDF
    Online search advertising is currently the greatest source of revenue for many Internet giants such as Googleâ„¢, Yahoo!â„¢, and Bingâ„¢. The increased number of specialized websites and modern profiling techniques have all contributed to an explosion of the income of ad brokers from online advertising. The single biggest threat to this growth is however click fraud. Trained botnets and even individuals are hired by click-fraud specialists in order to maximize the revenue of certain users from the ads they publish on their websites, or to launch an attack between competing businesses. Most academics and consultants who study online advertising estimate that 15% to 35% of ads in pay per click (PPC) online advertising systems are not authentic. In the first two quarters of 2010, US marketers alone spent 5.7billiononPPCads,wherePPCadsarebetween45and50percentofallonlineadspending.Onaverageabout5.7 billion on PPC ads, where PPC ads are between 45 and 50 percent of all online ad spending. On average about 1.5 billion is wasted due to click-fraud. These fraudulent clicks are believed to be initiated by users in poor countries, or botnets, who are trained to click on specific ads. For example, according to a 2010 study from Information Warfare Monitor, the operators of Koobface, a program that installed malicious software to participate in click fraud, made over $2 million in just over a year. The process of making such illegitimate clicks to generate revenue is called click-fraud. Search engines claim they filter out most questionable clicks and either not charge for them or reimburse advertisers that have been wrongly billed. However this is a hard task, despite the claims that brokers\u27 efforts are satisfactory. In the simplest scenario, a publisher continuously clicks on the ads displayed on his own website in order to make revenue. In a more complicated scenario. a travel agent may hire a large, globally distributed, botnet to click on its competitor\u27s ads, hence depleting their daily budget. We analyzed those different types of click fraud methods and proposed new methodologies to detect and prevent them real time. While traditional commercial approaches detect only some specific types of click fraud, Collaborative Click Fraud Detection and Prevention (CCFDP) system, an architecture that we have implemented based on the proposed methodologies, can detect and prevents all major types of click fraud. The proposed solution analyzes the detailed user activities on both, the server side and client side collaboratively to better describe the intention of the click. Data fusion techniques are developed to combine evidences from several data mining models and to obtain a better estimation of the quality of the click traffic. Our ideas are experimented through the development of the Collaborative Click Fraud Detection and Prevention (CCFDP) system. Experimental results show that the CCFDP system is better than the existing commercial click fraud solution in three major aspects: 1) detecting more click fraud especially clicks generated by software; 2) providing prevention ability; 3) proposing the concept of click quality score for click quality estimation. In the CCFDP initial version, we analyzed the performances of the click fraud detection and prediction model by using a rule base algorithm, which is similar to most of the existing systems. We have assigned a quality score for each click instead of classifying the click as fraud or genuine, because it is hard to get solid evidence of click fraud just based on the data collected, and it is difficult to determine the real intention of users who make the clicks. Results from initial version revealed that the diversity of CF attack Results from initial version revealed that the diversity of CF attack types makes it hard for a single counter measure to prevent click fraud. Therefore, it is important to be able to combine multiple measures capable of effective protection from click fraud. Therefore, in the CCFDP improved version, we provide the traffic quality score as a combination of evidence from several data mining algorithms. We have tested the system with a data from an actual ad campaign in 2007 and 2008. We have compared the results with Google Adwords reports for the same campaign. Results show that a higher percentage of click fraud present even with the most popular search engine. The multiple model based CCFDP always estimated less valid traffic compare to Google. Sometimes the difference is as high as 53%. Detection of duplicates, fast and efficient, is one of the most important requirement in any click fraud solution. Usually duplicate detection algorithms run in real time. In order to provide real time results, solution providers should utilize data structures that can be updated in real time. In addition, space requirement to hold data should be minimum. In this dissertation, we also addressed the problem of detecting duplicate clicks in pay-per-click streams. We proposed a simple data structure, Temporal Stateful Bloom Filter (TSBF), an extension to the regular Bloom Filter and Counting Bloom Filter. The bit vector in the Bloom Filter was replaced with a status vector. Duplicate detection results of TSBF method is compared with Buffering, FPBuffering, and CBF methods. False positive rate of TSBF is less than 1% and it does not have false negatives. Space requirement of TSBF is minimal among other solutions. Even though Buffering does not have either false positives or false negatives its space requirement increases exponentially with the size of the stream data size. When the false positive rate of the FPBuffering is set to 1% its false negative rate jumps to around 5%, which will not be tolerated by most of the streaming data applications. We also compared the TSBF results with CBF. TSBF uses only half the space or less than standard CBF with the same false positive probability. One of the biggest successes with CCFDP is the discovery of new mercantile click bot, the Smart ClickBot. We presented a Bayesian approach for detecting the Smart ClickBot type clicks. The system combines evidence extracted from web server sessions to determine the final class of each click. Some of these evidences can be used alone, while some can be used in combination with other features for the click bot detection. During training and testing we also addressed the class imbalance problem. Our best classifier shows recall of 94%. and precision of 89%, with F1 measure calculated as 92%. The high accuracy of our system proves the effectiveness of the proposed methodology. Since the Smart ClickBot is a sophisticated click bot that manipulate every possible parameters to go undetected, the techniques that we discussed here can lead to detection of other types of software bots too. Despite the enormous capabilities of modern machine learning and data mining techniques in modeling complicated problems, most of the available click fraud detection systems are rule-based. Click fraud solution providers keep the rules as a secret weapon and bargain with others to prove their superiority. We proposed validation framework to acquire another model of the clicks data that is not rule dependent, a model that learns the inherent statistical regularities of the data. Then the output of both models is compared. Due to the uniqueness of the CCFDP system architecture, it is better than current commercial solution and search engine/ISP solution. The system protects Pay-Per-Click advertisers from click fraud and improves their Return on Investment (ROI). The system can also provide an arbitration system for advertiser and PPC publisher whenever the click fraud argument arises. Advertisers can gain their confidence on PPC advertisement by having a channel to argue the traffic quality with big search engine publishers. The results of this system will booster the internet economy by eliminating the shortcoming of PPC business model. General consumer will gain their confidence on internet business model by reducing fraudulent activities which are numerous in current virtual internet world
    • …
    corecore