343 research outputs found

    Quantum inspired approach for early classification of time series

    Get PDF
    Is it possible to apply some fundamental principles of quantum-computing to time series classi\ufb01cation algorithms? This is the initial spark that became the research question I decided to chase at the very beginning of my PhD studies. The idea came accidentally after reading a note on the ability of entanglement to express the correlation between two particles, even far away from each other. The test problem was also at hand because I was investigating on possible algorithms for real time bot detection, a challenging problem at present day, by means of statistical approaches for sequential classi\ufb01cation. The quantum inspired algorithm presented in this thesis stemmed as an evolution of the statistical method mentioned above: it is a novel approach to address binary and multinomial classi\ufb01cation of an incoming data stream, inspired by the principles of Quantum Computing, in order to ensure the shortest decision time with high accuracy. The proposed approach exploits the analogy between the intrinsic correlation of two or more particles and the dependence of each item in a data stream with the preceding ones. Starting from the a-posteriori probability of each item to belong to a particular class, we can assign a Qubit state representing a combination of the aforesaid probabilities for all available observations of the time series. By leveraging superposition and entanglement on subsequences of growing length, it is possible to devise a measure of membership to each class, thus enabling the system to take a reliable decision when a suf\ufb01cient level of con\ufb01dence is met. In order to provide an extensive and thorough analysis of the problem, a well-\ufb01tting approach for bot detection was replicated on our dataset and later compared with the statistical algorithm to determine the best option. The winner was subsequently examined against the new quantum-inspired proposal, showing the superior capability of the latter in both binary and multinomial classi\ufb01cation of data streams. The validation of quantum-inspired approach in a synthetically generated use case, completes the research framework and opens new perspectives in on-the-\ufb02y time series classi\ufb01cation, that we have just started to explore. Just to name a few ones, the algorithm is currently being tested with encouraging results in predictive maintenance and prognostics for automotive, in collaboration with University of Bradford (UK), and in action recognition from video streams

    A study of different web-crawler behaviour

    Get PDF
    The article deals with a study of web-crawler behaviour on different websites. A classification of web-robots, information gathering tools and their detection methods are provided. Well-known scrapers and their behaviour are analyzed on the base of large web-server log set. Experimental results demonstrate that web-robot can be distinguished from human by feature analysis. The results of the research can be used as a basis for comprehensive intrusion detection and prevention system development

    A novel defense mechanism against web crawler intrusion

    Get PDF
    Web robots also known as crawlers or spiders are used by search engines, hackers and spammers to gather information about web pages. Timely detection and prevention of unwanted crawlers increases privacy and security of websites. In this research, a novel method to identify web crawlers is proposed to prevent unwanted crawler to access websites. The proposed method suggests a five-factor identification process to detect unwanted crawlers. This study provides the pretest and posttest results along with a systematic evaluation of web pages with the proposed identification technique versus web pages without the proposed identification process. An experiment was performed with repeated measures for two groups with each group containing ninety web pages. The outputs of the logistic regression analysis of treatment and control groups confirm the novel five-factor identification process as an effective mechanism to prevent unwanted web crawlers. This study concluded that the proposed five distinct identifier process is a very effective technique as demonstrated by a successful outcome

    Web Data Extraction, Applications and Techniques: A Survey

    Full text link
    Web Data Extraction is an important problem that has been studied by means of different scientific tools and in a broad range of applications. Many approaches to extracting data from the Web have been designed to solve specific problems and operate in ad-hoc domains. Other approaches, instead, heavily reuse techniques and algorithms developed in the field of Information Extraction. This survey aims at providing a structured and comprehensive overview of the literature in the field of Web Data Extraction. We provided a simple classification framework in which existing Web Data Extraction applications are grouped into two main classes, namely applications at the Enterprise level and at the Social Web level. At the Enterprise level, Web Data Extraction techniques emerge as a key tool to perform data analysis in Business and Competitive Intelligence systems as well as for business process re-engineering. At the Social Web level, Web Data Extraction techniques allow to gather a large amount of structured data continuously generated and disseminated by Web 2.0, Social Media and Online Social Network users and this offers unprecedented opportunities to analyze human behavior at a very large scale. We discuss also the potential of cross-fertilization, i.e., on the possibility of re-using Web Data Extraction techniques originally designed to work in a given domain, in other domains.Comment: Knowledge-based System

    Reverse Engineering Static Content and Dynamic Behaviour of E-Commerce Websites for Fun and Profit

    Get PDF
    Atualmente os websites de comércio eletrónico são uma das ferramentas principais para a realização de transações entre comerciantes online e consumidores ou empresas. Estes websites apoiam- se fortemente na sumarização e análise dos hábitos de navegação dos consumidores, de forma a influenciar as suas ações no website com o intuito de otimizar métricas de sucesso como o CTR (Click through Rate), CPC (Cost per Conversion), Basket e Lifetime Value e User Engagement. A utilização de técnicas de data mining e machine learning na extração de conhecimento a partir dos conjuntos de dados existentes nos websites de comércio eletrónico tem vindo a ter uma crescente influência nas campanhas de marketing realizadas na Internet.Quando o provedor de serviços de machine learning se deparada com um novo website de comércio eletrónico, inicia um processo de web mining, fazendo recolha de dados, tanto históricos como em tempo real, do website e analisando/transformando estes dados de forma a tornar os mesmos utilizáveis para fins de extração de informação tanto sobre a estrutura e conteúdo de um website assim como dos hábitos de navegação dos seus utilizadores típicos. Apenas após este processo é que os data scientists são capazes de desenvolver modelos relevantes e algoritmos para melhorar e otimizar as atividades de marketing online.Este processo é, na sua generalidade, moroso em tempo e recursos, dependendo sempre da condição em que os dados são apresentados ao data scientist. Dados com mais qualidade (p.ex. dados completos), facilitam o trabalho dos data scientists e tornam o mesmo mais rápido. Por outro lado, na generalidade dos casos, os data scientists tem de recorrer a técnicas de monitorização de eventos específicos ao domínio do website de forma a atingir o objetivo de conhecer os hábitos dos utlizadores, tornando-se necessário a realização de modificações ao código fonte do website para a captura desses mesmos eventos, aumentando assim o risco de não capturar toda a informação relevante por não ativar os mecanismos de monitorização em todas as páginas do web- site. Por exemplo, podemos não ter conhecimento a priori que uma visita à página de Condições de Entrega é relevante para prever o desejo de um dado consumidor efetuar uma compra e, desta forma, os mecanismos de monitorização nessas páginas podem não ser ativados.No contexto desta problemática, a solução proposta consiste numa metodologia capaz de ex- trair e combinar a informação sobre um dado website de comércio eletrónico através de um pro- cesso de web mining, compreendendo a estrutura de páginas de um website, assim como do conteúdo das mesmas, baseando-se para isso na identificação de conteúdo dinâmico das páginas assim como informação semântica recolhida de locais predefinidos. Adicionalmente esta informação é complementada, usando dados presente nos registos de acesso de utilizadores, com modelos preditivos do futuro comportamento dos utilizadores no website. Torna-se assim possível a apresentação de um modelo de dados representando a informação sobre um dado website de comércio eletrónico e os seus utilizadores arquetípicos, podendo posteriormente estes dados serem utiliza- dos, por exemplo, em sistemas de simulação.Nowadays electronic commerce websites are one of the main transaction tools between on-line merchants and consumers or businesses. These e-commerce websites rely heavily on summarizing and analyzing the behavior of customers, making an effort to influence user actions towards the optimization of success metrics such as CTR (Click through Rate), CPC (Cost per Conversion), Basket and Lifetime Value and User Engagement. Knowledge extraction from the existing e- commerce websites datasets, using data mining and machine learning techniques, has been greatly influencing the Internet marketing activities.When faced with a new e-commerce website, the machine learning practitioner starts a web mining process by collecting historical and real-time data of the website and analyzing/transforming this data in order to be capable of extracting information about the website structure and content and its users' behavior. Only after this process the data scientists are able to build relevant models and algorithms to enhance marketing activities.This is an expensive process in resources and time since it will always depend on the condition in which the data is presented to the data scientist, since data with more quality (i.e. no incomplete data) will make the data scientist work easier and faster. On the other hand, in most of the cases, data scientists would usually resort to tracking domain-specific events throughout a user's visit to the website in order to fulfill the objective of discovering the users' behavior and, for this, it is necessary to perform code modifications to the pages themselves, that will result in a larger risk of not capturing all the relevant information by not enabling tracking mechanisms in certain pages. For example, we may not know a priori that a visit to a Delivery Conditions page is relevant to the prediction of a user's willingness to buy and therefore would not enable tracking on those pages.Within this problem context, the proposed solution consists in a methodology capable of extracting and combining information about a e-commerce website through a process of web mining, comprehending the structure as well as the content of the website pages, relying mostly on identifying dynamic content and semantic information in predefined locations, complemented with the capability of, using the user's access logs, extracting more accurate models to predict the users future behavior. This allows for the creation of a data model representing an e-commerce website and its archetypical users that can be useful, for example, in simulation systems

    Denial of Service in Web-Domains: Building Defenses Against Next-Generation Attack Behavior

    Get PDF
    The existing state-of-the-art in the field of application layer Distributed Denial of Service (DDoS) protection is generally designed, and thus effective, only for static web domains. To the best of our knowledge, our work is the first that studies the problem of application layer DDoS defense in web domains of dynamic content and organization, and for next-generation bot behaviour. In the first part of this thesis, we focus on the following research tasks: 1) we identify the main weaknesses of the existing application-layer anti-DDoS solutions as proposed in research literature and in the industry, 2) we obtain a comprehensive picture of the current-day as well as the next-generation application-layer attack behaviour and 3) we propose novel techniques, based on a multidisciplinary approach that combines offline machine learning algorithms and statistical analysis, for detection of suspicious web visitors in static web domains. Then, in the second part of the thesis, we propose and evaluate a novel anti-DDoS system that detects a broad range of application-layer DDoS attacks, both in static and dynamic web domains, through the use of advanced techniques of data mining. The key advantage of our system relative to other systems that resort to the use of challenge-response tests (such as CAPTCHAs) in combating malicious bots is that our system minimizes the number of these tests that are presented to valid human visitors while succeeding in preventing most malicious attackers from accessing the web site. The results of the experimental evaluation of the proposed system demonstrate effective detection of current and future variants of application layer DDoS attacks

    A planetary nervous system for social mining and collective awareness

    Get PDF
    We present a research roadmap of a Planetary Nervous System (PNS), capable of sensing and mining the digital breadcrumbs of human activities and unveiling the knowledge hidden in the big data for addressing the big questions about social complexity. We envision the PNS as a globally distributed, self-organizing, techno-social system for answering analytical questions about the status of world-wide society, based on three pillars: social sensing, social mining and the idea of trust networks and privacy-aware social mining. We discuss the ingredients of a science and a technology necessary to build the PNS upon the three mentioned pillars, beyond the limitations of their respective state-of-art. Social sensing is aimed at developing better methods for harvesting the big data from the techno-social ecosystem and make them available for mining, learning and analysis at a properly high abstraction level. Social mining is the problem of discovering patterns and models of human behaviour from the sensed data across the various social dimensions by data mining, machine learning and social network analysis. Trusted networks and privacy-aware social mining is aimed at creating a new deal around the questions of privacy and data ownership empowering individual persons with full awareness and control on own personal data, so that users may allow access and use of their data for their own good and the common good. The PNS will provide a goal-oriented knowledge discovery framework, made of technology and people, able to configure itself to the aim of answering questions about the pulse of global society. Given an analytical request, the PNS activates a process composed by a variety of interconnected tasks exploiting the social sensing and mining methods within the transparent ecosystem provided by the trusted network. The PNS we foresee is the key tool for individual and collective awareness for the knowledge society. We need such a tool for everyone to become fully aware of how powerful is the knowledge of our society we can achieve by leveraging our wisdom as a crowd, and how important is that everybody participates both as a consumer and as a producer of the social knowledge, for it to become a trustable, accessible, safe and useful public good.Seventh Framework Programme (European Commission) (grant agreement No. 284709
    corecore