891 research outputs found

    WEB MINING IN E-COMMERCE

    Get PDF
    Recently, the web is becoming an important part of people’s life. The web is a very good place to run successful businesses. Selling products or services online plays an important role in the success of businesses that have a physical presence, like a reE-Commerce, Data mining, Web mining

    Critical infrastructure protection and the Domain Name Service (DNS) system

    Get PDF
    Components of the critical infrastructure of any system are natural targets for attack. Any inherent weakness of such components can potentially expose the entire system to vulnerability. The Domain Name System (DNS) is one component of the proper functioning of the Internet. Although DNS is a relatively simple, isolated component, it serves as a straightforward example for the study of distributed systems in general, and as such, we have explored properties of DNS to examine how enterprise-scale, critical infrastructure components are vulnerable to attack, what protections are afforded to defenders of such components, the inherent weaknesses of such systems, and what natural defenses are available to safeguard them and ensure the availability of these systems. Each of the works in this thesis analyzes some aspect of our efforts in this regard --Abstract, page iv

    MALICIOUS TRAFFIC DETECTION IN DNS INFRASTRUCTURE USING DECISION TREE ALGORITHM

    Get PDF
    Domain Name System (DNS) is an essential component in internet infrastructure to direct domains to IP addresses or conversely. Despite its important role in delivering internet services, attackers often use DNS as a bridge to breach a system. A DNS traffic analysis system is needed for early detection of attacks. However, the available security tools still have many shortcomings, for example broken authentication, sensitive data exposure, injection, etc. This research uses DNS analysis to develop anomaly-based techniques to detect malicious traffic on the DNS infrastructure. To do this, We look for network features that characterize DNS traffic. Features obtained will then be processed using the Decision Tree algorithm to classifyincoming DNS traffic. We experimented with 2.291.024 data traffic data matches the characteristics of BotNet and normal traffic. By dividing the data into 80% training and 20% testing data, our experimental results showed high detection aacuracy (96.36%) indicating the robustness of our method

    Distributed-based massive processing of activity logs for efficient user modeling in a Virtual Campus

    Get PDF
    This paper reports on a multi-fold approach for the building of user models based on the identification of navigation patterns in a virtual campus, allowing for adapting the campus’ usability to the actual learners’ needs, thus resulting in a great stimulation of the learning experience. However, user modeling in this context implies a constant processing and analysis of user interaction data during long-term learning activities, which produces huge amounts of valuable data stored typically in server log files. Due to the large or very large size of log files generated daily, the massive processing is a foremost step in extracting useful information. To this end, this work studies, first, the viability of processing large log data files of a real Virtual Campus using different distributed infrastructures. More precisely, we study the time performance of massive processing of daily log files implemented following the master-slave paradigm and evaluated using Cluster Computing and PlanetLab platforms. The study reveals the complexity and challenges of massive processing in the big data era, such as the need to carefully tune the log file processing in terms of chunk log data size to be processed at slave nodes as well as the bottleneck in processing in truly geographically distributed infrastructures due to the overhead caused by the communication time among the master and slave nodes. Then, an application of the massive processing approach resulting in log data processed and stored in a well-structured format is presented. We show how to extract knowledge from the log data analysis by using the WEKA framework for data mining purposes showing its usefulness to effectively build user models in terms of identifying interesting navigation patters of on-line learners. The study is motivated and conducted in the context of the actual data logs of the Virtual Campus of the Open University of Catalonia.Peer ReviewedPostprint (author's final draft

    Vector representation of Internet domain names using Word embedding techniques

    Get PDF
    Word embeddings is a well-known set of techniques widely used in natural language processing ( NLP ). This thesis explores the use of word embeddings in a new scenario. A vector space model ( VSM) for Internet domain names ( DNS) is created by taking core ideas from NLP techniques and applying them to real anonymized DNS log queries from a large Internet Service Provider ( ISP) . The main goal is to find semantically similar domains only using information of DNS queries without any other knowledge about the content of those domains. A set of transformations through a detailed preprocessing pipeline with eight specific steps is defined to move the original problem to a problem in the NLP field. Once the preprocessing pipeline is applied and the DNS log files are transformed to a standard text corpus, we show that state-of-the-art techniques for word embeddings can be successfully applied in order to build what we called a DNS-VSM (a vector space model for Internet domain names). Different word embeddings techniques are evaluated in this work: Word2Vec (with Skip-Gram and CBOW architectures), App2Vec (with a CBOW architecture and adding time gaps between DNS queries), and FastText (which includes sub-word information). The obtained results are compared using various metrics from Information Retrieval theory and the quality of the learned vectors is validated with a third party source, namely, similar sites service offered by Alexa Internet, Inc2 . Due to intrinsic characteristics of domain names, we found that FastText is the best option for building a vector space model for DNS. Furthermore, its performance (considering the top 3 most similar learned vectors to each domain) is compared against two baseline methods: Random Guessing (returning randomly any domain name from the dataset) and Zero Rule (returning always the same most popular domains), outperforming both of them considerably. The results presented in this work can be useful in many engineering activities, with practical application in many areas. Some examples include websites recommendations based on similar sites, competitive analysis, identification of fraudulent or risky sites, parental-control systems, UX improvements (based on recommendations, spell correction, etc.), click-stream analysis, representation and clustering of users navigation profiles, optimization of cache systems in recursive DNS resolvers (among others). Finally, as a contribution to the research community a set of vectors of the DNS-VSM trained on a similar dataset to the one used in this thesis is released and made available for download through the github page in [1]. With this we hope that further work and research can be done using these vectors.La vectorización de palabras es un conjunto de técnicas bien conocidas y ampliamente usadas en el procesamiento del lenguaje natural ( PLN ). Esta tesis explora el uso de vectorización de palabras en un nuevo escenario. Un modelo de espacio vectorial ( VSM) para nombres de dominios de Internet ( DNS ) es creado tomando ideas fundamentales de PLN, l as cuales son aplicadas a consultas reales anonimizadas de logs de DNS de un gran proveedor de servicios de Internet ( ISP) . El objetivo principal es encontrar dominios relacionados semánticamente solamente usando información de consultas DNS sin ningún otro conocimiento sobre el contenido de esos dominios. Un conjunto de transformaciones a través de un detallado pipeline de preprocesamiento con ocho pasos específicos es definido para llevar el problema original a un problema en el campo de PLN. Una vez aplicado el pipeline de preprocesamiento y los logs de DNS son transformados a un corpus de texto estándar, se muestra que es posible utilizar con éxito técnicas del estado del arte respecto a vectorización de palabras para construir lo que denominamos un DNS-VSM (un modelo de espacio vectorial para nombres de dominio de Internet). Diferentes técnicas de vectorización de palabras son evaluadas en este trabajo: Word2Vec (con arquitectura Skip-Gram y CBOW) , App2Vec (con arquitectura CBOW y agregando intervalos de tiempo entre consultas DNS ), y FastText (incluyendo información a nivel de sub-palabra). Los resultados obtenidos se comparan usando varias métricas de la teoría de Recuperación de Información y la calidad de los vectores aprendidos es validada por una fuente externa, un servicio para obtener sitios similares ofrecido por Alexa Internet, Inc . Debido a características intrínsecas de los nombres de dominio, encontramos que FastText es la mejor opción para construir un modelo de espacio vectorial para DNS . Además, su performance es comparada contra dos métodos de línea base: Random Guessing (devolviendo cualquier nombre de dominio del dataset de forma aleatoria) y Zero Rule (devolviendo siempre los mismos dominios más populares), superando a ambos de manera considerable. Los resultados presentados en este trabajo pueden ser útiles en muchas actividades de ingeniería, con aplicación práctica en muchas áreas. Algunos ejemplos incluyen recomendaciones de sitios web, análisis competitivo, identificación de sitios riesgosos o fraudulentos, sistemas de control parental, mejoras de UX (basada en recomendaciones, corrección ortográfica, etc.), análisis de flujo de clics, representación y clustering de perfiles de navegación de usuarios, optimización de sistemas de cache en resolutores de DNS recursivos (entre otros). Por último, como contribución a la comunidad académica, un conjunto de vectores del DNS-VSM entrenado sobre un juego de datos similar al utilizado en esta tesis es liberado y hecho disponible para descarga a través de la página github en [1]. Con esto esperamos a que más trabajos e investigaciones puedan realizarse usando estos vectores

    CopAS: A Big Data Forensic Analytics System

    Full text link
    With the advancing digitization of our society, network security has become one of the critical concerns for most organizations. In this paper, we present CopAS, a system targeted at Big Data forensics analysis, allowing network operators to comfortably analyze and correlate large amounts of network data to get insights about potentially malicious and suspicious events. We demonstrate the practical usage of CopAS for insider threat detection on a publicly available PCAP dataset and show how the system can be used to detect insiders hiding their malicious activity in the large amounts of networking data streams generated during the daily activities of an organization
    • …
    corecore