51 research outputs found

    The SPIRIT collection: an overview of a large web collection

    Get PDF
    A large scale collection of web pages has been essential for research in information retrieval and related areas. This paper provides an overview of a large web collection used in the SPIRIT project for the design and testing of spatially-aware retrieval systems. Several statistics are derived and presented to show the characteristics of the collection

    Performance comparison of clustered and replicated information retrieval systems

    Get PDF
    The amount of information available over the Internet is increasing daily as well as the importance and magnitude of Web search engines. Systems based on a single centralised index present several problems (such as lack of scalability), which lead to the use of distributed information retrieval systems to effectively search for and locate the required information. A distributed retrieval system can be clustered and/or replicated. In this paper, using simulations, we present a detailed performance analysis, both in terms of throughput and response time, of a clustered system compared to a replicated system. In addition, we consider the effect of changes in the query topics over time. We show that the performance obtained for a clustered system does not improve the performance obtained by the best replicated system. Indeed, the main advantage of a clustered system is the reduction of network traffic. However, the use of a switched network eliminates the bottleneck in the network, markedly improving the performance of the replicated systems. Moreover, we illustrate the negative performance effect of the changes over time in the query topics when a distributed clustered system is used. On the contrary, the performance of a distributed replicated system is query independent

    Time-Aware Detection Systems

    Get PDF
    [Abstract] Communication network data has been growing in the last decades and with the generalisation of the Internet of Things (IoT) its growth has increased. The number of attacks to this kind of infrastructures have also increased due to the relevance they are gaining. As a result, it is vital to guarantee an adequate level of security and to detect threats as soon as possible. Classical methods emphasise in detection but not taking into account the number of records needed to successfully identify an attack. To achieve this, time-aware techniques both for detection and measure may be used. In this work, well-known machine learning methods will be explored to detect attacks based on public datasets. In order to obtain the performance, classic metrics will be used but also the number of elements processed will be taken into account in order to determine a time-aware performance of the method.Ministero de Economía y Competitividad; TIN2015-70648-PXunta de Galicia; ED431G/01 2016-201

    Time Aware F-Score for Cybersecurity Early Detection Evaluation

    Get PDF
    [Abstract]: With the increase in the use of Internet interconnected systems, security has become of utmost importance. One key element to guarantee an adequate level of security is being able to detect the threat as soon as possible, decreasing the risk of consequences derived from those actions. In this paper, a new metric for early detection system evaluation that takes into account the delay in detection is defined. Time aware F-score (TaF) takes into account the number of items or individual elements processed to determine if an element is an anomaly or if it is not relevant to be detected. These results are validated by means of a dual approach to cybersecurity, Operative System (OS) scan attack as part of systems and network security and the detection of depression in social media networks as part of the protection of users. Also, different approaches, oriented towards studying the impact of single item selection, are applied to final decisions. This study allows to establish that nitems selection method is usually the best option for early detection systems. TaF metric provides, as well, an adequate alternative for time sensitive detection evaluation.This work was supported in part by the Ministry of Economy and Competitiveness of Spain and Fondo Europeo de Desarrollo Regional (FEDER) Funds of the European Union under Project PID2019-111388GB-I00; and in part by the Centro de Investigación de Galicia—Centro de Investigación en Tecnologías de la Información y las Comunicaciones (CITIC) Funded by Xunta de Galicia and the European Union (European Regional Development Fund–Galicia 2014-2020 Program), under Grant ED431G 2019/01.Xunta de Galicia; ED431G 2019/0

    Early Detection of Cyberbullying on Social Media Networks

    Get PDF
    [Abstract] Cyberbullying is an important issue for our society and has a major negative effect on the victims, that can be highly damaging due to the frequency and high propagation provided by Information Technologies. Therefore, the early detection of cyberbullying in social networks becomes crucial to mitigate the impact on the victims. In this article, we aim to explore different approaches that take into account the time in the detection of cyberbullying in social networks. We follow a supervised learning method with two different specific early detection models, named threshold and dual. The former follows a more simple approach, while the latter requires two machine learning models. To the best of our knowledge, this is the first attempt to investigate the early detection of cyberbullying. We propose two groups of features and two early detection methods, specifically designed for this problem. We conduct an extensive evaluation using a real world dataset, following a time-aware evaluation that penalizes late detections. Our results show how we can improve baseline detection models up to 42%.This research was supported by the Ministry of Economy and Competitiveness of Spain and FEDER funds of the European Union (Project PID2019-111388GB-I00) and by the Centro de Investigación de Galicia “CITIC”, funded by Xunta de Galicia (Galicia, Spain) and the European Union (European Regional Development Fund — Galicia 2014–2020 Program) , by grant ED431G 2019/01Xunta de Galicia; ED431G 2019/0

    Optimization Matrix Factorization Recommendation Algorithm Based on Rating Centrality

    Full text link
    Matrix factorization (MF) is extensively used to mine the user preference from explicit ratings in recommender systems. However, the reliability of explicit ratings is not always consistent, because many factors may affect the user's final evaluation on an item, including commercial advertising and a friend's recommendation. Therefore, mining the reliable ratings of user is critical to further improve the performance of the recommender system. In this work, we analyze the deviation degree of each rating in overall rating distribution of user and item, and propose the notion of user-based rating centrality and item-based rating centrality, respectively. Moreover, based on the rating centrality, we measure the reliability of each user rating and provide an optimized matrix factorization recommendation algorithm. Experimental results on two popular recommendation datasets reveal that our method gets better performance compared with other matrix factorization recommendation algorithms, especially on sparse datasets

    Auditoría Wi-Fi basada en placas de bajo coste

    Get PDF
    En Libro de Actas de XV Jornadas de Ingeniería Telemática (JITEL), A Coruña 2021.[Resumen]: En la actualidad, el uso de redes inalámbricas crece exponencialmente en entornos empresariales de todo tipo. Si bien es cierto que existen una gran cantidad de soluciones en el ámbito de auditoría de redes inalámbricas para grandes organizaciones, las soluciones que existen para las pequeñas empresas son escasas, y esto junto a la falta de conocimientos y experiencia en Tecnologías de la Información (TI) del personal de dichas organizaciones, provoca que este tipo de empresas se encuentren habitualmente en un nivel de riesgo en ciberseguridad alto. En este contexto desarrollamos una herramienta que tiene como objetivo la auditoría de redes inalámbricas en entornos empresariales, basada en hardware de bajo coste y que requiera, únicamente, un nivel básico de conocimientos de TI y ciberseguridad por parte del usuario. El diseño arquitectónico de la herramienta se basa en un sistema distribuido de dispositivos de bajo coste que permite monitorizar y auditar el entorno inalámbrico y mostrar la información obtenida al usuario de forma inteligible. En la implementación actual utilizamos Raspberry Pi 3B+ como placas de bajo coste, a las que conectamos antenas Wi-Fi externas, que facilitan la captura de tráfico de red. Posteriormente, procesamos dicho tráfico y los resultados obtenidos se muestran al usuario mediante una interfaz web. Tras la finalización del desarrollo de la herramienta, hemos realizado pruebas, tanto en un entorno real como en un entorno simulado, lo que nos ha permitido obtener interesantes conclusiones acerca del trabajo realizado.Esta investigación ha sido financiada por el Ministerio de Economía y Competitividad de España y fondos FEDER de la UE (Proyecto PID2019-525 111388GB-I00) y por el Centro de Investigación de Galicia "CITIC", financiado por la Xunta de Galicia y la UE (Fondo de Desarrollo Regional Europeo - Programa Galicia 2014- 2020), mediante la concesión de ED431G 2019/01Xunta de Galicia; ED431G 2019/0

    Site agnostic approach to early detection of cyberbullying on social media networks

    Get PDF
    [Abstract]: The rise in the use of social media networks has increased the prevalence of cyberbullying, and time is paramount to reduce the negative effects that derive from those behaviours on any social media platform. This paper aims to study the early detection problem from a general perspective by carrying out experiments over two independent datasets (Instagram and Vine), exclusively using users’ comments. We used textual information from comments over baseline early detection models (fixed, threshold, and dual models) to apply three different methods of improving early detection. First, we evaluated the performance of Doc2Vec features. Finally, we also presented multiple instance learning (MIL) on early detection models and we assessed its performance. We applied (Formula presented.) ((Formula presented.)) as an early detection metric to asses the performance of the presented methods. We conclude that the inclusion of Doc2Vec features improves the performance of baseline early detection models by up to 79.6%. Moreover, multiple instance learning shows an important positive effect for the Vine dataset, where smaller post sizes and less use of the English language are present, with a further improvement of up to 13%, but no significant enhancement is shown for the Instagram dataset.This research was supported by the Ministry of Economy and Competitiveness of Spain and FEDER funds of the European Union (Project PID2019-111388GB-I00) and by the Centro de Investigación de Galicia “CITIC”, funded by Xunta de Galicia and the European Union (European Regional Development Fund-Galicia 2014-2020 Program), by grant ED431G 2019/01.Xunta de Galicia; ED431G 2019/0

    Algorithms for within-cluster searches using inverted files

    No full text
    Abstract. Information retrieval over clustered document collections has two successive stages: first identifying the best-clusters and then the best-documents in these clusters that are most similar to the user query. In this paper, we assume that an inverted file over the entire document collection is used for the latter stage. We propose and evaluate algorithms for within-cluster searches, i.e., to integrate the best-clusters with the best-documents to obtain the final output including the highest ranked documents only from the best-clusters. Our experiments on a TREC collection including 210,158 documents with several query sets show that an appropriately selected integration algorithm based on the query length and system resources can significantly improve the query evaluation efficiency.
    corecore