76 research outputs found

    Using Big Data Analysis to Improve Cache Performance in Search Engines

    Get PDF
    Web Search Engines process huge amounts of data to support search but must run under strong performance requirements (to answer a query in a fraction of a second). To meet that performance they implement different optimization techniques such as caching, that may be implemented at several levels. One of these caching levels is the intersection cache, that attempts to exploit frequently occurring pairs of terms by keeping in the memory of the search node the results of intersecting the corresponding inverted lists. In this work we propose an optimization step to decide which items should be cached and which not by introducing the usage of data mining techniques. Our preliminary results show that it is possible to achieve extra cost savings in this already hyper-optimized field.Sociedad Argentina de Informática e Investigación Operativa (SADIO

    Letter from the Editor

    Get PDF
    Nota inaugural del Electronic Journal of SADIO, redactada para el primer volumen publicado de dicha revista.Sociedad Argentina de Informática e Investigación Operativ

    Improving LDA topic modeling in Twitter with graph community detection

    Get PDF
    Texts can be characterized from their content using machine learning and natural language processing techniques. In particular, understanding their topic is useful for different tasks such as personalized message recommendation, fake news detection or public opinion monitoring. Latent Dirichlet Allocation (LDA) is an unsupervised generative model for the decomposition of topics, which seeks to represent texts as random mixtures over topics with a Dirichlet distribution, and each topic is characterized by a distribution over words. However, this method is challenging to apply when the text is short and sometimes incoherent, as is often the case with posts on social networks such as twitter. Therefore, different works have shown that tweet pooling (aggregating tweets into longer documents) improves LDA results, but its performance depends on which method was used to aggregating the texts. We propose the new method to detect topics on twitter: “Community pooling”. In this novel scheme, first we define the retweet graph where users are the nodes and retweets between them are the edges. Then, we use the Louvain method for community detection in order to uncover the communities (a group of users who mainly interact with each other but not with other groups). Finally we aggregate into a single document all the tweets authored by all users in a community. Therefore, this method drastically reduces the number of total documents and makes denser word co-occurrence matrix, which is beneficial to LDA algorithm. With the intention of evaluating our model, we created two datasets of tweets with different characteristics. A first generic dataset involving various topics such as music, health and movies and a second dataset corresponding to an event: Biden’s presidential inauguration day in the United States. We compare the performance of our model with state of the art schemes and previous pooling models in terms of document retrieval performance, cluster quality and supervised machine learning classification score. Results showed that Community pooling had a better performance on all datasets and tasks, with the only exception of the retrieval task on the event dataset. Moreover, Community polling was faster than all other aggregation techniques (less than half the running time), which is particularly useful in big data scenarios.Sociedad Argentina de Informática e Investigación Operativ

    Algoritmos Eficientes y Datos Masivos en Búsquedas a Gran Escala

    Get PDF
    La cantidad, diversidad y dinamismo de la información distribuida por diferentes servicios en Internet presenta múltiples desafíos a los sistemas de búsquedas. Por un lado, los usuarios requieren de herramientas que les ayuden a resolver problemas en tiempo y forma. Por otro, el escenario cada vez más grande y complejo y exige el diseño de algoritmos y estructuras de datos que permitan mantener (y mejorar) la eficiencia, tanto en calidad de las respuestas como en tiempo. Si bien las búsquedas sobre conjuntos masivos de información pueden adquirir formas diversas, una de las aplicaciones más utilizadas son los motores de búsqueda. Éstos son sistemas distribuidos de altas prestaciones que se basan en estructuras de datos y algoritmos altamente eficientes. Esta problemática tiene aún muchas preguntas abiertas y – mientras se intentan resolver cuestiones – aparecen nuevos desafíos. En este proyecto se propone el diseño y evaluación de estructuras de datos y algoritmos eficientes junto con análisis de datos masivos (big data) para mejoras procesos internos de un motor de búsqueda. Para ello, exploran y explotan tanto el contenido y la estructura de la web como el comportamiento de los usuarios.Eje: Bases de Datos y Minería de DatosRed de Universidades con Carreras en Informática (RedUNCI

    Mejoras algorítmicas y estructuras de datos para búsquedas altamente eficientes

    Get PDF
    El problema de la búsqueda en Internet presenta desafíos constantes. Los datos son cada vez más ricos y complejos, se utilizan y varían en tiempo real, aportando nuevo valor, pero solamente si están disponibles en tiempo y forma. Los usuarios utilizan cada vez más motores de búsqueda, esperando satisfacer sus necesidades de información, navegación o para hacer transacciones, requiriendo que respondan miles de consultas por segundo. Para poder manejar eficientemente el tamaño de una colección de documentos recolectados desde la web, los motores de búsqueda utilizan estructuras de datos distribuidas para hacer eficiente la búsqueda y técnicas de caching para optimizar los tiempos de respuesta. En este proyecto se propone diseñar y evaluar estructuras de datos avanzadas junto con nuevas técnicas algorítmicas que permitan mejorar la performance en las búsquedas para colecciones de datos de escala web.Eje: Procesamiento distribuido y paraleloRed de Universidades con Carreras en Informática (RedUNCI

    Using Big Data Analysis to Improve Cache Performance in Search Engines

    Get PDF
    Web Search Engines process huge amounts of data to support search but must run under strong performance requirements (to answer a query in a fraction of a second). To meet that performance they implement different optimization techniques such as caching, that may be implemented at several levels. One of these caching levels is the intersection cache, that attempts to exploit frequently occurring pairs of terms by keeping in the memory of the search node the results of intersecting the corresponding inverted lists. In this work we propose an optimization step to decide which items should be cached and which not by introducing the usage of data mining techniques. Our preliminary results show that it is possible to achieve extra cost savings in this already hyper-optimized field.Sociedad Argentina de Informática e Investigación Operativa (SADIO

    Toxicity, polarizations and cultural diversity in social networks : Using machine learning and natural language processing to analyze these phenomena in social networks

    Get PDF
    Social media have increased the amount of information that people consume as well as the number of interactions between them. Nevertheless, most people tend to promote their favored narratives and hence form polarized groups. This encourages polarization and extremism resulting in extreme violence. Against this backdrop, it is in our interest to find environments, strategies and mechanisms that allow us to reduce toxicity on social media (defining “toxicity” as a rude, disrespectful or unreasonable comment that is likely to make people leave a discussion). We address the hypothesis that a higher cultural diversity among community users reduces the toxicity of the user messages. We use Reddit as a case study, since this platform is characterized by a variety of discussion sub-forums where users debate political and cultural issues. Using community2vec, we generate an embedding for each community that allows us to portray users in a demographic and ideological aspect. In order to analyze each user statement, we process the data with different models, thereby obtaining which are the topics of debate and what are the levels of aggressiveness and negativism in them. Finally, we will seek to corroborate the hypothesis by analyzing the relationship between the cultural diversity present in each discussion group and the toxicity found in their posts.Sociedad Argentina de Informática e Investigación Operativ

    Midiendo la controversia en redes sociales a través de la jerga

    Get PDF
    En este trabajo desarrollamos una metodología para cuanti car la controversia en una red social exclusivamente a través de su jerga, es decir por el lenguaje que utilizan sus usarios. Presentamos resultados preliminares de nuestros experimentos en los que logramos una accuracy equiparable a los métodos del estado del arte basados en la estructura.Sociedad Argentina de Informática e Investigación Operativ
    corecore