19 research outputs found

    Approximate TF–IDF based on topic extraction from massive message stream using the GPU

    Get PDF
    The Web is a constantly expanding global information space that includes disparate types of data and resources. Recent trends demonstrate the urgent need to manage the large amounts of data stream, especially in specific domains of application such as critical infrastructure systems, sensor networks, log file analysis, search engines and more recently, social networks. All of these applications involve large-scale data-intensive tasks, often subject to time constraints and space complexity. Algorithms, data management and data retrieval techniques must be able to process data stream, i.e., process data as it becomes available and provide an accurate response, based solely on the data stream that has already been provided. Data retrieval techniques often require traditional data storage and processing approach, i.e., all data must be available in the storage space in order to be processed. For instance, a widely used relevance measure is Term Frequency–Inverse Document Frequency (TF–IDF), which can evaluate how important a word is in a collection of documents and requires to a priori know the whole dataset. To address this problem, we propose an approximate version of the TF–IDF measure suitable to work on continuous data stream (such as the exchange of messages, tweets and sensor-based log files). The algorithm for the calculation of this measure makes two assumptions: a fast response is required, and memory is both limited and infinitely smaller than the size of the data stream. In addition, to face the great computational power required to process massive data stream, we present also a parallel implementation of the approximate TF–IDF calculation using Graphical Processing Units (GPUs). This implementation of the algorithm was tested on generated and real data stream and was able to capture the most frequent terms. Our results demonstrate that the approximate version of the TF–IDF measure performs at a level that is comparable to the solution of the precise TF–IDF measure

    Sentiment Analysis in Unstructured Textual Information with Deep Learning

    Get PDF
    This document analyses the current State-of-the-Art algorithms in the fields of Natural Language Processing and Sentiment Analysis. It continues with a step-by-step explication of the development process of pre-processing techniques and neural networks architectures that allow to perform sentiment predictions (predicting rating stars) on Amazon.com customer reviews. An accuracy comparison has been made between 4 different models to check their performance. The second part of the project has been the development of a demo web application to show the potential of a Product Analytics Tool, which allows to perform sentiment predictions of any product on Amazon website. This app scrapes the reviews, loads the previously trained model and makes the predictions, generating different insights such as the most positive and negative features of the product based exclusively on the most reliable and objective data, customer reviews. The source code of the app can be found here: https://github.com/albergar2/SA_Project At the end of the document an appendix has been added providing information and estimates of the cost and tasks required to replicate this project in a professional environment.Doble Grado en Ingeniería Informática y Administración de Empresa

    Infrastructuralization of social media : the effects of fragmentation, the filter bubble and echo chambers on the perception of history in the current state of social media

    Get PDF
    Amb l'objectiu de demostrar els fenòmens de la fragmentació i els filtres bombolla les xarxes socials. El treball recerca continguts i detecta comunitats a la xarxa juntament amb els perfils disseminadors fent servir Rstudio.Con el objetivo de demostrar los fenómenos de la fragmentación y los filtros burbuja en las redes sociales. El trabajo investiga contenidos y detecta comunidades en la red junto con perfiles diseminadores usando Rstudio.Aiming to prove fragmentation and filter bubble phenomena in social media. This research project looks into content and detects communities in the network and identifies dissemination profiles using Rstudi

    Sarcasm recognition survey and application based on Reddit comments

    Get PDF
    Social media platforms are continuously increasing their number of users, and every day enormous amounts of data are produced online. Machine Learning (ML) techniques in the form of speech recognition are applied to analyze the polarity of this unstructured text data. However, it is broadly used sarcasm through these platforms, reducing the accuracy of said systems, as the intention of the message expressed does not match the polarity that is measured. Throughout the development of this work a survey considering three different algorithms will be performed. These algorithms are Logistic Regression, Neural Networks and Support Vector Machines. This final degree project proposes a previous analysis to the data using a sarcasm recognition classifier implemented with a support vector machine algorithm, with a mean accuracy of 71.21% and an F1-Score around 60%. Finally, an analysis of the planification and the costs is performed, proposing future works that could complement this bachelor thesis.Ingeniería de la Energí

    Music recommender systems. Proof of concept

    Get PDF
    Data overload is a well-known problem due to the availability of big on-line distributed databases. While providing a wealth of information the difficulties to find the sought data and the necessary time spent in the search call for technological solutions. Classical search engines alleviate this problem and at the same time have transformed the way people access to the information they are interested in. On the other hand, Internet also has changed the music consuming habits around the world. It is possible to find almost every recorded song or music piece. Over the last years music streaming platforms like Spotify, Apple Music or Amazon Music have contributed to a substantial change of users’ listening habits and the way music is commercialized and distributed. On-demand music platforms offer their users a huge catalogue so they can do a quick search and listen what they want or build up their personal library. In this context Music Recommender Systems may help users to discover music that match their tastes. Therefore music recommender systems are a powerful tool to make the most of an immense catalogue, impossible to be fully known by a human. This project aims at testing different music recommendation approaches applied to the particular case of users playlists. Several recommender alternatives were designed and evaluated: collaborative filtering systems, content-based systems and hybrid recommender systems that combine both techniques. Two systems are proposed. One system is content-based and uses correlation between tracks characterized by high-level descriptors and the other is an hybrid recommender that first apply a collaborative method to filter the database and then computes the final recommendation using Gaussian Mixture Models. Recommendations were evaluated using objective metrics and human evaluations, obtaining positive results.Ingeniería de Sistemas Audiovisuale

    Mind the Gap:Newsfeed Visualisation with Metro Maps

    Get PDF
    News overload has emerged as a growing problem in our increasinglyconnected digital information era. With complex long-running storiesunfolding over weeks and months, young adults in particular are leftoverwhelmed and demotivated, which leads to their disengagementfrom politics and current events news.This dissertation presents a method for the automatic generationof metro maps based on news content obtained from user-speciedRSS feeds. Metro maps are familiar to most adults, and they areintuitive visual metaphors for representing concepts which branchand diverge, such as news stories. The method described performsentity disambiguation and various other NLP techniques to extracta set of topics (metro lines) from a news corpus which provide acohesive summary of its content.The diculty of drawing unoccluded octilinear metro maps is a barrierto their current utility in InfoVis. Therefore, this dissertationalso introduces a heuristic force-directed approach for drawing metromaps, which is rened using multicriteria optimisations taken fromneighbouring literature in information cartography.The resultant system is demonstrated using the RSS feeds publishedby several popular British newspapers, and empirically evaluated ina user study. The results of the study support the hypothesis thatmetro map users demonstrate greater topic recall than users of anequivalent RSS reader. Lastly, areas for future research are discussed,followed by recommendations for the commercial development of thisand similar systems

    Artificial Intelligence and the Creative Industries

    Get PDF
    Treball Final de Grau en Administració d'Empreses. Codi: AE1049. Curs 2022/2023In recent years, the world has witnessed the rise of Artificial Intelligence (AI) and its increasing impact on various sectors. One of the most intriguing areas where AI has been making waves is the creative industry. The use of AI in this sector is provoking significant changes in the way artists, designers, musicians, and writers work, think, and create. This paper explores the impact of AI on the creative industries and how it is reshaping the future of creativity. The study focuses on the ways in which AI is being used in creative industries by analyzing the opportunities and challenges that come with its integration into the creative process, as well as the ethical dilemmas that arise from its usage. To accomplish this, an overview of relevant theoretical frameworks is presented alongside data analysis and a case study of the music industry and its relationship with the technology. Moreover, the paper explores the role of AI in promoting creativity and innovation, how it is changing the nature of work in the creative industry, creating new job opportunities, and requiring a new set of skills and competencies from creative professionals. Overall, this study provides a comprehensive overview of the current state of AI in the creative industries, its potential for growth, and the challenges and opportunities it presents. Therefore, this paper is suited for anyone interested in understanding the impact of AI on creativity and the future of the creative industry

    Animating Truth

    Get PDF
    Animating Truth examines the rise of animated documentary in the 21st century, and addresses how non-photorealistic animation is increasingly used to depict and shape reality

    Extracción de información de las autopsias verbales

    Get PDF
    Civil registration and vital statistics registers births and deaths and compiles statistics. These statistics are a key factor to promote public health policies, register longevity and the health of the population. Death certificates issued in health institutions are the main source to collect the cause of death (CoD). Nevertheless, such counts are not straightforward, indeed, it is estimated that 65% of deaths in the world remain uncounted [D’Ambruoso, 2013]. In places where there is no access to health facilities and, hence, to death certificates, the World Health Organization (WHO) designed the Verbal Autopsy as an instrument to collect evidences about the CoD statistics. A Verbal Autopsy (VA) consists of an interview to the relative or the caregiver of the deceased. The VA conveys both an open response (OR) and the closed questions (CQs). On the one hand, the OR consists of a free narrative of the events expressed in natural language and without any pre-determined structure. On the other hand, the CQs are a set of a few hundreds controlled questions each with a small number of permitted answers (e.g. yes/no). InterVA is a suite of computer models and it is included in the WHO 2016 instrument, which gathers several algorithms chosen by the WHO for the analysis of verbal autopsies. InterVA estimates the CoD, based, merely, upon the CQs while the OR is disregarded. We hypothesize that the incorporation of the text provided by the OR might convey relevant information to discern the CoD and, accordingly, InterVA could be benefited from Natural Language Processing approaches. Empirical results corroborated that the CoD prediction capability of the InterVA algorithm is outperformed taking into account the valuable information conveyed by the OR. The experimental layout compares InterVA with other approaches well suited to the processing of structured inputs as is the case of the CQs. Next, alternative approaches based on language models are employed to analyze the OR. Finally, the best approach for each facet (CQs and OR) was combined leading to a multi-modal approach

    Digital traces and urban research : Barcelona through social media data

    Get PDF
    Most of the world’s population now resides in urban areas, and it is expected that almost all of the planet’s growth will be concentrated in them for the next 30 years, making the improvement of the quality of life in the cities one of the big challenges of this century. To that end, it is crucial to have information on how people use the spaces in the city, and allows urban planning to successfully respond to their needs. This dissertation proposes using data shared voluntarily by the millions of users that make up social network’s communities as a valuable tool for the study of the complexity of the city, because of its capacity of providing an unprecedented volume of urban information, with geographic, temporal, semantic and multimedia components. However, the volume and variety of data raises important challenges regarding its retrieval, manipulation, analysis and representation, requiring the adoption of the best practices in data science, using a multi-faceted approach in the field of urban studies with a strong emphasis in the reproducibility of the developed methodologies. This research focuses in the case of study of the city of Barcelona, using the public data collected from Panoramio, Flickr, Twitter and Instagram. After a literature review, the methods to access the different services are discussed, along with their available data and limitations. Next, the retrieved data is analyzed at different spatial and temporal scales. The first approximation to data focuses on the origins of users who took geotagged pictures of Barcelona, geocoding the hometowns that appear in their Flickr public profiles, allowing the identification of the regions, countries and cities with the largest influx of visitors, and relating the results with multiple indicators at a global scale. The next scale of analysis discusses the city as a whole, developing methodologies for the representation of the spatial distribution of the collected locations, avoiding the artifacts produced by overplotting. To this end, locations are aggregated in regular tessellations, whose size is determined empirically from their spatial distribution. Two spatial statistics techniques (Moran’s I and Getis-Ord’s G*) are used to visualize the local spatial autocorrelation of the areas with exceptionally high or low densities, under a statistical significance framework. Finally, the kernel density estimation is introduced as a non-parametric alternative. The third level of detail follows the official administrative division of Barcelona in 73 neighborhoods and 12 districts, which obeys to historical, morphological and functional criteria. Micromaps are introduced as a representation technique capable of providing a geographical context to commonly used statistical graphics, along with a methodology to produce these micromaps automatically. This technique is compared to annotated scatterplots to relate picture intensity with different urban indicators at a neighborhood scale. The hypothesis of spatial homogeneity is abandoned at the most detailed scale, focusing the analysis on the street network. Two techniques to assign events to road segments in the street graph are presented (direct by shortest distance or by proxy through the postal addresses), as well as the generalization of the kernel density estimation from the Euclidean space to a network topology. Beyond the spatial domain, the interactions of three temporal cycles are further analyzed using the timestamps available in the picture metadata: daytime/nighttime (daily cycle), work/leisure (weekly cycle) and seasonal (yearly cycle).La major part de la població mundial resideix actualment en àrees urbanes, i es preveu que pràcticament tot el creixement del planeta es concentri en elles en els propers 30 anys, convertint la millora de la qualitat de vida a les ciutats en un dels grans reptes del present segle. És per tant imprescindible disposar d'informació sobre les activitats que les persones desenvolupen en elles, que permetin al planejament donar resposta a les seves necessitats. Aquesta tesi proposa l'ús de dades compartides de manera voluntària pels milions d'usuaris que conformen les comunitats de les xarxes socials com una valuosa eina per a l'estudi de la complexitat de la ciutat, per la seva capacitat de proporcionar un volum d'informació urbana sense precedents, reunint components tant geogràfics, temporals, semàntics i multimèdia. No obstant això, aquest volum i varietat de les dades planteja grans reptes pel que fa a la seva obtenció, tractament, anàlisi i representació, requerint adoptar les millors pràctiques de la ciència de dades, aplicades des de múltiples punts de vista al camp dels estudis urbans, posant sempre l'èmfasi en la reproductibilitat de les metodologies desenvolupades. Aquesta investigació se centra en el cas d'estudi de la ciutat de Barcelona, a partir de les dades públiques obtingudes de Panoramio, Flickr, Twitter i Instagram. Després d'una revisió de l'estat de l'art, es desenvolupa l'operativa d'accés als diferents serveis, revisant les dades disponibles i les seves limitacions. A continuació, s'analitzen les dades obtingudes en diferents escales espacials i temporals. La primera aproximació a les dades es desenvolupa a partir de l'origen dels usuaris que han pres fotografies geolocalitzades de Barcelona, a través de la geocodificació de les ubicacions que apareixen en els seus perfils públics de Flickr, permetent identificar les regions, països i ciutats amb major afluència de visitants i relacionar els resultats amb diferents indicadors a escala global. La següent escala d'anàlisi es centra en la ciutat en el seu conjunt, desenvolupant metodologies per a la representació de la distribució espacial de les localitzacions obtingudes, evitant els artefactes produïts per la superposició de mostres. Per a això s'agreguen les localitzacions en tesselacions regulars, la mida de les quals es determina empíricament a partir de la seva distribució espacial. S'utilitzen dues tècniques d'estadística espacial (I de Moran i G* de Getis-Ord) per a visualitzar l'autocorrelació espacial local dels àmbits amb densitats excepcionalment altes o baixes, seguint un criteri de significança estadística. Finalment s'introdueix com a alternativa no paramètrica l'estimació de la densitat. El tercer nivell de detall coincideix amb la delimitació administrativa oficial de Barcelona en 73 barris i 12 districtes, realitzada a partir de criteris històrics, morfològics i funcionals. S'introdueixen els micromapes com a tècnica de representació capaç d'aportar un context geogràfic a gràfics estadístics d'ús comú, juntament amb una metodologia per produir aquests micromapes de manera automàtica. Es compara aquesta tècnica amb diagrames de dispersió anotats per a relacionar la intensitat de fotografies amb diferents indicadors urbans a escala de barri. En l'escala més detallada s'abandona la hipòtesi d'homogeneïtat espacial i es trasllada l'anàlisi al sistema viari. Es presenten dues tècniques d'atribució de localitzacions a trams de carrer del graf vial (directa per distància o indirecta a través de les adreces postals), així com la generalització de l'estimació de la densitat d'un espai euclidià a una topologia de xarxa. Fora del context espacial, s'analitzen les interaccions de tres cicles temporals a partir de les metadades del moment en què van ser preses les fotografies: diürn/nocturn (cicle diari), treball/oci (cicle setmanal) i estacional (cicle anual)
    corecore