502 research outputs found

    都市地域におけるソーシャルネットワーク利用者の活動性に関する研究--インドネシアマカッサル市を対象として--

    Get PDF
    This study aims to investigate the possibilities of using Twitter social media data as a source of knowledge for urban planning application. The author analyses 211,922 check-ins on Twitter. The dataset was utilized to analyses people\u27s movement by comparing the population on Twitter with the real urban population. Three data sources used: check-ins, population census, and questionnaire data. Secondly, with a mapping approach was used to study the dynamic urban land-use pattern by combining check-in features and individual text-posting activities. Thirdly, using a grid based on an aggregation method to analyze the city center\u27s location. Fourth, quantified the mobility of urban inhabitants by examining individuals\u27 movement patterns and calculated how far people travel in the city. Lastly, analyzed the social media users in the public spaces and public facilities. The thesis concludes that location based social media has great potential for helping understand the shape and structure of a city.北九州市立大

    Design, analysis and implementation of advanced methodologies to measure the socio-economic impact of personal data in large online services

    Get PDF
    El ecosistema web es enorme y, en general, se sustenta principalmente en un atributo intangible que sostiene la mayoría de los servicios gratuitos: la explotación de la información personal del usuario. A lo largo de los años, la preocupación por la forma en que los servicios utilizan los datos personales ha aumentado y atraído la atención de los medios de comunicación, gobiernos, reguladores y también de los usuarios. Esta recogida de información personal es hoy en día la principal fuente de ingresos en Internet. Además, por si fuera poco, la publicidad online es la pieza que lo sustenta todo. Sin la existencia de datos personales en comunión con la publicidad online, Internet probablemente no sería el gigante que hoy conocemos. La publicidad online es un ecosistema muy complejo en el que participan múltiples actores. Es el motor principal que genera ingresos en la red, y en pocos años ha evolucionado hasta llegar a miles de millones de usuarios en todo el mundo. Mientras navegan, los usuarios generan datos muy valiosos sobre sí mismos que los anunciantes utilizan después para ofrecerles productos relevantes en los que podrían estar interesados. Se trata de un enfoque bidireccional, ya que los anunciantes pagan a intermediarios para que muestren anuncios al público que, en principio, está más interesado. Sin embargo, este comercio, intercambio y tratamiento de datos personales, además de abrir nuevas vías de publicidad, exponen la privacidad de los usuarios. Esta incesante recopilación y comercialización de la información personal suele quedar tras un muro opaco, donde el usuario generalmente desconoce para qué se utilizan sus datos. Las iniciativas de privacidad y transparencia se han incrementado a lo largo de los años para empoderar al usuario en este negocio que mueve miles de millones de dólares en ingresos. No en vano, tras varios escándalos, como el de Facebook Cambridge Analytica, las empresas y los reguladores se han unido para crear transparencia y proteger a los usuarios de las malas prácticas derivadas del uso de su información personal. Por ejemplo, el Reglamento General de Protección de Datos, es el ejemplo más prometedor de regulación, que afecta a todos los estados miembros de la Unión Europea, abogando por la protección de los usuarios. El contenido de esta tesis tomará como referencia esta legislación. Por todo ello, el propósito de esta tesis consiste en aportar herramientas y metodologías que pongan de manifiesto usos inapropiados de datos personales por las grandes compañías del ecosistema publicitario online, y cree transparencia entre los usuarios, proporcionando, a su vez, soluciones para que se protejan. Así pues, el contenido de esta tesis ofrece diseño, análisis e implementación de metodologías que miden el impacto social y económico de la información personal online en los servicios extensivos de Internet. Principalmente, se centra en Facebook, una de las mayores redes sociales y servicios en la web, que cuenta con más de 2,8B de usuarios en todo el mundo y generó unos ingresos solo en publicidad online de más de 84 mil millones de dólares en el año 2020. En primer lugar, esta tesis presenta una solución, en forma de extensión del navegador llamada FDVT (Data Valuation Tool for Facebook users), para proporcionar a los usuarios una estimación personalizada y en tiempo real del dinero que están generando para Facebook. Analizando el número de anuncios e interacciones en una sesión, el usuario obtiene información sobre su valor dentro de esta red social. La extensión del navegador ha tenido una importante repercusión y adopción tanto por parte de los usuarios, instalándose más de 10k veces desde su lanzamiento público en octubre de 2016, como de los medios de comunicación, apareciendo en más de 100 medios. En segundo lugar, el estudio e investigación de los posibles riesgos asociados al tratamiento de los datos de los usuarios debe seguir también a la creación de este tipo de soluciones. En este contexto, esta tesis descubre y desvela resultados impactantes sobre el uso de la información personal: (i) cuantifica el número de usuarios afectados por el uso de atributos sensibles utilizados para la publicidad en Facebook, utilizando como referencia la definición de datos sensibles del Reglamento General de Protección de Datos. Esta tesis se basa en el uso de Procesamiento de Lenguaje Natural para identificar los atributos sensibles, y posteriormente utiliza el la plataforma de creación de anuncios de Facebook para recuperar el número de usuarios asignados con esta información sensible. Dos tercios de los usuarios de Facebook se ven afectados por el uso de datos personales sensibles que se les atribuyen. Además, la legislación parece no tener efecto en este uso de atributos sensibles por parte de Facebook, y presenta graves riesgos para los usuarios. (ii) Se modela cuál es el número de atributos que no identifican a priori personalmente al usuario y que aun así son suficientes para identificar de forma única a un individuo sobre una base de datos de miles de millones de usuarios, y se demuestra que llegar a un solo usuario es plausible incluso sin conocer datos que lo identifiquen personalmente de ellos mismos. Los resultados demuestran que 22 intereses al azar de un usuario son suficientes para identificarlo unívocamente con un 90% de probabilidad, y 4 si tomamos los menos populares. Por último, esta tesis se ha visto afectada por el estallido de la pandemia del COVID- 19, lo que ha contribuido al análisis de la evolución del mercado de la publicidad en línea con este periodo. La investigación demuestra que el mercado de la publicidad muestra una inelasticidad casi perfecta en la oferta y que cambió su composición debido a un cambio en el comportamiento en línea de los usuarios. También ilustra el potencial que tiene la utilización de los datos de los grandes servicios en línea, dado que ya tienen una alta tasa de adopción, y presenta un protocolo para la localización de contactos que han estado potencialmente expuestos a personas que direon positivo en COVID-19, en contraste con el fracaso de las nuevas aplicaciones de localización de contactos. En conclusión, la investigación de esta tesis muestra el impacto social y económico de la publicidad online y de los grandes servicios online en los usuarios. La metodología utilizada y desplegada sirve para poner de manifiesto y cuantificar los riesgos derivados de los datos personales en los servicios en línea. Presenta la necesidad de tales herramientas y metodologías en consonancia con la nueva legislación y los deseos de los usuarios. Siguiendo estas peticiones, en la búsqueda de transparencia y privacidad, esta tesis muestra soluciones y medidas fácilmente implementables para prevenir estos riesgos y capacitar al usuario para controlar su información personal.The web ecosystem is enormous, and overall it is sustained by an intangible attribute that mainly supports the majority of free services: the exploitation of personal information. Over the years, concerns on how services use personal data have increased and attracted the attention of media and users. This collection of personal information is the primary source of revenue on the Internet nowadays. Furthermore, on top of this, online advertising is the piece that supports it all. Without the existence of personal data in communion with online advertising, the Internet would probably not be the giant we know today. Online advertising is a very complex ecosystem in which multiple stakeholders take part. It is the motor that generates revenue on the web, and it has evolved in a few years to reach billions of users worldwide. While browsing, users generate valuable data about themselves that advertisers later use to offer them relevant products in which users could be interested. It is a two-way approach since advertisers pay intermediates to show ads to the public that is, in principle, most interested. However, this trading, sharing, and processing of personal data and behavior patterns, apart from opening up new advertising ways, expose users’ privacy. This incessant collection and commercialization of personal information usually fall behind an opaque wall, where the user often does not know what their data is used for. Privacy and transparency initiatives have increased over the years to empower the user in this business that moves billions of US dollars in revenue. Not surprisingly, after several scandals, such as the Facebook Cambridge Analytica scandal, businesses and regulators have joined forces to create transparency and protect users against the harmful practices derived from the use of their personal information. For instance, the General Data Protection Regulation (GDPR), is the most promising example of a data protection regulation, affecting all the member states of the European Union (EU), advocating for protecting users. The content of this thesis will use this legislation as a reference. For all these reasons, the purpose of this thesis is to provide tools and methodologies that reveal inappropriate uses of personal data by large companies in the online advertising ecosystem and create transparency among users, providing solutions to protect themselves. Thus, the content of this thesis offers design, analysis, and implementation of methodologies that measure online personal information’s social and economic impact on extensive Internet services. Mainly, it focuses on Facebook (FB), one of the largest social networks and services on the web, accounting with more than 2.8B Monthly Active Users (MAU) worldwide and generating only in online advertising revenue, more than $84B in 2020. First, this thesis presents a solution, in the form of a browser extension called Data Valuation Tool for Facebook users (FDVT), to provide users with a personalized, real-time estimation of the money they are generating for FB. By analyzing the number of ads and interactions in a session, the user gets information on their value within this social network. The add-on has had significant impact and adoption both by users, being installed more than 10k times since its public launch in October 2016, and media, appearing in more than 100 media outlets. Second, the study and research of the potential risks associated with processing users’ data should also follow the creation of these kinds of solutions. In this context, this thesis discovers and unveils striking results on the usage of personal information: (i) it quantifies the number of users affected by the usage of sensitive attributes used for advertising on FB, using as reference the definition of sensitive data from the GDPR. This thesis relies on the use of Natural Language Processing (NLP) to identify sensitive attributes, and it later uses the FB Ads Manager to retrieve the number of users assigned with this sensitive information. Two-thirds of FB users are affected by the use of sensitive personal data attributed to them. Moreover, the legislation seems not to affect this use of sensitive attributes from FB, and it presents severe risks to users. (ii) It models the number of non-Personal Identifiable Information (PII) attributes that are enough to uniquely identify an individual over a database of billions of users and proofs that reaching a single user is plausible even without knowing PII data of themselves. The results demonstrate that 22 interests at random from a user are enough to identify them uniquely with a 90% of probability, and 4 when taking the least popular ones. Finally, this thesis was affected by the outbreak of the COVID-19 pandemic what led to side contribute to the analysis of how the online advertising market evolved during this period. The research shows that the online advertising market shows an almost perfect inelasticity on supply and that it changed its composition due to a change in users’ online behavior. It also illustrates the potential of using data from large online services which already have a high adoption rate and presents a protocol for contact tracing individuals who have been potentially exposed to people who tested positive in COVID-19, in contrast to the failure of newly deployed contact tracing apps. In conclusion, the research for this thesis showcases the social and economic impact of online advertising and extensive online services on users. The methodology used and deployed is used to highlight and quantify the risks derived from personal data in online services. It presents the necessity of such tools and methodologies in line with new legislation and users’ desires. Following these requests, in the search for transparency and privacy, this thesis displays easy implementable solutions and measurements to prevent these risks and empower the user to control their personal information.This work was supported by the Ministerio de Educación, Cultura y Deporte, Spain, through the FPU Grant FPU16/05852, the Ministerio de Ciencia e Innovación, Spain, through the project ACHILLES Grant PID2019-104207RB-I00, the H2020 EU-Funded SMOOTH project under Grant 786741, and the H2020 EU-Funded PIMCITY project under Grant 871370.Programa de Doctorado en Ingeniería Telemática por la Universidad Carlos III de MadridPresidente: David Larrabeiti López.- Secretario: Gregorio Ignacio López López.- Vocal: Noel Cresp

    Hardening High-Assurance Security Systems with Trusted Computing

    Get PDF
    We are living in the time of the digital revolution in which the world we know changes beyond recognition every decade. The positive aspect is that these changes also drive the progress in quality and availability of digital assets crucial for our societies. To name a few examples, these are broadly available communication channels allowing quick exchange of knowledge over long distances, systems controlling automatic share and distribution of renewable energy in international power grid networks, easily accessible applications for early disease detection enabling self-examination without burdening the health service, or governmental systems assisting citizens to settle official matters without leaving their homes. Unfortunately, however, digitalization also opens opportunities for malicious actors to threaten our societies if they gain control over these assets after successfully exploiting vulnerabilities in the complex computing systems building them. Protecting these systems, which are called high-assurance security systems, is therefore of utmost importance. For decades, humanity has struggled to find methods to protect high-assurance security systems. The advancements in the computing systems security domain led to the popularization of hardware-assisted security techniques, nowadays available in commodity computers, that opened perspectives for building more sophisticated defense mechanisms at lower costs. However, none of these techniques is a silver bullet. Each one targets particular use cases, suffers from limitations, and is vulnerable to specific attacks. I argue that some of these techniques are synergistic and help overcome limitations and mitigate specific attacks when used together. My reasoning is supported by regulations that legally bind high-assurance security systems' owners to provide strong security guarantees. These requirements can be fulfilled with the help of diverse technologies that have been standardized in the last years. In this thesis, I introduce new techniques for hardening high-assurance security systems that execute in remote execution environments, such as public and hybrid clouds. I implemented these techniques as part of a framework that provides technical assurance that high-assurance security systems execute in a specific data center, on top of a trustworthy operating system, in a virtual machine controlled by a trustworthy hypervisor or in strong isolation from other software. I demonstrated the practicality of my approach by leveraging the framework to harden real-world applications, such as machine learning applications in the eHealth domain. The evaluation shows that the framework is practical. It induces low performance overhead (<6%), supports software updates, requires no changes to the legacy application's source code, and can be tailored to individual trust boundaries with the help of security policies. The framework consists of a decentralized monitoring system that offers better scalability than traditional centralized monitoring systems. Each monitored machine runs a piece of code that verifies that the machine's integrity and geolocation conform to the given security policy. This piece of code, which serves as a trusted anchor on that machine, executes inside the trusted execution environment, i.e., Intel SGX, to protect itself from the untrusted host, and uses trusted computing techniques, such as trusted platform module, secure boot, and integrity measurement architecture, to attest to the load-time and runtime integrity of the surrounding operating system running on a bare metal machine or inside a virtual machine. The trusted anchor implements my novel, formally proven protocol, enabling detection of the TPM cuckoo attack. The framework also implements a key distribution protocol that, depending on the individual security requirements, shares cryptographic keys only with high-assurance security systems executing in the predefined security settings, i.e., inside the trusted execution environments or inside the integrity-enforced operating system. Such an approach is particularly appealing in the context of machine learning systems where some algorithms, like the machine learning model training, require temporal access to large computing power. These algorithms can execute inside a dedicated, trusted data center at higher performance because they are not limited by security features required in the shared execution environment. The evaluation of the framework showed that training of a machine learning model using real-world datasets achieved 0.96x native performance execution on the GPU and a speedup of up to 1560x compared to the state-of-the-art SGX-based system. Finally, I tackled the problem of software updates, which makes the operating system's integrity monitoring unreliable due to false positives, i.e., software updates move the updated system to an unknown (untrusted) state that is reported as an integrity violation. I solved this problem by introducing a proxy to a software repository that sanitizes software packages so that they can be safely installed. The sanitization consists of predicting and certifying the future (after the specific updates are installed) operating system's state. The evaluation of this approach showed that it supports 99.76% of the packages available in Alpine Linux main and community repositories. The framework proposed in this thesis is a step forward in verifying and enforcing that high-assurance security systems execute in an environment compliant with regulations. I anticipate that the framework might be further integrated with industry-standard security information and event management tools as well as other security monitoring mechanisms to provide a comprehensive solution hardening high-assurance security systems

    Automated Machine Learning - Bayesian Optimization, Meta-Learning & Applications

    Get PDF
    Automating machine learning by providing techniques that autonomously find the best algorithm, hyperparameter configuration and preprocessing is helpful for both researchers and practitioners. Therefore, it is not surprising that automated machine learning has become a very interesting field of research. Bayesian optimization has proven to be a very successful tool for automated machine learning. In the first part of the thesis we present different approaches to improve Bayesian optimization by means of transfer learning. We present three different ways of considering meta-knowledge in Bayesian optimization, i.e. search space pruning, initialization and transfer surrogate models. Finally, we present a general framework for Bayesian optimization combined with meta-learning and conduct a comparison among existing work on two different meta-data sets. A conclusion is that in particular the meta-target driven approaches provide better results. Choosing algorithm configurations based on the improvement on the meta-knowledge combined with the expected improvement yields best results. The second part of this thesis is more application-oriented. Bayesian optimization is applied to large data sets and used as a tool to participate in machine learning challenges. We compare its autonomous performance and its performance in combination with a human expert. At two ECML-PKDD Discovery Challenges, we are able to show that automated machine learning outperforms human machine learning experts. Finally, we present an approach that automates the process of creating an ensemble of several layers, different algorithms and hyperparameter configurations. These kinds of ensembles are jokingly called Frankenstein ensembles and proved their benefit on versatile data sets in many machine learning challenges. We compare our approach Automatic Frankensteining with the current state of the art for automated machine learning on 80 different data sets and can show that it outperforms them on the majority using the same training time. Furthermore, we compare Automatic Frankensteining on a large-scale data set to more than 3,500 machine learning expert teams and are able to outperform more than 3,000 of them within 12 CPU hours.Die Automatisierung des Maschinellen Lernens erlaubt es ohne menschliche Mitwirkung den besten Algorithmus, die dazugehörige beste Konfiguration und die optimale Vorverarbeitung des Datensatzes zu bestimmen und ist daher hilfreich für Anwender mit und ohne fachlichen Hintergrund. Aus diesem Grund ist es wenig überraschend, dass die Automatisierung des Maschinellen Lernens zu einem populären Forschungsgebiet aufgestiegen ist. Bayessche Optimierung hat sich als eins der erfolgreicheren Werkzeuge für das automatisierte Maschinelle Lernen hervorgetan. Im ersten Teil dieser Arbeit werden verschiedene Methoden vorge-stellt, die Bayessche Optimierung mittels Lerntransfer auch über Probleme hinweg verbessern kann. Es werden drei Möglichkeiten vorgestellt, um Wissen von zuvor adressierten Problemen auf neue zu Übertragen: Suchraumreduzierung, Initialisierung und transferierende Ersatzmodelle. Schließlich wird ein allgemeines Framework für Bayessche Optimierung beschrieben, welches existierende Meta-lernansätze berücksichtigt und mit schon existierenden Arbeiten auf zwei Meta-Datensätzen verglichen. Die beschriebenen Ansätze, die direkt die Meta-Zielfunktion optimieren, liefern tendenziell bessere Ergebnisse. Die Wahl der Algorithmuskonfiguration basierend auf Meta-Wissen kombiniert mit der zu erwartenen Verbesserung erweist sich als beste Methode. Der zweite Teil der Arbeit ist anwendungsorientierter. Bayessche Optimierung wird im Rahmen von Wettbewerben auf großen Datensätzen angewandt, um Algorithmen des Maschinellen Lernens zu optimieren. Es wird sowohl die eigenständige Leistung der automatisierten Methode als auch die Leistung in Kombination mit einem menschlichen Experten bewertet. Durch die Teilnahme an zwei ECML-PKDD Wettbewerben wird gezeigt, dass das automatisierte Verfahren menschliche Konkurrenten übertreffen kann. Abschließend wird eine Methode vorgestellt, die automatisch ein mehrschichtiges Ensemble erstellt, welches aus verschiedenen Algorithmen und entsprechenden Konfigurationen besteht. In der Vergangenheit hat sich gezeigt, dass diese Art von Ensemble die besten Vorhersagen liefern kann. Die beschriebende Methode zur automatisierten Erstellung dieser Ensemble wird mit Hilfe von 80 Datensätzen mit existierenden Konkurrenzansätzen verglichen und erreicht innerhalb derselben Zeit auf der Mehrzahl der Datensätze bessere Ergebnisse. Diese Methode wird zusätzlich mit 3.500 Teams von Experten des Maschinellen Lernens auf einem größeren Datensatz verglichen. Es zeigt sich, dass die automatisierte Methodik schon innerhalb von 12 CPU Stunden bessere Ergebnisse liefert als 3.000 der menschlichen Teilnehmer des Wettbewerbs

    The Algorithm Game

    Get PDF
    Most of the discourse on algorithmic decisionmaking, whether it comes in the form of praise or warning, assumes that algorithms apply to a static world. But automated decisionmaking is a dynamic process. Algorithms attempt to estimate some difficult-to-measure quality about a subject using proxies, and the subjects in turn change their behavior in order to game the system and get a better treatment for themselves (or, in some cases, to protest the system.) These behavioral changes can then prompt the algorithm to make corrections. The moves and countermoves create a dance that has great import to the fairness and efficiency of a decision-making process. And this dance can be structured through law. Yet existing law lacks a clear policy vision or even a coherent language to foster productive debate. This Article provides the foundation. We describe gaming and countergaming strategies using credit scoring, employment markets, criminal investigation, and corporate reputation management as key examples. We then show how the law implicitly promotes or discourages these behaviors, with mixed effects on accuracy, distributional fairness, efficiency, and autonomy

    On the use of multi-sensor digital traces to discover spatio-temporal human behavioral patterns

    Get PDF
    134 p.La tecnología ya es parte de nuestras vidas y cada vez que interactuamos con ella, ya sea en una llamada telefónica, al realizar un pago con tarjeta de crédito o nuestra actividad en redes sociales, se almacenan trazas digitales. En esta tesis nos interesan aquellas trazas digitales que también registran la geolocalización de las personas al momento de realizar sus actividades diarias. Esta información nos permite conocer cómo las personas interactúan con la ciudad, algo muy valioso en planificación urbana,gestión de tráfico, políticas publicas e incluso para tomar acciones preventivas frente a desastres naturales.Esta tesis tiene por objetivo estudiar patrones de comportamiento humano a partir de trazas digitales. Para ello se utilizan tres conjuntos de datos masivos que registran la actividad de usuarios anonimizados en cuanto a llamados telefónicos, compras en tarjetas de crédito y actividad en redes sociales (check-ins,imágenes, comentarios y tweets). Se propone una metodología que permite extraer patrones de comportamiento humano usando modelos de semántica latente, Latent Dirichlet Allocation y DynamicTopis Models. El primero para detectar patrones espaciales y el segundo para detectar patrones espaciotemporales. Adicionalmente, se propone un conjunto de métricas para contar con un métodoobjetivo de evaluación de patrones obtenidos

    Data Science Solution for User Authentication

    Get PDF
    User authentication is considered a key factor in almost any software system and is often the first layer of security in the digital world. Authentication methods utilize one, or a combination of up to two, of the following factors: something you know, something you have and something you are. To prevent serious data breaches that have occurred using the traditional authentication methods, a fourth factor, something you do, that is being discussed among researchers; unfortunately, methods that rely on this fourth factor have problems of their own. This thesis addresses the issues of the fourth authentication factor and proposes a data science solution for user authentication. The new solution is based on something you do and relies on analytic techniques to transfer Big data characteristics (volume, velocity and variety) into relevant security user profiles. Users’ information will be analyzed to create behavioral profiles. Just-in-time challenging questions are generated by these behavioral profiles, allowing an authentication on demand feature to be obtained. The proposed model assumes that the data is received from different sources. This data is analyzed using collaborative filtering (CF), a learning technique, that builds up knowledge by aggregating the collected users’ transaction data to identify information of security potential. Four use case scenarios were evaluated regarding the proposed model’s proof of concept. Additionally, a web based case study using MovieLens public dataset was implemented. Results show that the proposed model is successful as a proof of concept. The experiment confirms the potential of applying the proposed approach in real life as a new authentication method, leveraging the characteristics of Big data: volume, velocity and variety
    corecore