11,450 research outputs found

    Mining large-scale human mobility data for long-term crime prediction

    Full text link
    Traditional crime prediction models based on census data are limited, as they fail to capture the complexity and dynamics of human activity. With the rise of ubiquitous computing, there is the opportunity to improve such models with data that make for better proxies of human presence in cities. In this paper, we leverage large human mobility data to craft an extensive set of features for crime prediction, as informed by theories in criminology and urban studies. We employ averaging and boosting ensemble techniques from machine learning, to investigate their power in predicting yearly counts for different types of crimes occurring in New York City at census tract level. Our study shows that spatial and spatio-temporal features derived from Foursquare venues and checkins, subway rides, and taxi rides, improve the baseline models relying on census and POI data. The proposed models achieve absolute R^2 metrics of up to 65% (on a geographical out-of-sample test set) and up to 89% (on a temporal out-of-sample test set). This proves that, next to the residential population of an area, the ambient population there is strongly predictive of the area's crime levels. We deep-dive into the main crime categories, and find that the predictive gain of the human dynamics features varies across crime types: such features bring the biggest boost in case of grand larcenies, whereas assaults are already well predicted by the census features. Furthermore, we identify and discuss top predictive features for the main crime categories. These results offer valuable insights for those responsible for urban policy or law enforcement

    Data analytics 2016: proceedings of the fifth international conference on data analytics

    Get PDF

    Web browsing interactions inferred from a flow-level perspective

    Get PDF
    Desde que su uso se extendiera a mediados de los noventa, la web ha sido probablemente el servicio de Internet más popular. De hecho, muchos usuarios la utilizan prácticamente como sinónimo de Internet. Hoy en día los usuarios de la web utilizan una gran cantidad dispositivos distintos para acceder a ella desde ordenadores tradicionales a teléfonos móviles, tabletas, lectores de libros electrónicos o, incluso, relojes inteligentes. Además, los usuarios se han acostumbrado a acceder a diferentes servicios a través de sus navegadores web en vez de utilizar aplicaciones dedicadas a ello. Este es el caso, por ejemplo del correo electrónico, del streaming de vídeo o de suites ofimáticas (como la proporcionada por Google Docs). Como consecuencia de todo esto, hoy en día el tráfico web es muy complejo y el efecto que tiene en las redes es muy importante. La comunidad científica ha reaccionado a esta situación impulsando muchos estudios que caracterizan la web y su tráfico y que proponen maneras de mejorar su funcionamiento. Sin embargo, muchos estudios centrados en el tráfico web han considerado el tráfico de los clientes o los servidores en su totalidad con el objetivo de describirlo estadísticamente. En otros casos, se han introducido en el nivel de aplicación al centrarse en los mensajes HTTP. Pocos trabajos han buscado describir el efecto que las sesiones de un sitio web y las visitas a páginas web tienen en el tráfico de un usuario. No obstante, esas interacciones son las que el usuario experimenta al navegar y, por tanto, son las que mejor representan su comportamiento. El trabajo que se presenta en esta tesis gira alrededor de esas interacciones y se enfoca especialmente en identificarlas en el tráfico de los usuarios. Esta tesis aborda el problema desde una perspectiva a nivel de flujo. En otras palabras, el estudio que se presenta se centra en una caracterización del tráfico web obtenida para cada conexión mediante datos de los niveles de transporte y red, nunca mediante datos de aplicación. La perspectiva a nivel de flujo introduce ciertas limitaciones en las propuestas desarrolladas, pero lo compensa al permitir desarrollar sistemas escalables, fáciles de instalar en cualquier red y que evitan acceder a información de usuario que podría ser sensible. En los capítulos de este documento se introducen varios métodos para identificar sesiones a sitios web y descargas de páginas web en el tráfico de los usuarios. Para desarrollar dichos métodos se ha caracterizado tráfico web capturado de varias formas: accediendo a páginas automáticamente, con la ayuda de voluntarios en un entorno controlado y en el enlace de la Universidad Pública de Navarra. Los métodos que presentamos se basan en parámetros a nivel de conexión como los tiempos de inicio y final de los flujos o las direcciones IP de servidor. Estos parámetros se emplean para encontrar conexiones relacionadas en el tráfico de los usuarios. La validación de los resultados obtenidos con los distintos métodos ha sido complicada al no disponer de trazas etiquetadas correctamente que puedan usarse para verificar que las clasificaciones se han realizado de forma correcta. Además, al no haber propuestas similares en la literatura científica ha sido imposible comparar los resultados obtenidos con los de otros autores. Por todo esto ha sido necesario diseña métodos específicos de validación que también se describen en este documento. Ser capaces de identificar sesiones a sitios web y descargas de páginas web tiene aplicaciones inmediatas para administradores de red y proveedores de servicio ya que les permitiría recoger datos sobre el perfil de navegación de sus usuarios e incluso bloquear tráfico indeseado y dar prioridad al importante. Además, las ventajas de trabajar a nivel de conexión se aplican especialmente en su caso. Por último, los resultados obtenidos a través de los métodos presentados en esta tesis podrían emplearse en diseñar esquemas capaces de clasificar el tráfico web dependiendo del servicio que lo haya producido ya que se podrían utilizar como parámetros de entrada las características de múltiples conexiones relacionadas.Since its use became widespread during the mid 1990s, the web has probably been the most popular Internet service. In fact, for many lay users, the web is almost a synonym for the Internet. Web users today access it from a myriad of different devices from traditional computers to smartphones, tablets, ebook readers and even smart watches. Moreover, users have become accustomed to accessing multiple different services through their web browsers instead of through dedicated applications. This is the case, for example, of e-mail, video-streaming or office suites (such as the one provided by Google Docs). As a consequence, web traffic nowadays is complex and its effect on the networks is very important. The scientific community has reacted to this providing many works that characterize the web and its traffic and propose ways of improving its operation. Nevertheless, studies focused on web traffic have often considered the traffic of web clients or servers as a whole in order to describe their particular performance, or have delved into the application level by focusing on HTTP messages. Few works have attempted to describe the effect of website sessions and webpage visits on web traffic. Those web browsing interactions are, however, the elements of web operation that the user actually experiences and thus are the most representative of his behavior. The work presented in this thesis revolves around these web interactions with the special focus of identifying them in user traffic. This thesis offers a distinctive approach in that the problem at hand is faced from a flow-level perspective. That is, the study presented here centers on a characterization of web traffic obtained on a per connection basis and using information from the transport and network levels rather than relying on deep packet inspection. This flow-level perspective introduces various constraints to the proposals developed, but pays off by offering scalability, ease of deployment, and by avoiding the need to access potentially sensitive application data. In the chapters of this document, different methods for identifying website sessions and webpage downloads in user traffic are introduced. In order to develop those methods, web traffic is characterized from a connection perspective using traces captured by accessing the web automatically, with the help of voluntary users in a controlled environment, and captured in the wild from users of the Public University of Navarre. The methods rely on connection-level parameters such as start and end timestamps or server IP addresses in order to find related connections in the traffic of web users. Evaluating the performance of the different methods has been problematic because of the absence of ground truth (labeled web traffic traces are hard to obtain and the labeling process is very complex) and the lack of similar research which could be used for comparison purposes. As a consequence, specific validation methods have been designed and they are also described in this document. Identifying website sessions and webpage downloads in user traffic has multiple immediate applications for network administrators and Internet service providers as it would allow them to gather additional insight into their users browsing behavior and even block undesired traffic or prioritize important one. Moreover, the advantages of a connection-level perspective would be specially interesting for them. Finally, this work could also help in research directed to classifying thee services provided through the web as grouping the connections related to the same website session may offer additional information for the classification process.Programa Oficial de Doctorado en Tecnologías de las Comunicaciones (RD 1393/2007)Komunikazioen Teknologietako Doktoretza Programa Ofiziala (ED 1393/2007

    Mining Missing Hyperlinks from Human Navigation Traces: A Case Study of Wikipedia

    Full text link
    Hyperlinks are an essential feature of the World Wide Web. They are especially important for online encyclopedias such as Wikipedia: an article can often only be understood in the context of related articles, and hyperlinks make it easy to explore this context. But important links are often missing, and several methods have been proposed to alleviate this problem by learning a linking model based on the structure of the existing links. Here we propose a novel approach to identifying missing links in Wikipedia. We build on the fact that the ultimate purpose of Wikipedia links is to aid navigation. Rather than merely suggesting new links that are in tune with the structure of existing links, our method finds missing links that would immediately enhance Wikipedia's navigability. We leverage data sets of navigation paths collected through a Wikipedia-based human-computation game in which users must find a short path from a start to a target article by only clicking links encountered along the way. We harness human navigational traces to identify a set of candidates for missing links and then rank these candidates. Experiments show that our procedure identifies missing links of high quality

    Predictive Process Monitoring for Lead-to-Contract Process Optimization

    Get PDF
    Äriprotsesside toetamiseks on üha laiemalt kasutusele võetud ettevõtte ressursside planeerimise (ERP) tööriistad, sealhulgas CRM süsteemid müügiprotsessi jaoks. ERP süsteemid salvestavad oma töö käigus protsesside logisid, mille oskuslik käsitlemine võimaldab efektiivistada äriprotsesse. Protsessilogide analüüsimiseks on välja töötatud protsessikaeve meetodid, mis oskavad logidest pöördprojekteerida tegelikult käivitatud protsesside mudeleid. Neid meetodeid on rakendatud koos ennustava seire meetoditega protsesside tulemuste soovitud ja soovimatute tulemuste varajaseks tuvastamiseks.\n\rKuigi ennustav seire on hiljuti rohkelt tähelepanu saanud ja leidnud rakendamist soovitusmootorites, mis pakuvad välja soovitusi äriprotsesside parendamiseks, ei ole seni palju uuritud kontekstiandmete, nt müügisüsteemi kirjetes klientide finantsandmed, mõju ennustava seire tulemustele soovituste kontekstis.\n\rKäesolevas magistritöös uuritakse kontekstiandmete mõju ennustava seire mudelite kvaliteedile müügiprotsessi optimeerimise kontekstis. Eksperimendid näitavad, et välistel kontekstiandmetel on pigem negatiivne mõju, samas kui sisemistel, protsessi käigus kogutud kontekstiandmetel on positiivne mõju mudelite kvaliteedile. Muuhulgas selgub eksperimentidest, et juba kolme esimese sündmuse baasil saab müügiprotsessis ennustada müügi õnnestumist.Business processes today are supported by enterprise systems such as Enter-\n\rprise Resource Planning systems. These systems store large amounts of process execution\n\rlog data that can be used to improve business processes across the organization. The\n\rprocess mining methods have been developed to analyze such logs, which are capable of\n\rextracting process models. These methods, in turn, have been applied in conjunctions\n\rwith predictive monitoring methods for early differentiation of desired and undesired\n\routcomes. Although predictive monitoring approach has recently caught attention and\n\rfound application in recommendation engines, which suggest cases to improve business\n\rprocess outcomes, there is no much research on how contextual data, such as clients fi-\n\rnancial indicators and other external data, may improve the quality of recommendations.\n\rThis thesis examines whether including the external data with the event data affects the\n\raccuracy of predictive monitoring for early predictions positively. More specifically, this\n\rthesis reveals usage of context data had the adverse effect on the performance of learned\n\rmodels. Furthermore, the study indicated that the usage of first three events from the\n\revent logs with internal data is sufficient to predict the label of an opportunity in the\n\rsales funnel

    Automated Field Usability Evaluation Using Generated Task Trees

    Get PDF
    Jedes Produkt hat eine Gebrauchstauglichkeit (Usability). Das umfasst auch Software,Webseiten und Apps auf mobilen Endgeräten und Fernsehern. Im heutigen Anbieterwettbewerb kann Usability ein entscheidender Faktor für den Erfolg eines Produktes sein. Dies gilt speziell für Software, da alternative Angebote meist schnell und einfach verfügbar sind. Daher sollte jede Softwareentwicklung Gebrauchstauglichkeit als eines ihrer Ziele definieren. Um dieses Ziel zu erreichen, wird beim Usability Engineering während der Entwicklung und der Nutzung eines Produkts kontinuierlich dessen Gebrauchstauglichkeit erfasst und verbessert. Hierfür existiert eine Reihe von Methoden, mit denen in allen Projektphasen entsprechende Probleme erkannt und gelöst werden können. Die meisten dieser Methoden sind jedoch nur manuell einsetzbar und daher kostspielig in der Anwendung.    Die vorliegende Arbeit beschreibt ein vollautomatisiertes Verfahren zur Bewertung der Usability von Software. Das Verfahren zählt zu den nutzerorientierten Methoden und kann für Feldstudien eingesetzt werden. In diesem Verfahren werden zunächst detailliert die Aktionen der Nutzer auf der Oberfläche einer Software aufgezeichnet. Aus diesen Aufzeichnungen berechnet das Verfahren ein Modell der Nutzeroberfläche sowie sogenannte Task-Bäume, welche ein Modell der Nutzung der Software sind. Die beiden Modelle bilden die Grundlage für eine anschließende Erkennung von 14 sogenannten Usability Smells. Diese definieren unerwartetes Nutzerverhalten, das auf ein Problem mit der Gebrauchstauglichkeit der Software hinweist. Das Ergebnis des Verfahrens sind detaillierte Beschreibungen zum Auftreten der Smells in den Task-Bäumen und den aufgezeichneten Nutzeraktionen. Dadurch wird ein Bezug zwischen den Aufgaben des Nutzers, den entsprechenden Problemen sowie ursächlichen Elementen der graphischen Oberfläche hergestellt.    Das Verfahren wird anhand von zwei Webseiten und einer Desktopanwendung validiert. Dabei wird zunächst die Repräsentativität der generierten Task-Bäume für das Nutzerverhalten überprüft. Anschließend werden Usability Smells erkannt und die Ergebnisse manuell analysiert sowie mit Ergebnissen aus der Anwendung etablierter Methoden des Usability Engineerings verglichen. Daraus ergeben sich unter anderem Bedingungen, die bei der Erkennung von Usability Smells erfüllt sein müssen.    Die drei Fallstudien, sowie die gesamte Arbeit zeigen, dass das vorgestellte Verfahren fähig ist, vollautomatisiert unterschiedlichste Usabilityprobleme zu erkennen. Dabei wird auch gezeigt, dass die Ergebnisse des Verfahrens genügend Details beinhalten, um ein gefundenes Problem genauer zu beschreiben und Anhaltspunkte für dessen Lösung zu liefern. Außerdem kann das Verfahren andere Methoden der Usabilityevaluation ergänzen und dabei sehr einfach auch im großen Umfang eingesetzt werden.Usability is an important aspect of any kind of product. This also applies for software like desktop applications and websites, as well as apps on mobile devices and smart TVs. In a competitive market, the usability of a software becomes a discriminator between success and failure. This is especially important for software, as alternatives are often close at hand and only one click away. Hence, the software development must strive for highly usable products. Usability engineering allows for continuously measuring and improving the usability of a software during its development and beyond. For this, it offers a broad variety of methods, that support detecting usability issues in early development stages on a prototype level, as well as during the operation of a final software. Unfortunately, most of these methods are applied manually, which increases the effort of their utilization. In this thesis, we describe a fully automated approach for usability evaluation. This approach is a user-oriented method to be applied in the field, i.e., during the operation of a software. For this, it first traces the usage of a software by recording user actions on key stroke level. From these recordings, it compiles a model of the Graphical User Interface (GUI) of a software, as well as a usage model in the form of task trees. Based on these models and the recorded actions, our approach performs a detection of 14 different so called usability smells. These smells are exceptional user behavior and indicate usability issues. The result of the application of our approach on a software is a list of findings for each of the smells. These findings provide detailed information about the user tasks that are affected by the related usability issues, as well as about the elements of the GUI that cause the issues. By applying it on two websites and one desktop application, we perform an in-depth validation of our approach in three case studies. In these case studies, we verify if task trees can be generated from recorded user actions and if they are representative for the user behavior. Furthermore, we apply the usability smell detection and analyze the corresponding results with respect to their validity. For this, we also compare the findings with the results of generally accepted usability evaluation methods. Finally, we conclude on the results and derive conditions for findings of our approach, which must be met to consider them as indicators for usability issues. The results of the case studies are promising. They show, that our approach can find, fully automated, a broad range of usability issues. In addition, we show, that the findings can reference in detail the elements of the GUI that cause a usability issue. Our approach is supplemental to established usability engineering methods and can be applied with minimal effort on a large scale

    Analyzing repetitiveness in big code to support software maintenance and evolution

    Get PDF
    Software systems inevitably contain a large amount of repeated artifacts at different level of abstraction---from ideas, requirements, designs, algorithms to implementation. This dissertation focuses on analyzing software repetitiveness at implementation code level and leveraging the derived knowledge for easing tasks in software maintenance and evolution such as program comprehension, API use, change understanding, API adaptation and bug fixing. The guiding philosophy of this work is that, in a large corpus, code that conforms to specifications appears more frequently than code that does not, and similar code is changed similarly and similar code could have similar bugs that can be fixed similarly. We have developed different representations for software artifacts at source code level, and the corresponding algorithms for measuring code similarity and mining repeated code. Our mining techniques bases on the key insight that code that conforms to programming patterns and specifications appears more frequently than code that does not. Thus, correct patterns and specifications can be mined from large code corpus. We also have built program differencing techniques for analyzing changes in software evolution. Our key insight is that similar code is likely changed in similar ways and similar code likely has similar bug(s) which can be fixed similarly. Therefore, learning changes and fixes from the past can help automatically detect and suggest changes/fixes to the repeated code in software development. Our empirical evaluation shows that our techniques can accurately and efficiently detect repeated code, mine useful programming patterns and API specifications, and recommend changes. It can also detect bugs and suggest fixes, and provide actionable insights to ease maintenance tasks. Specifically, our code clone detection tool detects more meaningful clones than other tools. Our mining tools recover high quality programming patterns and API preconditions. The mined results have been used to successfully detect many bugs violating patterns and specifications in mature open-source systems. The mined API preconditions are shown to help API specification writer identify missing preconditions in already-specified APIs and start building preconditions for the not-yet-specified ones. The tools are scalable which analyze large systems in reasonable times. Our study on repeated changes give useful insights for program auto-repair tools. Our automated change suggestion approach achieves top-1 accuracy of 45%-51% which relatively improves more than 200% over the base approach. For a special type of change suggestion, API adaptation, our tool is highly correct and useful
    corecore