90 research outputs found

    Turning data into information: assessing and reporting GIS metadata integrity using integrated computing technologies

    Get PDF
    A Geographic Information System (GIS) serves as the tangible and intangible means by which spatially related phenomena can be created, analyzed and rendered. GIS metadata serves as the formal framework to catalog information about a GIS data set. Metadata is independent of the encoded spatial and attribute information. GIS metadata is a subset of electronic metadata which catalogs electronic resources such as web pages and software applications. However, GIS metadata is inherently different than electronic media because each metadata file can be applied to a spatial component that is not implicit with other forms of metadata. Using open source technologies such as R, Perl and PHP, metadata information for large GIS data sets (thousands of layers) can be gleaned quickly and more efficiently than the human element. In doing so, metrics to express the integrity of both the metadata and GIS data can be captured, displayed and compared for use in the decision making process. Supervised and unsupervised techniques allow users and computer algorithms to explore unseen trends about the GIS data not obvious to the human component. The validity of these analyses was tested using a Technology Acceptance Model (TAM). Responses from 40 GIS professionals about the results of this methodology were captured to find a relationship between this technology’s Perceived Ease of Use, Perceived Usefulness, Attitude Towards Using and the Intention to Further use this technology

    Socioeconomic characteristics of cancer mortality in the United States of America: a spatial data mining approach

    Get PDF
    Cancer is the second leading cause of death in the United States of America. Though it is generally known that cancer is influenced by environment, its relation to socioeconomic conditions is still widely debated. This research analyzed the spatial distribution of cancer mortalities of breast, colorectal, lung, and prostate, and their associated socioeconomic characteristics using association rule mining technique. The mortality patterns were analyzed at the county and health service area levels that corresponded to the years between 1999 – 2002 and 1988 – 1992, respectively. Distinct socioeconomic characteristics of cancer mortality were revealed by the association rule mining technique. The counties that had very high rates of breast cancer mortality also had very low percent of whites who walked to work; very high rates of colorectal cancer mortality was associated with very low percentage of foreign born population; very high rates of lung cancer mortality was associated with very low percent of whites who walked to work; and counties that had very high prostate cancer mortality rates had a very low percentage of their residents born in the west. The cancer mortality and socioeconomic variables were discretized using equal interval, natural breaks, and quantile discretization methods to analyze the impact discretization techniques have on the cancer mortality and socioeconomic patterns obtained using association rule mining. The three discretization techniques produced patterns that involved different rates of cancer mortality and socioeconomic characteristics. Results of this analysis showed that a 5-class interval natural breaks discretization technique achieved the highest discretization accuracy, while the equal interval method produced association rules that had the highest support value. The research also analyzed the effect of scale on the patterns produced by the association rule technique. At the county level breast and lung cancers associated with mode of transportation to work, whereas colorectal and prostate cancers associated with place of birth. At the health service area level, the association rule with the highest support value among the breast-, colorectal-, and prostate-cancer mortality rates involved a household family characteristics, whereas high lung cancer mortality rates were associated with low educational attainment

    ExpRalytics: analyse expressive et efficace de graphes RDF

    Get PDF
    Large (Linked) Open Data are increasingly shared as RDF graphs today. However, such data does not yet reach its full potential in terms of sharing and reuse. We provide new methods to meaningfully summarize data graphs, with a particular focus on RDF graphs. One class of tools for this task are structural RDF graph summaries, which allow users to grasp the different connections between RDF graph nodes. To this end, we introduce our novel RDFQuotient tool that finds compact yet informative RDF graph summaries that can serve as first-sight visualizations of an RDF graph’s structure. We also consider the problem of automatically identifying the k most interesting aggregate queries that can be evaluated on an RDF graph, given an integer k and a user-specified interestingness function. Aggregate queries are routinely used to learn insights from relational data warehouses, and some prior research has addressed the problem of automatically recommending interesting aggregate queries.Les données ouvertes sont souvent partagées sous la forme de graphes RDF, qui sont une incarnation du principe Linked Open Data (données ouvertes liées). De telles données n’ont toutefois pas atteint leur entier potentiel d’utilisation et de partage. L’obstacle pour ce faire réside principalement au niveau de la capacité des utilisateurs à explorer, découvrir et saisir le contenu et des graphes RDF; cette tâche est complexe car les graphes sont naturellement hétérogènes, et peuvent être à la fois volumineux et complexes. Nous proposons de nouvelles méthodes pour résumer de grands graphes de données, avec un accent particulier sur les graphes RDF. A cette fin, nous avons proposé une nouvelle approché pour la construction de résumés structurels de graphes RDF, à savoir RDFQuotient.Nous considérons aussi le problème d’identifier automatiquement les requêtes d’agrégation les plus intéressantes qui peuvent être évaluées sur un graphe RDF

    Semantic interpretation of events in lifelogging

    Get PDF
    The topic of this thesis is lifelogging, the automatic, passive recording of a person’s daily activities and in particular, on performing a semantic analysis and enrichment of lifelogged data. Our work centers on visual lifelogged data, such as taken from wearable cameras. Such wearable cameras generate an archive of a person’s day taken from a first-person viewpoint but one of the problems with this is the sheer volume of information that can be generated. In order to make this potentially very large volume of information more manageable, our analysis of this data is based on segmenting each day’s lifelog data into discrete and non-overlapping events corresponding to activities in the wearer’s day. To manage lifelog data at an event level, we define a set of concepts using an ontology which is appropriate to the wearer, applying automatic detection of concepts to these events and then semantically enriching each of the detected lifelog events making them an index into the events. Once this enrichment is complete we can use the lifelog to support semantic search for everyday media management, as a memory aid, or as part of medical analysis on the activities of daily living (ADL), and so on. In the thesis, we address the problem of how to select the concepts to be used for indexing events and we propose a semantic, density- based algorithm to cope with concept selection issues for lifelogging. We then apply activity detection to classify everyday activities by employing the selected concepts as high-level semantic features. Finally, the activity is modeled by multi-context representations and enriched by Semantic Web technologies. The thesis includes an experimental evaluation using real data from users and shows the performance of our algorithms in capturing the semantics of everyday concepts and their efficacy in activity recognition and semantic enrichment

    Information Technology and Lawyers. Advanced Technology in the Legal Domain, from Challenges to Daily Routine

    Get PDF

    Detecting and Classifying Malevolent Dialogue Responses: Taxonomy, Data and Methodology

    Get PDF
    Conversational interfaces are increasingly popular as a way of connecting people to information. Corpus-based conversational interfaces are able to generate more diverse and natural responses than template-based or retrieval-based agents. With their increased generative capacity of corpusbased conversational agents comes the need to classify and filter out malevolent responses that are inappropriate in terms of content and dialogue acts. Previous studies on the topic of recognizing and classifying inappropriate content are mostly focused on a certain category of malevolence or on single sentences instead of an entire dialogue. In this paper, we define the task of Malevolent Dialogue Response Detection and Classification (MDRDC). We make three contributions to advance research on this task. First, we present a Hierarchical Malevolent Dialogue Taxonomy (HMDT). Second, we create a labelled multi-turn dialogue dataset and formulate the MDRDC task as a hierarchical classification task over this taxonomy. Third, we apply stateof-the-art text classification methods to the MDRDC task and report on extensive experiments aimed at assessing the performance of these approaches.Comment: under review at JASIS

    Nuevos retos en clasificación asociativa: Big Data y aplicaciones

    Get PDF
    La clasificación asociativa surge como resultado de la unión de dos importantes ámbitos del aprendizaje automático. Por un lado la tarea descriptiva de extracción de reglas de asociación, como mecanismo para obtener información previamente desconocida e interesante de un conjunto de datos, combinado con una tarea predictiva, como es la clasificación, que permite en base a un conjunto de variables explicativas y previamente conocidas realizar una predicción sobre una variable de interés o predictiva. Los objetivos de esta tesis doctoral son los siguientes: 1) El estudio y el análisis del estado del arte de tanto la extracción de reglas de asociación como de la clasificación asociativa; 2) La propuesta de nuevos modelos de clasificación asociativa así como de extracción de reglas de asociación teniendo en cuenta la obtención de modelos que sean precisos, interpretables, eficientes así como flexibles para poder introducir conocimiento subjetivo en éstos. 3) Adicionalmente, y dado la gran cantidad de datos que cada día se genera en las últimas décadas, se prestará especial atención al tratamiento de grandes cantidades datos, también conocido como Big Data. En primer lugar, se ha analizado el estado del arte tanto de clasificación asociativa como de la extracción de reglas de asociación. En este sentido, se ha realizado un estudio y análisis exhaustivo de la bibliografía de los trabajos relacionados para poder conocer con gran nivel de detalle el estado del arte. Como resultado, se ha permitido sentar las bases para la consecución de los demás objetivos así como detectar que dentro de la clasificación asociativa se requería de algún mecanismo que facilitara la unificación de comparativas así como que fueran lo más completas posibles. Para tal fin, se ha propuesto una herramienta de software que cuenta con al menos un algoritmo de todas las categorías que componen la taxonomía actual. Esto permitirá dentro de las investigaciones del área, realizar comparaciones más diversas y completas que hasta el momento se consideraba una tarea en el mejor de los casos muy ardua, al no estar disponibles muchos de los algoritmos en un formato ejecutable ni mucho menos como código abierto. Además, esta herramienta también dispone de un conjunto muy diverso de métricas que permite cuantificar la calidad de los resultados desde diferentes perspectivas. Esto permite conseguir clasificadores lo más completos posibles, así como para unificar futuras comparaciones con otras propuestas. En segundo lugar, y como resultado del análisis previo, se ha detectado que las propuestas actuales no permiten escalar, ni horizontalmente, ni verticalmente, las metodologías sobre conjuntos de datos relativamente grandes. Dado el creciente interés, tanto del mundo académico como del industrial, de aumentar la capacidad de cómputo a ingentes cantidades de datos, se ha considerado interesante continuar esta tesis doctoral realizando un análisis de diferentes propuestas sobre Big Data. Para tal fin, se ha comenzado realizando un análisis pormenorizado de los últimos avances para el tratamiento de tal cantidad de datos. En este respecto, se ha prestado especial atención a la computación distribuida ya que ha demostrado ser el único procedimiento que permite el tratamiento de grandes cantidades de datos sin la realización de técnicas de muestreo. En concreto, se ha prestado especial atención a las metodologías basadas en MapReduce que permite la descomposición de problemas complejos en fracciones divisibles y paralelizables, que posteriormente pueden ser agrupadas para obtener el resultado final. Como resultado de este objetivo se han propuesto diferentes algoritmos que permiten el tratamiento de grandes cantidades de datos, sin la pérdida de precisión ni interpretabilidad. Todos los algoritmos propuestos se han diseñado para que puedan funcionar sobre las implementaciones de código abierto más conocidas de MapReduce. En tercer y último lugar, se ha considerado interesante realizar una propuesta que mejore el estado del arte de la clasificación asociativa. Para tal fin, y dado que las reglas de asociación son la base y factores determinantes para los clasificadores asociativos, se ha comenzado realizando una nueva propuesta para la extracción de reglas de asociación. En este aspecto, se ha combinado el uso de los últimos avances en computación distribuida, como MapReduce, con los algoritmos evolutivos que han demostrado obtener excelentes resultados en el área. En particular, se ha hecho uso de programación genética gramatical por su flexibilidad para codificar las soluciones, así como introducir conocimiento subjetivo en el proceso de búsqueda a la vez que permiten aliviar los requisitos computacionales y de memoria. Este nuevo algoritmo, supone una mejora significativa de la extracción de reglas de asociación ya que ha demostrado obtener mejores resultados que las propuestas existentes sobre diferentes tipos de datos así como sobre diferentes métricas de interés, es decir, no sólo obtiene mejores resultados sobre Big Data, sino que se ha comparado en su versión secuencial con los algoritmos existentes. Una vez que se ha conseguido este algoritmo que permite extraer excelentes reglas de asociación, se ha adaptado para la obtención de reglas de asociación de clase así como para obtener un clasificador a partir de tales reglas. De nuevo, se ha hecho uso de programación genética gramatical para la obtención del clasificador de forma que se permite al usuario no sólo introducir conocimiento subjetivo en las propias formas de las reglas, sino también en la forma final del clasificador. Esta nueva propuesta también se ha comparado con los algoritmos existentes de clasificación asociativa forma secuencial para garantizar que consigue diferencias significativas respecto a éstos en términos de exactitud, interpretabilidad y eficiencia. Adicionalmente, también se ha comparado con otras propuestas específicas de Big Data demostrado obtener excelentes resultados a la vez que mantiene un compromiso entre los objetivos conflictivos de interpretabilidad, exactitud y eficiencia. Esta tesis doctoral se ha desarrollado bajo un entorno experimental apropiado, haciendo uso de diversos conjunto de datos incluyendo tanto datos de pequeña dimensionalidad como Big Data. Además, todos los conjuntos de datos usados están publicados libremente y conforman un conglomerado de diversas dimensionalidades, número de instancias y de clases. Todos los resultados obtenidos se han comparado con el estado de arte correspondiente, y se ha hecho uso de tests estadísticos no paramétricos para comprobar que las diferencias encontradas son significativas desde un punto de vista estadístico, y no son fruto del azar. Adicionalmente, todas las comparaciones realizadas consideran diferentes perspectivas, es decir, se ha analizado rendimiento, eficiencia, precisión así como interpretabilidad en cada uno de los estudios.This Doctoral Thesis aims at solving the challenging problem of associative classification and its application on very large datasets. First, associative classification state-of-art has been studied and analyzed, and a new tool covering the whole taxonomy of algorithms as well as providing many different measures has been proposed. The goal of this tool is two-fold: 1) unification of comparisons, since existing works compare with very different measures; 2) providing a unique tool which has at least one algorithm of each category forming the taxonomy. This tool is a very important advancement in the field, since until the moment the whole taxonomy has not been covered due to that many algorithms have not been released as open source nor they were available to be run. Second, AC has been analyzed on very large quantities of data. In this regard, many different platforms for distributed computing have been studied and different proposals have been developed on them. These proposals enable to deal with very large data in a efficient way scaling up the load on very different compute nodes. Third, as one of the most important part of the associative classification is to extract high quality rules, it has been proposed a novel grammar-guided genetic programming algorithm which enables to obtain interesting association rules with regard to different metrics and in different kinds of data, including truly Big Data datasets. This proposal has proved to obtain very good results in terms of both quality and interpretability, at the same time of providing a very flexible way of representing the solutions and enabling to introduce subjective knowledge in the search process. Then, a novel algorithm has been proposed for associative classification using a non-trivial adaptation of the aforementioned algorithm to obtain the rules forming the classifier. This methodology is also based on grammar-guided genetic programming enabling user not only to constrain the form of the rules, but the final form of the classifier. Results have proved that this algorithm obtains very accurate classifiers at the same time of maintaining a good level of interpretability. All the methodologies proposed along this Thesis has been evaluated using a proper experimental framework, using a varied set of datasets including both classical and Big Data dataset, and analyzing different metrics to quantify the quality of the algorithms with regard to different perspectives. Results have been compared with state-of-the-art and they have been verified by means of non-parametric statistical tests proving that the proposed methods overcome to existing approaches

    Data analytics 2016: proceedings of the fifth international conference on data analytics

    Get PDF
    corecore