589 research outputs found

    Data Science Technologies for Vibrant Cities

    Get PDF
    Smart Cities forced IT technologies make a significant step in their development. A new generation of agile knowledge based software applications and systems have been successfully designed and implemented. Wide capabilities of the agile applications were sufficient to meet the complete set of requirements of smart cities. Fast transformation of modern cities from smart cities to vibrant cities throws new even more complicated challenges to information technologies. While smart cities assumed wide usage of agile means and tools for solving applied tasks, applications for vibrant cities must provide agile environment for exploring and managing of all types of data, information and knowledge. Agile environment must be flexible enough to support iterative data processing and analyses procedures that can be easily reorganized or changed depending on context. The aim of agile environment creation and support is to extend a set of used mathematical, technological and program solutions. In the paper it is proposed to build applications for vibrant cities using agile data science methodologies and toolsets within the commonly used approaches for developing agile information systems

    An overview of recent distributed algorithms for learning fuzzy models in Big Data classification

    Get PDF
    AbstractNowadays, a huge amount of data are generated, often in very short time intervals and in various formats, by a number of different heterogeneous sources such as social networks and media, mobile devices, internet transactions, networked devices and sensors. These data, identified as Big Data in the literature, are characterized by the popular Vs features, such as Value, Veracity, Variety, Velocity and Volume. In particular, Value focuses on the useful knowledge that may be mined from data. Thus, in the last years, a number of data mining and machine learning algorithms have been proposed to extract knowledge from Big Data. These algorithms have been generally implemented by using ad-hoc programming paradigms, such as MapReduce, on specific distributed computing frameworks, such as Apache Hadoop and Apache Spark. In the context of Big Data, fuzzy models are currently playing a significant role, thanks to their capability of handling vague and imprecise data and their innate characteristic to be interpretable. In this work, we give an overview of the most recent distributed learning algorithms for generating fuzzy classification models for Big Data. In particular, we first show some design and implementation details of these learning algorithms. Thereafter, we compare them in terms of accuracy and interpretability. Finally, we argue about their scalability

    PROBABILISTIC SHORT TERM SOLAR DRIVER FORECASTING WITH NEURAL NETWORK ENSEMBLES

    Get PDF
    Commonly utilized space weather indices and proxies drive predictive models for thermosphere density, directly impacting objects in low-Earth orbit (LEO) by influencing atmospheric drag forces. A set of solar proxies and indices (drivers), F10.7, S10.7, M10.7, and Y10.7, are created from a mixture of ground based radio observations and satellite instrument data. These solar drivers represent heating in various levels of the thermosphere and are used as inputs by the JB2008 empirical thermosphere density model. The United States Air Force (USAF) operational High Accuracy Satellite Drag Model (HASDM) relies on JB2008, and forecasts of solar drivers made by a linear algorithm, to produce forecasts of density. Density forecasts are useful to the space traffic management community and can be used to determine orbital state and probability of collision for space objects. In this thesis, we aim to provide improved and probabilistic forecasting models for these solar drivers, with a focus on providing first time probabilistic models for S10.7, M10.7, and Y10.7. We introduce auto-regressive methods to forecast solar drivers using neural network ensembles with multi-layer perceptron (MLP) and long-short term memory (LSTM) models in order to improve on the current operational forecasting methods. We investigate input data manipulation methods such as backwards averaging, varied lookback, and PCA rotation for multivariate prediction. We also investigate the differences associated with multi-step and dynamic prediction methods. A novel method for splitting data, referred to as striped sampling, is introduced to produce statistically consistent machine learning data sets. We also investigate the effects of loss function on forecasting performance and uncertainty estimates, as well as investigate novel ensemble weighting methods. We show the best models for univariate forecasting are ensemble approaches using multi step or a combination of multi step and dynamic predictions. Nearly all univariate approaches offer an improvement, with best models improving between 48 and 59% on relative mean squared error (MSE) with respect to persistence, which is used as the baseline model in this work. We show also that a stacked neural network ensemble approach significantly outperforms the operational linear method. When using MV-MLE (multivariate multi-lookback ensemble), we see improvements in performance error metrics over the operational method on all drivers. The multivariate approach also yields an improvement of root mean squared error (RMSE) for F10.7, S10.7, M10.7, and Y10.7 of 17.7%, 12.3%, 13.8%, 13.7% respectively, over the current operational method. We additionally provide the first probabilistic forecasting models for S10.7, M10.7, and Y10.7. Ensemble approaches are leveraged to provide a distribution of predicted values, allowing an investigation into robustness and reliability (R&R) of uncertainty estimates, using the calibration error score (CES) metric and calibration curves. Univariate models provided similar uncertainty estimates as other works, while improving on performance metrics. We also produce probabilistic forecasts using MV-MLE, which are well calibrated for all drivers, providing an average CES of 5.63%

    Ubiquitous intelligence for smart cities: a public safety approach

    Get PDF
    Citizen-centered safety enhancement is an integral component of public safety and a top priority for decision makers in a smart city development. However, public safety agencies are constantly faced with the challenge of deterring crime. While most smart city initiatives have placed emphasis on the use of modern technology for fighting crime, this may not be sufficient to achieve a sustainable safe and smart city in a resource constrained environment, such as in Africa. In particular, crime series which is a set of crimes considered to have been committed by the same offender is currently less explored in developing nations and has great potential in helping to fight against crime and promoting safety in smart cities. This research focuses on detecting the situation of crime through data mining approaches that can be used to promote citizens' safety, and assist security agencies in knowledge-driven decision support, such as crime series identification. While much research has been conducted on crime hotspots, not enough has been done in the area of identifying crime series. This thesis presents a novel crime clustering model, CriClust, for crime series pattern (CSP) detection and mapping to derive useful knowledge from a crime dataset, drawing on sound scientific and mathematical principles, as well as assumptions from theories of environmental criminology. The analysis is augmented using a dual-threshold model, and pattern prevalence information is encoded in similarity graphs. Clusters are identified by finding highly-connected subgraphs using adaptive graph size and Monte-Carlo heuristics in the Karger-Stein mincut algorithm. We introduce two new interest measures: (i) Proportion Difference Evaluation (PDE), which reveals the propagation effect of a series and dominant series; and (ii) Pattern Space Enumeration (PSE), which reveals underlying strong correlations and defining features for a series. Our findings on experimental quasi-real data set, generated based on expert knowledge recommendation, reveal that identifying CSP and statistically interpretable patterns could contribute significantly to strengthening public safety service delivery in a smart city development. Evaluation was conducted to investigate: (i) the reliability of the model in identifying all inherent series in a crime dataset; (ii) the scalability of the model with varying crime records volume; and (iii) unique features of the model compared to competing baseline algorithms and related research. It was found that Monte Carlo technique and adaptive graph size mechanism for crime similarity clustering yield substantial improvement. The study also found that proportion estimation (PDE) and PSE of series clusters can provide valuable insight into crime deterrence strategies. Furthermore, visual enhancement of clusters using graphical approaches to organising information and presenting a unified viable view promotes a prompt identification of important areas demanding attention. Our model particularly attempts to preserve desirable and robust statistical properties. This research presents considerable empirical evidence that the proposed crime cluster (CriClust) model is promising and can assist in deriving useful crime pattern knowledge, contributing knowledge services for public safety authorities and intelligence gathering organisations in developing nations, thereby promoting a sustainable "safe and smart" city

    Analytics of Sequential Time Data from Physical Assets

    Get PDF
    RÉSUMÉ: Avec l’avancement dans les technologies des capteurs et de l’intelligence artificielle, l'analyse des données est devenue une source d’information et de connaissance qui appuie la prise de décisions dans l’industrie. La prise de ces décisions, en se basant seulement sur l’expertise humaine n’est devenu suffisant ou souhaitable, et parfois même infaisable pour de nouvelles industries. L'analyse des données collectées à partir des actifs physiques vient renforcer la prise de décisions par des connaissances pratiques qui s’appuient sur des données réelles. Ces données sont utilisées pour accomplir deux tâches principales; le diagnostic et le pronostic. Les deux tâches posent un défi, principalement à cause de la provenance des données et de leur adéquation avec l’exploitation, et aussi à cause de la difficulté à choisir le type d'analyse. Ce dernier exige un analyste ayant une expertise dans les déférentes techniques d’analyse de données, et aussi dans le domaine de l’application. Les problèmes de données sont dus aux nombreuses sources inconnues de variations interagissant avec les données collectées, qui peuvent parfois être dus à des erreurs humaines. Le choix du type de modélisation est un autre défi puisque chaque modèle a ses propres hypothèses, paramètres et limitations. Cette thèse propose quatre nouveaux types d'analyse de séries chronologiques dont deux sont supervisés et les deux autres sont non supervisés. Ces techniques d'analyse sont testées et appliquées sur des différents problèmes industriels. Ces techniques visent à minimiser la charge de choix imposée à l'analyste. Pour l’analyse de séries chronologiques par des techniques supervisées, la prédiction de temps de défaillance d’un actif physique est faite par une technique qui porte le nom de ‘Logical Analysis of Survival Curves (LASC)’. Cette technique est utilisée pour stratifier de manière adaptative les courbes de survie tout au long d’un processus d’inspection. Ceci permet une modélisation plus précise au lieu d'utiliser un seul modèle augmenté pour toutes les données. L'autre technique supervisée de pronostic est un nouveau réseau de neurones de type ‘Long Short-Term Memory (LSTM) bidirectionnel’ appelé ‘Bidirectional Handshaking LSTM (BHLSTM)’. Ce modèle fait un meilleur usage des séquences courtes en faisant un tour de ronde à travers les données. De plus, le réseau est formé à l'aide d'une nouvelle fonction objective axée sur la sécurité qui force le réseau à faire des prévisions plus sûres. Enfin, étant donné que LSTM est une technique supervisée, une nouvelle approche pour générer la durée de vie utile restante (RUL) est proposée. Cette technique exige la formulation des hypothèses moins importantes par rapport aux approches précédentes. À des fins de diagnostic non supervisé, une nouvelle technique de classification interprétable est proposée. Cette technique est intitulée ‘Interpretable Clustering for Rule Extraction and Anomaly Detection (IC-READ)’. L'interprétation signifie que les groupes résultants sont formulés en utilisant une logique conditionnelle simple. Cela est pratique lors de la fourniture des résultats à des non-spécialistes. Il facilite toute mise en oeuvre du matériel si nécessaire. La technique proposée est également non paramétrique, ce qui signifie qu'aucun réglage n'est requis. Cette technique pourrait également être utiliser dans un contexte de ‘one class classification’ pour construire un détecteur d'anomalie. L'autre technique non supervisée proposée est une approche de regroupement de séries chronologiques à plusieurs variables de longueur variable à l'aide d'une distance de type ‘Dynamic Time Warping (DTW)’ modifiée. Le DTW modifié donne des correspondances plus élevées pour les séries temporelles qui ont des tendances et des grandeurs similaires plutôt que de se concentrer uniquement sur l'une ou l'autre de ces propriétés. Cette technique est également non paramétrique et utilise la classification hiérarchique pour regrouper les séries chronologiques de manière non supervisée. Cela est particulièrement utile pour décider de la planification de la maintenance. Il est également montré qu'il peut être utilisé avec ‘Kernel Principal Components Analysis (KPCA)’ pour visualiser des séquences de longueurs variables dans des diagrammes bidimensionnels.---------- ABSTRACT: Data analysis has become a necessity for industry. Working with inherited expertise only has become insufficient, expensive, not easily transferable, and mostly unavailable for new industries and facilities. Data analysis can provide decision-makers with more insight on how to manage their production, maintenance and personnel. Data collection requires acquisition and storage of observatory information about the state of the different production assets. Data collection usually takes place in a timely manner which result in time-series of observations. Depending on the type of data records available, the type of possible analyses will differ. Data labeled with previous human experience in terms of identifiable faults or fatigues can be used to build models to perform the expert’s task in the future by means of supervised learning. Otherwise, if no human labeling is available then data analysis can provide insights about similar observations or visualize these similarities through unsupervised learning. Both are challenging types of analyses. The challenges are two-fold; the first originates from the data and its adequacy, and the other is selecting the type of analysis which is a decision made by the analyst. Data challenges are due to the substantial number of unknown sources of variations inherited in the collected data, which may sometimes include human errors. Deciding upon the type of modelling is another issue as each model has its own assumptions, parameters to tune, and limitations. This thesis proposes four new types of time-series analysis, two of which are supervised requiring data labelling by certain events such as failure when, and the other two are unsupervised that require no such labelling. These analysis techniques are tested and applied on various industrial applications, namely road maintenance, bearing outer race failure detection, cutting tool failure prediction, and turbo engine failure prediction. These techniques target minimizing the burden of choice laid on the analyst working with industrial data by providing reliable analysis tools that require fewer choices to be made by the analyst. This in turn allows different industries to easily make use of their data without requiring much expertise. For prognostic purposes a proposed modification to the binary Logical Analysis of Data (LAD) classifier is used to adaptively stratify survival curves into long survivors and short life sets. This model requires no parameters to choose and completely relies on empirical estimations. The proposed Logical Analysis of Survival Curves show a 27% improvement in prediction accuracy than the results obtained by well-known machine learning techniques in terms of the mean absolute error. The other prognostic model is a new bidirectional Long Short-Term Memory (LSTM) neural network termed the Bidirectional Handshaking LSTM (BHLSTM). This model makes better use of short sequences by making a round pass through the given data. Moreover, the network is trained using a new safety oriented objective function which forces the network to make safer predictions. Finally, since LSTM is a supervised technique, a novel approach for generating the target Remaining Useful Life (RUL) is proposed requiring lesser assumptions to be made compared to previous approaches. This proposed network architecture shows an average of 18.75% decrease in the mean absolute error of predictions on the NASA turbo engine dataset. For unsupervised diagnostic purposes a new technique for providing interpretable clustering is proposed named Interpretable Clustering for Rule Extraction and Anomaly Detection (IC-READ). Interpretation means that the resulting clusters are formulated using simple conditional logic. This is very important when providing the results to non-specialists especially those in management and ease any hardware implementation if required. The proposed technique is also non-parametric, which means there is no tuning required and shows an average of 20% improvement in cluster purity over other clustering techniques applied on 11 benchmark datasets. This technique also can use the resulting clusters to build an anomaly detector. The last proposed technique is a whole multivariate variable length time-series clustering approach using a modified Dynamic Time Warping (DTW) distance. The modified DTW gives higher matches for time-series that have the similar trends and magnitudes rather than just focusing on either property alone. This technique is also non-parametric and uses hierarchal clustering to group time-series in an unsupervised fashion. This can be specifically useful for management to decide maintenance scheduling. It is shown also that it can be used along with Kernel Principal Components Analysis (KPCA) for visualizing variable length sequences in two-dimensional plots. The unsupervised techniques can help, in some cases where there is a lot of variation within certain classes, to ease the supervised learning task by breaking it into smaller problems having the same nature

    An Integrated Process for Co-Developing and Implementing Written and Computable Clinical Practice Guidelines

    Get PDF
    The goal of this article is to describe an integrated parallel process for the co-development of written and computable clinical practice guidelines (CPGs) to accelerate adoption and increase the impact of guideline recommendations in clinical practice. From February 2018 through December 2021, interdisciplinary work groups were formed after an initial Kaizen event and using expert consensus and available literature, produced a 12-phase integrated process (IP). The IP includes activities, resources, and iterative feedback loops for developing, implementing, disseminating, communicating, and evaluating CPGs. The IP incorporates guideline standards and informatics practices and clarifies how informaticians, implementers, health communicators, evaluators, and clinicians can help guideline developers throughout the development and implementation cycle to effectively co-develop written and computable guidelines. More efficient processes are essential to create actionable CPGs, disseminate and communicate recommendations to clinical end users, and evaluate CPG performance. Pilot testing is underway to determine how this IP expedites the implementation of CPGs into clinical practice and improves guideline uptake and health outcomes

    IT-CODE:IT in COllaborative DEsign

    Get PDF

    A STUDY IN THE INFORMATION CONTENT, CONSISTENCY, AND EXPRESSIVE POWER OF FUNCTION STRUCTURES IN MECHANICAL DESIGN

    Get PDF
    In engineering design research, function structures are used to represent the intended functionality of technical artifacts. Function structures are graph-based representations where the nodes are functions, or actions, and the edges are flows, or objects of those actions. For the consistent description of artifact functionality, multiple controlled vocabularies have been developed in previous research. The Functional Basis is one such vocabulary that provides for a set of verbs and a set of nouns, organized in the three-level hierarchy. This vocabulary is extensively studied in design research. Two major application of this vocabulary are the Design Repository, which is a web-base archive of design information of consumer electro-mechanical products obtained through reverse engineering, and the functional decomposition grammar rules that synthesizes sub-functions or elementary actions of a product from the overall function or goal of the product. However, despite the Functional Basis\u27 popularity, the usefulness of its hierarchical structure has not been specifically tested. Additionally, although this vocabulary provides the verbs and nouns, no explicit guideline for using those terms in function structures has been proposed. Consequently, multiple representational inconsistencies can be found in the function structures within the Design Repository. The two research goals in this thesis are: (1) to investigate if the hierarchy in the Functional Basis is useful for constructing function structures and (2) to explore means to increase the consistency and expressive power of the Functional Basis vocabulary. To address the first goal, an information metric for function structures and function vocabularies is developed based on the principles of Information Theory. This metric is applied to three function structures from the Design Repository to demonstrate that the secondary level of the Functional Basis is the most informative of the three. This finding is validated by an external empirical study, which shows that the secondary level is used most frequently in the Design Repository, finally indicating that the hierarchy is not useful for constructing function structures. To address the second research goal, a new representation of functions, including rules the topological connections in a function structure, is presented. It is demonstrated through experiments that the new representation is more expressive than the text-based descriptions of functions used in the Functional Basis, as it formally describes which flows can be connected to which functions. It is also shown that the new representation reduces the uncertainty involved in the individual function structures

    Proceedings of the 12th International Conference on Digital Preservation

    Get PDF
    The 12th International Conference on Digital Preservation (iPRES) was held on November 2-6, 2015 in Chapel Hill, North Carolina, USA. There were 327 delegates from 22 countries. The program included 12 long papers, 15 short papers, 33 posters, 3 demos, 6 workshops, 3 tutorials and 5 panels, as well as several interactive sessions and a Digital Preservation Showcase
    • …
    corecore