5 research outputs found

    Supervised and Unsupervised Categorization of an Imbalanced Italian Crime News Dataset

    Get PDF
    The automatic categorization of crime news is useful to create statistics on the type of crimes occurring in a certain area. This assignment can be treated as a text categorization problem. Several studies have shown that the use of word embeddings improves outcomes in many Natural Language Processing (NLP), including text categorization. The scope of this paper is to explore the use of word embeddings for Italian crime news text categorization. The approach followed is to compare different document pre-processing, Word2Vec models and methods to obtain word embeddings, including the extraction of bigrams and keyphrases. Then, supervised and unsupervised Machine Learning categorization algorithms have been applied and compared. In addition, the imbalance issue of the input dataset has been addressed by using Synthetic Minority Oversampling Technique (SMOTE) to oversample the elements in the minority classes. Experiments conducted on an Italian dataset of 17,500 crime news articles collected from 2011 till 2021 show very promising results. The supervised categorization has proven to be better than the unsupervised categorization, overcoming 80% both in precision and recall, reaching an accuracy of 0.86. Furthermore, lemmatization, bigrams and keyphrase extraction are not so decisive. In the end, the availability of our model on GitHub together with the code we used to extract word embeddings allows replicating our approach to other corpus either in Italian or other languages

    Distributed Database Management Techniques for Wireless Sensor Networks

    Full text link
    Authors and/or their employers shall have the right to post the accepted version of IEEE-copyrighted articles on their own personal servers or the servers of their institutions or employers without permission from IEEE, provided that the posted version includes a prominently displayed IEEE copyright notice and, when published, a full citation to the original IEEE publication, including a link to the article abstract in IEEE Xplore. Authors shall not post the final, published versions of their papers.In sensor networks, the large amount of data generated by sensors greatly influences the lifetime of the network. In order to manage this amount of sensed data in an energy-efficient way, new methods of storage and data query are needed. In this way, the distributed database approach for sensor networks is proved as one of the most energy-efficient data storage and query techniques. This paper surveys the state of the art of the techniques used to manage data and queries in wireless sensor networks based on the distributed paradigm. A classification of these techniques is also proposed. The goal of this work is not only to present how data and query management techniques have advanced nowadays, but also show their benefits and drawbacks, and to identify open issues providing guidelines for further contributions in this type of distributed architectures.This work was partially supported by the Instituto de Telcomunicacoes, Next Generation Networks and Applications Group (NetGNA), Portugal, by the Ministerio de Ciencia e Innovacion, through the Plan Nacional de I+D+i 2008-2011 in the Subprograma de Proyectos de Investigacion Fundamental, project TEC2011-27516, by the Polytechnic University of Valencia, though the PAID-05-12 multidisciplinary projects, by Government of Russian Federation, Grant 074-U01, and by National Funding from the FCT-Fundacao para a Ciencia e a Tecnologia through the Pest-OE/EEI/LA0008/2013 Project.Diallo, O.; Rodrigues, JJPC.; Sene, M.; Lloret, J. (2013). Distributed Database Management Techniques for Wireless Sensor Networks. IEEE Transactions on Parallel and Distributed Systems. PP(99):1-17. https://doi.org/10.1109/TPDS.2013.207S117PP9

    FIN-DM: finantsteenuste andmekaeve protsessi mudel

    Get PDF
    Andmekaeve hõlmab reeglite kogumit, protsesse ja algoritme, mis võimaldavad ettevõtetel iga päev kogutud andmetest rakendatavaid teadmisi ammutades suurendada tulusid, vähendada kulusid, optimeerida tooteid ja kliendisuhteid ning saavutada teisi eesmärke. Andmekaeves ja -analüütikas on vaja hästi määratletud metoodikat ja protsesse. Saadaval on mitu andmekaeve ja -analüütika standardset protsessimudelit. Kõige märkimisväärsem ja laialdaselt kasutusele võetud standardmudel on CRISP-DM. Tegu on tegevusalast sõltumatu protsessimudeliga, mida kohandatakse sageli sektorite erinõuetega. CRISP-DMi tegevusalast lähtuvaid kohandusi on pakutud mitmes valdkonnas, kaasa arvatud meditsiini-, haridus-, tööstus-, tarkvaraarendus- ja logistikavaldkonnas. Seni pole aga mudelit kohandatud finantsteenuste sektoris, millel on omad valdkonnapõhised erinõuded. Doktoritöös käsitletakse seda lünka finantsteenuste sektoripõhise andmekaeveprotsessi (FIN-DM) kavandamise, arendamise ja hindamise kaudu. Samuti uuritakse, kuidas kasutatakse andmekaeve standardprotsesse eri tegevussektorites ja finantsteenustes. Uurimise käigus tuvastati mitu tavapärase raamistiku kohandamise stsenaariumit. Lisaks ilmnes, et need meetodid ei keskendu piisavalt sellele, kuidas muuta andmekaevemudelid tarkvaratoodeteks, mida saab integreerida organisatsioonide IT-arhitektuuri ja äriprotsessi. Peamised finantsteenuste valdkonnas tuvastatud kohandamisstsenaariumid olid seotud andmekaeve tehnoloogiakesksete (skaleeritavus), ärikesksete (tegutsemisvõime) ja inimkesksete (diskrimineeriva mõju leevendus) aspektidega. Seejärel korraldati tegelikus finantsteenuste organisatsioonis juhtumiuuring, mis paljastas 18 tajutavat puudujääki CRISP- DMi protsessis. Uuringu andmete ja tulemuste abil esitatakse doktoritöös finantsvaldkonnale kohandatud CRISP-DM nimega FIN-DM ehk finantssektori andmekaeve protsess (Financial Industry Process for Data Mining). FIN-DM laiendab CRISP-DMi nii, et see toetab privaatsust säilitavat andmekaevet, ohjab tehisintellekti eetilisi ohte, täidab riskijuhtimisnõudeid ja hõlmab kvaliteedi tagamist kui osa andmekaeve elutsüklisData mining is a set of rules, processes, and algorithms that allow companies to increase revenues, reduce costs, optimize products and customer relationships, and achieve other business goals, by extracting actionable insights from the data they collect on a day-to-day basis. Data mining and analytics projects require well-defined methodology and processes. Several standard process models for conducting data mining and analytics projects are available. Among them, the most notable and widely adopted standard model is CRISP-DM. It is industry-agnostic and often is adapted to meet sector-specific requirements. Industry- specific adaptations of CRISP-DM have been proposed across several domains, including healthcare, education, industrial and software engineering, logistics, etc. However, until now, there is no existing adaptation of CRISP-DM for the financial services industry, which has its own set of domain-specific requirements. This PhD Thesis addresses this gap by designing, developing, and evaluating a sector-specific data mining process for financial services (FIN-DM). The PhD thesis investigates how standard data mining processes are used across various industry sectors and in financial services. The examination identified number of adaptations scenarios of traditional frameworks. It also suggested that these approaches do not pay sufficient attention to turning data mining models into software products integrated into the organizations' IT architectures and business processes. In the financial services domain, the main discovered adaptation scenarios concerned technology-centric aspects (scalability), business-centric aspects (actionability), and human-centric aspects (mitigating discriminatory effects) of data mining. Next, an examination by means of a case study in the actual financial services organization revealed 18 perceived gaps in the CRISP-DM process. Using the data and results from these studies, the PhD thesis outlines an adaptation of CRISP-DM for the financial sector, named the Financial Industry Process for Data Mining (FIN-DM). FIN-DM extends CRISP-DM to support privacy-compliant data mining, to tackle AI ethics risks, to fulfill risk management requirements, and to embed quality assurance as part of the data mining life-cyclehttps://www.ester.ee/record=b547227

    Performance assessment of real-time data management on wireless sensor networks

    Get PDF
    Technological advances in recent years have allowed the maturity of Wireless Sensor Networks (WSNs), which aim at performing environmental monitoring and data collection. This sort of network is composed of hundreds, thousands or probably even millions of tiny smart computers known as wireless sensor nodes, which may be battery powered, equipped with sensors, a radio transceiver, a Central Processing Unit (CPU) and some memory. However due to the small size and the requirements of low-cost nodes, these sensor node resources such as processing power, storage and especially energy are very limited. Once the sensors perform their measurements from the environment, the problem of data storing and querying arises. In fact, the sensors have restricted storage capacity and the on-going interaction between sensors and environment results huge amounts of data. Techniques for data storage and query in WSN can be based on either external storage or local storage. The external storage, called warehousing approach, is a centralized system on which the data gathered by the sensors are periodically sent to a central database server where user queries are processed. The local storage, in the other hand called distributed approach, exploits the capabilities of sensors calculation and the sensors act as local databases. The data is stored in a central database server and in the devices themselves, enabling one to query both. The WSNs are used in a wide variety of applications, which may perform certain operations on collected sensor data. However, for certain applications, such as real-time applications, the sensor data must closely reflect the current state of the targeted environment. However, the environment changes constantly and the data is collected in discreet moments of time. As such, the collected data has a temporal validity, and as time advances, it becomes less accurate, until it does not reflect the state of the environment any longer. Thus, these applications must query and analyze the data in a bounded time in order to make decisions and to react efficiently, such as industrial automation, aviation, sensors network, and so on. In this context, the design of efficient real-time data management solutions is necessary to deal with both time constraints and energy consumption. This thesis studies the real-time data management techniques for WSNs. It particularly it focuses on the study of the challenges in handling real-time data storage and query for WSNs and on the efficient real-time data management solutions for WSNs. First, the main specifications of real-time data management are identified and the available real-time data management solutions for WSNs in the literature are presented. Secondly, in order to provide an energy-efficient real-time data management solution, the techniques used to manage data and queries in WSNs based on the distributed paradigm are deeply studied. In fact, many research works argue that the distributed approach is the most energy-efficient way of managing data and queries in WSNs, instead of performing the warehousing. In addition, this approach can provide quasi real-time query processing because the most current data will be retrieved from the network. Thirdly, based on these two studies and considering the complexity of developing, testing, and debugging this kind of complex system, a model for a simulation framework of the real-time databases management on WSN that uses a distributed approach and its implementation are proposed. This will help to explore various solutions of real-time database techniques on WSNs before deployment for economizing money and time. Moreover, one may improve the proposed model by adding the simulation of protocols or place part of this simulator on another available simulator. For validating the model, a case study considering real-time constraints as well as energy constraints is discussed. Fourth, a new architecture that combines statistical modeling techniques with the distributed approach and a query processing algorithm to optimize the real-time user query processing are proposed. This combination allows performing a query processing algorithm based on admission control that uses the error tolerance and the probabilistic confidence interval as admission parameters. The experiments based on real world data sets as well as synthetic data sets demonstrate that the proposed solution optimizes the real-time query processing to save more energy while meeting low latency.Fundação para a Ciência e Tecnologi

    Réception des données spatiales et leurs traitements : analyse d'images satellites pour la mise à jour des SIG par enrichissement du système de raisonnement spatial RCC8

    Get PDF
    De nos jours, la résolution des images satellites et le volume des bases de données géographiques disponibles sont en constante augmentation. Les images de télédétection à haute résolution représentent des sources de données hétérogènes de plus en plus nécessaires et difficiles à exploiter. Ces images sont considérées comme des sources très riches et utiles pour la mise à jour des Systèmes d'Information Géographique (SIG). Afin de mettre à jour ces bases de données, une étape de détection de changements est nécessaire. Cette thèse s'attache à l'étude de l'analyse d'images satellites par enrichissement du système de raisonnement spatial RCC8 (Region Connection Calculus) pour la détection des changements topologiques dans le but de mettre à jour des SIG. L'objectif à terme de cette étude est d'exploiter, de détailler et d'enrichir les relations topologiques du système RCC8. L'intérêt de l'enrichissement, l'exploitation et la description détaillée des relations du système RCC8 réside dans le fait qu'elles permettent de détecter automatiquement les différents niveaux de détails topologiques et les changements topologiques entre des régions géographiques représentées sur des cartes numériques (CN) et dans des images satellitaires. Dans cette thèse, nous proposons et développons une extension du modèle topologique d'Intersection et Différence (ID) par des invariants topologiques qui sont : le nombre de séparations, le voisinage et le type des éléments spatiaux. Cette extension vient enrichir et détailler les relations du système RCC8 à deux niveaux de détail. Au premier niveau, l'enrichissement du système RCC8 est fait par l'invariant topologique du nombre de séparations, et le nouveau système est appelé "système RCC-16 au niveau-1". Pour éviter des problèmes de confusion entre les relations de ce nouveau système, au deuxième niveau, l'enrichissement du "RCC-16 au niveau-1" est fait par l'invariant topologique du type d'éléments spatiaux et le nouveau système est appelé "système RCC-16 au niveau-2". Ces deux systèmes RCC-16 (au niveau-1 et au niveau-2) seront appliqués pour l'analyse d'images satellites, la détection de changements et l'analyse spatiale dans des SIG. Nous proposons à partir de celà une nouvelle méthode de détection de changements entre une nouvelle image satellite et une ancienne carte numérique des SIG qui intègre l'analyse topologique par le système RCC-16 afin de détecter et d'identifier les changements entre deux images satellites, ou entre deux cartes vectorielles produites à différentes dates. Dans cette étude de l'enrichissement du système RCC8, les régions spatiales ont de simples représentations spatiales. Cependant, la représentation spatiale et les relations topologiques entre régions dans des images satellites et des données des SIG sont plus complexes, floues et incertaines. Dans l'objectif d'étudier les relations topologiques entre régions floues, un modèle appelé le modèle topologique Flou d'Intersection et Différence (FID) pour la description des relations topologiques entre régions floues sera proposé et développé. 152 relations topologiques peuvent être extraites à l'aide de ce modèle FID. Ces 152 relations sont regroupées dans huit clusters qualitatifs du système RCC8 : Disjoint (Déconnexion), Meets (Connexion Extérieure), Overlaps (Chevauchement), CoveredBy (Inclusion Tangentielle), Inside (Inclusion Non-Tangentielle), Covers (Inclusion Tangentielle Inverse), Contains (Inclusion Non-Tangentielle Inverse), et Equal (Égalité). Ces relations seront évaluées et extraites à partir des images satellites pour donner des exemples de leur intérêt dans le domaine de l'analyse d'image et dans des SIG. La contribution de cette thèse est marquée par l'enrichissement du système RCC8 donnant lieu à un nouveau système, RCC-16, mettant en ouvre une nouvelle méthode de détection de changements, le modèle FID, et regroupant les 152 relations topologiques floues dans les huit clusters qualitatifs du système RCC8.Nowadays, the resolution of satellite images and the volume of available geographic databases are constantly growing. Images of high resolution remote sensing represent sources of heterogeneous data increasingly necessary and difficult to exploit. These images are considered very rich and useful sources for updating Geographic Information Systems (GIS). To update these databases, a step of change detection is necessary and required. This thesis focuses on the study of satellite image analysis by enriching the spatial reasoning system RCC8 (Region Connection Calculus) for the detection of topological changes in order to update GIS databases. The ultimate goal of this study is to exploit and enrich the topological relations of the system RCC8. The interest of the enrichment and detailed description of RCC8 system relations lies in the fact that they can automatically detect the different levels of topological details and topological changes between geographical regions represented on GIS digital maps and satellite images. In this thesis, we propose and develop an extension of the Intersection and Difference (ID) topological model by using topological invariants which are : the separation number, the neighborhood and the spatial element type. This extension enriches and details the relations of the system RCC8 at two levels of detail. At the first level, the enrichment of the system RCC8 is made by using the topological invariant of the separation number and the new system is called "system RCC-16 at level-1". To avoid confusion problems between the topological relations of this new system, the second level by enriching the "system RCC-16 at level-1" is done by using the topological invariant of the spatial element type and the new system is called "system RCC-16 at level-2". These two systems RCC-16 (at two levels : level-1 and level-2) will be applied to satellite image analysis, change detection and spatial analysis in GIS. We propose a new method for detecting changes between a new satellite image and a GIS old digital map. This method integrates the topological analysis of the system RCC-16 to detect and identify changes between two satellite images, or between two vector maps produced at different dates. In this study of the enrichment of the system RCC8, spatial regions have simple spatial representations. However, the spatial and topological relations between regions in satellite images and GIS data are more complex, vague and uncertain. With the aim of studying the topological relations between fuzzy regions, a model called the Fuzzy topological model of Intersection and Difference (FID) for the description of topological relations between fuzzy regions is proposed and developed. 152 topological relations can be extracted using this model FID. These 152 relations are grouped into eight clusters of the qualitative relations of the system RCC8 : Disjoint (Disconnected), Meets (Externally Connected), Overlaps (Partially Overlapping), CoveredBy (Tangential Proper Part), Inside (Non-Tangential Proper Part), Covers (Tangential Proper Part Inverse), Contains (Non-Tangential Proper Part Inverse), and Equal. These relations will be evaluated and extracted from satellite images to give examples of their interest in the image analysis field and GIS. The contribution of this thesis is marked by enriching the qualitative spatial reasoning system RCC8 giving rise to a new system, RCC-16, implementing a new method of change detection, the model FID, and clustering the 152 fuzzy topological relations in eight qualitative clusters of the system RCC8
    corecore