Search CORE

4,732 research outputs found

NCeSS Project : Data mining for social scientists

Author: Fagan Colette
Gibson Jon
Halfpenny Peter
Lin Yuwei
Nazroo James Y.
Procter Rob
Tekiner Firat
Publication venue
Publication date: 01/01/2007
Field of study

We will discuss the work being undertaken on the NCeSS data mining project, a one year project at the University of Manchester which began at the start of 2007, to develop data mining tools of value to the social science community. Our primary goal is to produce a suite of data mining codes, supported by a web interface, to allow social scientists to mine their datasets in a straightforward way and hence, gain new insights into their data. In order to fully define the requirements, we are looking at a range of typical datasets to find out what forms they take and the applications and algorithms that will be required. In this paper, we will describe a number of these datasets and will discuss how easily data mining techniques can be used to extract information from the data that would either not be possible or would be too time consuming by more standard methods

Warwick Research Archives Portal Repository

A machine learning approach for layout inference in spreadsheets

Author: Koci Elvis
Lehner Wolfgang
Romero Moral Óscar
Thiele Maik
Publication venue: 'Scitepress'
Publication date: 01/01/2016
Field of study

Spreadsheet applications are one of the most used tools for content generation and presentation in industry and the Web. In spite of this success, there does not exist a comprehensive approach to automatically extract and reuse the richness of data maintained in this format. The biggest obstacle is the lack of awareness about the structure of the data in spreadsheets, which otherwise could provide the means to automatically understand and extract knowledge from these files. In this paper, we propose a classification approach to discover the layout of tables in spreadsheets. Therefore, we focus on the cell level, considering a wide range of features not covered before by related work. We evaluated the performance of our classifiers on a large dataset covering three different corpora from various domains. Finally, our work includes a novel technique for detecting and repairing incorrectly classified cells in a post-processing step. The experimental results show that our approach deliver s very high accuracy bringing us a crucial step closer towards automatic table extraction.Peer ReviewedPostprint (published version

UPCommons. Portal del coneixement obert de la UPC

Visualization and analytics of codicological data of Hebrew books

Author: Pateiro Tiago Miguel Garcez
Publication venue
Publication date: 28/11/2018
Field of study

The goal is to provide a proper data model, using a common vocabulary, to decrease the heterogenous nature of these datasets as well as its inherent uncertainty caused by the descriptive nature of the field of Codicology. This research project was developed with the goal of applying data visualization and data mining techniques to the field of Codicology and Digital Humanities. Using Hebrew manuscript data as a starting point, this dissertation proposes an environment for exploratory analysis to be used by Humanities experts to deepen their understanding of codicological data, to formulate new, or verify existing, research hypotheses, and to communicate their findings in a richer way. To improve the scope of visualizations and knowledge discovery we will try to use data mining methods such as Association Rule Mining and Formal Concept Analysis. The present dissertation aims to retrieve information and structure from Hebrew manuscripts collected by codicologists. These manuscripts reflect the production of books of a specific region, namely "Sefarad" region, within the period between 10th and 16th.A presente dissertação tem como objetivo obter conhecimento estruturado de manuscritos hebraicos coletados por codicologistas. Estes manuscritos refletem a produção de livros de uma região específica, nomeadamente a região "Sefarad", no período entre os séculos X e XVI. O objetivo é fornecer um modelo de dados apropriado, usando um vocabulário comum, para diminuir a natureza heterogénea desses conjuntos de dados, bem como sua incerteza inerente causada pela natureza descritiva no campo da Codicologia. Este projeto de investigação foi desenvolvido com o objetivo de aplicar técnicas de visualização de dados e "data mining" no campo da Codicologia e Humanidades Digitais. Usando os dados de manuscritos hebraicos como ponto de partida, esta dissertação propõe um ambiente para análise exploratória a ser utilizado por especialistas em Humanidades Digitais e Codicologia para aprofundar a compreensão dos dados codicológicos, formular novas hipóteses de pesquisa, ou verificar existentes, e comunicar as suas descobertas de uma forma mais rica. Para melhorar as visualizações e descoberta de conhecimento, tentaremos usar métodos de data mining, como a "Association Rule Mining" e "Formal Concept Analysis"

Repositório Institucional do ISCTE-IUL

Data management for production quality deep learning models: Challenges and solutions

Author: Arpteg Anders
Bosch Jan
Brinne Bjoern
Munappy Aiswarya Raj
Olsson Helena Holmstrom
Publication venue: 'Elsevier BV'
Publication date: 01/01/2022
Field of study

Deep learning (DL) based software systems are difficult to develop and maintain in industrial settings due to several challenges. Data management is one of the most prominent challenges which complicates DL in industrial deployments. DL models are data-hungry and require high-quality data. Therefore, the volume, variety, velocity, and quality of data cannot be compromised. This study aims to explore the data management challenges encountered by practitioners developing systems with DL components, identify the potential solutions from the literature and validate the solutions through a multiple case study. We identified 20 data management challenges experienced by DL practitioners through a multiple interpretive case study. Further, we identified 48 articles through a systematic literature review that discuss the solutions for the data management challenges. With the second round of multiple case study, we show that many of these solutions have limitations and are not used in practice due to a combination of four factors: high cost, lack of skill-set and infrastructure, inability to solve the problem completely, and incompatibility with certain DL use cases. Thus, data management for data-intensive DL models in production is complicated. Although the DL technology has achieved very promising results, there is still a significant need for further research in the field of data management to build high-quality datasets and streams that can be used for building production-ready DL systems. Furthermore, we have classified the data management challenges into four categories based on the availability of the solutions.(c) 2022 The Authors. Published by Elsevier Inc. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/)

Chalmers Research

Digitala Vetenskapliga Arkivet - Academic Archive On-line

Malmö University

Management and Development Model for Open Standards (BOMOS) version 2

Author: Folmer Erwin
Punter Matthijs
Publication venue: Programmabureau NOiV
Publication date: 01/01/2011
Field of study

University of Twente Research Information

An intelligent classification system for land use and land cover mapping using spaceborne remote sensing and GIS

Author: Kamal M.
Kamal M.
Publication venue
Publication date: 01/01/2006
Field of study

The objectives of this study were to experiment with and extend current methods of Synthetic Aperture Rader (SAR) image classification, and to design and implement a prototype intelligent remote sensing image processing and classification system for land use and land cover mapping in wet season conditions in Bangladesh, which incorporates SAR images and other geodata. To meet these objectives, the problem of classifying the spaceborne SAR images, and integrating Geographic Information System (GIS) data and ground truth data was studied first. In this phase of the study, an extension to traditional techniques was made by applying a Self-Organizing feature Map (SOM) to include GIS data with the remote sensing data during image segmentation. The experimental results were compared with those of traditional statistical classifiers, such as Maximum Likelihood, Mahalanobis Distance, and Minimum Distance classifiers. The performances of the classifiers were evaluated in terms of the classification accuracy with respect to the collected real-time ground truth data. The SOM neural network provided the highest overall accuracy when a GIS layer of land type classification (with respect to the period of inundation by regular flooding) was used in the network. Using this method, the overall accuracy was around 15% higher than the previously mentioned traditional classifiers. It also achieved higher accuracies for more classes in comparison to the other classifiers. However, it was also observed that different classifiers produced better accuracy for different classes. Therefore, the investigation was extended to consider Multiple Classifier Combination (MCC) techniques, which is a recently emerging research area in pattern recognition. The study has tested some of these techniques to improve the classification accuracy by harnessing the goodness of the constituent classifiers. A Rule-based Contention Resolution method of combination was developed, which exhibited an improvement in the overall accuracy of about 2% in comparison to its best constituent (SOM) classifier. The next phase of the study involved the design of an architecture for an intelligent image processing and classification system (named ISRIPaC) that could integrate the extended methodologies mentioned above. Finally, the architecture was implemented in a prototype and its viability was evaluated using a set of real data. The originality of the ISRIPaC architecture lies in the realisation of the concept of a complete system that can intelligently cover all the steps of image processing classification and utilise standardised metadata in addition to a knowledge base in determining the appropriate methods and course of action for the given task. The implemented prototype of the ISRIPaC architecture is a federated system that integrates the CLIPS expert system shell, the IDRISI Kilimanjaro image processing and GIS software, and the domain experts' knowledge via a control agent written in Visual C++. It starts with data assessment and pre-processing and ends up with image classification and accuracy assessment. The system is designed to run automatically, where the user merely provides the initial information regarding the intended task and the source of available data. The system itself acquires necessary information about the data from metadata files in order to make decisions and perform tasks. The test and evaluation of the prototype demonstrates the viability of the proposed architecture and the possibility of extending the system to perform other image processing tasks and to use different sources of data. The system design presented in this study thus suggests some directions for the development of the next generation of remote sensing image processing and classification systems

Middlesex University Research Repository

Internal hacking detection using machine learning

Author: Guerra Madalena da Costa França Ribeiro
Publication venue
Publication date: 01/01/2020
Field of study

Tese de mestrado, Ciência de Dados, Universidade de Lisboa, Faculdade de Ciências, 2020Being able to prevent and early detect insider threats through an automated forewarning system has been a massive challenge for large companies. In recent years, to fill this gap several anomaly user behavior algorithms based on machine learning have been proposed, experimentally evaluated and analyzed in numerous surveys. The present work was conducted in the cybersecurity department (DCY) of Altice Portugal (MEO) and aims to address this problem identifying the families of unsupervised anomaly detection techniques that are more effective for insider threats detection based on a large dataset corresponding to a collection of users’ access log records. To this end, multi-domain attributes related to possible insider threats are interactively extracted and processed, creating a summary of user account’s daily activity. A clusteringbased algorithm that groups and characterizes similar accounts was applied. Without any example anomalies required in the training set, anomaly detection techniques were computed over those profiles, identifying unusual changes in user account behavior on a current day. Finally, to make it easier for analysts and managers to understand the anomaly, anomaly metrics and a visualization dashboard were created. To evaluate the efficiency of this project ten insider threat scenarios were injected and was found that the system can successfully detect anomalous behavior that may be an insider threat event

Universidade de Lisboa: Repositório.UL