4,591 research outputs found

    NCeSS Project : Data mining for social scientists

    Get PDF
    We will discuss the work being undertaken on the NCeSS data mining project, a one year project at the University of Manchester which began at the start of 2007, to develop data mining tools of value to the social science community. Our primary goal is to produce a suite of data mining codes, supported by a web interface, to allow social scientists to mine their datasets in a straightforward way and hence, gain new insights into their data. In order to fully define the requirements, we are looking at a range of typical datasets to find out what forms they take and the applications and algorithms that will be required. In this paper, we will describe a number of these datasets and will discuss how easily data mining techniques can be used to extract information from the data that would either not be possible or would be too time consuming by more standard methods

    A machine learning approach for layout inference in spreadsheets

    Get PDF
    Spreadsheet applications are one of the most used tools for content generation and presentation in industry and the Web. In spite of this success, there does not exist a comprehensive approach to automatically extract and reuse the richness of data maintained in this format. The biggest obstacle is the lack of awareness about the structure of the data in spreadsheets, which otherwise could provide the means to automatically understand and extract knowledge from these files. In this paper, we propose a classification approach to discover the layout of tables in spreadsheets. Therefore, we focus on the cell level, considering a wide range of features not covered before by related work. We evaluated the performance of our classifiers on a large dataset covering three different corpora from various domains. Finally, our work includes a novel technique for detecting and repairing incorrectly classified cells in a post-processing step. The experimental results show that our approach deliver s very high accuracy bringing us a crucial step closer towards automatic table extraction.Peer ReviewedPostprint (published version

    Visualization and analytics of codicological data of Hebrew books

    Get PDF
    The goal is to provide a proper data model, using a common vocabulary, to decrease the heterogenous nature of these datasets as well as its inherent uncertainty caused by the descriptive nature of the field of Codicology. This research project was developed with the goal of applying data visualization and data mining techniques to the field of Codicology and Digital Humanities. Using Hebrew manuscript data as a starting point, this dissertation proposes an environment for exploratory analysis to be used by Humanities experts to deepen their understanding of codicological data, to formulate new, or verify existing, research hypotheses, and to communicate their findings in a richer way. To improve the scope of visualizations and knowledge discovery we will try to use data mining methods such as Association Rule Mining and Formal Concept Analysis. The present dissertation aims to retrieve information and structure from Hebrew manuscripts collected by codicologists. These manuscripts reflect the production of books of a specific region, namely "Sefarad" region, within the period between 10th and 16th.A presente dissertação tem como objetivo obter conhecimento estruturado de manuscritos hebraicos coletados por codicologistas. Estes manuscritos refletem a produção de livros de uma região específica, nomeadamente a região "Sefarad", no período entre os séculos X e XVI. O objetivo é fornecer um modelo de dados apropriado, usando um vocabulário comum, para diminuir a natureza heterogénea desses conjuntos de dados, bem como sua incerteza inerente causada pela natureza descritiva no campo da Codicologia. Este projeto de investigação foi desenvolvido com o objetivo de aplicar técnicas de visualização de dados e "data mining" no campo da Codicologia e Humanidades Digitais. Usando os dados de manuscritos hebraicos como ponto de partida, esta dissertação propõe um ambiente para análise exploratória a ser utilizado por especialistas em Humanidades Digitais e Codicologia para aprofundar a compreensão dos dados codicológicos, formular novas hipóteses de pesquisa, ou verificar existentes, e comunicar as suas descobertas de uma forma mais rica. Para melhorar as visualizações e descoberta de conhecimento, tentaremos usar métodos de data mining, como a "Association Rule Mining" e "Formal Concept Analysis"

    Data management for production quality deep learning models: Challenges and solutions

    Get PDF
    Deep learning (DL) based software systems are difficult to develop and maintain in industrial settings due to several challenges. Data management is one of the most prominent challenges which complicates DL in industrial deployments. DL models are data-hungry and require high-quality data. Therefore, the volume, variety, velocity, and quality of data cannot be compromised. This study aims to explore the data management challenges encountered by practitioners developing systems with DL components, identify the potential solutions from the literature and validate the solutions through a multiple case study. We identified 20 data management challenges experienced by DL practitioners through a multiple interpretive case study. Further, we identified 48 articles through a systematic literature review that discuss the solutions for the data management challenges. With the second round of multiple case study, we show that many of these solutions have limitations and are not used in practice due to a combination of four factors: high cost, lack of skill-set and infrastructure, inability to solve the problem completely, and incompatibility with certain DL use cases. Thus, data management for data-intensive DL models in production is complicated. Although the DL technology has achieved very promising results, there is still a significant need for further research in the field of data management to build high-quality datasets and streams that can be used for building production-ready DL systems. Furthermore, we have classified the data management challenges into four categories based on the availability of the solutions.(c) 2022 The Authors. Published by Elsevier Inc. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/)

    An intelligent classification system for land use and land cover mapping using spaceborne remote sensing and GIS

    Get PDF
    The objectives of this study were to experiment with and extend current methods of Synthetic Aperture Rader (SAR) image classification, and to design and implement a prototype intelligent remote sensing image processing and classification system for land use and land cover mapping in wet season conditions in Bangladesh, which incorporates SAR images and other geodata. To meet these objectives, the problem of classifying the spaceborne SAR images, and integrating Geographic Information System (GIS) data and ground truth data was studied first. In this phase of the study, an extension to traditional techniques was made by applying a Self-Organizing feature Map (SOM) to include GIS data with the remote sensing data during image segmentation. The experimental results were compared with those of traditional statistical classifiers, such as Maximum Likelihood, Mahalanobis Distance, and Minimum Distance classifiers. The performances of the classifiers were evaluated in terms of the classification accuracy with respect to the collected real-time ground truth data. The SOM neural network provided the highest overall accuracy when a GIS layer of land type classification (with respect to the period of inundation by regular flooding) was used in the network. Using this method, the overall accuracy was around 15% higher than the previously mentioned traditional classifiers. It also achieved higher accuracies for more classes in comparison to the other classifiers. However, it was also observed that different classifiers produced better accuracy for different classes. Therefore, the investigation was extended to consider Multiple Classifier Combination (MCC) techniques, which is a recently emerging research area in pattern recognition. The study has tested some of these techniques to improve the classification accuracy by harnessing the goodness of the constituent classifiers. A Rule-based Contention Resolution method of combination was developed, which exhibited an improvement in the overall accuracy of about 2% in comparison to its best constituent (SOM) classifier. The next phase of the study involved the design of an architecture for an intelligent image processing and classification system (named ISRIPaC) that could integrate the extended methodologies mentioned above. Finally, the architecture was implemented in a prototype and its viability was evaluated using a set of real data. The originality of the ISRIPaC architecture lies in the realisation of the concept of a complete system that can intelligently cover all the steps of image processing classification and utilise standardised metadata in addition to a knowledge base in determining the appropriate methods and course of action for the given task. The implemented prototype of the ISRIPaC architecture is a federated system that integrates the CLIPS expert system shell, the IDRISI Kilimanjaro image processing and GIS software, and the domain experts' knowledge via a control agent written in Visual C++. It starts with data assessment and pre-processing and ends up with image classification and accuracy assessment. The system is designed to run automatically, where the user merely provides the initial information regarding the intended task and the source of available data. The system itself acquires necessary information about the data from metadata files in order to make decisions and perform tasks. The test and evaluation of the prototype demonstrates the viability of the proposed architecture and the possibility of extending the system to perform other image processing tasks and to use different sources of data. The system design presented in this study thus suggests some directions for the development of the next generation of remote sensing image processing and classification systems

    The TXM Portal Software giving access to Old French Manuscripts Online

    Get PDF
    Texte intégral en ligne : http://www.lrec-conf.org/proceedings/lrec2012/workshops/13.ProceedingsCultHeritage.pdfInternational audiencehttp://www.lrec-conf.org/proceedings/lrec2012/workshops/13.ProceedingsCultHeritage.pdf This paper presents the new TXM software platform giving online access to Old French Text Manuscripts images and tagged transcriptions for concordancing and text mining. This platform is able to import medieval sources encoded in XML according to the TEI Guidelines for linking manuscript images to transcriptions, encode several diplomatic levels of transcription including abbreviations and word level corrections. It includes a sophisticated tokenizer able to deal with TEI tags at different levels of linguistic hierarchy. Words are tagged on the fly during the import process using IMS TreeTagger tool with a specific language model. Synoptic editions displaying side by side manuscript images and text transcriptions are automatically produced during the import process. Texts are organized in a corpus with their own metadata (title, author, date, genre, etc.) and several word properties indexes are produced for the CQP search engine to allow efficient word patterns search to build different type of frequency lists or concordances. For syntactically annotated texts, special indexes are produced for the Tiger Search engine to allow efficient syntactic concordances building. The platform has also been tested on classical Latin, ancient Greek, Old Slavonic and Old Hieroglyphic Egyptian corpora (including various types of encoding and annotations)
    corecore