248 research outputs found

    Data Cleaning for XML Electronic Dictionaries via Statistical Anomaly Detection

    Get PDF
    Many important forms of data are stored digitally in XML format. Errors can occur in the textual content of the data in the fields of the XML. Fixing these errors manually is time-consuming and expensive, especially for large amounts of data. There is increasing interest in the research, development, and use of automated techniques for assisting with data cleaning. Electronic dictionaries are an important form of data frequently stored in XML format that frequently have errors introduced through a mixture of manual typographical entry errors and optical character recognition errors. In this paper we describe methods for flagging statistical anomalies as likely errors in electronic dictionaries stored in XML format. We describe six systems based on different sources of information. The systems detect errors using various signals in the data including uncommon characters, text length, character-based language models, word-based language models, tied-field length ratios, and tied-field transliteration models. Four of the systems detect errors based on expectations automatically inferred from content within elements of a single field type. We call these single-field systems. Two of the systems detect errors based on correspondence expectations automatically inferred from content within elements of multiple related field types. We call these tied-field systems. For each system, we provide an intuitive analysis of the type of error that it is successful at detecting. Finally, we describe two larger-scale evaluations using crowdsourcing with Amazon's Mechanical Turk platform and using the annotations of a domain expert. The evaluations consistently show that the systems are useful for improving the efficiency with which errors in XML electronic dictionaries can be detected.Comment: 8 pages, 4 figures, 5 tables; published in Proceedings of the 2016 IEEE Tenth International Conference on Semantic Computing (ICSC), Laguna Hills, CA, USA, pages 79-86, February 201

    Health systems data interoperability and implementation

    Get PDF
    Objective The objective of this study was to use machine learning and health standards to address the problem of clinical data interoperability across healthcare institutions. Addressing this problem has the potential to make clinical data comparable, searchable and exchangeable between healthcare providers. Data sources Structured and unstructured data has been used to conduct the experiments in this study. The data was collected from two disparate data sources namely MIMIC-III and NHanes. The MIMIC-III database stored data from two electronic health record systems which are CareVue and MetaVision. The data stored in these systems was not recorded with the same standards; therefore, it was not comparable because some values were conflicting, while one system would store an abbreviation of a clinical concept, the other would store the full concept name and some of the attributes contained missing information. These few issues that have been identified make this form of data a good candidate for this study. From the identified data sources, laboratory, physical examination, vital signs, and behavioural data were used for this study. Methods This research employed a CRISP-DM framework as a guideline for all the stages of data mining. Two sets of classification experiments were conducted, one for the classification of structured data, and the other for unstructured data. For the first experiment, Edit distance, TFIDF and JaroWinkler were used to calculate the similarity weights between two datasets, one coded with the LOINC terminology standard and another not coded. Similar sets of data were classified as matches while dissimilar sets were classified as non-matching. Then soundex indexing method was used to reduce the number of potential comparisons. Thereafter, three classification algorithms were trained and tested, and the performance of each was evaluated through the ROC curve. Alternatively the second experiment was aimed at extracting patient’s smoking status information from a clinical corpus. A sequence-oriented classification algorithm called CRF was used for learning related concepts from the given clinical corpus. Hence, word embedding, random indexing, and word shape features were used for understanding the meaning in the corpus. Results Having optimized all the model’s parameters through the v-fold cross validation on a sampled training set of structured data ( ), out of 24 features, only ( 8) were selected for a classification task. RapidMiner was used to train and test all the classification algorithms. On the final run of classification process, the last contenders were SVM and the decision tree classifier. SVM yielded an accuracy of 92.5% when the and parameters were set to and . These results were obtained after more relevant features were identified, having observed that the classifiers were biased on the initial data. On the other side, unstructured data was annotated via the UIMA Ruta scripting language, then trained through the CRFSuite which comes with the CLAMP toolkit. The CRF classifier obtained an F-measure of 94.8% for “nonsmoker” class, 83.0% for “currentsmoker”, and 65.7% for “pastsmoker”. It was observed that as more relevant data was added, the performance of the classifier improved. The results show that there is a need for the use of FHIR resources for exchanging clinical data between healthcare institutions. FHIR is free, it uses: profiles to extend coding standards; RESTFul API to exchange messages; and JSON, XML and turtle for representing messages. Data could be stored as JSON format on a NoSQL database such as CouchDB, which makes it available for further post extraction exploration. Conclusion This study has provided a method for learning a clinical coding standard by a computer algorithm, then applying that learned standard to unstandardized data so that unstandardized data could be easily exchangeable, comparable and searchable and ultimately achieve data interoperability. Even though this study was applied on a limited scale, in future, the study would explore the standardization of patient’s long-lived data from multiple sources using the SHARPn open-sourced tools and data scaling platformsInformation ScienceM. Sc. (Computing

    DINÁMICAS DE REPRODUCCIÓN DE ERRATAS: OPTIMIZACIÓN DE LA CORRECCIÓN FORMAL EN TRES DICCIONARIOS ESPECIALIZADOS BILINGÜES

    Get PDF
    It is only through an extreme concern for accuracy and the understanding of typographical errors that authors can turn specialised dictionaries into high quality reference works. This paper describes patterns of typographical error reproduction in three specialised English-Spanish dictionaries. We approach intratextual error reproduction (within a particular dictionary), either through related subentries or through non-related subentries. In addition, we compare the frequency of errors between dictionaries written by institutional lexicographers and works written by freelance professionals. The purpose is to provide a model for typographical error detection and analysis that may contribute to formal correctness in reference works. The reason is twofold: a) dictionaries are expected to be high-standard primary tools for language professionals; b) data quality is essential for a wide variety of utilities, ranging from dictionary writing systems and writing assistants to corpus tools.Los diccionarios especializados no pueden ser considerados obras de referencia de calidad si sus autores no prestan una especial atención a la corrección y si no entienden el fenómeno de la reproducción de las erratas. Este artículo describe patrones de reproducción de erratas en tres diccionarios especializados inglés-español. Abordamos la reproducción intratextual de erratas (en un diccionario en particular), tanto en subentradas relacionadas como no relacionadas. Además, comparamos la frecuencia de erratas en diccionarios elaborados por lexicógrafos institucionales con la de obras realizadas por profesionales independientes. El objetivo es ofrecer un modelo de detección y análisis de erratas que contribuya a la corrección formal en obras de referencia, por dos motivos: a) se supone que los diccionarios deben ser herramientas esenciales de alto nivel para los profesionales del lenguaje; b) la calidad de los datos es fundamental para una amplia gama de herramientas, desde programas de elaboración de diccionarios (dictionary writing systems) hasta asistentes de escritura y herramientas relacionadas con córpora.It is only through an extreme concern for accuracy and the understanding of typographical errors that authors can turn specialised dictionaries into high quality reference works. This paper describes patterns of typographical error reproduction in three specialised English-Spanish dictionaries. We approach intratextual error reproduction (within a particular dictionary), either through related subentries or through non-related subentries. In addition, we compare the frequency of errors between dictionaries written by institutional lexicographers and works written by freelance professionals. The purpose is to provide a model for typographical error detection and analysis that may contribute to formal correctness in reference works. The reason is twofold: a) dictionaries are expected to be high-standard primary tools for language professionals; b) data quality is essential for a wide variety of utilities, ranging from dictionary writing systems and writing assistants to corpus tools

    A Data-driven Methodology Towards Mobility- and Traffic-related Big Spatiotemporal Data Frameworks

    Get PDF
    Human population is increasing at unprecedented rates, particularly in urban areas. This increase, along with the rise of a more economically empowered middle class, brings new and complex challenges to the mobility of people within urban areas. To tackle such challenges, transportation and mobility authorities and operators are trying to adopt innovative Big Data-driven Mobility- and Traffic-related solutions. Such solutions will help decision-making processes that aim to ease the load on an already overloaded transport infrastructure. The information collected from day-to-day mobility and traffic can help to mitigate some of such mobility challenges in urban areas. Road infrastructure and traffic management operators (RITMOs) face several limitations to effectively extract value from the exponentially growing volumes of mobility- and traffic-related Big Spatiotemporal Data (MobiTrafficBD) that are being acquired and gathered. Research about the topics of Big Data, Spatiotemporal Data and specially MobiTrafficBD is scattered, and existing literature does not offer a concrete, common methodological approach to setup, configure, deploy and use a complete Big Data-based framework to manage the lifecycle of mobility-related spatiotemporal data, mainly focused on geo-referenced time series (GRTS) and spatiotemporal events (ST Events), extract value from it and support decision-making processes of RITMOs. This doctoral thesis proposes a data-driven, prescriptive methodological approach towards the design, development and deployment of MobiTrafficBD Frameworks focused on GRTS and ST Events. Besides a thorough literature review on Spatiotemporal Data, Big Data and the merging of these two fields through MobiTraffiBD, the methodological approach comprises a set of general characteristics, technical requirements, logical components, data flows and technological infrastructure models, as well as guidelines and best practices that aim to guide researchers, practitioners and stakeholders, such as RITMOs, throughout the design, development and deployment phases of any MobiTrafficBD Framework. This work is intended to be a supporting methodological guide, based on widely used Reference Architectures and guidelines for Big Data, but enriched with inherent characteristics and concerns brought about by Big Spatiotemporal Data, such as in the case of GRTS and ST Events. The proposed methodology was evaluated and demonstrated in various real-world use cases that deployed MobiTrafficBD-based Data Management, Processing, Analytics and Visualisation methods, tools and technologies, under the umbrella of several research projects funded by the European Commission and the Portuguese Government.A população humana cresce a um ritmo sem precedentes, particularmente nas áreas urbanas. Este aumento, aliado ao robustecimento de uma classe média com maior poder económico, introduzem novos e complexos desafios na mobilidade de pessoas em áreas urbanas. Para abordar estes desafios, autoridades e operadores de transportes e mobilidade estão a adotar soluções inovadoras no domínio dos sistemas de Dados em Larga Escala nos domínios da Mobilidade e Tráfego. Estas soluções irão apoiar os processos de decisão com o intuito de libertar uma infraestrutura de estradas e transportes já sobrecarregada. A informação colecionada da mobilidade diária e da utilização da infraestrutura de estradas pode ajudar na mitigação de alguns dos desafios da mobilidade urbana. Os operadores de gestão de trânsito e de infraestruturas de estradas (em inglês, road infrastructure and traffic management operators — RITMOs) estão limitados no que toca a extrair valor de um sempre crescente volume de Dados Espaciotemporais em Larga Escala no domínio da Mobilidade e Tráfego (em inglês, Mobility- and Traffic-related Big Spatiotemporal Data —MobiTrafficBD) que estão a ser colecionados e recolhidos. Os trabalhos de investigação sobre os tópicos de Big Data, Dados Espaciotemporais e, especialmente, de MobiTrafficBD, estão dispersos, e a literatura existente não oferece uma metodologia comum e concreta para preparar, configurar, implementar e usar uma plataforma (framework) baseada em tecnologias Big Data para gerir o ciclo de vida de dados espaciotemporais em larga escala, com ênfase nas série temporais georreferenciadas (em inglês, geo-referenced time series — GRTS) e eventos espacio- temporais (em inglês, spatiotemporal events — ST Events), extrair valor destes dados e apoiar os RITMOs nos seus processos de decisão. Esta dissertação doutoral propõe uma metodologia prescritiva orientada a dados, para o design, desenvolvimento e implementação de plataformas de MobiTrafficBD, focadas em GRTS e ST Events. Além de uma revisão de literatura completa nas áreas de Dados Espaciotemporais, Big Data e na junção destas áreas através do conceito de MobiTrafficBD, a metodologia proposta contem um conjunto de características gerais, requisitos técnicos, componentes lógicos, fluxos de dados e modelos de infraestrutura tecnológica, bem como diretrizes e boas práticas para investigadores, profissionais e outras partes interessadas, como RITMOs, com o objetivo de guiá-los pelas fases de design, desenvolvimento e implementação de qualquer pla- taforma MobiTrafficBD. Este trabalho deve ser visto como um guia metodológico de suporte, baseado em Arqui- teturas de Referência e diretrizes amplamente utilizadas, mas enriquecido com as característi- cas e assuntos implícitos relacionados com Dados Espaciotemporais em Larga Escala, como no caso de GRTS e ST Events. A metodologia proposta foi avaliada e demonstrada em vários cenários reais no âmbito de projetos de investigação financiados pela Comissão Europeia e pelo Governo português, nos quais foram implementados métodos, ferramentas e tecnologias nas áreas de Gestão de Dados, Processamento de Dados e Ciência e Visualização de Dados em plataformas MobiTrafficB

    Intelligent Sensor Networks

    Get PDF
    In the last decade, wireless or wired sensor networks have attracted much attention. However, most designs target general sensor network issues including protocol stack (routing, MAC, etc.) and security issues. This book focuses on the close integration of sensing, networking, and smart signal processing via machine learning. Based on their world-class research, the authors present the fundamentals of intelligent sensor networks. They cover sensing and sampling, distributed signal processing, and intelligent signal learning. In addition, they present cutting-edge research results from leading experts

    Data integration support for offshore decommissioning waste management

    Get PDF
    Offshore oil and gas platforms have a design life of about 25 years whereas the techniques and tools used for managing their data are constantly evolving. Therefore, data captured about platforms during their lifetimes will be in varying forms. Additionally, due to the many stakeholders involved with a facility over its life cycle, information representation of its components varies. These challenges make data integration difficult. Over the years, data integration technology application in the oil and gas industry has focused on meeting the needs of asset life cycle stages other than decommissioning. This is the case because most assets are just reaching the end of their design lives. Currently, limited work has been done on integrating life cycle data for offshore decommissioning purposes, and reports by industry stakeholders underscore this need. This thesis proposes a method for the integration of the common data types relevant in oil and gas decommissioning. The key features of the method are that it (i) ensures semantic homogeneity using knowledge representation languages (Semantic Web) and domain specific reference data (ISO 15926); and (ii) allows stakeholders to continue to use their current applications. Prototypes of the framework have been implemented using open source software applications and performance measures made. The work of this thesis has been motivated by the business case of reusing offshore decommissioning waste items. The framework developed is generic and can be applied whenever there is a need to integrate and query disparate data involving oil and gas assets. The prototypes presented show how the data management challenges associated with assessing the suitability of decommissioned offshore facility items for reuse can be addressed. The performance of the prototypes show that significant time and effort is saved compared to the state-of‐the‐art solution. The ability to do this effectively and efficiently during decommissioning will advance the oil the oil and gas industry’s transition toward a circular economy and help save on cost

    Advances in Data Mining Knowledge Discovery and Applications

    Get PDF
    Advances in Data Mining Knowledge Discovery and Applications aims to help data miners, researchers, scholars, and PhD students who wish to apply data mining techniques. The primary contribution of this book is highlighting frontier fields and implementations of the knowledge discovery and data mining. It seems to be same things are repeated again. But in general, same approach and techniques may help us in different fields and expertise areas. This book presents knowledge discovery and data mining applications in two different sections. As known that, data mining covers areas of statistics, machine learning, data management and databases, pattern recognition, artificial intelligence, and other areas. In this book, most of the areas are covered with different data mining applications. The eighteen chapters have been classified in two parts: Knowledge Discovery and Data Mining Applications

    Advanced document data extraction techniques to improve supply chain performance

    Get PDF
    In this thesis, a novel machine learning technique to extract text-based information from scanned images has been developed. This information extraction is performed in the context of scanned invoices and bills used in financial transactions. These financial transactions contain a considerable amount of data that must be extracted, refined, and stored digitally before it can be used for analysis. Converting this data into a digital format is often a time-consuming process. Automation and data optimisation show promise as methods for reducing the time required and the cost of Supply Chain Management (SCM) processes, especially Supplier Invoice Management (SIM), Financial Supply Chain Management (FSCM) and Supply Chain procurement processes. This thesis uses a cross-disciplinary approach involving Computer Science and Operational Management to explore the benefit of automated invoice data extraction in business and its impact on SCM. The study adopts a multimethod approach based on empirical research, surveys, and interviews performed on selected companies.The expert system developed in this thesis focuses on two distinct areas of research: Text/Object Detection and Text Extraction. For Text/Object Detection, the Faster R-CNN model was analysed. While this model yields outstanding results in terms of object detection, it is limited by poor performance when image quality is low. The Generative Adversarial Network (GAN) model is proposed in response to this limitation. The GAN model is a generator network that is implemented with the help of the Faster R-CNN model and a discriminator that relies on PatchGAN. The output of the GAN model is text data with bonding boxes. For text extraction from the bounding box, a novel data extraction framework consisting of various processes including XML processing in case of existing OCR engine, bounding box pre-processing, text clean up, OCR error correction, spell check, type check, pattern-based matching, and finally, a learning mechanism for automatizing future data extraction was designed. Whichever fields the system can extract successfully are provided in key-value format.The efficiency of the proposed system was validated using existing datasets such as SROIE and VATI. Real-time data was validated using invoices that were collected by two companies that provide invoice automation services in various countries. Currently, these scanned invoices are sent to an OCR system such as OmniPage, Tesseract, or ABBYY FRE to extract text blocks and later, a rule-based engine is used to extract relevant data. While the system’s methodology is robust, the companies surveyed were not satisfied with its accuracy. Thus, they sought out new, optimized solutions. To confirm the results, the engines were used to return XML-based files with text and metadata identified. The output XML data was then fed into this new system for information extraction. This system uses the existing OCR engine and a novel, self-adaptive, learning-based OCR engine. This new engine is based on the GAN model for better text identification. Experiments were conducted on various invoice formats to further test and refine its extraction capabilities. For cost optimisation and the analysis of spend classification, additional data were provided by another company in London that holds expertise in reducing their clients' procurement costs. This data was fed into our system to get a deeper level of spend classification and categorisation. This helped the company to reduce its reliance on human effort and allowed for greater efficiency in comparison with the process of performing similar tasks manually using excel sheets and Business Intelligence (BI) tools.The intention behind the development of this novel methodology was twofold. First, to test and develop a novel solution that does not depend on any specific OCR technology. Second, to increase the information extraction accuracy factor over that of existing methodologies. Finally, it evaluates the real-world need for the system and the impact it would have on SCM. This newly developed method is generic and can extract text from any given invoice, making it a valuable tool for optimizing SCM. In addition, the system uses a template-matching approach to ensure the quality of the extracted information

    Challenges in Cybersecurity and Privacy - the European Research Landscape

    Get PDF
    Cybersecurity and Privacy issues are becoming an important barrier for a trusted and dependable global digital society development. Cyber-criminals are continuously shifting their cyber-attacks specially against cyber-physical systems and IoT, since they present additional vulnerabilities due to their constrained capabilities, their unattended nature and the usage of potential untrustworthiness components. Likewise, identity-theft, fraud, personal data leakages, and other related cyber-crimes are continuously evolving, causing important damages and privacy problems for European citizens in both virtual and physical scenarios. In this context, new holistic approaches, methodologies, techniques and tools are needed to cope with those issues, and mitigate cyberattacks, by employing novel cyber-situational awareness frameworks, risk analysis and modeling, threat intelligent systems, cyber-threat information sharing methods, advanced big-data analysis techniques as well as exploiting the benefits from latest technologies such as SDN/NFV and Cloud systems. In addition, novel privacy-preserving techniques, and crypto-privacy mechanisms, identity and eID management systems, trust services, and recommendations are needed to protect citizens’ privacy while keeping usability levels. The European Commission is addressing the challenge through different means, including the Horizon 2020 Research and Innovation program, thereby financing innovative projects that can cope with the increasing cyberthreat landscape. This book introduces several cybersecurity and privacy research challenges and how they are being addressed in the scope of 15 European research projects. Each chapter is dedicated to a different funded European Research project, which aims to cope with digital security and privacy aspects, risks, threats and cybersecurity issues from a different perspective. Each chapter includes the project’s overviews and objectives, the particular challenges they are covering, research achievements on security and privacy, as well as the techniques, outcomes, and evaluations accomplished in the scope of the EU project. The book is the result of a collaborative effort among relative ongoing European Research projects in the field of privacy and security as well as related cybersecurity fields, and it is intended to explain how these projects meet the main cybersecurity and privacy challenges faced in Europe. Namely, the EU projects analyzed in the book are: ANASTACIA, SAINT, YAKSHA, FORTIKA, CYBECO, SISSDEN, CIPSEC, CS-AWARE. RED-Alert, Truessec.eu. ARIES, LIGHTest, CREDENTIAL, FutureTrust, LEPS. Challenges in Cybersecurity and Privacy - the European Research Landscape is ideal for personnel in computer/communication industries as well as academic staff and master/research students in computer science and communications networks interested in learning about cyber-security and privacy aspects

    Reinventing the Social Scientist and Humanist in the Era of Big Data

    Get PDF
    This book explores the big data evolution by interrogating the notion that big data is a disruptive innovation that appears to be challenging existing epistemologies in the humanities and social sciences. Exploring various (controversial) facets of big data such as ethics, data power, and data justice, the book attempts to clarify the trajectory of the epistemology of (big) data-driven science in the humanities and social sciences
    corecore