412 research outputs found
Input variable selection in time-critical knowledge integration applications: A review, analysis, and recommendation paper
This is the post-print version of the final paper published in Advanced Engineering Informatics. The published article is available from the link below. Changes resulting from the publishing process, such as peer review, editing, corrections, structural formatting, and other quality control mechanisms may not be reflected in this document. Changes may have been made to this work since it was submitted for publication. Copyright @ 2013 Elsevier B.V.The purpose of this research is twofold: first, to undertake a thorough appraisal of existing Input Variable Selection (IVS) methods within the context of time-critical and computation resource-limited dimensionality reduction problems; second, to demonstrate improvements to, and the application of, a recently proposed time-critical sensitivity analysis method called EventTracker to an environment science industrial use-case, i.e., sub-surface drilling.
Producing time-critical accurate knowledge about the state of a system (effect) under computational and data acquisition (cause) constraints is a major challenge, especially if the knowledge required is critical to the system operation where the safety of operators or integrity of costly equipment is at stake. Understanding and interpreting, a chain of interrelated events, predicted or unpredicted, that may or may not result in a specific state of the system, is the core challenge of this research. The main objective is then to identify which set of input data signals has a significant impact on the set of system state information (i.e. output). Through a cause-effect analysis technique, the proposed technique supports the filtering of unsolicited data that can otherwise clog up the communication and computational capabilities of a standard supervisory control and data acquisition system.
The paper analyzes the performance of input variable selection techniques from a series of perspectives. It then expands the categorization and assessment of sensitivity analysis methods in a structured framework that takes into account the relationship between inputs and outputs, the nature of their time series, and the computational effort required. The outcome of this analysis is that established methods have a limited suitability for use by time-critical variable selection applications. By way of a geological drilling monitoring scenario, the suitability of the proposed EventTracker Sensitivity Analysis method for use in high volume and time critical input variable selection problems is demonstrated.E
Machine Learning Approach for Risk-Based Inspection Screening Assessment
Risk-based inspection (RBI) screening assessment is used to identify equipment that makes a significant contribution to the system's total risk of failure (RoF), so that the RBI detailed assessment can focus on analyzing higher-risk equipment. Due to its qualitative nature and high dependency on sound engineering judgment, screening assessment is vulnerable to human biases and errors, and thus subject to output variability and threatens the integrity of the assets. This paper attempts to tackle these challenges by utilizing a machine learning approach to conduct screening assessment. A case study using a dataset of RBI assessment for oil and gas production and processing units is provided, to illustrate the development of an intelligent system, based on a machine learning model for performing RBI screening assessment. The best performing model achieves accuracy and precision of 92.33% and 84.58%, respectively. A comparative analysis between the performance of the intelligent system and the conventional assessment is performed to examine the benefits of applying the machine learning approach in the RBI screening assessment. The result shows that the application of the machine learning approach potentially improves the quality of the conventional RBI screening assessment output by reducing output variability and increasing accuracy and precision.acceptedVersio
Temporal Information in Data Science: An Integrated Framework and its Applications
Data science is a well-known buzzword, that is in fact composed of two distinct keywords, i.e., data and science. Data itself is of great importance: each analysis task begins from a set of examples. Based on such a consideration, the present work starts with the analysis of a real case scenario, by considering the development of a data warehouse-based decision support system for an Italian contact center company. Then, relying on the information collected in the developed system, a set of machine learning-based analysis tasks have been developed to answer specific business questions, such as employee work anomaly detection and automatic call classification. Although such initial applications rely on already available algorithms, as we shall see, some clever analysis workflows had also to be developed. Afterwards, continuously driven by real data and real world applications, we turned ourselves to the question of how to handle temporal information within classical decision tree models. Our research brought us the development of J48SS, a decision tree induction algorithm based on Quinlan's C4.5 learner, which is capable of dealing with temporal (e.g., sequential and time series) as well as atemporal (such as numerical and categorical) data during the same execution cycle. The decision tree has been applied into some real world analysis tasks, proving its worthiness. A key characteristic of J48SS is its interpretability, an aspect that we specifically addressed through the study of an evolutionary-based decision tree pruning technique. Next, since a lot of work concerning the management of temporal information has already been done in automated reasoning and formal verification fields, a natural direction in which to proceed was that of investigating how such solutions may be combined with machine learning, following two main tracks. First, we show, through the development of an enriched decision tree capable of encoding temporal information by means of interval temporal logic formulas, how a machine learning algorithm can successfully exploit temporal logic to perform data analysis. Then, we focus on the opposite direction, i.e., that of employing machine learning techniques to generate temporal logic formulas, considering a natural language processing scenario. Finally, as a conclusive development, the architecture of a system is proposed, in which formal methods and machine learning techniques are seamlessly combined to perform anomaly detection and predictive maintenance tasks. Such an integration represents an original, thrilling research direction that may open up new ways of dealing with complex, real-world problems.Data science is a well-known buzzword, that is in fact composed of two distinct keywords, i.e., data and science. Data itself is of great importance: each analysis task begins from a set of examples. Based on such a consideration, the present work starts with the analysis of a real case scenario, by considering the development of a data warehouse-based decision support system for an Italian contact center company. Then, relying on the information collected in the developed system, a set of machine learning-based analysis tasks have been developed to answer specific business questions, such as employee work anomaly detection and automatic call classification. Although such initial applications rely on already available algorithms, as we shall see, some clever analysis workflows had also to be developed. Afterwards, continuously driven by real data and real world applications, we turned ourselves to the question of how to handle temporal information within classical decision tree models. Our research brought us the development of J48SS, a decision tree induction algorithm based on Quinlan's C4.5 learner, which is capable of dealing with temporal (e.g., sequential and time series) as well as atemporal (such as numerical and categorical) data during the same execution cycle. The decision tree has been applied into some real world analysis tasks, proving its worthiness. A key characteristic of J48SS is its interpretability, an aspect that we specifically addressed through the study of an evolutionary-based decision tree pruning technique. Next, since a lot of work concerning the management of temporal information has already been done in automated reasoning and formal verification fields, a natural direction in which to proceed was that of investigating how such solutions may be combined with machine learning, following two main tracks. First, we show, through the development of an enriched decision tree capable of encoding temporal information by means of interval temporal logic formulas, how a machine learning algorithm can successfully exploit temporal logic to perform data analysis. Then, we focus on the opposite direction, i.e., that of employing machine learning techniques to generate temporal logic formulas, considering a natural language processing scenario. Finally, as a conclusive development, the architecture of a system is proposed, in which formal methods and machine learning techniques are seamlessly combined to perform anomaly detection and predictive maintenance tasks. Such an integration represents an original, thrilling research direction that may open up new ways of dealing with complex, real-world problems
Integration of Social Media News Mining and Text Mining Techniques to Determine a Corporateâs Competitive Edge
Market globalization have triggered much more severe challenges for corporates than ever before. Thus, how to survive in this highly fluctuating economic atmosphere is an attractive topic for corporate managers, especially when an economy goes into a severe recession. One of the most consensus conclusions is to highly integrate a corporateâs supply chain network, as it can facilitate knowledge circulation, reduce transportation cost, increase market share, and sustain customer loyalty. However, a corporateâs supply chain relations are unapparent and opaque. To solve such an obstacle, this study integrates text mining (TM) and social network analysis (SNA) techniques to exploit the latent relation among corporates from social media news. Sequentially, this study examines its impact on corporate operating performance forecasting. The empirical result shows that the proposed mechanism is a promising alternative for performance forecasting. Public authorities and decision makers can thus consider the potential implications when forming a future policy
Metadata-driven data integration
Cotutela: Universitat PolitÚcnica de Catalunya i Université Libre de Bruxelles, IT4BI-DC programme for the joint Ph.D. degree in computer science.Data has an undoubtable impact on society. Storing and processing large amounts of available data is currently one of the key success factors for an organization. Nonetheless, we are recently witnessing a change represented by huge and heterogeneous amounts of data. Indeed, 90% of the data in the world has been generated in the last two years. Thus, in order to carry on these data exploitation tasks, organizations must first perform data integration combining data from multiple sources to yield a unified view over them. Yet, the integration of massive and heterogeneous amounts of data requires revisiting the traditional integration assumptions to cope with the new requirements posed by such data-intensive settings.
This PhD thesis aims to provide a novel framework for data integration in the context of data-intensive ecosystems, which entails dealing with vast amounts of heterogeneous data, from multiple sources and in their original format. To this end, we advocate for an integration process consisting of sequential activities governed by a semantic layer, implemented via a shared repository of metadata. From an stewardship perspective, this activities are the deployment of a data integration architecture, followed by the population of such shared metadata. From a data consumption perspective, the activities are virtual and materialized data integration, the former an exploratory task and the latter a consolidation one. Following the proposed framework, we focus on providing contributions to each of the four activities.
We begin proposing a software reference architecture for semantic-aware data-intensive systems. Such architecture serves as a blueprint to deploy a stack of systems, its core being the metadata repository. Next, we propose a graph-based metadata model as formalism for metadata management. We focus on supporting schema and data source evolution, a predominant factor on the heterogeneous sources at hand. For virtual integration, we propose query rewriting algorithms that rely on the previously proposed metadata model. We additionally consider semantic heterogeneities in the data sources, which the proposed algorithms are capable of automatically resolving. Finally, the thesis focuses on the materialized integration activity, and to this end, proposes a method to select intermediate results to materialize in data-intensive flows. Overall, the results of this thesis serve as contribution to the field of data integration in contemporary data-intensive ecosystems.Les dades tenen un impacte indubtable en la societat. La capacitat dâemmagatzemar i processar grans quantitats de dades disponibles Ă©s avui en dia un dels factors claus per lâĂšxit dâuna organitzaciĂł. No obstant, avui en dia estem presenciant un canvi representat per grans volums de dades heterogenis. En efecte, el 90% de les dades mundials han sigut generades en els Ășltims dos anys. Per tal de dur a terme aquestes tasques dâexplotaciĂł de dades, les organitzacions primer han de realitzar una integraciĂł de les dades, combinantles a partir de diferents fonts amb lâobjectiu de tenir-ne una vista unificada dâelles. Per aixĂČ, aquest fet requereix reconsiderar les assumpcions tradicionals en integraciĂł amb lâobjectiu de lidiar amb els requisits imposats per aquests sistemes de tractament massiu de dades.
Aquesta tesi doctoral tĂ© com a objectiu proporcional un nou marc de treball per a la integraciĂł de dades en el context de sistemes de tractament massiu de dades, el qual implica lidiar amb una gran quantitat de dades heterogĂšnies, provinents de mĂșltiples fonts i en el seu format original. Per aixĂČ, proposem un procĂ©s dâintegraciĂł compost dâuna seqĂŒĂšncia dâactivitats governades per una capa semĂ ntica, la qual Ă©s implementada a partir dâun repositori de metadades compartides. Des dâuna perspectiva dâadministraciĂł, aquestes activitats sĂłn el desplegament dâuna arquitectura dâintegraciĂł de dades, seguit per la inserciĂł dâaquestes metadades compartides. Des dâuna perspectiva de consum de dades, les activitats sĂłn la integraciĂł virtual i materialitzaciĂł de les dades, la primera sent una tasca exploratĂČria i la segona una de consolidaciĂł.
Seguint el marc de treball proposat, ens centrem en proporcionar contribucions a cada una de les quatre activitats. La tesi inicia proposant una arquitectura de referĂšncia de software per a sistemes de tractament massiu de dades amb coneixement semĂ ntic. Aquesta arquitectura serveix com a planell per a desplegar un conjunt de sistemes, sent el repositori de metadades al seu nucli.
Posteriorment, proposem un model basat en grafs per a la gestiĂł de metadades. Concretament, ens centrem en donar suport a lâevoluciĂł dâesquemes i fonts de dades, un dels factors predominants en les fonts de dades heterogĂšnies considerades. Per a lâintegraciĂł virtual, proposem algorismes de rescriptura de consultes que usen el model de metadades previament proposat. Com a afegitĂł, considerem heterogeneĂŻtat semĂ ntica en les fonts de dades, les quals els algorismes de rescriptura poden resoldre automĂ ticament. Finalment, la tesi es centra en lâactivitat dâintegraciĂł materialitzada. Per aixĂČ proposa un mĂštode per a seleccionar els resultats intermedis a materialitzar un fluxes
de tractament intensiu de dades. En general, els resultats dâaquesta tesi serveixen com a contribuciĂł al camp dâintegraciĂł de dades en els ecosistemes de tractament massiu de dades contemporanisLes donnĂ©es ont un impact indĂ©niable sur la sociĂ©tĂ©. Le stockage et le traitement de grandes quantitĂ©s de donnĂ©es disponibles constituent actuellement lâun des facteurs clĂ©s de succĂšs dâune entreprise. NĂ©anmoins, nous assistons rĂ©cemment Ă un changement reprĂ©sentĂ© par des quantitĂ©s de donnĂ©es massives et hĂ©tĂ©rogĂšnes. En effet, 90% des donnĂ©es dans le monde ont Ă©tĂ© gĂ©nĂ©rĂ©es au cours des deux derniĂšres annĂ©es. Ainsi, pour mener Ă bien ces tĂąches dâexploitation des donnĂ©es, les organisations doivent dâabord rĂ©aliser une intĂ©gration des donnĂ©es en combinant des donnĂ©es provenant de sources multiples pour obtenir une vue unifiĂ©e de ces derniĂšres. Cependant, lâintĂ©gration
de quantitĂ©s de donnĂ©es massives et hĂ©tĂ©rogĂšnes nĂ©cessite de revoir les hypothĂšses dâintĂ©gration traditionnelles afin de faire face aux nouvelles exigences posĂ©es par les systĂšmes de gestion de donnĂ©es massives.
Cette thĂšse de doctorat a pour objectif de fournir un nouveau cadre pour lâintĂ©gration de donnĂ©es dans le contexte dâĂ©cosystĂšmes Ă forte intensitĂ© de donnĂ©es, ce qui implique de traiter de grandes quantitĂ©s de donnĂ©es hĂ©tĂ©rogĂšnes, provenant de sources multiples et dans leur format dâorigine. Ă cette fin, nous prĂ©conisons un processus dâintĂ©gration constituĂ© dâactivitĂ©s sĂ©quentielles rĂ©gies par une couche sĂ©mantique, mise en oeuvre via un dĂ©pĂŽt partagĂ© de mĂ©tadonnĂ©es. Du point de vue de la gestion, ces activitĂ©s consistent Ă dĂ©ployer une architecture dâintĂ©gration de donnĂ©es, suivies de la population de mĂ©tadonnĂ©es partagĂ©es. Du point de vue de la consommation de donnĂ©es, les activitĂ©s sont lâintĂ©gration de donnĂ©es virtuelle et matĂ©rialisĂ©e, la premiĂšre Ă©tant une tĂąche exploratoire et la seconde, une tĂąche de consolidation.
Conformément au cadre proposé, nous nous attachons à fournir des contributions à chacune des quatre activités. Nous commençons par proposer une architecture logicielle de référence pour les systÚmes de gestion de données massives et à connaissance sémantique. Une telle architecture
consiste en un schĂ©ma directeur pour le dĂ©ploiement dâune pile de systĂšmes, le dĂ©pĂŽt de mĂ©tadonnĂ©es Ă©tant son composant principal. Ensuite, nous proposons un modĂšle de mĂ©tadonnĂ©es basĂ© sur des graphes comme formalisme pour la gestion des mĂ©tadonnĂ©es. Nous mettons lâaccent sur la prise en charge de lâĂ©volution des schĂ©mas et des sources de donnĂ©es, facteur prĂ©dominant des sources hĂ©tĂ©rogĂšnes sous-jacentes. Pour lâintĂ©gration virtuelle, nous proposons des algorithmes de rĂ©Ă©criture de requĂȘtes qui sâappuient sur le modĂšle de mĂ©tadonnĂ©es proposĂ© prĂ©cĂ©demment. Nous considĂ©rons en outre les hĂ©tĂ©rogĂ©nĂ©itĂ©s sĂ©mantiques dans les sources de donnĂ©es, que les
algorithmes proposĂ©s sont capables de rĂ©soudre automatiquement. Enfin, la thĂšse se concentre sur lâactivitĂ© dâintĂ©gration matĂ©rialisĂ©e et propose Ă cette fin une mĂ©thode de sĂ©lection de rĂ©sultats intermĂ©diaires Ă matĂ©rialiser dans des flux des donnĂ©es massives. Dans lâensemble, les rĂ©sultats de cette thĂšse constituent une contribution au domaine de lâintĂ©gration des donnĂ©es dans les Ă©cosystĂšmes contemporains de gestion de donnĂ©es massivesPostprint (published version
Recommended from our members
Ageneric predictive information system for resource planning and optimisation
This thesis was submitted for the degree of Doctor of Philosophy and awarded by Brunel UniversityThe purpose of this research work is to demonstrate the feasibility of creating a quick response decision platform for middle management in industry. It utilises the strengths of current, but more importantly creates a leap forward in the theory and practice of Supervisory and Data Acquisition (SCADA) systems and Discrete Event Simulation and Modelling (DESM). The proposed research platform uses real-time data and creates an automatic platform for real-time and predictive system analysis, giving current and ahead of time information on the performance of the system in an efficient manner. Data acquisition as the backend connection of data integration system to the shop floor faces both hardware and software challenges for coping with large scale real-time data collection. Limited scope of SCADA systems does not make them suitable candidates for this. Cost effectiveness, complexity, and efficiency-orientation of proprietary solutions leave space for more challenge. A Flexible Data Input Layer Architecture (FDILA) is proposed to address generic data integration platform so a multitude of data sources can be connected to the data processing unit. The efficiency of the proposed integration architecture lies in decentralising and distributing services between different layers. A novel Sensitivity Analysis (SA) method called EvenTracker is proposed as an effective tool to measure the importance and priority of inputs to the system. The EvenTracker method is introduced to deal with the complexity systems in real-time. The approach takes advantage of event-based definition of data involved in process flow. The underpinning logic behind EvenTracker SA method is capturing the cause-effect relationships between triggers (input variables) and events (output variables) at a specified period of time determined by an expert. The approach does not require estimating data distribution of any kind. Neither the performance model requires execution beyond the real-time. The proposed EvenTracker sensitivity analysis method has the lowest computational complexity compared with other popular sensitivity analysis methods. For proof of concept, a three tier data integration system was designed and developed by using National Instrumentsâ LabVIEW programming language, Rockwell Automationâs Arena simulation and modelling software, and OPC data communication software. A laboratory-based conveyor system with 29 sensors was installed to simulate a typical shop floor production line. In addition, EvenTracker SA method has been implemented on the data extracted from 28 sensors of one manufacturing line in a real factory. The experiment has resulted 14% of the input variables to be unimportant for evaluation of model outputs. The method proved a time efficiency gain of 52% on the analysis of filtered system when unimportant input variables were not sampled anymore. The EvenTracker SA method compared to Entropy-based SA technique, as the only other method that can be used for real-time purposes, is quicker, more accurate and less computationally burdensome. Additionally, theoretic estimation of computational complexity of SA methods based on both structural complexity and energy-time analysis resulted in favour of the efficiency of the proposed EvenTracker SA method. Both laboratory and factory-based experiments demonstrated flexibility and efficiency of the proposed solution.The Engineering and Physical Sciences Research Council
SMOTE for Learning from Imbalanced Data: Progress and Challenges, Marking the 15-year Anniversary
The Synthetic Minority Oversampling Technique (SMOTE) preprocessing algorithm is
considered \de facto" standard in the framework of learning from imbalanced data. This
is due to its simplicity in the design of the procedure, as well as its robustness when applied
to di erent type of problems. Since its publication in 2002, SMOTE has proven
successful in a variety of applications from several di erent domains. SMOTE has also inspired
several approaches to counter the issue of class imbalance, and has also signi cantly
contributed to new supervised learning paradigms, including multilabel classi cation, incremental
learning, semi-supervised learning, multi-instance learning, among others. It is
standard benchmark for learning from imbalanced data. It is also featured in a number of
di erent software packages | from open source to commercial. In this paper, marking the
fteen year anniversary of SMOTE, we re
ect on the SMOTE journey, discuss the current
state of a airs with SMOTE, its applications, and also identify the next set of challenges
to extend SMOTE for Big Data problems.This work have been partially supported by the Spanish Ministry of Science and Technology
under projects TIN2014-57251-P, TIN2015-68454-R and TIN2017-89517-P; the Project
887 BigDaP-TOOLS - Ayudas Fundaci on BBVA a Equipos de Investigaci on Cient ca 2016;
and the National Science Foundation (NSF) Grant IIS-1447795
Framework for data quality in knowledge discovery tasks
Actualmente la explosiĂłn de datos es tendencia en el universo digital debido a los
avances en las tecnologĂas de la informaciĂłn. En este sentido, el descubrimiento
de conocimiento y la minerĂa de datos han ganado mayor importancia debido a
la gran cantidad de datos disponibles. Para un exitoso proceso de descubrimiento
de conocimiento, es necesario preparar los datos. Expertos afirman que la fase de
preprocesamiento de datos toma entre un 50% a 70% del tiempo de un proceso de
descubrimiento de conocimiento.
Herramientas software basadas en populares metodologĂas para el descubrimiento
de conocimiento ofrecen algoritmos para el preprocesamiento de los datos.
SegĂșn el cuadrante mĂĄgico de Gartner de 2018 para ciencia de datos y plataformas
de aprendizaje automĂĄtico, KNIME, RapidMiner, SAS, Alteryx, y H20.ai son las
mejores herramientas para el desucrimiento del conocimiento. Estas herramientas
proporcionan diversas técnicas que facilitan la evaluación del conjunto de datos,
sin embargo carecen de un proceso orientado al usuario que permita abordar los
problemas en la calidad de datos. AdemŽas, la selección de las técnicas adecuadas
para la limpieza de datos es un problema para usuarios inexpertos, ya que estos
no tienen claro cuales son los métodos mås confiables.
De esta forma, la presente tesis doctoral se enfoca en abordar los problemas
antes mencionados mediante: (i) Un marco conceptual que ofrezca un proceso
guiado para abordar los problemas de calidad en los datos en tareas de descubrimiento
de conocimiento, (ii) un sistema de razonamiento basado en casos
que recomiende los algoritmos adecuados para la limpieza de datos y (iii) una ontologĂa que representa el conocimiento de los problemas de calidad en los datos
y los algoritmos de limpieza de datos. Adicionalmente, esta ontologĂa contribuye
en la representacion formal de los casos y en la fase de adaptaciĂłn, del sistema de
razonamiento basado en casos.The creation and consumption of data continue to grow by leaps and bounds. Due
to advances in Information and Communication Technologies (ICT), today the
data explosion in the digital universe is a new trend. The Knowledge Discovery
in Databases (KDD) gain importance due the abundance of data. For a successful
process of knowledge discovery is necessary to make a data treatment. The
experts affirm that preprocessing phase take the 50% to 70% of the total time of
knowledge discovery process.
Software tools based on Knowledge Discovery Methodologies offers algorithms
for data preprocessing. According to Gartner 2018 Magic Quadrant for
Data Science and Machine Learning Platforms, KNIME, RapidMiner, SAS, Alteryx
and H20.ai are the leader tools for knowledge discovery. These software
tools provide different techniques and they facilitate the evaluation of data analysis,
however, these software tools lack any kind of guidance as to which techniques
can or should be used in which contexts. Consequently, the use of suitable data
cleaning techniques is a headache for inexpert users. They have no idea which
methods can be confidently used and often resort to trial and error.
This thesis presents three contributions to address the mentioned problems:
(i) A conceptual framework to provide the user a guidance to address data quality
issues in knowledge discovery tasks, (ii) a Case-based reasoning system to
recommend the suitable algorithms for data cleaning, and (iii) an Ontology that
represent the knowledge in data quality issues and data cleaning methods. Also,
this ontology supports the case-based reasoning system for case representation
and reuse phase.Programa Oficial de Doctorado en Ciencia y TecnologĂa InformĂĄticaPresidente: Fernando FernĂĄndez Rebollo.- Secretario: Gustavo Adolfo RamĂrez.- Vocal: Juan Pedro Caraça-Valente HernĂĄnde
Predictive analytics applied to Alzheimerâs disease : a data visualisation framework for understanding current research and future challenges
Dissertation as a partial requirement for obtaining a masterâs degree in information management, with a specialisation in Business Intelligence and Knowledge Management.Big Data is, nowadays, regarded as a tool for improving the healthcare sector in many areas, such as in its economic side, by trying to search for operational efficiency gaps, and in personalised treatment, by selecting the best drug for the patient, for instance. Data science can play a key role in identifying diseases in an early stage, or even when there are no signs of it, track its progress, quickly identify the efficacy of treatments and suggest alternative ones. Therefore, the prevention side of healthcare can be enhanced with the usage of state-of-the-art predictive big data analytics and machine learning methods, integrating the available, complex, heterogeneous, yet sparse, data from multiple sources, towards a better disease and pathology patterns identification. It can be applied for the diagnostic challenging neurodegenerative disorders; the identification of the patterns that trigger those disorders can make possible to identify more risk factors, biomarkers, in every human being. With that, we can improve the effectiveness of the medical interventions, helping people to stay healthy and active for a longer period. In this work, a review of the state of science about predictive big data analytics is done, concerning its application to Alzheimerâs Disease early diagnosis. It is done by searching and summarising the scientific articles published in respectable online sources, putting together all the information that is spread out in the world wide web, with the goal of enhancing knowledge management and collaboration practices about the topic. Furthermore, an interactive data visualisation tool to better manage and identify the scientific articles is develop, delivering, in this way, a holistic visual overview of the developments done in the important field of Alzheimerâs Disease diagnosis.Big Data Ă© hoje considerada uma ferramenta para melhorar o sector da saĂșde em muitas ĂĄreas, tais como na sua vertente mais econĂłmica, tentando encontrar lacunas de eficiĂȘncia operacional, e no tratamento personalizado, selecionando o melhor medicamento para o paciente, por exemplo. A ciĂȘncia de dados pode desempenhar um papel fundamental na identificação de doenças em um estĂĄgio inicial, ou mesmo quando nĂŁo hĂĄ sinais dela, acompanhar o seu progresso, identificar rapidamente a eficĂĄcia dos tratamentos indicados ao paciente e sugerir alternativas. Portanto, o lado preventivo dos cuidados de saĂșde pode ser bastante melhorado com o uso de mĂ©todos avançados de anĂĄlise preditiva com big data e de machine learning, integrando os dados disponĂveis, geralmente complexos, heterogĂ©neos e esparsos provenientes de mĂșltiplas fontes, para uma melhor identificação de padrĂ”es patolĂłgicos e da doença. Estes mĂ©todos podem ser aplicados nas doenças neurodegenerativas que ainda sĂŁo um grande desafio no seu diagnĂłstico; a identificação dos padrĂ”es que desencadeiam esses distĂșrbios pode possibilitar a identificação de mais fatores de risco, biomarcadores, em todo e qualquer ser humano. Com isso, podemos melhorar a eficĂĄcia das intervençÔes mĂ©dicas, ajudando as pessoas a permanecerem saudĂĄveis e ativas por um perĂodo mais longo. Neste trabalho, Ă© feita uma revisĂŁo do estado da arte sobre a anĂĄlise preditiva com big data, no que diz respeito Ă sua aplicação ao diagnĂłstico precoce da Doença de Alzheimer. Isto foi realizado atravĂ©s da pesquisa exaustiva e resumo de um grande nĂșmero de artigos cientĂficos publicados em fontes online de referĂȘncia na ĂĄrea, reunindo a informação que estĂĄ amplamente espalhada na world wide web, com o objetivo de aprimorar a gestĂŁo do conhecimento e as prĂĄticas de colaboração sobre o tema. AlĂ©m disso, uma ferramenta interativa de visualização de dados para melhor gerir e identificar os artigos cientĂficos foi desenvolvida, fornecendo, desta forma, uma visĂŁo holĂstica dos avanços cientĂfico feitos no importante campo do diagnĂłstico da Doença de Alzheimer
- âŠ