    Improving Retrieval Performance of Case Based Reasoning Systems by Fuzzy Clustering

    Case-based reasoning (CBR), which is a classical reasoning methodology, has been put to use. Its application has allowed significant progress in resolving problems related to the diagnosis, therapy, and prediction of diseases. However, this methodology has shown some complicated problems that must be resolved, including determining a representation form for the case (complexity, uncertainty, and vagueness of medical information), preventing the case base from the infinite growth of generated medical information and selecting the best retrieval technique. These limitations have pushed researchers to think about other ways of solving problems, and we are recently witnessing the integration of CBR with other techniques such as data mining. In this article, we develop a new approach integrating clustering (Fuzzy C-Means (FCM) and K-Means) in the CBR cycle. Clustering is one of the crucial challenges and has been successfully used in many areas to develop innate structures and hidden patterns for data grouping [1]. The objective of the proposed approach is to solve the limitations of CBR and improve it, particularly in the search for similar cases (retrieval step). The approach is tested with the publicly available immunotherapy dataset. The results of the experimentations show that the integration of the FCM algorithm in the retrieval step reduces the search space (the large volume of information), resolves the problem of the vagueness of medical information, speeds up the calculation and response time, and increases the search efficiency, which further improves the performance of the retrieval step and, consequently, the CBR system

    Enhancing the interactivity of a clinical decision support system by using knowledge engineering and natural language processing

    Mental illness is a serious health problem and it affects many people. Increasingly,Clinical Decision Support Systems (CDSS) are being used for diagnosis and it is important to improve the reliability and performance of these systems. Missing a potential clue or a wrong diagnosis can have a detrimental effect on the patient's quality of life and could lead to a fatal outcome. The context of this research is the Galatean Risk and Safety Tool (GRiST), a mental-health-risk assessment system. Previous research has shown that success of a CDSS depends on its ease of use, reliability and interactivity. This research addresses these concerns for the GRiST by deploying data mining techniques. Clinical narratives and numerical data have both been analysed for this purpose.Clinical narratives have been processed by natural language processing (NLP)technology to extract knowledge from them. SNOMED-CT was used as a reference ontology and the performance of the different extraction algorithms have been compared. A new Ensemble Concept Mining (ECM) method has been proposed, which may eliminate the need for domain specific phrase annotation requirements. Word embedding has been used to filter phrases semantically and to build a semantic representation of each of the GRiST ontology nodes.The Chi-square and FP-growth methods have been used to find relationships between GRiST ontology nodes. Interesting patterns have been found that could be used to provide real-time feedback to clinicians. Information gain has been used efficaciously to explain the differences between the clinicians and the consensus risk. A new risk management strategy has been explored by analysing repeat assessments. A few novel methods have been proposed to perform automatic background analysis of the patient data and improve the interactivity and reliability of GRiST and similar systems


    Customer churn can be described as the process by which consumers of goods and services discontinue the consumption of a product or service and switch over to a competitor.It is of great concern to many companies. Thus, decision support systems are needed to overcome this pressing issue and ensure good return on investments for organizations. Decision support systems use analytical models to provide the needed intelligence to analyze an integrated customer record database to predict customers that will churn and offer recommendations that will prevent them from churning. Customers churn prediction, unlike most conventional business intelligence techniques, deals with customer demographics, net worth-value, and market opportunities. It is used in determining customers who are likely to churn, those likely to remain loyal to the organization, and for prediction of future churn rates. Customer defection is naturally a slow rate event, and it is not easily detected by most business intelligent solutions available in the market; especially when data is skewed, large, and distinct. Thus, accurate and precise prediction methods are needed to detect the churning trend. In this study, a churn model that applies business intelligence techniques to detect the possibility that a customer will churn using churn trend analysis of customer records is proposed. The model applies clustering algorithms and enhanced SPRINT decision tree algorithms to explore customer record database, and identify the customer profile and behavior patterns. The Model then predicts the possibility that a customer will churn. Additionally, it offers solutions for retaining customers and making them loyal to a business entity by recommending customer-relationship management measures

    Rule-based preprocessing for data stream mining using complex event processing

    Data preprocessing is known to be essential to produce accurate data from which mining methods are able to extract valuable knowledge. When data constantly arrives from one or more sources, preprocessing techniques need to be adapted to efficiently handle these data streams. To help domain experts to define and execute preprocessing tasks for data streams, this paper proposes the use of active rule-based systems and, more specifically, complex event processing (CEP) languages and engines. The main contribution of our approach is the formulation of preprocessing procedures as event detection rules, expressed in an SQL-like language, that provide domain experts a simple way to manipulate temporal data. This idea is materialized into a publicly available solution that integrates a CEP engine with a library for online data mining. To evaluate our approach, we present three practical scenarios in which CEP rules preprocess data streams with the aim of adding temporal information, transforming features and handling missing values. Experiments show how CEP rules provide an effective language to express preprocessing tasks in a modular and high-level manner, without significant time and memory overheads. The resulting data streams do not only help improving the predictive accuracy of classification algorithms, but also allow reducing the complexity of the decision models and the time needed for learning in some cases

    A Model for Intrusion Detection in IoT using Machine Learning

    The Internet of Things is an open and comprehensive global network of intelligent objects that have the capacity to auto-organize, share information, data and resources. There are currently over a billion devices connected to the Internet, and this number increases by the day. While these devices make our life easier, safer and healthier, they are expanding the number of attack targets vulnerable to cyber-attacks from potential hackers and malicious software. Therefore, protecting these devices from adversaries and unauthorized access and modification is very important. The purpose of this study is to develop a secure lightweight intrusion and anomaly detection model for IoT to help detect threats in the environment. We propose the use of data mining and machine learning algorithms as a classification technique for detecting abnormal or malicious traffic transmitted between devices due to potential attacks such as DoS, Man-In-Middle and Flooding attacks at the application level. This study makes use of two robust machine learning algorithms, namely the C4.5 Decision Trees and K-means clustering to develop an anomaly detection model. MATLAB Math Simulator was used for implementation. The study conducts a series of experiments in detecting abnormal data and normal data in a dataset that contains gas concentration readings from a number of sensors deployed in an Italian city over a year. Thereafter we examined the classification performance in terms of accuracy of our proposed anomaly detection model. Results drawn from the experiments conducted indicate that the size of the training sample improves classification ability of the proposed model. Our findings noted that the choice of discretization algorithm does matter in the quest for optimal classification performance. The proposed model proved accurate in detecting anomalies in IoT, and classifying between normal and abnormal data. The proposed model has a classification accuracy of 96.51% which proved to be higher compared to other algorithms such as the NaĂŻve Bayes. The model proved to be lightweight and efficient in-terms of being faster at training and testing as compared to Artificial Neural Networks. The conclusions drawn from this research are a perspective from a novice machine learning researcher with valuable recommendations that ensure optimal classification of normal and abnormal IoT data

    Mehitamata õhusõiduki rakendamine põllukultuuride saagikuse ja maa harimisviiside tuvastamisel

    A Thesis for applying for the degree of Doctor of Philosophy in Environmental Protection.Väitekiri filosoofiadoktori kraadi taotlemiseks keskkonnakaitse erialal.This thesis aims to examine how machine learning (ML) technologies have aided significant advancements in image analysis in the area of precision agriculture. These multimodal computing technologies extend the use of machine learning to a broader spectrum of data collecting and selection for the advancement of agricultural practices (Nawar et al., 2017) These techniques will assist complicated cropping systems with more informed decisions with less human intervention, and provide a scalable framework for incorporating expert knowledge of the PA system. (Chlingaryan et al., 2018). Complexity, on the other hand, can be seen as a disadvantage in crop trials, as machine learning models require training/testing databases, limited areas with insignificant sampling sizes, time and space-specificity, and environmental factor interventions, all of which complicate parameter selection and make using a single empirical model for an entire region impractical. During the early stages of writing this thesis, we used a relatively traditional machine learning method to address the regression problem of crop yield and biomass prediction [(i.e., random forest regression (RFR), support vector regression (SVR), and artificial neural network (ANN)] to predicted dry matter (DM) yields of red clover. It obtained favourable results, however, the choosing of hyperparameters, the lengthy algorithms selection process, data cleaning, and redundant collinearity issues significantly limited the way of the machine learning application. We will further discuss the recent trend of automated machine learning (AutoML) that has been driving further significant technological innovation in the application of artificial intelligence from its automated algorithm selection and hyperparameter optimization of the deployable pipeline model for unravelling substance problems. However, a present knowledge gap exists in the integration of machine learning (ML) technology with unmanned aerial systems (UAS) and hyperspectral-based imaging data categorization and regression applications. In this thesis, we explored a state-of-the-art (SOTA) and entirely open-source AutoML framework, Auto-sklearn, which was built on one of the most frequently used machine learning systems, Scikit-learn. It was integrated with two unique AutoML visualization tools to examine the recognition and acceptance of multispectral vegetation indices (VI) data collected from UAS and hyperspectral narrow-band VIs across a varied spectrum of agricultural management practices (AMP). These procedures incorporate soil tillage method (STM), cultivation method (CM), and manure application (MA), and are classified as four-crop combination fields (i.e., red clover-grass mixture, spring wheat, pea-oat mixture, and spring barley). Additionally, they have not been thoroughly evaluated and lack characteristics that are accessible in agriculture remote sensing applications. This thesis further explores the existing gaps in the knowledge base for several critical crop categories and cultivation management methods referring to biomass and yield analysis, as well as to gain a better understanding of the potential for remotely sensed solutions to field-based and multifunctional platforms to meet precision agriculture demands. To overcome these knowledge gaps, this research introduces a rapid, non-destructive, and low-cost framework for field-based biomass and grain yield modelling, as well as the identification of agricultural management practices. The results may aid agronomists and farmers in establishing more accurate agricultural methods and in monitoring environmental conditions more effectively.Doktoritöö eesmärk oli uurida, kuidas masinõppe (MÕ) tehnoloogiad võimaldavad edusamme täppispõllumajanduse valdkonna pildianalüüsis. Multimodaalsed arvutustehnoloogiad laiendavad masinõppe kasutamist põllumajanduses andmete kogumisel ja valimisel (Nawar et al., 2017). Selline täpsemal informatsioonil põhinev tehnoloogia võimaldab keerukate viljelussüsteemide puhul teha otsuseid inimese vähema sekkumisega, ja loob skaleeritava raamistiku täppispõllumajanduse jaoks (Chlingaryan et al., 2018). Põllukultuuride katsete korral on komplekssete masinõppemudelite kasutamine keerukas, sest alad on piiratud ning valimi suurus ei ole piisav; vaja on testandmebaase, kindlaid aja- ja ruumitingimusi ning keskkonnategureid. See komplitseerib parameetrite valikut ning muudab ebapraktiliseks ühe empiirilise mudeli kasutamise terves piirkonnas. Siinse uurimuse algetapis rakendati suhteliselt traditsioonilist masinõppemeetodit, et lahendada saagikuse ja biomassi prognoosimise regressiooniprobleem (otsustusmetsa regression, tugivektori regressioon ja tehisnärvivõrk) punase ristiku prognoositava kuivaine saagikuse suhtes. Saadi sobivaid tulemusi, kuid hüperparameetrite valimine, pikk algoritmide valimisprotsess, andmete puhastamine ja kollineaarsusprobleemid takistasid masinõpet oluliselt. Automatiseeritud masinõppe (AMÕ) uusimate suundumustena rakendatakse tehisintellekti, et lahendada põhiprobleemid automatiseeritud algoritmi valiku ja rakendatava pipeline-mudeli hüperparameetrite optimeerimise abil. Seni napib teadmisi MÕ tehnoloogia integreerimiseks mehitamata õhusõidukite ning hüperspektripõhiste pildiandmete kategoriseerimise ja regressioonirakendustega. Väitekirjas uuriti nüüdisaegset ja avatud lähtekoodiga AMÕ tehnoloogiat Auto-sklearn, mis on ühe enimkasutatava masinõppesüsteemi Scikit-learn edasiarendus. Süsteemiga liideti kaks unikaalset AMÕ visualiseerimisrakendust, et uurida mehitamata õhusõidukiga kogutud andmete multispektraalsete taimkatteindeksite ja hüperspektraalsete kitsaribaandmete taimkatteindeksite tuvastamist ja rakendamist põllumajanduses. Neid võtteid kasutatakse mullaharimisel, kultiveerimisel ja sõnnikuga väetamisel nelja kultuuriga põldudel (punase ristiku rohusegu, suvinisu, herne-kaera segu, suvioder). Neid ei ole põhjalikult hinnatud, samuti ei hõlma need omadusi, mida kasutatatakse põllumajanduses kaugseire rakendustes. Uurimus käsitleb biomassi ja saagikuse seni uurimata analüüsivõimalusi oluliste põllukultuuride ja viljelusmeetodite näitel. Hinnatakse ka kaugseirelahenduste potentsiaali põllupõhiste ja multifunktsionaalsete platvormide kasutamisel täppispõllumajanduses. Uurimus tutvustab kiiret, keskkonna suhtes kahjutut ja mõõduka hinnaga tehnoloogiat põllupõhise biomassi ja teraviljasaagi modelleerimiseks, et leida sobiv viljelusviis. Töö tulemused võimaldavad põllumajandustootjatel ja agronoomidel tõhusamalt valida põllundustehnoloogiaid ning arvestada täpsemalt keskkonnatingimustega.Publication of this thesis is supported by the Estonian University of Life Scieces and by the Doctoral School of Earth Sciences and Ecology created under the auspices of the European Social Fund

    Advances in Artificial Intelligence: Models, Optimization, and Machine Learning

    The present book contains all the articles accepted and published in the Special Issue “Advances in Artificial Intelligence: Models, Optimization, and Machine Learning” of the MDPI Mathematics journal, which covers a wide range of topics connected to the theory and applications of artificial intelligence and its subfields. These topics include, among others, deep learning and classic machine learning algorithms, neural modelling, architectures and learning algorithms, biologically inspired optimization algorithms, algorithms for autonomous driving, probabilistic models and Bayesian reasoning, intelligent agents and multiagent systems. We hope that the scientific results presented in this book will serve as valuable sources of documentation and inspiration for anyone willing to pursue research in artificial intelligence, machine learning and their widespread applications

    Global gene expression profiling of healthy human brain and its application in studying neurological disorders

    The human brain is the most complex structure known to mankind and one of the greatest challenges in modern biology is to understand how it is built and organized. The power of the brain arises from its variety of cells and structures, and ultimately where and when different genes are switched on and off throughout the brain tissue. In other words, brain function depends on the precise regulation of gene expression in its sub-anatomical structures. But, our understanding of the complexity and dynamics of the transcriptome of the human brain is still incomplete. To fill in the need, we designed a gene expression model that accurately defines the consistent blueprint of the brain transcriptome; thereby, identifying the core brain specific transcriptional processes conserved across individuals. Functionally characterizing this model would provide profound insights into the transcriptional landscape, biological pathways and the expression distribution of neurotransmitter systems. Here, in this dissertation we developed an expression model by capturing the similarly expressed gene patterns across congruently annotated brain structures in six individual brains by using data from the Allen Brain Atlas (ABA). We found that 84% of genes are expressed in at least one of the 190 brain structures. By employing hierarchical clustering we were able to show that distinct structures of a bigger brain region can cluster together while still retaining their expression identity. Further, weighted correlation network analysis identified 19 robust modules of coexpressing genes in the brain that demonstrated a wide range of functional associations. Since signatures of local phenomena can be masked by larger signatures, we performed local analysis on each distinct brain structure. Pathway and gene ontology enrichment analysis on these structures showed, striking enrichment for brain region specific processes. Besides, we also mapped the structural distribution of the gene expression profiles of genes associated with major neurotransmission systems in the human. We also postulated the utility of healthy brain tissue gene expression to predict potential genes involved in a neurological disorder, in the absence of data from diseased tissues. To this end, we developed a supervised classification model, which achieved an accuracy of 84% and an AUC (Area Under the Curve) of 0.81 from ROC plots, for predicting autism-implicated genes using the healthy expression model as the baseline. This study represents the first use of healthy brain gene expression to predict the scope of genes in autism implication and this generic methodology can be applied to predict genes involved in other neurological disorders

    J Biomed Inform

    We followed a systematic approach based on the Preferred Reporting Items for Systematic Reviews and Meta-Analyses to identify existing clinical natural language processing (NLP) systems that generate structured information from unstructured free text. Seven literature databases were searched with a query combining the concepts of natural language processing and structured data capture. Two reviewers screened all records for relevance during two screening phases, and information about clinical NLP systems was collected from the final set of papers. A total of 7149 records (after removing duplicates) were retrieved and screened, and 86 were determined to fit the review criteria. These papers contained information about 71 different clinical NLP systems, which were then analyzed. The NLP systems address a wide variety of important clinical and research tasks. Certain tasks are well addressed by the existing systems, while others remain as open challenges that only a small number of systems attempt, such as extraction of temporal information or normalization of concepts to standard terminologies. This review has identified many NLP systems capable of processing clinical free text and generating structured output, and the information collected and evaluated here will be important for prioritizing development of new approaches for clinical NLP.
