42 research outputs found

    Granular Support Vector Machines Based on Granular Computing, Soft Computing and Statistical Learning

    With emergence of biomedical informatics, Web intelligence, and E-business, new challenges are coming for knowledge discovery and data mining modeling problems. In this dissertation work, a framework named Granular Support Vector Machines (GSVM) is proposed to systematically and formally combine statistical learning theory, granular computing theory and soft computing theory to address challenging predictive data modeling problems effectively and/or efficiently, with specific focus on binary classification problems. In general, GSVM works in 3 steps. Step 1 is granulation to build a sequence of information granules from the original dataset or from the original feature space. Step 2 is modeling Support Vector Machines (SVM) in some of these information granules when necessary. Finally, step 3 is aggregation to consolidate information in these granules at suitable abstract level. A good granulation method to find suitable granules is crucial for modeling a good GSVM. Under this framework, many different granulation algorithms including the GSVM-CMW (cumulative margin width) algorithm, the GSVM-AR (association rule mining) algorithm, a family of GSVM-RFE (recursive feature elimination) algorithms, the GSVM-DC (data cleaning) algorithm and the GSVM-RU (repetitive undersampling) algorithm are designed for binary classification problems with different characteristics. The empirical studies in biomedical domain and many other application domains demonstrate that the framework is promising. As a preliminary step, this dissertation work will be extended in the future to build a Granular Computing based Predictive Data Modeling framework (GrC-PDM) with which we can create hybrid adaptive intelligent data mining systems for high quality prediction

    Stable Feature Selection for Biomarker Discovery

    Feature selection techniques have been used as the workhorse in biomarker discovery applications for a long time. Surprisingly, the stability of feature selection with respect to sampling variations has long been under-considered. It is only until recently that this issue has received more and more attention. In this article, we review existing stable feature selection methods for biomarker discovery using a generic hierarchal framework. We have two objectives: (1) providing an overview on this new yet fast growing topic for a convenient reference; (2) categorizing existing methods under an expandable framework for future research and development

    Interpretability-oriented data-driven modelling of bladder cancer via computational intelligence

    Breast cancer disease classification using fuzzy-ID3 algorithm based on association function

    Breast cancer is the second leading cause of mortality among female cancer patients worldwide. Early detection of breast cancer is considerd as one of the most effective ways to prevent the disease from spreading and enable human can make correct decision on the next process. Automatic diagnostic methods were frequently used to conduct breast cancer diagnoses in order to increase the accuracy and speed of detection. The fuzzy-ID3 algorithm with association function implementation (FID3-AF) is proposed as a classification technique for breast cancer detection. The FID3-AF algorithm is a hybridisation of the fuzzy system, the iterative dichotomizer 3 (ID3) algorithm, and the association function. The fuzzy-neural dynamic-bottleneck-detection (FUZZYDBD) is considered as an automatic fuzzy database definition method, would aid in the development of the fuzzy database for the data fuzzification process in FID3-AF. The FID3-AF overcame ID3’s issue of being unable to handle continuous data. The association function is implemented to minimise overfitting and enhance generalisation ability. The results indicated that FID3-AF is robust in breast cancer classification. A thorough comparison of FID3-AF to numerous existing methods was conducted to validate the proposed method’s competency. This study established that the FID3-AF performed well and outperform other methods in breast cancer classification

    PREDICTION OF RECURRENCE AND MORTALITY OF ORAL TONGUE CANCER USING ARTIFICIAL NEURAL NETWORK (A case study of 5 hospitals in Finland and 1 hospital from Sao Paulo, Brazil)

    Cancer is a dreadful disease that had caused the death of millions of people. It is characterized by an uncontrollable growth of cell to form lumps or masses of tissue that are known as tumour. Therefore, it is a concern to all and sundry as these tumours mostly release hormones which have negative impact on the body system. Data mining approaches, statistical methods and machine learning algorithms have been proposed for effective cancer data classification. Artificial Neural Networks (ANN) have been used in this thesis for the prediction of recurrence and mortality of oral tongue cancer in patients. Similarly, ANN was also used to examine the diagnostic and prognostic factors. This was aimed at determining which of these diagnostic and prognostics factors had influence on the prediction of recurrence and mortality of oral tongue cancer in patients. Three different ANN have been applied for the learning and testing phases. The aim was to find the most effective technique. They are Elman, Feedforward, and Layer Recurrent neural networks techniques. Elman neural network was not able to make acceptable prediction of the recurrence or the mortality of tongue cancer based on the data. In contrast, Feedforward neural network captured the relationship between the prognostic factors and correctly predicted recurrence. However, it failed to predict the mortality based on the patient's data. Layer Recurrence neural network has been very effective and successfully predicted the recurrence and the mortality of oral tongue cancer in patients. The constructed layered recurrence neural network has been used to investigate the correlation between the prognostic factors. It was found that out of 11 prognostic factors in the data sheet, it was only 5 of them that had considerable impact on the recurrence and mortality. These are grade, depth, budding, modified stage, and gender. Time in months and disease free months were also used to train the network.fi=Opinnäytetyö kokotekstinä PDF-muodossa.|en=Thesis fulltext in PDF format.|sv=Lärdomsprov tillgängligt som fulltext i PDF-format

    Risk prediction analysis for post-surgical complications in cardiothoracic surgery

    Cardiothoracic surgery patients have the risk of developing surgical site infections (SSIs), which causes hospital readmissions, increases healthcare costs and may lead to mortality. The first 30 days after hospital discharge are crucial for preventing these kind of infections. As an alternative to a hospital-based diagnosis, an automatic digital monitoring system can help with the early detection of SSIs by analyzing daily images of patient’s wounds. However, analyzing a wound automatically is one of the biggest challenges in medical image analysis. The proposed system is integrated into a research project called CardioFollowAI, which developed a digital telemonitoring service to follow-up the recovery of cardiothoracic surgery patients. This present work aims to tackle the problem of SSIs by predicting the existence of worrying alterations in wound images taken by patients, with the help of machine learning and deep learning algorithms. The developed system is divided into a segmentation model which detects the wound region area and categorizes the wound type, and a classification model which predicts the occurrence of alterations in the wounds. The dataset consists of 1337 images with chest wounds (WC), drainage wounds (WD) and leg wounds (WL) from 34 cardiothoracic surgery patients. For segmenting the images, an architecture with a Mobilenet encoder and an Unet decoder was used to obtain the regions of interest (ROI) and attribute the wound class. The following model was divided into three sub-classifiers for each wound type, in order to improve the model’s performance. Color and textural features were extracted from the wound’s ROIs to feed one of the three machine learning classifiers (random Forest, support vector machine and K-nearest neighbors), that predict the final output. The segmentation model achieved a final mean IoU of 89.9%, a dice coefficient of 94.6% and a mean average precision of 90.1%, showing good results. As for the algorithms that performed classification, the WL classifier exhibited the best results with a 87.6% recall and 52.6% precision, while WC classifier achieved a 71.4% recall and 36.0% precision. The WD had the worst performance with a 68.4% recall and 33.2% precision. The obtained results demonstrate the feasibility of this solution, which can be a start for preventing SSIs through image analysis with artificial intelligence.Os pacientes submetidos a uma cirurgia cardiotorácica tem o risco de desenvolver infeções no local da ferida cirúrgica, o que pode consequentemente levar a readmissões hospitalares, ao aumento dos custos na saúde e à mortalidade. Os primeiros 30 dias após a alta hospitalar são cruciais na prevenção destas infecções. Assim, como alternativa ao diagnóstico no hospital, a utilização diária de um sistema digital e automático de monotorização em imagens de feridas cirúrgicas pode ajudar na precoce deteção destas infeções. No entanto, a análise automática de feridas é um dos grandes desafios em análise de imagens médicas. O sistema proposto integra um projeto de investigação designado CardioFollow.AI, que desenvolveu um serviço digital de telemonitorização para realizar o follow-up da recuperação dos pacientes de cirurgia cardiotorácica. Neste trabalho, o problema da infeção de feridas cirúrgicas é abordado, através da deteção de alterações preocupantes na ferida com ajuda de algoritmos de aprendizagem automática. O sistema desenvolvido divide-se num modelo de segmentação, que deteta a região da ferida e a categoriza consoante o seu tipo, e num modelo de classificação que prevê a existência de alterações na ferida. O conjunto de dados consistiu em 1337 imagens de feridas do peito (WC), feridas dos tubos de drenagem (WD) e feridas da perna (WL), provenientes de 34 pacientes de cirurgia cardiotorácica. A segmentação de imagem foi realizada através da combinação de Mobilenet como codificador e Unet como decodificador, de forma a obter-se as regiões de interesse e atribuir a classe da ferida. O modelo seguinte foi dividido em três subclassificadores para cada tipo de ferida, de forma a melhorar a performance do modelo. Caraterísticas de cor e textura foram extraídas da região da ferida para serem introduzidas num dos modelos de aprendizagem automática de forma a prever a classificação final (Random Forest, Support Vector Machine and K-Nearest Neighbors). O modelo de segmentação demonstrou bons resultados ao obter um IoU médio final de 89.9%, um dice de 94.6% e uma média de precisão de 90.1%. Relativamente aos algoritmos que realizaram a classificação, o classificador WL exibiu os melhores resultados com 87.6% de recall e 62.6% de precisão, enquanto o classificador das WC conseguiu um recall de 71.4% e 36.0% de precisão. Por fim, o classificador das WD teve a pior performance com um recall de 68.4% e 33.2% de precisão. Os resultados obtidos demonstram a viabilidade desta solução, que constitui o início da prevenção de infeções em feridas cirúrgica a partir da análise de imagem, com recurso a inteligência artificial

    Protein Tertiary Model Assessment Using Granular Machine Learning Techniques

    The automatic prediction of protein three dimensional structures from its amino acid sequence has become one of the most important and researched fields in bioinformatics. As models are not experimental structures determined with known accuracy but rather with prediction it’s vital to determine estimates of models quality. We attempt to solve this problem using machine learning techniques and information from both the sequence and structure of the protein. The goal is to generate a machine that understands structures from PDB and when given a new model, predicts whether it belongs to the same class as the PDB structures (correct or incorrect protein models). Different subsets of PDB (protein data bank) are considered for evaluating the prediction potential of the machine learning methods. Here we show two such machines, one using SVM (support vector machines) and another using fuzzy decision trees (FDT). First using a preliminary encoding style SVM could get around 70% in protein model quality assessment accuracy, and improved Fuzzy Decision Tree (IFDT) could reach above 80% accuracy. For the purpose of reducing computational overhead multiprocessor environment and basic feature selection method is used in machine learning algorithm using SVM. Next an enhanced scheme is introduced using new encoding style. In the new style, information like amino acid substitution matrix, polarity, secondary structure information and relative distance between alpha carbon atoms etc is collected through spatial traversing of the 3D structure to form training vectors. This guarantees that the properties of alpha carbon atoms that are close together in 3D space and thus interacting are used in vector formation. With the use of fuzzy decision tree, we obtained a training accuracy around 90%. There is significant improvement compared to previous encoding technique in prediction accuracy and execution time. This outcome motivates to continue to explore effective machine learning algorithms for accurate protein model quality assessment. Finally these machines are tested using CASP8 and CASP9 templates and compared with other CASP competitors, with promising results. We further discuss the importance of model quality assessment and other information from proteins that could be considered for the same

    Streaming Feature Grouping and Selection (Sfgs) For Big Data Classification

    Real-time data has always been an essential element for organizations when the quickness of data delivery is critical to their businesses. Today, organizations understand the importance of real-time data analysis to maintain benefits from their generated data. Real-time data analysis is also known as real-time analytics, streaming analytics, real-time streaming analytics, and event processing. Stream processing is the key to getting results in real-time. It allows us to process the data stream in real-time as it arrives. The concept of streaming data means the data are generated dynamically, and the full stream is unknown or even infinite. This data becomes massive and diverse and forms what is known as a big data challenge. In machine learning, streaming feature selection has always been a preferred method in the preprocessing of streaming data. Recently, feature grouping, which can measure the hidden information between selected features, has begun gaining attention. This dissertation’s main contribution is in solving the issue of the extremely high dimensionality of streaming big data by delivering a streaming feature grouping and selection algorithm. Also, the literature review presents a comprehensive review of the current streaming feature selection approaches and highlights the state-of-the-art algorithms trending in this area. The proposed algorithm is designed with the idea of grouping together similar features to reduce redundancy and handle the stream of features in an online fashion. This algorithm has been implemented and evaluated using benchmark datasets against state-of-the-art streaming feature selection algorithms and feature grouping techniques. The results showed better performance regarding prediction accuracy than with state-of-the-art algorithms