78 research outputs found

    Aco-based feature selection algorithm for classification

    Get PDF
    Dataset with a small number of records but big number of attributes represents a phenomenon called “curse of dimensionality”. The classification of this type of dataset requires Feature Selection (FS) methods for the extraction of useful information. The modified graph clustering ant colony optimisation (MGCACO) algorithm is an effective FS method that was developed based on grouping the highly correlated features. However, the MGCACO algorithm has three main drawbacks in producing a features subset because of its clustering method, parameter sensitivity, and the final subset determination. An enhanced graph clustering ant colony optimisation (EGCACO) algorithm is proposed to solve the three (3) MGCACO algorithm problems. The proposed improvement includes: (i) an ACO feature clustering method to obtain clusters of highly correlated features; (ii) an adaptive selection technique for subset construction from the clusters of features; and (iii) a genetic-based method for producing the final subset of features. The ACO feature clustering method utilises the ability of various mechanisms such as intensification and diversification for local and global optimisation to provide highly correlated features. The adaptive technique for ant selection enables the parameter to adaptively change based on the feedback of the search space. The genetic method determines the final subset, automatically, based on the crossover and subset quality calculation. The performance of the proposed algorithm was evaluated on 18 benchmark datasets from the University California Irvine (UCI) repository and nine (9) deoxyribonucleic acid (DNA) microarray datasets against 15 benchmark metaheuristic algorithms. The experimental results of the EGCACO algorithm on the UCI dataset are superior to other benchmark optimisation algorithms in terms of the number of selected features for 16 out of the 18 UCI datasets (88.89%) and the best in eight (8) (44.47%) of the datasets for classification accuracy. Further, experiments on the nine (9) DNA microarray datasets showed that the EGCACO algorithm is superior than the benchmark algorithms in terms of classification accuracy (first rank) for seven (7) datasets (77.78%) and demonstrates the lowest number of selected features in six (6) datasets (66.67%). The proposed EGCACO algorithm can be utilised for FS in DNA microarray classification tasks that involve large dataset size in various application domains

    Monte Carlo Method with Heuristic Adjustment for Irregularly Shaped Food Product Volume Measurement

    Get PDF
    Volume measurement plays an important role in the production and processing of food products. Various methods have been proposed to measure the volume of food products with irregular shapes based on 3D reconstruction. However, 3D reconstruction comes with a high-priced computational cost. Furthermore, some of the volume measurement methods based on 3D reconstruction have a low accuracy. Another method for measuring volume of objects uses Monte Carlo method. Monte Carlo method performs volume measurements using random points. Monte Carlo method only requires information regarding whether random points fall inside or outside an object and does not require a 3D reconstruction. This paper proposes volume measurement using a computer vision system for irregularly shaped food products without 3D reconstruction based on Monte Carlo method with heuristic adjustment. Five images of food product were captured using five cameras and processed to produce binary images. Monte Carlo integration with heuristic adjustment was performed to measure the volume based on the information extracted from binary images. The experimental results show that the proposed method provided high accuracy and precision compared to the water displacement method. In addition, the proposed method is more accurate and faster than the space carving method

    Neue bioinformatische und statistische Methoden für die Analyse von Massenspektrometrie-basierten phosphoproteomischen Daten

    Get PDF
    In living cells, reversible protein phosphorylation events propagate signals caused by external stimuli from the plasma membrane to their intracellular destinations. Aberrations in these signaling cascades can lead to diseases such as cancer. To identify and quantify phosphorylation events on a large scale, mass spectrometry (MS) has become the predominant technology. The large amount of data generated by MS requires efficient, tailor-made computational tools in order to draw meaningful biological conclusions. In this work, four new methods for analyzing MS-based phosphoproteomic data are presented. The first method, called SubExtractor, combines phosphoproteomic data with protein network information to identify differentially regulated subnetworks. The method is based on a Bayesian probabilistic model that accounts for information about both differential regulation and network topology, combined with a genetic algorithm and rigorous significance testing. The second method, called MeanRank test, is a global one-sample location test, which is based on the mean ranks across replicates, and internally estimates and controls the false discovery rate. The test successfully deals with small numbers of replicates, missing values without the need of imputation, non-normally distributed expression levels, and non-identical distribution of up- and down-regulated features, while its statistical power scales well with the number of replicates. The third method is a biomarker discovery workflow that aims at identifying a multivariate response prediction biomarker for treatment of non-small cell lung cancer cell lines with the kinase inhibitor dasatinib from phosphoproteomic data (referred to as NSCLC biomarker). An elaborate biomarker workflow based on robust feature selection in combination with a support vector machine (SVM) was designed in order to find a phosphorylation signature that accurately predicts the response to dasatanib. The fourth method, called Pareto biomarker, extends the previous NSCLC biomarker workflow by optimizing not only one single objective (i.e. best possible separation of responders and non-responders), but also the objectives signature size and relevance (i.e. association of signature proteins with dasatinib’s main target). This is achieved by employing a multiobjective optimization algorithm based on the principle of Pareto optimality, which allows for a simultaneous optimization of all three objectives. These novel data analysis methods were thoroughly validated using experimental data and compared to existing methods. They can be used on their own, or they can be combined into a joint workflow in order to efficiently answer complex biological questions in the field of large-scale omics in general and phosphoproteomics in particular.In lebenden Zellen sind reversible Proteinphosphorylierungen für die Weiterleitung von Signalen externer Stimuli zu deren intrazellulären Bestimmungsorten verantwortlich. Anomalien in solchen Signaltransduktionswegen können zu Krankheiten wie beispielsweise Krebs führen. Um Phosphorylierungsstellen in großem Maßstab zu identifizieren und zu quantifizieren, hat sich die Massenspektrometrie (MS) zur vorherrschenden Technologie entwickelt. Die große Menge an Daten, die von Massenspektrometern generiert wird, erfordert effiziente maßgeschneiderte Computerprogramme, um aussagekräftige biologische Schlüsse ziehen zu können. In dieser Arbeit werden vier neue Methoden zur Analyse von MS-basierten phosphoproteomischen Daten präsentiert. Die erste Methode, genannt SubExtractor, kombiniert phosphoproteomische Daten mit Proteinnetzwerkinformationen um differentiell regulierte Subnetzwerke zu identifizieren. Die Methode basiert auf einem Bayesschen Wahrscheinlichkeitsmodell, das sowohl Information über die differentielle Regulation der Einzelknoten als auch die Netzwerktopologie berücksichtigt. Das Modell ist kombiniert mit einem genetischen Algorithmus und stringenter Signifikanzanalyse. Die zweite Methode, genannt MeanRank-Test, ist ein globaler Einstichproben-Lagetest, der auf den mittleren Rängen der Replikate beruht, und die False Discovery Rate implizit abschätzt und kontrolliert. Der Test eignet sich für die Anwendung auf Daten mit wenigen Replikate, fehlenden und nicht normalverteilten Werten, sowie nicht gleichverteilter Hoch- und Runterregulation. Gleichzeitig skaliert die Teststärke gut mit der Anzahl an Replikaten. Die dritte Methode ist ein Arbeitsablauf zur Biomarkeridentifizierung und hat zum Ziel, einen multivariaten Stratifikationsbiomarker aus phosphoproteomischen Daten zu extrahieren, der das Ansprechen von nichtkleinzelligen Bronchialkarzinomzelllinien auf den Kinaseinhibitor Dasatinib vorhersagt (bezeichnet als NSCLC-Biomarker). Dazu wurde ein ausführlicher Biomarkerarbeitsablauf basierend auf einer robusten Feature Selection in Kombination mit Support Vector Machine-Klassifizierung erstellt, um eine Phosphorylierungssignatur zu finden, die das Ansprechen auf Dasatinib richtig vorhersagt. Die vierte Methode, genannt Pareto-Biomarker, erweitert den vorherigen Biomarkerarbeitsablauf, indem nicht nur eine Zielfunktion (d.h. die bestmögliche Trennung von Respondern und Nichtrespondern) optimiert wird, sondern zusätzlich noch die Signaturgröße und Relevanz (d.h. die Verbindung der Signaturproteine mit dem Targetprotein von Dasatinib). Dies wird durch die Verwendung eines multiobjektiven Optimierungsalgorithmus erreicht, der auf dem Prinzip der Pareto-Optimalität beruht und die gleichzeitige Optimierung aller drei Zielfunktionen ermöglicht. Die hier präsentierten neuen Datenanalysemethoden wurden gründlich mittels experimenteller Daten validiert und mit bereits bestehenden Methoden verglichen. Sie können einzeln verwendet werden, oder man kann sie zu einem gemeinsamen Arbeitsablauf zusammenfügen, um komplexe biologische Fragestellungen in Omik-Gebieten im Allgemeinen und Phosphoproteomik im Speziellen zu beantworten
    corecore