394 research outputs found

    Random Subset Feature Selection for Ecological Niche Modeling of Wildfire Activity and the Monarch Butterfly

    Get PDF
    Correlative ecological niche models (ENMs) are essential for investigating distributions of species and natural phenomena via environmental correlates across broad fields, including entomology and pyrogeography featured in this study. Feature (variable) selection is critical for producing more robust ENMs with greater transferability across space and time, but few studies evaluate formal feature selection algorithms (FSAs) for producing higher performance ENMs. Variability of ENMs arising from feature subsets is also seldom represented. A novel FSA is developed and evaluated, the random subset feature selection algorithm (RSFSA). The RSFSA generates an ensemble of higher accuracy ENMs from different feature subsets, producing a feature subset ensemble (FSE). The RSFSA-selected FSEs are novelly used to represent ENM variability. Wildfire activity presence/absence databases for the western US prove ideal for evaluating RSFSA-selected MaxEnt ENMs. The RSFSA was effective in identifying FSEs of 15 of 90 variables with higher accuracy and information content than random FSEs. Selected FSEs were used to identify severe contemporary wildfire deficits and significant future increases in wildfire activity for many ecoregions. Migratory roosting localities of declining eastern North American monarch butterflies (Danaus plexippus) were used to spatially model migratory pathways, comparing RSFSAselected MaxEnt ENMs and kernel density estimate models (KDEMs). The higher information content ENMs best correlated migratory pathways with nectar resources in grasslands. Higher accuracy KDEMs best revealed migratory pathways through less suitable desert environments. Monarch butterfly roadkill data was surveyed for Texas within the main Oklahoma to Mexico Central Funnel migratory pathway. A random FSE of MaxEnt roadkill ENMs was used to estimate a 2-3% loss of migrants to roadkill. Hotspots of roadkill in west Texas and Mexico were recommended for assessing roadkill mitigation to assist in monarch population recovery. The RSFSA effectively produces higher performance ENM FSEs for estimating optimal feature subset sizes, and comparing ENM algorithms and parameters, and environmental scenarios. The RSFSA also performed comparably to expert variable selection, confirming its value in the absence of expert information. The RSFSA should be compared with other FSAs for developing ENMs and in data mining applications across other disciplines, such as image classification and molecular bioinformatics

    Scalable Feature Selection Applications for Genome-Wide Association Studies of Complex Diseases

    Get PDF
    Personalized medicine will revolutionize our capabilities to combat disease. Working toward this goal, a fundamental task is the deciphering of geneticvariants that are predictive of complex diseases. Modern studies, in the formof genome-wide association studies (GWAS) have afforded researchers with the opportunity to reveal new genotype-phenotype relationships through the extensive scanning of genetic variants. These studies typically contain over half a million genetic features for thousands of individuals. Examining this with methods other than univariate statistics is a challenging task requiring advanced algorithms that are scalable to the genome-wide level. In the future, next-generation sequencing studies (NGS) will contain an even larger number of common and rare variants. Machine learning-based feature selection algorithms have been shown to have the ability to effectively create predictive models for various genotype-phenotype relationships. This work explores the problem of selecting genetic variant subsets that are the most predictive of complex disease phenotypes through various feature selection methodologies, including filter, wrapper and embedded algorithms. The examined machine learning algorithms were demonstrated to not only be effective at predicting the disease phenotypes, but also doing so efficiently through the use of computational shortcuts. While much of the work was able to be run on high-end desktops, some work was further extended so that it could be implemented on parallel computers helping to assure that they will also scale to the NGS data sets. Further, these studies analyzed the relationships between various feature selection methods and demonstrated the need for careful testing when selecting an algorithm. It was shown that there is no universally optimal algorithm for variant selection in GWAS, but rather methodologies need to be selected based on the desired outcome, such as the number of features to be included in the prediction model. It was also demonstrated that without proper model validation, for example using nested cross-validation, the models can result in overly-optimistic prediction accuracies and decreased generalization ability. It is through the implementation and application of machine learning methods that one can extract predictive genotype–phenotype relationships and biological insights from genetic data sets.Siirretty Doriast

    Feature Selection for Document Classification : Case Study of Meta-heuristic Intelligence and Traditional Approaches

    Get PDF
    Doctor of Philosophy (Computer Engineering), 2020Nowadays, the culture for accessing news around the world is changed from paper to electronic format and the rate of publication for newspapers and magazines on website are increased dramatically. Meanwhile, text feature selection for the automatic document classification (ADC) is becoming a big challenge because of the unstructured nature of text feature, which is called “multi-dimension feature problem”. On the other hand, various powerful schemes dealing with text feature selection are being developed continuously nowadays, but there still exists a research gap for “optimization of feature selection problem (OFSP)”, which can be looked for the global optimal features. Meanwhile, the capacity of meta-heuristic intelligence for knowledge discovery process (KDP) is also become the critical role to overcome NP-hard problem of OFSP by providing effective performance and efficient computation time. Therefore, the idea of meta-heuristic based approach for optimization of feature selection is proposed in this research to search the global optimal features for ADC. In this thesis, case study of meta-heuristic intelligence and traditional approaches for feature selection optimization process in document classification is observed. It includes eleven meta-heuristic algorithms such as Ant Colony search, Artificial Bee Colony search, Bat search, Cuckoo search, Evolutionary search, Elephant search, Firefly search, Flower search, Genetic search, Rhinoceros search, and Wolf search, for searching the optimal feature subset for document classification. Then, the results of proposed model are compared with three traditional search algorithms like Best First search (BFS), Greedy Stepwise (GS), and Ranker search (RS). In addition, the framework of data mining is applied. It involves data preprocessing, feature engineering, building learning model and evaluating the performance of proposed meta-heuristic intelligence-based feature selection using various performance and computation complexity evaluation schemes. In data processing, tokenization, stop-words handling, stemming and lemmatizing, and normalization are applied. In feature engineering process, n-gram TF-IDF feature extraction is used for implementing feature vector and both filter and wrapper approach are applied for observing different cases. In addition, three different classifiers like J48, Naïve Bayes, and Support Vector Machine, are used for building the document classification model. According to the results, the proposed system can reduce the number of selected features dramatically that can deteriorate learning model performance. In addition, the selected global subset features can yield better performance than traditional search according to single objective function of proposed model

    Deciphering the genetic background of quantitative traits using machine learning and bioinformatics frameworks

    Get PDF
    In dieser Doktorarbeit habe ich zwei Ansätze verfolgt, mit denen genetische Mechanismen, welche quantitativen Merkmalen zugrunde liegen, aufgezeigt und bestimmt werden können. In diesem Zusammenhang lag mein Fokus auf der Entwicklung effizienter Methoden um Genotyp-Phänotyp Assoziationen zu identifizieren. Durch diese lassen sich im Weiteren regulatorische Mechanismen beschreiben, welche phänotypische Unterschiede zwischen Individuen verursachen. Im ersten Ansatz habe ich Schlüsselmechanismen der Genregulation untersucht, welche die Entwicklung der Bruchfestigkeit von Eierschalen steuern. Das Ziel war es zeitliche Unterschiede der Signalkaskaden, welche die Eierschalen Bruchfestigkeit im Verlauf eines Vogellebens regulieren, zu detektieren. Hierfür habe ich die Bruchfestigkeit zu zwei verschiedenen Zeitpunkten innerhalb eines Produktionszyklus betrachtet und die Genotyp-Phänotyp Assoziationen mithilfe eines Random Forest-Algorithmus bestimmt. Für die Analyse der entsprechenden Gene wurde ein etablierter systembiologischer Ansatz verfolgt, mit dem genregulatorische Pathways und Master-Regulatoren identifiziert werden konnten. Meine Ergebnisse zeigen, dass einige Pathways und Master-Regulatoren (z.B. Slc22a1 und Sox11) gleichzeitig in verschiedenen Legephasen identifiziert wurden, andere (z.B. Scn11a, St8sia2 oder der TGF-beta Pathway) speziell in lediglich einer Phase gefunden wurden. Sie stellen somit altersspezifische Mechanismen dar.Insgesamt liefern meine Ergebnisse (i) signifikante Einblicke in altersspezifische und allgemeine molekulare Mechanismen, welche die Eierschalen-Bruchfestigkeit regulieren und bestimmen; und (ii) neue Zuchtziele, um die Bruchstärke von Eierschalen vor allem in späteren Legephasen zu erhöhen und somit die Eierschalen Qualität zu verbessern. In meinem zweitem Ansatz, habe ich die Methode der Random Forests mit einer Strategie zur Signaldetektierung kombiniert, um robuste Genotyp-Phänotyp-Beziehungen zu identifizieren. Ziel dieses Ansatzes war die Verbesserung der Effizienz der Einzel-SNP basierten Assoziationsanalyse. Genomweite Assoziationsstudien (GWAS) sind ein weit verbreiteter Ansatz zur Identifikation genomischer Varianten und Genen, die verantwortlich sind für Merkmale, welche von Interesse sowohl für den akademischen als auch den wirtschaftlichen Sektor sind. Trotz des langjährigen Einsatzes verschiedener GWAS-Methoden stellt die zuverlässige Identifikation von Genotyp-Phänotyp-Beziehungen noch immer eine Herausforderung für viele quantitative Merkmale dar. Dies wird hauptsächlich durch die große Anzahl genomischer Loci begründet, welche lediglich einen schwachen Effekt auf das zu untersuchende Merkmal haben. Daher lässt sich Hypothese aufstellen, dass genomische Varianten, welche zwar einen geringen, aber dennoch realen Einfluss ausüben, in vielen GWAS-Ansätzen unentdeckt bleiben. Zur Behandlung dieser Unzulänglichkeiten wird in der Arbeit ein zweistufiges Verfahren verwendet. Zunächst werden kubische Splines für Teststatistiken und genomische Regionen angepasst. Die Spline-Maxima, welche höher als die zu erwartenden zufallsbasierten Maximalwerte ausfallen, werden als quantitative Merkmals-Loci (QTL) eingestuft. Anschließend werden die SNPs in diesen QTLs, basierend auf ihrer Assoziationsstärke mit den Phänotypen, durch einen Random Forests-Ansatz priorisiert. Im Rahmen einer Fallstudie haben wir unseren Ansatz auf reale Datensätze angewendet und eine plausible Anzahl, teilweise neuartiger, genomischer Varianten und Genen identifiziert, welche verschiedenen Qualitätsmerkmalen zugrunde liegen.In this thesis, I developed two frameworks that can help highlight the genetic mechanisms underlying quantitative traits. In this regard, my focus was to design efficient methodologies to discover genotype-phenotype associations and then use these identified associations to describe the regulatory mechanism that affects the manifestation of phenotypic differences among the individuals. In the first framework, I investigated key regulatory mechanisms governing the development of eggshell strength. The aim was to highlight the temporal changes in the signaling cascades governing the dynamic eggshell strength during the life of birds. I considered chicken eggshell strength at two different time points during the egg production cycle and studied the genotype-phenotype associations by employing the Random Forest algorithm on genotypic data. For the analysis of corresponding genes, a well established systems biology approach was adopted to delineate gene regulatory pathways and master regulators underlying this important trait. My results indicate that, while some of the master regulators (Slc22a1 and Sox11) and pathways are common at different laying stages of chicken, others (e.g., Scn11a, St8sia2, or the TGF-beta pathway) represent age-specific functions. Overall, my results provide: (i) significant insights into age-specific and common molecular mechanisms underlying the regulation of eggshell strength; and (ii) new breeding targets to improve the eggshell quality during the later stages of the chicken production cycle. In my second framework, I combined the Random Forests and a signal detection strategy to identify robust genotype-phenotype associations. The objective of this framework was to improve on the efficiency of single-SNP based association analysis. Genome wide association studies (GWAS) are a well established methodology to identify genomic variants and genes that are responsible for traits of interest in all branches of the life sciences. Despite the long time this methodology has had to mature the reliable detection of genotype-phenotype associations is still a challenge for many quantitative traits mainly because of the large number of genomic loci with weak individual effects on the trait under investigation. Thus, it can be hypothesized that many genomic variants that have a small, however real, effect~remain unnoticed in many GWAS approaches. Here, we propose a two-step procedure to address this problem. In a first step, cubic splines are fitted to the test statistic values and genomic regions with spline-peaks that are higher than expected by chance are considered as quantitative trait loci (QTL). Then the SNPs in these QTLs are prioritized with respect to the strength of their association with the phenotype using a Random Forests approach. As a case study, we apply our procedure to real data sets and find trustworthy numbers of, partially novel, genomic variants and genes involved in various egg quality traits.2021-10-1

    Artificial Neural Networks in Agriculture

    Get PDF
    Modern agriculture needs to have high production efficiency combined with a high quality of obtained products. This applies to both crop and livestock production. To meet these requirements, advanced methods of data analysis are more and more frequently used, including those derived from artificial intelligence methods. Artificial neural networks (ANNs) are one of the most popular tools of this kind. They are widely used in solving various classification and prediction tasks, for some time also in the broadly defined field of agriculture. They can form part of precision farming and decision support systems. Artificial neural networks can replace the classical methods of modelling many issues, and are one of the main alternatives to classical mathematical models. The spectrum of applications of artificial neural networks is very wide. For a long time now, researchers from all over the world have been using these tools to support agricultural production, making it more efficient and providing the highest-quality products possible

    Data Mining Feature Subset Weighting and Selection Using Genetic Algorithms

    Get PDF
    We present a simple genetic algorithm (sGA), which is developed under Genetic Rule and Classifier Construction Environment (GRaCCE) to solve feature subset selection and weighting problem to have better classification accuracy on k-nearest neighborhood (KNN) algorithm. Our hypotheses are that weighting the features will affect the performance of the KNN algorithm and will cause better classification accuracy rate than that of binary classification. The weighted-sGA algorithm uses real-value chromosomes to find the weights for features and binary-sGA uses integer-value chromosomes to select the subset of features from original feature set. A Repair algorithm is developed for weighted-sGA algorithm to guarantee the feasibility of chromosomes. By feasibility we mean that the sum of values of each gene in a chromosome must be equal to 1. To calculate the fitness values for each chromosome in the population, we use K Nearest Neighbor Algorithm (KNN) as our fitness function. The Euclidean distance from one individual to other individuals is calculated on the d-dimensional feature space to classify an unknown instance. GRaCCE searches for good feature subsets and their associated weights. These feature weights are then multiplied with normalized feature values and these new values are used to calculate the distance between features

    Random Subset Feature Selection for Ecological Niche Modeling of Wildfire Activity and the Monarch Butterfly

    Get PDF
    Correlative ecological niche models (ENMs) are essential for investigating distributions of species and natural phenomena via environmental correlates across broad fields, including entomology and pyrogeography featured in this study. Feature (variable) selection is critical for producing more robust ENMs with greater transferability across space and time, but few studies evaluate formal feature selection algorithms (FSAs) for producing higher performance ENMs. Variability of ENMs arising from feature subsets is also seldom represented. A novel FSA is developed and evaluated, the random subset feature selection algorithm (RSFSA). The RSFSA generates an ensemble of higher accuracy ENMs from different feature subsets, producing a feature subset ensemble (FSE). The RSFSA-selected FSEs are novelly used to represent ENM variability. Wildfire activity presence/absence databases for the western US prove ideal for evaluating RSFSA-selected MaxEnt ENMs. The RSFSA was effective in identifying FSEs of 15 of 90 variables with higher accuracy and information content than random FSEs. Selected FSEs were used to identify severe contemporary wildfire deficits and significant future increases in wildfire activity for many ecoregions. Migratory roosting localities of declining eastern North American monarch butterflies (Danaus plexippus) were used to spatially model migratory pathways, comparing RSFSAselected MaxEnt ENMs and kernel density estimate models (KDEMs). The higher information content ENMs best correlated migratory pathways with nectar resources in grasslands. Higher accuracy KDEMs best revealed migratory pathways through less suitable desert environments. Monarch butterfly roadkill data was surveyed for Texas within the main Oklahoma to Mexico Central Funnel migratory pathway. A random FSE of MaxEnt roadkill ENMs was used to estimate a 2-3% loss of migrants to roadkill. Hotspots of roadkill in west Texas and Mexico were recommended for assessing roadkill mitigation to assist in monarch population recovery. The RSFSA effectively produces higher performance ENM FSEs for estimating optimal feature subset sizes, and comparing ENM algorithms and parameters, and environmental scenarios. The RSFSA also performed comparably to expert variable selection, confirming its value in the absence of expert information. The RSFSA should be compared with other FSAs for developing ENMs and in data mining applications across other disciplines, such as image classification and molecular bioinformatics

    Computational Optimizations for Machine Learning

    Get PDF
    The present book contains the 10 articles finally accepted for publication in the Special Issue “Computational Optimizations for Machine Learning” of the MDPI journal Mathematics, which cover a wide range of topics connected to the theory and applications of machine learning, neural networks and artificial intelligence. These topics include, among others, various types of machine learning classes, such as supervised, unsupervised and reinforcement learning, deep neural networks, convolutional neural networks, GANs, decision trees, linear regression, SVM, K-means clustering, Q-learning, temporal difference, deep adversarial networks and more. It is hoped that the book will be interesting and useful to those developing mathematical algorithms and applications in the domain of artificial intelligence and machine learning as well as for those having the appropriate mathematical background and willing to become familiar with recent advances of machine learning computational optimization mathematics, which has nowadays permeated into almost all sectors of human life and activity

    Swarm Intelligence

    Get PDF
    Swarm Intelligence has emerged as one of the most studied artificial intelligence branches during the last decade, constituting the fastest growing stream in the bio-inspired computation community. A clear trend can be deduced analyzing some of the most renowned scientific databases available, showing that the interest aroused by this branch has increased at a notable pace in the last years. This book describes the prominent theories and recent developments of Swarm Intelligence methods, and their application in all fields covered by engineering. This book unleashes a great opportunity for researchers, lecturers, and practitioners interested in Swarm Intelligence, optimization problems, and artificial intelligence
    corecore