122 research outputs found

    Comparative Analysis of Predictive Data-Mining Techniques

    Get PDF
    This thesis compares five different predictive data-mining techniques (four linear techniques and one nonlinear technique) on four different and unique data sets: the Boston Housing data sets, a collinear data set (called ā€œthe COLā€ data set in this thesis), an airliner data set (called ā€œthe Airlinerā€ data in this thesis) and a simulated data set (called ā€œthe Simulatedā€ data in this thesis). These data are unique, having a combination of the following characteristics: few predictor variables, many predictor variables, highly collinear variables, very redundant variables and presence of outliers. The natures of these data sets are explored and their unique qualities defined. This is called data pre-processing and preparation. To a large extent, this data processing helps the miner/analyst to make a choice of the predictive technique to apply. The big problem is how to reduce these variables to a minimal number that can completely predict the response variable. Different data-mining techniques, including multiple linear regression MLR, based on the ordinary least-square approach; principal component regression (PCR), an unsupervised technique based on the principal component analysis; ridge regression, which uses the regularization coefficient (a smoothing technique); the Partial Least Squares (PLS, a supervised technique), and the Nonlinear Partial Least Squares (NLPLS), which uses some neural network functions to map nonlinearity into models, were applied to each of the data sets. Each technique has different methods of usage; these different methods were used on each data set first and the best method in each technique was noted and used for global comparison with other techniques for the same data set. Based on the five model adequacy measuring criteria used, the PLS outperformed all the other techniques for the Boston housing data set. It used only the first nine factors and gave an MSE of 21.1395, a condition number less than 29, and a modified coefficient of efficiency, E-mod, of 0.4408. The closest models to this are the models built with all the variables in MLR, all PCs in PCR, and all factors in PLS. Using only the mean absolute error (MAE), the ridge regression with a regularization parameter of 1 outperformed all other models, but the condition number (CN) of the PLS (nine factors) was better. With the COL data, which is highly collinear data set, the best model, based on the condition number (\u3c100) and MSE (57.8274) was the PLS with two factors. If the selection is based on the MSE only, the ridge regression with an alpha value of 3.08 would be the best because it gave an MSE of 31.8292. The NLPLS was not considered even though it gave an MSE of 22.7552 because NLPLS mapped nonlinearity into the model and in this case, the solution was not stable. With the Airliner data set, which is also a highly ill-conditioned data set with redundant input variables, the ridge regression with regularization coefficient of 6.65 outperformed all the other models (with an MSE of 2.874 and condition number of 61.8195). This gave a good compromise between smoothing and bias. The lease MSE and MAE were recorded in PLS (all factors), PCR (all PCs), and MLR (all variables), but the condition numbers were far above 100. For the Simulated data set, the best model was the optimal PLS (eight factors) model with an MSE of 0.0601, an MAE of 0.1942 and a condition number of 12.2668. The MSE and MAE were the same for the PCR model built with PCs that accounted for 90% of the variation in the data, but the condition numbers were all more than 1000. The PLS, in most cases, gave better models both in the case of ill-conditioned data sets and also for data sets with redundant input variables. The principal component regression and the ridge regression, which are methods that basically deal with the highly ill-conditioned data matrix, performed well also in those data sets that were ill-conditioned

    Impact of Urban Surface Characteristics and Socio-Economic Variables on the Spatial Variation of Land Surface Temperature in Lagos City, Nigeria

    Get PDF
    The urban heat island (UHI) and its consequences have become a key research focus of various disciplines because of its negative externalities on urban ecology and the total livability of cities. Identifying spatial variation of the land surface temperature (LST) provides a clear picture to understand the UHI phenomenon, and it will help to introduce appropriate mitigation technique to address the advanced impact of UHI. Hence, the aim of the research is to examine the spatial variation of LST concerning the UHI phenomenon in rapidly urbanizing Lagos City. Four variables were examined to identify the impact of urban surface characteristics and socio-economic activities on LST. The gradient analysis was employed to assess the distribution outline of LST from the city center point to rural areas over the vegetation and built-up areas. Partial least square (PLS) regression analysis was used to assess the correlation and statistically significance of the variables. Landsat data captured in 2002 and 2013 were used as primary data sources and other gridded data, such as PD and FFCOE, were employed. The results of the analyses show that the distribution pattern of the LST in 2002 and 2013 has changed over the study period as results of changing urban surface characteristics (USC) and the influence of socio-economic activities. LST has a strong positive relationship with NDBI and a strong negative relationship with NDVI. The rapid development of Lagos City has been directly affected by conversion more green areas to build up areas over the time, and it has resulted in formulating more surface urban heat island (SUHI). Further, the increasing population and their socio-economic activities including industrialization and infrastructure development have also caused a significant impact on LST changes. We recommend that the results of this research be used as a proxy tool to introduce appropriate landscape and town planning in a sustainable viewpoint to make healthier and livable urban environments in Lagos City, Nigeria

    Regulation of Microclimatic Conditions inside Native Beehives and Its Relationship with Climate in Southern Spain

    Get PDF
    In this study, the Wbee Sensor System was used to record data from 10 Iberian beehives for two years in southern Spain. These data were used to identify potential conditioning climatic factors of the internal regulatory behavior of the hive and its weight. Categorical principal components analysis (CATPCA) was used to determine the minimum number of those factors able to capture the maximum percentage of variability in the data recorded. Then, categorical regression (CATREG) was used to select the factors that were linearly related to hive internal humidity, temperature and weight to issue predictive regression equations in Iberian bees. Average relative humidity values of 51.7% Ā± 10.4 and 54.2% Ā± 11.7 were reported for humidity in the brood nest and in the food area, while average temperatures were 34.3 Ā°C Ā± 1.5 in the brood nest and 29.9 Ā°C Ā± 5.8 in the food area. Average beehive weight was 38.2 kg Ā± 13.6. Some of our data, especially those related to humidity, contrast with previously published results for other studies about bees from Central and northern Europe. Conclusively, certain combinations of climatic factors may condition within hive humidity, temperature and hive weight. Southern Iberian honeybeesā€™ brood nest humidity regulatory capacity could be lower than brood nest thermoregulatory capacity, maintaining values close to 34 Ā°C, even in dry conditions

    Kern-basierte Lernverfahren fĆ¼r das virtuelle Screening

    Get PDF
    We investigate the utility of modern kernel-based machine learning methods for ligand-based virtual screening. In particular, we introduce a new graph kernel based on iterative graph similarity and optimal assignments, apply kernel principle component analysis to projection error-based novelty detection, and discover a new selective agonist of the peroxisome proliferator-activated receptor gamma using Gaussian process regression. Virtual screening, the computational ranking of compounds with respect to a predicted property, is a cheminformatics problem relevant to the hit generation phase of drug development. Its ligand-based variant relies on the similarity principle, which states that (structurally) similar compounds tend to have similar properties. We describe the kernel-based machine learning approach to ligand-based virtual screening; in this, we stress the role of molecular representations, including the (dis)similarity measures defined on them, investigate effects in high-dimensional chemical descriptor spaces and their consequences for similarity-based approaches, review literature recommendations on retrospective virtual screening, and present an example workflow. Graph kernels are formal similarity measures that are defined directly on graphs, such as the annotated molecular structure graph, and correspond to inner products. We review graph kernels, in particular those based on random walks, subgraphs, and optimal vertex assignments. Combining the latter with an iterative graph similarity scheme, we develop the iterative similarity optimal assignment graph kernel, give an iterative algorithm for its computation, prove convergence of the algorithm and the uniqueness of the solution, and provide an upper bound on the number of iterations necessary to achieve a desired precision. In a retrospective virtual screening study, our kernel consistently improved performance over chemical descriptors as well as other optimal assignment graph kernels. Chemical data sets often lie on manifolds of lower dimensionality than the embedding chemical descriptor space. Dimensionality reduction methods try to identify these manifolds, effectively providing descriptive models of the data. For spectral methods based on kernel principle component analysis, the projection error is a quantitative measure of how well new samples are described by such models. This can be used for the identification of compounds structurally dissimilar to the training samples, leading to projection error-based novelty detection for virtual screening using only positive samples. We provide proof of principle by using principle component analysis to learn the concept of fatty acids. The peroxisome proliferator-activated receptor (PPAR) is a nuclear transcription factor that regulates lipid and glucose metabolism, playing a crucial role in the development of type 2 diabetes and dyslipidemia. We establish a Gaussian process regression model for PPAR gamma agonists using a combination of chemical descriptors and the iterative similarity optimal assignment kernel via multiple kernel learning. Screening of a vendor library and subsequent testing of 15 selected compounds in a cell-based transactivation assay resulted in 4 active compounds. One compound, a natural product with cyclobutane scaffold, is a full selective PPAR gamma agonist (EC50 = 10 +/- 0.2 muM, inactive on PPAR alpha and PPAR beta/delta at 10 muM). The study delivered a novel PPAR gamma agonist, de-orphanized a natural bioactive product, and, hints at the natural product origins of pharmacophore patterns in synthetic ligands.Wir untersuchen moderne Kern-basierte maschinelle Lernverfahren fĆ¼r das Liganden-basierte virtuelle Screening. Insbesondere entwickeln wir einen neuen Graphkern auf Basis iterativer GraphƤhnlichkeit und optimaler Knotenzuordnungen, setzen die Kernhauptkomponentenanalyse fĆ¼r Projektionsfehler-basiertes Novelty Detection ein, und beschreiben die Entdeckung eines neuen selektiven Agonisten des Peroxisom-Proliferator-aktivierten Rezeptors gamma mit Hilfe von GauƟ-Prozess-Regression. Virtuelles Screening ist die rechnergestĆ¼tzte Priorisierung von MolekĆ¼len bezĆ¼glich einer vorhergesagten Eigenschaft. Es handelt sich um ein Problem der Chemieinformatik, das in der Trefferfindungsphase der Medikamentenentwicklung auftritt. Seine Liganden-basierte Variante beruht auf dem Ƅhnlichkeitsprinzip, nach dem (strukturell) Ƥhnliche MolekĆ¼le tendenziell Ƥhnliche Eigenschaften haben. In unserer Beschreibung des Lƶsungsansatzes mit Kern-basierten Lernverfahren betonen wir die Bedeutung molekularer ReprƤsentationen, einschlieƟlich der auf ihnen definierten (Un)ƤhnlichkeitsmaƟe. Wir untersuchen Effekte in hochdimensionalen chemischen DeskriptorrƤumen, ihre Auswirkungen auf Ƅhnlichkeits-basierte Verfahren und geben einen LiteraturĆ¼berblick zu Empfehlungen zur retrospektiven Validierung, einschlieƟlich eines Beispiel-Workflows. Graphkerne sind formale ƄhnlichkeitsmaƟe, die inneren Produkten entsprechen und direkt auf Graphen, z.B. annotierten molekularen Strukturgraphen, definiert werden. Wir geben einen LiteraturĆ¼berblick Ć¼ber Graphkerne, insbesondere solche, die auf zufƤlligen Irrfahrten, Subgraphen und optimalen Knotenzuordnungen beruhen. Indem wir letztere mit einem Ansatz zur iterativen GraphƤhnlichkeit kombinieren, entwickeln wir den iterative similarity optimal assignment Graphkern. Wir beschreiben einen iterativen Algorithmus, zeigen dessen Konvergenz sowie die Eindeutigkeit der Lƶsung, und geben eine obere Schranke fĆ¼r die Anzahl der benƶtigten Iterationen an. In einer retrospektiven Studie zeigte unser Graphkern konsistent bessere Ergebnisse als chemische Deskriptoren und andere, auf optimalen Knotenzuordnungen basierende Graphkerne. Chemische DatensƤtze liegen oft auf Mannigfaltigkeiten niedrigerer DimensionalitƤt als der umgebende chemische Deskriptorraum. Dimensionsreduktionsmethoden erlauben die Identifikation dieser Mannigfaltigkeiten und stellen dadurch deskriptive Modelle der Daten zur VerfĆ¼gung. FĆ¼r spektrale Methoden auf Basis der Kern-Hauptkomponentenanalyse ist der Projektionsfehler ein quantitatives MaƟ dafĆ¼r, wie gut neue Daten von solchen Modellen beschrieben werden. Dies kann zur Identifikation von MolekĆ¼len verwendet werden, die strukturell unƤhnlich zu den Trainingsdaten sind, und erlaubt so Projektionsfehler-basiertes Novelty Detection fĆ¼r virtuelles Screening mit ausschlieƟlich positiven Beispielen. Wir fĆ¼hren eine Machbarkeitsstudie zur Lernbarkeit des Konzepts von FettsƤuren durch die Hauptkomponentenanalyse durch. Der Peroxisom-Proliferator-aktivierte Rezeptor (PPAR) ist ein im Zellkern vorkommender Rezeptor, der den Fett- und Zuckerstoffwechsel reguliert. Er spielt eine wichtige Rolle in der Entwicklung von Krankheiten wie Typ-2-Diabetes und DyslipidƤmie. Wir etablieren ein GauƟ-Prozess-Regressionsmodell fĆ¼r PPAR gamma-Agonisten mit chemischen Deskriptoren und unserem Graphkern durch gleichzeitiges Lernen mehrerer Kerne. Das Screening einer kommerziellen Substanzbibliothek und die anschlieƟende Testung 15 ausgewƤhlter Substanzen in einem Zell-basierten Transaktivierungsassay ergab vier aktive Substanzen. Eine davon, ein Naturstoff mit Cyclobutan-GrundgerĆ¼st, ist ein voller selektiver PPAR gamma-Agonist (EC50 = 10 +/- 0,2 muM, inaktiv auf PPAR alpha und PPAR beta/delta bei 10 muM). Unsere Studie liefert einen neuen PPAR gamma-Agonisten, legt den Wirkmechanismus eines bioaktiven Naturstoffs offen, und erlaubt RĆ¼ckschlĆ¼sse auf die NaturstoffursprĆ¼nge von Pharmakophormustern in synthetischen Liganden
    • ā€¦
    corecore