154 research outputs found

    Fast missing value imputation using ensemble of SOMs

    Get PDF
    This report presents a methodology for missing value imputation. The methodology is based on an ensemble of Self-Organizing Maps (SOM), which is weighted using Nonnegative Least Squares algorithm. Instead of a need for lengthy validation procedure as when using single SOMs, the ensemble proceeds straight into final model building. Therefore, the methodology has very low computational time while retaining the accuracy. The performance is compared to other state-of-the-art methodologies using two real world databases from different fields

    Mutual Information Based Initialization of Forward-Backward Search for Feature Selection in Regression Problems

    Get PDF
    Pure feature selection, where variables are chosen or not to be in the training data set, still remains as an unsolved problem, especially when the dimensionality is high. Recently, the Forward-Backward Search algorithm using the Delta Test to evaluate a possible solution was presented, showing a good performance. However, due to the locality of the search procedure, the initial starting point of the search becomes crucial in order to obtain good results. This paper presents new heuristics to find a more adequate starting point that could lead to a better solution. The heuristic is based on the sorting of the variables using the Mutual Information criterion, and then performing parallel local searches. These local searches provide an initial starting point for the actual parallel Forward-Backward algorithm

    optimal pruned K-nearest neighbors: op-knn application to financial modeling

    Get PDF
    The paper proposes a methodology called OP-KNN, which builds a one hidden- layer feedforward neural network, using nearest neighbors neurons with extremely small com- putational time. The main strategy is to select the most relevant variables beforehand, then to build the model using KNN kernels. Multiresponse Sparse Regression (MRSR) is used as the second step in order to rank each kth nearest neighbor and finally as a third step Leave-One- Out estimation is used to select the number of neighbors and to estimate the generalization performances. This new methodology is tested on a toy example and is applied to financial modeling

    The role of surfactants in Köhler theory reconsidered

    No full text
    International audienceAtmospheric aerosol particles typically consist of inorganic salts and organic material. The inorganic compounds as well as their hygroscopic properties are well defined, but the effect of organic compounds on cloud droplet activation is still poorly characterized. The focus of the present study is in the organic compounds that are surface active i.e. they concentrate on droplet surface and decrease droplet surface tension. Gibbsian surface thermodynamics were used to find out how partitioning in binary and ternary aqueous solutions affects the droplet surface tension and the droplet bulk concentration in droplets large enough to act as cloud condensation nuclei. Sodium dodecyl sulfate was used as a model compound together with sodium chloride to find out the effect the correct evaluation of surfactant partitioning has on the solute effect (Raoult effect). While the partitioning is known to lead to higher surface tension compared to a case in which partitioning is neglected, the present results show that the partitioning also alters the solute effect, and that the change is large enough to further increase the critical supersaturation and hence decrease the droplet activation. The fraction of surfactant partitioned to droplet surface increases with decreasing droplet size, which suggests that surfactants might enhance the activation of larger particles relatively more thus leading to less dense clouds. Cis-pinonic acid-ammonium sulfate aqueous solution was studied in order to relate the partitioning to more realistic atmospheric situation and to find out the combined effects of dissolution and partitioning behaviour. The results show that correct partitioning consideration alters the shape of the Köhler curve when compared to a situation in which the partitioning is neglected either completely or in the Raoult effect

    Methodologies for time series prediction and missing value imputation

    Get PDF
    The amount of collected data is increasing all the time in the world. More sophisticated measuring instruments and increase in the computer processing power produce more and more data, which requires more capacity from the collection, transmission and storage. Even though computers are faster, large databases need also good and accurate methodologies for them to be useful in practice. Some techniques are not feasible to be applied to very large databases or are not able to provide the necessary accuracy. As the title proclaims, this thesis focuses on two aspects encountered with databases, time series prediction and missing value imputation. The first one is a function approximation and regression problem, but can, in some cases, be formulated also as a classification task. Accurate prediction of future values is heavily dependent not only on a good model, which is well trained and validated, but also preprocessing, input variable selection or projection and output approximation strategy selection. The importance of all these choices made in the approximation process increases when the prediction horizon is extended further into the future. The second focus area deals with missing values in a database. The missing values can be a nuisance, but can be also be a prohibiting factor in the use of certain methodologies and degrade the performance of others. Hence, missing value imputation is a very necessary part of the preprocessing of a database. This imputation has to be done carefully in order to retain the integrity of the database and not to insert any unwanted artifacts to aggravate the job of the final data analysis methodology. Furthermore, even though the accuracy is always the main requisite for a good methodology, computational time has to be considered alongside the precision. In this thesis, a large variety of different strategies for output approximation and variable processing for time series prediction are presented. There is also a detailed presentation of new methodologies and tools for solving the problem of missing values. The strategies and methodologies are compared against the state-of-the-art ones and shown to be accurate and useful in practice.Maailmassa tuotetaan koko ajan enemmän ja enemmän tietoa. Kehittyneemmät mittalaitteet, nopeammat tietokoneet sekä kasvaneet siirto- ja tallennuskapasiteetit mahdollistavat suurien tietomassojen keräämisen, siirtämisen ja varastoinnin. Vaikka tietokoneiden laskentateho kasvaa jatkuvasti, suurten tietoaineistojen käsittelyssä tarvitaan edelleen hyviä ja tarkkoja menetelmiä. Kaikki menetelmät eivät sovellu valtavien aineistojen käsittelyyn tai eivät tuota tarpeeksi tarkkoja tuloksia. Tässä työssä keskitytään kahteen tärkeään osa-alueeseen tietokantojen käsittelyssä: aikasarjaennustamiseen ja puuttuvien arvojen täydentämiseen. Ensimmäinen näistä alueista on regressio-ongelma, jossa pyritään arvioimaan aikasarjan tulevaisuutta edeltävien näytteiden pohjalta. Joissain tapauksissa regressio-ongelma voidaan muotoilla myös luokitteluongelmaksi. Tarkka aikasarjan ennustaminen on riippuvainen hyvästä ja luotettavasta ennustusmallista. Malli on opetettava oikein ja sen oikeellisuus ja tarkkuus on varmistettava. Lisäksi aikasarjan esikäsittely, syötemuuttujien valinta- tai projektiotapa sekä ennustusstrategia täytyy valita huolella ja niiden soveltuvuus mallin yhteyteen on varmistettava huolellisesti. Tehtyjen valintojen tärkeys kasvaa entisestään mitä pidemmälle tulevaisuuteen ennustetaan. Toinen tämän työn osa-alue käsittelee puuttuvien arvojen ongelmaa. Tietokannasta puuttuvat arvot voivat heikentää data-analyysimenetelmän tuottamia tuloksia tai jopa estää joidenkin menetelmien käytön, joten puuttuvien arvojen arviointi ja täydentäminen esikäsittelyn osana on suositeltavaa. Täydentäminen on kuitenkin tehtävä harkiten, sillä puutteellinen täydentäminen johtaa hyvin todennäköisesti epätarkkuuksiin lopullisessa käyttökohteessa ja ei-toivottuihin rakenteisiin tietokannan sisällä. Koska kyseessä on esikäsittely, eikä varsinainen datan hyötykäyttö, puuttuvien arvojen täydentämiseen käytetty laskenta-aika tulisi minimoida säilyttäen laskentatarkkuus. Tässä väitöskirjassa on esitelty erilaisia tapoja ennustaa pitkän ajan päähän tulevaisuuteen ja keinoja syötemuuttujien valintaan. Lisäksi uusia menetelmiä puuttuvien arvojen täydentämiseen on kehitetty ja niitä on vertailtu olemassa oleviin menetelmiin

    optimal pruned K-nearest neighbors: op-knn application to financial modeling

    No full text
    The paper proposes a methodology called OP-KNN, which builds a one hidden- layer feedforward neural network, using nearest neighbors neurons with extremely small com- putational time. The main strategy is to select the most relevant variables beforehand, then to build the model using KNN kernels. Multiresponse Sparse Regression (MRSR) is used as the second step in order to rank each kth nearest neighbor and finally as a third step Leave-One- Out estimation is used to select the number of neighbors and to estimate the generalization performances. This new methodology is tested on a toy example and is applied to financial modelin

    The role of surfactants in Köhler theory reconsidered

    Get PDF
    International audienceAtmospheric aerosol particles typically consist of inorganic salts and organic material. The inorganic compounds as well as their hygroscopic properties are well defined, but the effect of organic compounds on cloud droplet activation is still poorly characterized. The focus of the present study is the organic compounds that are surface active i.e. tend to concentrate on droplet surface and decrease the surface tension. Gibbsian surface thermodynamics was used to find out how partitioning between droplet surface and the bulk of the droplet affects the surface tension and the surfactant bulk concentration in droplets large enough to act as cloud condensation nuclei. Sodium dodecyl sulfate (SDS) was used together with sodium chloride to investigate the effect of surfactant partitioning on the Raoult effect (solute effect). While accounting for the surface to bulk partitioning is known to lead to lowered bulk surfactant concentration and thereby to increased surface tension compared to a case in which the partitioning is neglected, the present results show that the partitioning also alters the Raoult effect, and that the change is large enough to further increase the critical supersaturation and hence decrease cloud droplet activation. The fraction of surfactant partitioned to droplet surface increases with decreasing droplet size, which suggests that surfactants might enhance the activation of larger particles relatively more thus leading to less dense clouds. Cis-pinonic acid-ammonium sulfate aqueous solutions were studied in order to study the partitioning with compounds found in the atmosphere and to find out the combined effects of dissolution and partitioning behavior. The results show that the partitioning consideration presented in this paper alters the shape of the Köhler curve when compared to calculations in which the partitioning is neglected either completely or in the Raoult effect. In addition, critical supersaturation was measured for SDS particles with dry radii of 25-60nm using a static parallel plate Cloud Condensation Nucleus Counter. The experimentally determined critical supersaturations agree very well with theoretical calculations taking the surface to bulk partitioning fully into account and are much higher than those calculated neglecting the partitioning

    Niche matters : The comparison between bone marrow stem cells and endometrial stem cells and stromal fibroblasts reveal distinct migration and cytokine profiles in response to inflammatory stimulus

    Get PDF
    Objective Intrinsic inflammatory characteristics play a pivotal role in stem cell recruitment and homing through migration where the subsequent change in niche has been shown to alter these characteristics. The bone marrow mesenchymal stem cells (bmMSCs) have been demonstrated to migrate to the endometrium contributing to the stem cell reservoir and regeneration of endometrial tissue. Thus, the aim of the present study was to compare the inflammation-driven migration and cytokine secretion profile of human bmMSCs to endometrial mesenchymal stem cells (eMSCs) and endometrial fibroblasts (eSFs). Materials and methods The bmMSCs were isolated from bone marrow aspirates through culturing, whereas eMSCs and eSFs were FACS-isolated. All cell types were tested for their surface marker, proliferation profiles and migration properties towards serum and inflammatory attractants. The cytokine/chemokine secretion profile of 35 targets was analysed in each cell type at basal level along with lipopolysaccharide (LPS)-induced state. Results Both stem cell types, bmMSCs and eMSCs, presented with similar stem cell surface marker profiles as well as possessed high proliferation and migration potential compared to eSFs. In multiplex assays, the secretion of 16 cytokine targets was detected and LPS stimulation expanded the cytokine secretion pattern by triggering the secretion of several targets. The bmMSCs exhibited higher cytokine secretion of vascular endothelial growth factor (VEGF)A, stromal cell-derived factor-1 alpha (SDF)-1 alpha, interleukin-1 receptor antagonist (IL-1RA), IL-6, interferon-gamma inducible protein (IP)-10, monocyte chemoattractant protein (MCP)1, macrophage inflammatory protein (MIP) 1 alpha and RANTES compared to eMSCs and/or eSFs after stimulation with LPS. The basal IL-8 secretion was higher in both endometrial cell types compared to bmMSCs. Conclusion Our results highlight that similar to bmMSCs, the eMSCs possess high migration activity while the differentiation process towards stromal fibroblasts seemed to result in loss of stem cell surface markers, minimal migration activity and a subtler cytokine profile likely contributing to normal endometrial functionPeer reviewe

    A review and comparison of strategies for multi-step ahead time series forecasting based on the NN5 forecasting competition

    Full text link
    Multi-step ahead forecasting is still an open challenge in time series forecasting. Several approaches that deal with this complex problem have been proposed in the literature but an extensive comparison on a large number of tasks is still missing. This paper aims to fill this gap by reviewing existing strategies for multi-step ahead forecasting and comparing them in theoretical and practical terms. To attain such an objective, we performed a large scale comparison of these different strategies using a large experimental benchmark (namely the 111 series from the NN5 forecasting competition). In addition, we considered the effects of deseasonalization, input variable selection, and forecast combination on these strategies and on multi-step ahead forecasting at large. The following three findings appear to be consistently supported by the experimental results: Multiple-Output strategies are the best performing approaches, deseasonalization leads to uniformly improved forecast accuracy, and input selection is more effective when performed in conjunction with deseasonalization

    Adsorptive uptake of water by semisolid secondary organic aerosols

    Get PDF
    Aerosol climate effects are intimately tied to interactions with water. Here we combine hygroscopicity measurements with direct observations about the phase of secondary organic aerosol (SOA) particles to show that water uptake by slightly oxygenated SOA is an adsorption-dominated process under subsaturated conditions, where low solubility inhibits water uptake until the humidity is high enough for dissolution to occur. This reconciles reported discrepancies in previous hygroscopicity closure studies. We demonstrate that the difference in SOA hygroscopic behavior in subsaturated and supersaturated conditions can lead to an effect up to about 30% in the direct aerosol forcinghighlighting the need to implement correct descriptions of these processes in atmospheric models. Obtaining closure across the water saturation point is therefore a critical issue for accurate climate modeling.Peer reviewe
    corecore