11,510 research outputs found

    Multi-tier framework for the inferential measurement and data-driven modeling

    Get PDF
    A framework for the inferential measurement and data-driven modeling has been proposed and assessed in several real-world application domains. The architecture of the framework has been structured in multiple tiers to facilitate extensibility and the integration of new components. Each of the proposed four tiers has been assessed in an uncoupled way to verify their suitability. The first tier, dealing with exploratory data analysis, has been assessed with the characterization of the chemical space related to the biodegradation of organic chemicals. This analysis has established relationships between physicochemical variables and biodegradation rates that have been used for model development. At the preprocessing level, a novel method for feature selection based on dissimilarity measures between Self-Organizing maps (SOM) has been developed and assessed. The proposed method selected more features than others published in literature but leads to models with improved predictive power. Single and multiple data imputation techniques based on the SOM have also been used to recover missing data in a Waste Water Treatment Plant benchmark. A new dynamic method to adjust the centers and widths of in Radial basis Function networks has been proposed to predict water quality. The proposed method outperformed other neural networks. The proposed modeling components have also been assessed in the development of prediction and classification models for biodegradation rates in different media. The results obtained proved the suitability of this approach to develop data-driven models when the complex dynamics of the process prevents the formulation of mechanistic models. The use of rule generation algorithms and Bayesian dependency models has been preliminary screened to provide the framework with interpretation capabilities. Preliminary results obtained from the classification of Modes of Toxic Action (MOA) indicate that this could be a promising approach to use MOAs as proxy indicators of human health effects of chemicals.Finally, the complete framework has been applied to three different modeling scenarios. A virtual sensor system, capable of inferring product quality indices from primary process variables has been developed and assessed. The system was integrated with the control system in a real chemical plant outperforming multi-linear correlation models usually adopted by chemical manufacturers. A model to predict carcinogenicity from molecular structure for a set of aromatic compounds has been developed and tested. Results obtained after the application of the SOM-dissimilarity feature selection method yielded better results than models published in the literature. Finally, the framework has been used to facilitate a new approach for environmental modeling and risk management within geographical information systems (GIS). The SOM has been successfully used to characterize exposure scenarios and to provide estimations of missing data through geographic interpolation. The combination of SOM and Gaussian Mixture models facilitated the formulation of a new probabilistic risk assessment approach.Aquesta tesi proposa i avalua en diverses aplicacions reals, un marc general de treball per al desenvolupament de sistemes de mesurament inferencial i de modelat basats en dades. L'arquitectura d'aquest marc de treball s'organitza en diverses capes que faciliten la seva extensibilitat així com la integració de nous components. Cadascun dels quatre nivells en que s'estructura la proposta de marc de treball ha estat avaluat de forma independent per a verificar la seva funcionalitat. El primer que nivell s'ocupa de l'anàlisi exploratòria de dades ha esta avaluat a partir de la caracterització de l'espai químic corresponent a la biodegradació de certs compostos orgànics. Fruit d'aquest anàlisi s'han establert relacions entre diverses variables físico-químiques que han estat emprades posteriorment per al desenvolupament de models de biodegradació. A nivell del preprocés de les dades s'ha desenvolupat i avaluat una nova metodologia per a la selecció de variables basada en l'ús del Mapes Autoorganitzats (SOM). Tot i que el mètode proposat selecciona, en general, un major nombre de variables que altres mètodes proposats a la literatura, els models resultants mostren una millor capacitat predictiva. S'han avaluat també tot un conjunt de tècniques d'imputació de dades basades en el SOM amb un conjunt de dades estàndard corresponent als paràmetres d'operació d'una planta de tractament d'aigües residuals. Es proposa i avalua en un problema de predicció de qualitat en aigua un nou model dinàmic per a ajustar el centre i la dispersió en xarxes de funcions de base radial. El mètode proposat millora els resultats obtinguts amb altres arquitectures neuronals. Els components de modelat proposat s'han aplicat també al desenvolupament de models predictius i de classificació de les velocitats de biodegradació de compostos orgànics en diferents medis. Els resultats obtinguts demostren la viabilitat d'aquesta aproximació per a desenvolupar models basats en dades en aquells casos en els que la complexitat de dinàmica del procés impedeix formular models mecanicistes. S'ha dut a terme un estudi preliminar de l'ús de algorismes de generació de regles i de grafs de dependència bayesiana per a introduir una nova capa que faciliti la interpretació dels models. Els resultats preliminars obtinguts a partir de la classificació dels Modes d'acció Tòxica (MOA) apunten a que l'ús dels MOA com a indicadors intermediaris dels efectes dels compostos químics en la salut és una aproximació factible.Finalment, el marc de treball proposat s'ha aplicat en tres escenaris de modelat diferents. En primer lloc, s'ha desenvolupat i avaluat un sensor virtual capaç d'inferir índexs de qualitat a partir de variables primàries de procés. El sensor resultant ha estat implementat en una planta química real millorant els resultats de les correlacions multilineals emprades habitualment. S'ha desenvolupat i avaluat un model per a predir els efectes carcinògens d'un grup de compostos aromàtics a partir de la seva estructura molecular. Els resultats obtinguts desprès d'aplicar el mètode de selecció de variables basat en el SOM milloren els resultats prèviament publicats. Aquest marc de treball s'ha usat també per a proporcionar una nova aproximació al modelat ambiental i l'anàlisi de risc amb sistemes d'informació geogràfica (GIS). S'ha usat el SOM per a caracteritzar escenaris d'exposició i per a desenvolupar un nou mètode d'interpolació geogràfica. La combinació del SOM amb els models de mescla de gaussianes dona una nova formulació al problema de l'anàlisi de risc des d'un punt de vista probabilístic

    An adaptation of the experiences in close relationships scale-revised for use with children and adolescents

    Get PDF
    The investigation of attachment processes during middle childhood and early adolescence has been hampered by a relative lack of measures for this age group differentiating between two fundamental attachment dimensions, that is, anxiety and avoidance. The aim of this study is to develop and validate a child version of the Experiences in Close Relationships Scale-Revised (referred to as the ECR-RC), a self-report questionnaire measuring attachment anxiety and avoidance. Two studies were conducted to examine the internal structure (Study 1, N = 514 and Study 2, N = 296) and construct and predictive validity (Study 2) of the ECR-RC. The ECR-RC appears to be a promising instrument to measure the two attachment dimensions in middle childhood and early adolescence

    Integrated system to perform surrogate based aerodynamic optimisation for high-lift airfoil

    Get PDF
    This work deals with the aerodynamics optimisation of a generic two-dimensional three element high-lift configuration. Although the high-lift system is applied only during take-off and landing in the low speed phase of the flight the cost efficiency of the airplane is strongly influenced by it [1]. The ultimate goal of an aircraft high lift system design team is to define the simplest configuration which, for prescribed constraints, will meet the take-off, climb, and landing requirements usually expressed in terms of maximum L/D and/or maximum CL. The ability of the calculation method to accurately predict changes in objective function value when gaps, overlaps and element deflections are varied is therefore critical. Despite advances in computer capacity, the enormous computational cost of running complex engineering simulations makes it impractical to rely exclusively on simulation for the purpose of design optimisation. To cut down the cost, surrogate models, also known as metamodels, are constructed from and then used in place of the actual simulation models. This work outlines the development of integrated systems to perform aerodynamics multi-objective optimisation for a three-element airfoil test case in high lift configuration, making use of surrogate models available in MACROS Generic Tools, which has been integrated in our design tool. Different metamodeling techniques have been compared based on multiple performance criteria. With MACROS is possible performing either optimisation of the model built with predefined training sample (GSO) or Iterative Surrogate-Based Optimization (SBO). In this first case the model is build independent from the optimisation and then use it as a black box in the optimisation process. In the second case is needed to provide the possibility to call CFD code from the optimisation process, and there is no need to build any model, it is being built internally during the optimisation process. Both approaches have been applied. A detailed analysis of the integrated design system, the methods as well as th

    Industrial process monitoring by means of recurrent neural networks and Self Organizing Maps

    Get PDF
    Industrial manufacturing plants often suffer from reliability problems during their day-to-day operations which have the potential for causing a great impact on the effectiveness and performance of the overall process and the sub-processes involved. Time-series forecasting of critical industrial signals presents itself as a way to reduce this impact by extracting knowledge regarding the internal dynamics of the process and advice any process deviations before it affects the productive process. In this paper, a novel industrial condition monitoring approach based on the combination of Self Organizing Maps for operating point codification and Recurrent Neural Networks for critical signal modeling is proposed. The combination of both methods presents a strong synergy, the information of the operating condition given by the interpretation of the maps helps the model to improve generalization, one of the drawbacks of recurrent networks, while assuring high accuracy and precision rates. Finally, the complete methodology, in terms of performance and effectiveness is validated experimentally with real data from a copper rod industrial plant.Postprint (published version

    Scalable aggregation predictive analytics: a query-driven machine learning approach

    Get PDF
    We introduce a predictive modeling solution that provides high quality predictive analytics over aggregation queries in Big Data environments. Our predictive methodology is generally applicable in environments in which large-scale data owners may or may not restrict access to their data and allow only aggregation operators like COUNT to be executed over their data. In this context, our methodology is based on historical queries and their answers to accurately predict ad-hoc queries’ answers. We focus on the widely used set-cardinality, i.e., COUNT, aggregation query, as COUNT is a fundamental operator for both internal data system optimizations and for aggregation-oriented data exploration and predictive analytics. We contribute a novel, query-driven Machine Learning (ML) model whose goals are to: (i) learn the query-answer space from past issued queries, (ii) associate the query space with local linear regression & associative function estimators, (iii) define query similarity, and (iv) predict the cardinality of the answer set of unseen incoming queries, referred to the Set Cardinality Prediction (SCP) problem. Our ML model incorporates incremental ML algorithms for ensuring high quality prediction results. The significance of contribution lies in that it (i) is the only query-driven solution applicable over general Big Data environments, which include restricted-access data, (ii) offers incremental learning adjusted for arriving ad-hoc queries, which is well suited for query-driven data exploration, and (iii) offers a performance (in terms of scalability, SCP accuracy, processing time, and memory requirements) that is superior to data-centric approaches. We provide a comprehensive performance evaluation of our model evaluating its sensitivity, scalability and efficiency for quality predictive analytics. In addition, we report on the development and incorporation of our ML model in Spark showing its superior performance compared to the Spark’s COUNT method

    On the role of pre and post-processing in environmental data mining

    Get PDF
    The quality of discovered knowledge is highly depending on data quality. Unfortunately real data use to contain noise, uncertainty, errors, redundancies or even irrelevant information. The more complex is the reality to be analyzed, the higher the risk of getting low quality data. Knowledge Discovery from Databases (KDD) offers a global framework to prepare data in the right form to perform correct analyses. On the other hand, the quality of decisions taken upon KDD results, depend not only on the quality of the results themselves, but on the capacity of the system to communicate those results in an understandable form. Environmental systems are particularly complex and environmental users particularly require clarity in their results. In this paper some details about how this can be achieved are provided. The role of the pre and post processing in the whole process of Knowledge Discovery in environmental systems is discussed

    Hydrologic prediction using pattern recognition and soft-computing techniques

    Get PDF
    Several studies indicate that the data-driven models have proven to be potentially useful tools in hydrological modeling. Nevertheless, it is a common perception among researchers and practitioners that the usefulness of the system theoretic models is limited to forecast applications, and they cannot be used as a tool for scientific investigations. Also, the system-theoretic models are believed to be less reliable as they characterize the hydrological processes by learning the input-output patterns embedded in the dataset and not based on strong physical understanding of the system. It is imperative that the above concerns needs to be addressed before the data-driven models can gain wider acceptability by researchers and practitioners.In this research different methods and tools that can be adopted to promote transparency in the data-driven models are probed with the objective of extending the usefulness of data-driven models beyond forecast applications as a tools for scientific investigations, by providing additional insights into the underlying input-output patterns based on which the data-driven models arrive at a decision. In this regard, the utility of self-organizing networks (competitive learning and self-organizing maps) in learning the patterns in the input space is evaluated by developing a novel neural network model called the spiking modular neural networks (SMNNs). The performance of the SMNNs is evaluated based on its ability to characterize streamflows and actual evapotranspiration process. Also the utility of self-organizing algorithms, namely genetic programming (GP), is evaluated with regards to its ability to promote transparency in data-driven models. The robustness of the GP to evolve its own model structure with relevant parameters is illustrated by applying GP to characterize the actual-evapotranspiration process. The results from this research indicate that self-organization in learning, both in terms of self-organizing networks and self-organizing algorithms, could be adopted to promote transparency in data-driven models.In pursuit of improving the reliability of the data-driven models, different methods for incorporating uncertainty estimates as part of the data-driven model building exercise is evaluated in this research. The local-scale models are shown to be more reliable than the global-scale models in characterizing the saturated hydraulic conductivity of soils. In addition, in this research, the importance of model structure uncertainty in geophysical modeling is emphasized by developing a framework to account for the model structure uncertainty in geophysical modeling. The contribution of the model structure uncertainty to the predictive uncertainty of the model is shown to be larger than the uncertainty associated with the model parameters. Also it has been demonstrated that increasing the model complexity may lead to a better fit of the function, but at the cost of an increasing level of uncertainty. It is recommended that the effect of model structure uncertainty should be considered for developing reliable hydrological models

    A Protein Sequence-Properties Evaluation Framework for Crystallization Screen Design

    Get PDF
    The goal of the research was to develop a Protein-Specific Properties Evaluation (PSPE) framework that would aid in the statistical evaluation of variables for predicting ranges of and prior probability distributions for protein crystallization conditions. Development of such a framework is motivated by the rapid growth and evolution of the Protein Data Bank. Features of the framework that has been developed include (1) it is an instantiation of the "scientific method" for the framing and testing of hypotheses in an informatics setting, (2) the use of hidden variables, and (3) a negative result is still useful.The hidden variables examined in this study are related to the estimated net charge (Q) of the proteins under consideration. The Q is a function of the amino acid composition, the solution pH, and the assumed pKa values for the titratable amino acid residues. The protein's size clearly has a significant impact on the magnitude of the Q. Therefore, two additional variables were introduced to mitigate this effect, the specific charge (Qbar) and the average surface charge density (sigma).The principal observation is that proteins appear to crystallize at low values of Qbar and sigma. One problem with this observation is that "low" is a relative term and the frame of reference requires careful examination. The results are sufficiently weak that no prospective predictions appear possible although information of this type could be included with other weak predictors in a Bayesian predictor scheme. Additional work would be required to establish this; however that work is beyond the scope of the dissertation. Although many statistically significant correlations among Q-related quantities were noted, no evidence could be developed to suggest they were anything other than those expected from the additional information introduced with the hidden variables.Thus, the principal conclusions of this PSPE analysis are that (1) Qbar/sigma and other Q-related variables are of limited value as prospective predictors of ranges of values of crystallization conditions. Although this is a negative result, it is still useful in that it allows attention to be directed into more productive avenues

    Variableselectioninmultivariatecalibrationbasedonclusteringofvariableconcept

    Get PDF
    Recentlywehaveproposedanewvariableselectionalgorithm,basedonclusteringofvariableconcept(CLoVA)inclassificationproblem.Withthesameidea,thisnewconcepthasbeenappliedtoaregres-sionproblemandthentheobtainedresultshavebeencomparedwithconventionalvariableselectionstrategiesforPLS.Thebasicideabehindtheclusteringofvariableisthat,theinstrumentchannelsareclusteredintodifferentclustersviaclusteringalgorithms.Then,thespectraldataofeachclusteraresubjectedtoPLSregression.Differentrealdatasets(Cargillcorn,Biscuitdough,ACEQSAR,Soy,andTablet)havebeenusedtoevaluatetheinfluenceoftheclusteringofvariablesonthepredictionper-formancesofPLS.Almostintheallcases,thestatisticalparameterespeciallyinpredictionerrorshowsthesuperiorityofCLoVA-PLSrespecttoothervariableselectionstrategies.Finallythesynergyclusteringofvariable(sCLoVA-PLS),whichisusedthecombinationofcluster,hasbeenproposedasanefficientandmodificationofCLoVAalgorithm.Theobtainedstatisticalparameterindicatesthatvariableclusteringcansplitusefulpartfromredundantones,andthenbasedoninformativecluster;stablemodelcanbereache
    corecore