9 research outputs found
Heterogeneous Kohonen networks
A large number of practical problems involves elements that are described as a mixture of qualitative and quantitative infomation, and whose description is probably incomplete. The self-organizing map is an effective tool for visualization of high-dimensional continuous data. In this work, we extend the network and training algorithm to cope with heterogeneous information, as well as missing values. The classification performance on a collection of benchmarking data sets is compared in different configurations. Various visualization methods are suggested to aid users interpret post-training results.Peer ReviewedPostprint (author's final draft
Non-Direct Encoding Method Based on Cellular Automata to Design Neural Network Architectures
Architecture design is a fundamental step in the successful application of Feed forward Neural Networks. In most cases a large number of neural networks architectures suitable to solve a problem exist and the architecture design is, unfortunately, still a human expert’s job. It depends heavily on the expert and on a tedious trial-and-error process. In the last years, many works have been focused on automatic resolution of the design of neural network architectures. Most of the methods are based on evolutionary computation paradigms. Some of the designed methods are based on direct representations of the parameters of the network. These representations do not allow scalability; thus, for representing large architectures very large structures are required. More interesting alternatives are represented by indirect schemes. They codify a compact representation of the neural network. In this work, an indirect constructive encoding scheme is proposed. This scheme is based on cellular automata representations and is inspired by the idea that only a few seeds for the initial configuration of a cellular automaton can produce a wide variety of feed forward neural networks architectures. The cellular approach is experimentally validated in different domains and compared with a direct codification scheme.Publicad
Non-Direct Encoding Method Based on Cellular Automata to Design Neural Network Architectures
Architecture design is a fundamental step in the successful application of Feed forward Neural Networks. In most cases a large number of neural networks architectures suitable to solve a problem exist and the architecture design is, unfortunately, still a human expert's job. It depends heavily on the expert and on a tedious trial-and-error process. In the last years, many works have been focused on automatic resolution of the design of neural network architectures. Most of the methods are based on evolutionary computation paradigms. Some of the designed methods are based on direct representations of the parameters of the network. These representations do not allow scalability; thus, for representing large architectures very large structures are required. More interesting alternatives are represented by indirect schemes. They codify a compact representation of the neural network. In this work, an indirect constructive encoding scheme is proposed. This scheme is based on cellular automata representations and is inspired by the idea that only a~few seeds for the initial configuration of a cellular automaton can produce a wide variety of feed forward neural networks architectures. The cellular approach is experimentally validated in different domains and compared with a direct codification scheme
Rejection-oriented learning without complete class information
Machine Learning is commonly used to support decision-making in numerous, diverse contexts. Its usefulness in this regard is unquestionable: there are complex systems built on the top of machine learning techniques whose descriptive and predictive capabilities go far beyond those of human beings. However, these systems still have limitations, whose analysis enable to estimate their applicability and confidence in various cases. This is interesting considering that abstention from the provision of a response is preferable to make a mistake in doing so. In the context of classification-like tasks, the indication of such inconclusive output is called rejection. The research which culminated in this thesis led to the conception, implementation and evaluation of rejection-oriented learning systems for two distinct tasks: open set recognition and data stream clustering. These system were derived from WiSARD artificial neural network, which had rejection modelling incorporated into its functioning. This text details and discuss such realizations. It also presents experimental results which allow assess the scientific and practical importance of the proposed state-of-the-art methodology.Aprendizado de Máquina é comumente usado para apoiar a tomada de decisão em numerosos e diversos contextos. Sua utilidade neste sentido é inquestionável: existem sistemas complexos baseados em técnicas de aprendizado de máquina cujas capacidades descritivas e preditivas vão muito além das dos seres humanos. Contudo, esses sistemas ainda possuem limitações, cuja análise permite estimar sua aplicabilidade e confiança em vários casos. Isto é interessante considerando que a abstenção da provisão de uma resposta é preferÃvel a cometer um equÃvoco ao realizar tal ação. No contexto de classificação e tarefas similares, a indicação desse resultado inconclusivo é chamada de rejeição. A pesquisa que culminou nesta tese proporcionou a concepção, implementação e avaliação de sistemas de aprendizado orientados `a rejeição para duas tarefas distintas: reconhecimento em cenário abertos e agrupamento de dados em fluxo contÃnuo. Estes sistemas foram derivados da rede neural artificial WiSARD, que teve a modelagem de rejeição incorporada a seu funcionamento. Este texto detalha e discute tais realizações. Ele também apresenta resultados experimentais que permitem avaliar a importância cientÃfica e prática da metodologia de ponta proposta
On the development of decision-making systems based on fuzzy models to assess water quality in rivers
There are many situations where a linguistic description of complex phenomena allows better assessments. It is well known that the assessment of water quality continues depending heavily upon subjective judgments and interpretation, despite the huge datasets available nowadays. In that sense, the aim of this study has been to introduce intelligent linguistic operations to analyze databases, and produce self interpretable water quality indicators, which tolerate both imprecision and linguistic uncertainty. Such imprecision typically reflects the ambiguity of human thinking when perceptions need to be expressed. Environmental management concepts such as: "water quality", "level of risk", or "ecological status" are ideally dealt with linguistic variables. In the present Thesis, the flexibility of computing with words offered by fuzzy logic has been considered in these management issues. Firstly, a multipurpose hierarchical water quality index has been designed with fuzzy reasoning. It integrates a wide set of indicators including: organic pollution, nutrients, pathogens, physicochemical macro-variables, and priority micro-contaminants. Likewise, the relative importance of the water quality indicators has been dealt with the analytic hierarchy process, a decision-aiding method. Secondly, a methodology based on a hybrid approach that combines fuzzy inference systems and artificial neural networks has been used to classify ecological status in surface waters according to the Water Framework Directive. This methodology has allowed dealing efficiently with the non-linearity and subjective nature of variables involved in this classification problem. The complexity of inference systems, the appropriate choice of linguistic rules, and the influence of the functions that transform numerical variables into linguistic variables have been studied. Thirdly, a concurrent neuro-fuzzy model based on screening ecological risk assessment has been developed. It has considered the presence of hazardous substances in rivers, and incorporates an innovative ranking and scoring system, based on a self-organizing map, to account for the likely ecological hazards posed by the presence of chemical substances in freshwater ecosystems. Hazard factors are combined with environmental concentrations within fuzzy inference systems to compute ecological risk potentials under linguistic uncertainty. The estimation of ecological risk potentials allows identifying those substances requiring stricter controls and further rigorous risk assessment. Likewise, the aggregation of ecological risk potentials, by means of empirical cumulative distribution functions, has allowed estimating changes in water quality over time. The neuro-fuzzy approach has been validated by comparison with biological monitoring. Finally, a hierarchical fuzzy inference system to deal with sediment based ecological risk assessment has been designed. The study was centered in sediments, since they produce complementary findings to water quality analysis, especially when temporal trends are required. Results from chemical and eco-toxicological analyses have been used as inputs to two parallel inference systems which assess levels of contamination and toxicity, respectively. Results from both inference engines are then treated in a third inference engine which provides a final risk characterization, where the risk is provided in linguistic terms, with their respective degrees of certitude. Inputs to the risk system have been the levels of potentially toxic substances, mainly metals and chlorinated organic compounds, and the toxicity measured with a screening test which uses the photo-luminescent bacteria Vibrio fischeri. The Ebro river basin has been selected as case study, although the methodologies here explained can easily be applied to other rivers. In conclusion, this study has broadly demonstrated that the design of water quality indexes, based on fuzzy logic, emerges as suitable and alternative tool to support decision makers involved in effective sustainable river basin management plans.Existen diversas situaciones en las cuales la descripción en términos lingüÃsticos de fenómenos complejos permite mejores resultados. A pesar de los volúmenes de información cuantitativa que se manejan actualmente, es bien sabido que la gestión de la calidad del agua todavÃa obedece a juicios subjetivos y de interpretación de los expertos. Por tanto, el reto en este trabajo ha sido la introducción de operaciones lógicas que computen con palabras durante el análisis de los datos, para la elaboración de indicadores auto-interpretables de calidad del agua, que toleren la imprecisión e incertidumbre lingüÃstica. Esta imprecisión tÃpicamente refleja la ambigüedad del pensamiento humano para expresar percepciones. De allà que las variables lingüÃsticas se presenten como muy atractivas para el manejo de conceptos de la gestión medioambiental, como es el caso de la "calidad del agua", el "nivel de riesgo" o el "estado ecológico". Por tanto, en la presente Tesis, la flexibilidad de la lógica difusa para computar con palabras se ha adaptado a diversos tópicos en la gestión de la calidad del agua. Primero, se desarrolló un Ãndice jerárquico multipropósito de calidad del agua que se obtuvo mediante razonamiento difuso. El Ãndice integra un extenso grupo de indicadores que incluyen: contaminación orgánica, nutrientes, patógenos, variables macroscópicas, asà como sustancias prioritarias micro-contaminantes. La importancia relativa de los indicadores al interior del sistema de inferencia se estimó con un método de análisis de decisiones, llamado proceso jerárquico analÃtico. En una segunda fase, se utilizó una metodologÃa hÃbrida que combina los sistemas de inferencia difusos y las redes neuronales artificiales, conocida como neuro-fuzzy, para el estudio de la clasificación del estado ecológico de los rÃos, de acuerdo con los lineamientos de la Directiva Marco de Aguas. Esta metodologÃa permitió un manejo adecuado de la no-linealidad y naturaleza subjetiva de las variables involucradas en este problema clasificatorio. Con ella, se estudió la complejidad de los sistemas de inferencia, la selección apropiada de reglas lingüÃsticas y la influencia de las funciones que transforman las variables numéricas en lingüÃsticas. En una tercera fase, se desarrolló un modelo conceptual neuro-fuzzy concurrente basado en la metodologÃa de evaluación de riesgo ecológico preliminar. Este modelo consideró la presencia de sustancias peligrosas en los rÃos, e incorporó un mapa auto-organizativo para clasificar las sustancias quÃmicas, en términos de su peligrosidad hacia los ecosistemas acuáticos. Con este modelo se estimaron potenciales de riesgo ecológico por combinación de factores de peligrosidad y de concentraciones de las sustancias quÃmicas en el agua. Debido a la alta imprecisión e incertidumbre lingüÃstica, estos potenciales se obtuvieron mediante sistemas de inferencia difusos, y se integraron por medio de distribuciones empÃricas acumuladas, con las cuales se pueden analizar cambios espacio-temporales en la calidad del agua. Finalmente, se diseñó un sistema jerárquico de inferencia difuso para la evaluación del riesgo ecológico en sedimentos de ribera. Este sistema estima los grados de contaminación, toxicidad y riesgo en los sedimentos en términos lingüÃsticos, con sus respectivos niveles de certeza. El sistema se alimenta con información proveniente de análisis quÃmicos, que detectan la presencia de sustancias micro-contaminantes, y de ensayos eco-toxicológicos tipo "screening" que usan la bacteria Vibrio fischeri. Como caso de estudio se seleccionó la cuenca del rÃo Ebro, aunque las metodologÃas aquà desarrolladas pueden aplicarse fácilmente a otros rÃos. En conclusión, este trabajo demuestra ampliamente que el diseño y aplicación de indicadores de calidad de las aguas, basados en la metodologÃa de la lógica difusa, constituyen una herramienta sencilla y útil para los tomadores de decisiones encargados de la gestión sostenible de las cuencas hidrográficas
Heuristic methods for support vector machines with applications to drug discovery.
The contributions to computer science presented in this thesis were inspired by the analysis of the data generated in the early stages of drug discovery. These data sets are generated by screening compounds against various biological receptors. This gives a first indication of biological activity. To avoid screening inactive compounds, decision rules for selecting compounds are required. Such a decision rule is a mapping from a compound representation to an estimated activity. Hand-coding such rules is time-consuming, expensive and subjective. An alternative is to learn these rules from the available data. This is difficult since the compounds may be characterized by tens to thousands of physical, chemical, and structural descriptors and it is not known which are most relevant to the prediction of biological activity. Further, the activity measurements are noisy, so the data can be misleading. The support vector machine (SVM) is a statistically well-founded learning machine that is not adversely affected by high-dimensional representations and is robust with respect to measurement inaccuracies. It thus appears to be ideally suited to the analysis of screening data. The novel application of the SVM to this domain highlights some shortcomings with the vanilla SVM. Three heuristics are developed to overcome these deficiencies: a stopping criterion, HERMES, that allows good solutions to be found in less time; an automated method, LAIKA, for tuning the Gaussian kernel SVM; and, an algorithm, STAR, that outputs a more compact solution. These heuristics achieve their aims on public domain data and are broadly successful when applied to the drug discovery data. The heuristics and associated data analysis are thus of benefit to both pharmacology and computer science
Complexity and modeling power of insertion-deletion systems
SISTEMAS DE INSERCIÓN Y BORRADO: COMPLEJIDAD Y
CAPACIDAD DE MODELADO
El objetivo central de la tesis es el estudio de los sistemas de inserción y borrado y su
capacidad computacional. Más concretamente, estudiamos algunos modelos de
generación de lenguaje que usan operaciones de reescritura de dos cadenas. También
consideramos una variante distribuida de los sistemas de inserción y borrado en el
sentido de que las reglas se separan entre un número finito de nodos de un grafo.
Estos sistemas se denominan sistemas controlados mediante grafo, y aparecen en
muchas áreas de la Informática, jugando un papel muy importante en los lenguajes
formales, la lingüÃstica y la bio-informática. Estudiamos la decidibilidad/
universalidad de nuestros modelos mediante la variación de los parámetros de tamaño
del vector. Concretamente, damos respuesta a la cuestión más importante
concerniente a la expresividad de la capacidad computacional: si nuestro modelo es
equivalente a una máquina de Turing o no. Abordamos sistemáticamente las
cuestiones sobre los tamaños mÃnimos de los sistemas con y sin control de grafo.COMPLEXITY AND MODELING POWER OF
INSERTION-DELETION SYSTEMS
The central object of the thesis are insertion-deletion systems and their computational
power. More specifically, we study language generating models that use two string
rewriting operations: contextual insertion and contextual deletion, and their
extensions. We also consider a distributed variant of insertion-deletion systems in the
sense that rules are separated among a finite number of nodes of a graph. Such
systems are refereed as graph-controlled systems. These systems appear in many
areas of Computer Science and they play an important role in formal languages,
linguistics, and bio-informatics. We vary the parameters of the vector of size of
insertion-deletion systems and we study decidability/universality of obtained models.
More precisely, we answer the most important questions regarding the expressiveness
of the computational model: whether our model is Turing equivalent or not. We
systematically approach the questions about the minimal sizes of the insertiondeletion
systems with and without the graph-control