826 research outputs found

    Computational representation and discovery of transcription factor binding sites

    Get PDF
    Tesi per compendi de publicacions.The information about how, when, and where are produced the proteins has been one of the major challenge in molecular biology. The studies about the control of the gene expression are essential in order to have a better knowledge about the protein synthesis. The gene regulation is a highly controlled process that starts with the DNA transcription. This process operates at the gene level, hereditary basic units, which will be copied into primary ribonucleic acid (RNA). This first step is controlled by the binding of specific proteins, called as Transcription Factors (TF), with a sequence of the DNA (Deoxyribonucleic Acid) in the regulatory region of the gene. These DNA sequences are known as binding sites (BS). The binding sites motifs are usually very short (5 to 20 bp long) and highly degenerate. These sequences are expected to occur at random every few hundred base pairs. Besides, a TF can bind among different sites. Due to its highly variability, it is difficult to establish a consensus sequence. The study and identification binding sites is important to clarify the control of the gene expression. Due to the importance of identifying binding sites sequences, projects such as ENCODE (Encyclopedia of DNA elements), have dedicated efforts to map binding sites for large set of transcription factor to identify regulatory regions. In this thesis, we have approached the problem of the binding site detection from another angle. We have developed a set of toolkit for motif binding detection based on linear and non-linear models. First of all, we have been able to characterize binding sites using different approaches. The first one is based on the information that there is in each binding sites position. The second one is based on the covariance model of an aligned set of binding sites sequences. From these motif characterizations, we have proposed a new set of computational methods to detect binding sites. First, it was developed a new method based on parametric uncertainty measurement (Rényi entropy). This detection algorithm evaluates the variation on the total Rényi entropy of a set of sequences when a candidate sequence is assumed to be a true binding site belonging to the set. This method was found to perform especially well on transcription factors that the correlation among binding sites was null. The correlation among binding sites positions was considered through linear, Q-residuals, and non-linear models, alpha-Divergence and SIGMA. Q-residuals is a novel motif finding method which constructs a subspace based on the covariance of numerical DNA sequences. When the number of available sequences was small, The Q-residuals performance was significantly better and faster than all the others methodologies. Alpha-Divergence was based on the variation of the total parametric divergence in a set of aligned sequenced with binding evidence when a candidate sequence is added. Given an optimal q-value, the alpha-Divergence performance had a better behavior than the others methodologies in most of the studied transcription factor binding sites. And finally, a new computational tool, SIGMA, was developed as a trade-off between the good generalisation properties of pure entropy methods and the ability of position-dependency metrics to improve detection power. In approximately 70% of the cases considered, SIGMA exhibited better performance properties, at comparable levels of computational resources, than the methods which it was compared. This set of toolkits and the models for the detection of a set of transcription factor binding sites (TFBS) has been included in an R-package called MEET.La informació sobre com, quan i on es produeixen les proteïnes ha estat un dels majors reptes en la biologia molecular. Els estudis sobre el control de l'expressió gènica són essencials per conèixer millor el procés de síntesis d'una proteïna. La regulació gènica és un procés altament controlat que s'inicia amb la transcripció de l'ADN. En aquest procés, els gens, unitat bàsica d'herència, són copiats a àcid ribonucleic (RNA). El primer pas és controlat per la unió de proteïnes, anomenades factors de transcripció (TF), amb una seqüència d'ADN (àcid desoxiribonucleic) en la regió reguladora del gen. Aquestes seqüències s'anomenen punts d'unió i són específiques de cada proteïna. La unió dels factors de transcripció amb el seu corresponent punt d'unió és l'inici de la transcripció. Els punts d'unió són seqüències molt curtes (5 a 20 parells de bases de llargada) i altament degenerades. Aquestes seqüències poden succeir de forma aleatòria cada centenar de parells de bases. A més a més, un factor de transcripció pot unir-se a diferents punts. A conseqüència de l'alta variabilitat, és difícil establir una seqüència consensus. Per tant, l'estudi i la identificació del punts d'unió és important per entendre el control de l'expressió gènica. La importància d'identificar seqüències reguladores ha portat a projectes com l'ENCODE (Encyclopedia of DNA Elements) a dedicar grans esforços a mapejar les seqüències d'unió d'un gran conjunt de factors de transcripció per identificar regions reguladores. L'accés a seqüències genòmiques i els avanços en les tecnologies d'anàlisi de l'expressió gènica han permès també el desenvolupament dels mètodes computacionals per la recerca de motius. Gràcies aquests avenços, en els últims anys, un gran nombre de algorismes han sigut aplicats en la recerca de motius en organismes procariotes i eucariotes simples. Tot i la simplicitat dels organismes, l'índex de falsos positius és alt respecte als veritables positius. Per tant, per estudiar organismes més complexes és necessari mètodes amb més sensibilitat. En aquesta tesi ens hem apropat al problema de la detecció de les seqüències d'unió des de diferents angles. Concretament, hem desenvolupat un conjunt d'eines per la detecció de motius basats en models lineals i no-lineals. Les seqüències d'unió dels factors de transcripció han sigut caracteritzades mitjançant dues aproximacions. La primera està basada en la informació inherent continguda en cada posició de les seqüències d'unió. En canvi, la segona aproximació caracteritza la seqüència d'unió mitjançant un model de covariància. A partir d'ambdues caracteritzacions, hem proposat un nou conjunt de mètodes computacionals per la detecció de seqüències d'unió. Primer, es va desenvolupar un nou mètode basat en la mesura paramètrica de la incertesa (entropia de Rényi). Aquest algorisme de detecció avalua la variació total de l'entropia de Rényi d'un conjunt de seqüències d'unió quan una seqüència candidata és afegida al conjunt. Aquest mètode va obtenir un bon rendiment per aquells seqüències d'unió amb poca o nul.la correlació entre posicions. La correlació entre posicions fou considerada a través d'un model lineal, Qresiduals, i dos models no-lineals, alpha-Divergence i SIGMA. Q-residuals és una nova metodologia per la recerca de motius basada en la construcció d'un subespai a partir de la covariància de les seqüències d'ADN numèriques. Quan el nombre de seqüències disponible és petit, el rendiment de Q-residuals fou significant millor i més ràpid que en les metodologies comparades. Alpha-Divergence avalua la variació total de la divergència paramètrica en un conjunt de seqüències d'unió quan una seqüència candidata és afegida. Donat un q-valor òptim, alpha-Divergence va tenir un millor rendiment que les metodologies comparades en la majoria de seqüències d'unió dels factors de transcripció considerats. Finalment, un nou mètode computacional, SIGMA, va ser desenvolupat per tal millorar la potència de deteccióPostprint (published version

    Impact of alife simulation of Darwinian and Lamarckian evolutionary theories

    Get PDF
    Dissertation presented as the partial requirement for obtaining a Master's degree in Information Management, specialization in Information Systems and Technologies ManagementUntil nowadays, the scientific community firmly rejected the Theory of Inheritance of Acquired Characteristics, a theory mostly associated with the name of Jean-Baptiste Lamarck (1774-1829). Though largely dismissed when applied to biological organisms, this theory found its place in a young discipline called Artificial Life. Based on the two abstract models of Darwinian and Lamarckian evolutionary theories built using neural networks and genetic algorithms, this research aims to present a notion of the potential impact of implementation of Lamarckian knowledge inheritance across disciplines. In order to obtain our results, we conducted a focus group discussion between experts in biology, computer science and philosophy, and used their opinions as qualitative data in our research. As a result of completing the above procedure, we have found some implications of such implementation in each mentioned discipline. In synthetic biology, this means that we would engineer organisms precisely up to our specific needs. At the moment, we can think of better drugs, greener fuels and dramatic changes in chemical industry. In computer science, Lamarckian evolutionary algorithms have been used for quite some years, and quite successfully. However, their application in strong ALife can only be approximated based on the existing roadmaps of futurists. In philosophy, creating artificial life seems consistent with nature and even God, if there is one. At the same time, this implementation may contradict the concept of free will, which is defined as the capacity for an agent to make choices in which the outcome has not been determined by past events. This study has certain limitations, which means that larger focus group and more prepared participants would provide more precise results

    Automatic machine learning:methods, systems, challenges

    Get PDF

    The role of parallel computing in bioinformatics

    Get PDF
    The need to intelligibly capture, manage and analyse the ever-increasing amount of publicly available genomic data is one of the challenges facing bioinformaticians today. Such analyses are in fact impractical using uniprocessor machines, which has led to an increasing reliance on clusters of commodity-priced computers. An existing network of cheap, commodity PCs was utilised as a single computational resource for parallel computing. The performance of the cluster was investigated using a whole genome-scanning program written in the Java programming language. The TSpaces framework, based on the Linda parallel programming model, was used to parallelise the application. Maximum speedup was achieved at between 30 and 50 processors, depending on the size of the genome being scanned. Together with this, the associated significant reductions in wall-clock time suggest that both parallel computing and Java have a significant role to play in the field of bioinformatics

    Temporal Information in Data Science: An Integrated Framework and its Applications

    Get PDF
    Data science is a well-known buzzword, that is in fact composed of two distinct keywords, i.e., data and science. Data itself is of great importance: each analysis task begins from a set of examples. Based on such a consideration, the present work starts with the analysis of a real case scenario, by considering the development of a data warehouse-based decision support system for an Italian contact center company. Then, relying on the information collected in the developed system, a set of machine learning-based analysis tasks have been developed to answer specific business questions, such as employee work anomaly detection and automatic call classification. Although such initial applications rely on already available algorithms, as we shall see, some clever analysis workflows had also to be developed. Afterwards, continuously driven by real data and real world applications, we turned ourselves to the question of how to handle temporal information within classical decision tree models. Our research brought us the development of J48SS, a decision tree induction algorithm based on Quinlan's C4.5 learner, which is capable of dealing with temporal (e.g., sequential and time series) as well as atemporal (such as numerical and categorical) data during the same execution cycle. The decision tree has been applied into some real world analysis tasks, proving its worthiness. A key characteristic of J48SS is its interpretability, an aspect that we specifically addressed through the study of an evolutionary-based decision tree pruning technique. Next, since a lot of work concerning the management of temporal information has already been done in automated reasoning and formal verification fields, a natural direction in which to proceed was that of investigating how such solutions may be combined with machine learning, following two main tracks. First, we show, through the development of an enriched decision tree capable of encoding temporal information by means of interval temporal logic formulas, how a machine learning algorithm can successfully exploit temporal logic to perform data analysis. Then, we focus on the opposite direction, i.e., that of employing machine learning techniques to generate temporal logic formulas, considering a natural language processing scenario. Finally, as a conclusive development, the architecture of a system is proposed, in which formal methods and machine learning techniques are seamlessly combined to perform anomaly detection and predictive maintenance tasks. Such an integration represents an original, thrilling research direction that may open up new ways of dealing with complex, real-world problems.Data science is a well-known buzzword, that is in fact composed of two distinct keywords, i.e., data and science. Data itself is of great importance: each analysis task begins from a set of examples. Based on such a consideration, the present work starts with the analysis of a real case scenario, by considering the development of a data warehouse-based decision support system for an Italian contact center company. Then, relying on the information collected in the developed system, a set of machine learning-based analysis tasks have been developed to answer specific business questions, such as employee work anomaly detection and automatic call classification. Although such initial applications rely on already available algorithms, as we shall see, some clever analysis workflows had also to be developed. Afterwards, continuously driven by real data and real world applications, we turned ourselves to the question of how to handle temporal information within classical decision tree models. Our research brought us the development of J48SS, a decision tree induction algorithm based on Quinlan's C4.5 learner, which is capable of dealing with temporal (e.g., sequential and time series) as well as atemporal (such as numerical and categorical) data during the same execution cycle. The decision tree has been applied into some real world analysis tasks, proving its worthiness. A key characteristic of J48SS is its interpretability, an aspect that we specifically addressed through the study of an evolutionary-based decision tree pruning technique. Next, since a lot of work concerning the management of temporal information has already been done in automated reasoning and formal verification fields, a natural direction in which to proceed was that of investigating how such solutions may be combined with machine learning, following two main tracks. First, we show, through the development of an enriched decision tree capable of encoding temporal information by means of interval temporal logic formulas, how a machine learning algorithm can successfully exploit temporal logic to perform data analysis. Then, we focus on the opposite direction, i.e., that of employing machine learning techniques to generate temporal logic formulas, considering a natural language processing scenario. Finally, as a conclusive development, the architecture of a system is proposed, in which formal methods and machine learning techniques are seamlessly combined to perform anomaly detection and predictive maintenance tasks. Such an integration represents an original, thrilling research direction that may open up new ways of dealing with complex, real-world problems

    Supporting the design of sequences of cumulative activities impacting on multiple areas through a data mining approach : application to design of cognitive rehabilitation programs for traumatic brain injury patients

    Get PDF
    Traumatic brain injury (TBI) is a leading cause of disability worldwide. It is the most common cause of death and disability during the first three decades of life and accounts for more productive years of life lost than cancer, cardiovascular disease and HIV/AIDS combined. Cognitive Rehabilitation (CR), as part of Neurorehabilitation, aims to reduce the cognitive deficits caused by TBI. CR treatment consists of sequentially organized tasks that require repetitive use of impaired cognitive functions. While task repetition is not the only important feature, it is becoming clear that neuroplastic change and functional improvement only occur after a number of specific tasks are performed in a certain order and repetitions and does not occur otherwise. Until now, there has been an important lack of well-established criteria and on-field experience by which to identify the right number and order of tasks to propose to each individual patient. This thesis proposes the CMIS methodology to support health professionals to compose CR programs by selecting the most promising tasks in the right order. Two contributions to this topic were developed for specific steps of CMIS through innovative data mining techniques SAIMAP and NRRMR methodologies. SAIMAP (Sequence of Activities Improving Multi-Area Performance) proposes an innovative combination of data mining techniques in a hybrid generic methodological framework to find sequential patterns of a predefined set of activities and to associate them with multi-criteria improvement indicators regarding a predefined set of areas targeted by the activities. It combines data and prior knowledge with preprocessing, clustering, motif discovery and classes` post-processing to understand the effects of a sequence of activities on targeted areas, provided that these activities have high interactions and cumulative effects. Furthermore, this work introduces and defines the Neurorehabilitation Range (NRR) concept to determine the degree of performance expected for a CR task and the number of repetitions required to produce maximum rehabilitation effects on the individual. An operationalization of NRR is proposed by means of a visualization tool called SAP. SAP (Sectorized and Annotated Plane) is introduced to identify areas where there is a high probability of a target event occurring. Three approaches to SAP are defined, implemented, applied, and validated to a real case: Vis-SAP, DT-SAP and FT-SAP. Finally, the NRRMR (Neurorehabilitation Range Maximal Regions) problem is introduced as a generalization of the Maximal Empty Rectangle problem (MER) to identify maximal NRR over a FT-SAP. These contributions combined together in the CMIS methodology permit to identify a convenient pattern for a CR program (by means of a regular expression) and to instantiate by a real sequence of tasks in NRR by maximizing expected improvement of patients, thus provide support for the creation of CR plans. First of all, SAIMAP provides the general structure of successful CR sequences providing the length of the sequence and the kind of task recommended at every position (attention tasks, memory task or executive function task). Next, NRRMR provides specific tasks information to help decide which particular task is placed at each position in the sequence, the number of repetitions, and the expected range of results to maximize improvement after treatment. From the Artificial Intelligence point of view the proposed methodologies are general enough to be applied in similar problems where a sequence of interconnected activities with cumulative effects are used to impact on a set of areas of interest, for example spinal cord injury patients following physical rehabilitation program or elderly patients facing cognitive decline due to aging by cognitive stimulation programs or on educational settings to find the best way to combine mathematical drills in a program for a specific Mathematics course.El traumatismo craneoencefálico (TCE) es una de las principales causas de morbilidad y discapacidad a nivel mundial. Es la causa más común de muerte y discapacidad en personas menores de 30 años y es responsable de la pérdida de más años de vida productiva que el cáncer, las enfermedades cardiovasculares y el SIDA sumados. La Rehabilitación Cognitiva (RC) como parte de la Neurorehabilitación, tiene como objetivo reducir el impacto de las condiciones de discapacidad y disminuir los déficits cognitivos causados (por ejemplo) por un TCE. Un tratamiento de RC está formado por un conjunto de tareas organizadas de forma secuencial que requieren un uso repetitivo de las funciones cognitivas afectadas. Mientras que el número de ejecuciones de una tarea no es la única característica importante, es cada vez más evidente que las transformaciones neuroplásticas ocurren cuando se ejecutan un número específico de tareas en un cierto orden y no ocurren en caso contrario. Esta tesis propone la metodología CMIS para dar soporte a los profesionales de la salud en la composición de programas de RC, seleccionando las tareas más prometedoras en el orden correcto. Se han desarrollado dos contribuciones para CMIS mediante las metodologías SAMDMA y RNRRM basadas en técnicas innovadoras de minería de datos. SAMDMA (Secuencias de Actividades que Mejoran el Desempeño en Múltiples Áreas) propone una combinación de técnicas de minería de datos y un marco de trabajo genérico híbrido para encontrar patrones secuenciales en un conjunto de actividades y asociarlos con indicadores de mejora multi-criterio en relación a un conjunto de áreas hacia las cuales las actividades están dirigidas. Combina el uso de datos y conocimiento experto con técnicas de pre-procesamiento, clustering, descubrimiento de motifs y post procesamiento de clases. Además, se introduce y define el concepto de Rango de NeuroRehabilitación (RNR) para determinar el grado de performance esperado para una tarea de RC y el número de repeticiones que debe ejecutarse para producir mayores efectos rehabilitadores. Se propone una operacionalización del RNR por medio de una herramienta de visualización llamada Plano Sectorizado Anotado (PAS). PAS permite identificar áreas en las que hay una alta probabilidad de que ocurra un evento. Tres enfoques diferentes al PAS se definen, implementan, aplican y validan en un caso real : Vis-PAS, DT-PAS y FT-PAS. Finalmente, el problema RNRRM (Rango de NeuroRehabilitación de Regiones Máximas) se presenta como una generalización del problema del Máximo Rectángulo Vacío para identificar RNR máximos sobre un FT-PAS. La combinación de estas dos contribuciones en la metodología CMIS permite identificar un patrón conveniente para un programa de RC (por medio de una expresión regular) e instanciarlo en una secuencia real de tareas en RNR maximizando las mejoras esperadas de los pacientes, proporcionando soporte a la creación de planes de RC. Inicialmente, SAMDMA proporciona la estructura general de secuencias de RC exitosas para cada paciente, proporcionando el largo de la secuencia y el tipo de tarea recomendada en cada posición. RNRRM proporciona información específica de tareas para ayudar a decidir cuál se debe ejecutar en cada posición de la secuencia, el número de veces que debe ser repetida y el rango esperado de resultados para maximizar la mejora. Desde el punto de vista de la Inteligencia Artificial, ambas metodologías propuestas, son suficientemente generales como para ser aplicadas a otros problemas de estructura análoga en que una secuencia de actividades interconectadas con efectos acumulativos se utilizan para impactar en un conjunto de áreas de interés. Por ejemplo pacientes lesionados medulares en tratamiento de rehabilitación física, personas mayores con deterioro cognitivo debido al envejecimiento y utilizan programas de estimulación cognitiva, o entornos educacionales para combinar ejercicios de cálculo en un programa específico de Matemáticas
    corecore