19 research outputs found

    Analysis of Large-Scale Structural Changes in Proteins with focus on the Recovery Stroke Mechanism of Myosin II

    Get PDF
    The mechanisms through which proteins achieve their functional three-dimensional structure starting from a string of amino acids, as well as the manner in which the interactions between different structural elements are orchestrated to mediate function are largely unknown, despite the large amount of data accumulating from theoretical and experimental studies. One clear view emerging from all these studies is that function is a result of the intrinsic protein dynamics and flexibility, namely the motions of its well-defined structural elements and their ability to change their position and shape in space to allow large conformational transitions necessary for the function. Simulation techniques have been increasingly used over the past years in the endeavour to solve the structure-function puzzle as they have proven to be powerful tools to investigate the dynamics of proteins. However, extracting useful dynamical information from trajectories thus generated in order to draw functionally relevant conclusions is not always straight forward, especially when the protein function involves concerted movements of entire protein domains. This is due to the high dimensionality of the energy surface the proteins can explore. Therefore, a decrease in complexity is to be desired and can be achieved in principle by reducing the number of dimensions to the ones capturing only the dominant motions of the protein. To this purpose, in this thesis two different dimensionality reducing techniques, namely Principal Component Analysis and Sammon Mapping are applied and compared on four proteins that undergo conformational changes with different amplitudes and mechanisms. In particular, the present thesis tackles the large conformational change occurring during the recovery stroke of myosin, using these methods and rigidity analysis algorithms in the attempt to elucidate in atomic detail the structural mechanism underlying the function of this protein that couples ATP hydrolysis to the mechanical force needed to achieve muscle contraction. The results presented in this thesis show the successful applicability of certain dimensionality reducing methods to large conformational changes and their suitability in analyzing and dissecting dynamical transitions in computationally generated trajectories. The findings regarding the recovery stroke step in the myosin cycle are consistent with experimental data coming from mutational studies and confirm the previously postulated communication mechanism between the active sites of the protein, thus representing a major contribution to the field of molecular motors and a strong evidence of the importance of theoretical studies in complementing the experimental investigations

    A general modeling and visualization tool for comparing different members of a group: application to studying tau-mediated regulation of microtubule dynamics

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Innumerable biological investigations require comparing collections of molecules, cells or organisms to one another with respect to one or more of their properties. Almost all of these comparisons are performed manually, which can be susceptible to inadvertent bias as well as miss subtle effects. The development and application of computer-assisted analytical and interpretive tools could help address these issues and thereby dramatically improve these investigations.</p> <p>Results</p> <p>We have developed novel computer-assisted analytical and interpretive tools and applied them to recent studies examining the ability of 3-repeat and 4-repeat tau to regulate the dynamic behavior of microtubules in vitro. More specifically, we have developed an automated and objective method to define growth, shortening and attenuation events from real time videos of dynamic microtubules, and demonstrated its validity by comparing it to manually assessed data. Additionally, we have used the same data to develop a general strategy of building different models of interest, computing appropriate dissimilarity functions to compare them, and embedding them on a two-dimensional plot for visualization and easy comparison. Application of these methods to assess microtubule growth rates and growth rate distributions established the validity of the embedding procedure and revealed non-linearity in the relationship between the tau:tubulin molar ratio and growth rate distribution.</p> <p>Conclusion</p> <p>This work addresses the need of the biological community for rigorously quantitative and generally applicable computational tools for comparative studies. The two-dimensional embedding method retains the inherent structure of the data, and yet markedly simplifies comparison between models and parameters of different samples. Most notably, even in cases where numerous parameters exist by which to compare the different samples, our embedding procedure provides a generally applicable computational strategy to detect subtle relationships between different molecules or conditions that might otherwise escape manual analyses.</p

    DATA VISUALIZATION OF ASYMMETRIC DATA USING SAMMON MAPPING AND APPLICATIONS OF SELF-ORGANIZING MAPS

    Get PDF
    Data visualization can be used to detect hidden structures and patterns in data sets that are found in data mining applications. However, although efficient data visualization algorithms to handle data sets with asymmetric proximities have been proposed, we develop an improved algorithm in this dissertation. In the first part of the proposal, we develop a modified Sammon mapping approach that uses the upper triangular part and the lower triangular part of an asymmetric distance matrix simultaneously. Our proposed approach is applied to two asymmetric data sets: an American college selection data set, and a Canadian college selection data set which contains rank information. When compared to other approaches that are used in practice, our modified approach generates visual maps that have smaller distance errors and provide more reasonable representations of the data sets. In data visualization, self-organizing maps (SOM) have been used to cluster points. In the second part of the proposal, we assess the performance of several software implementations of SOM-based methods. Viscovery SOMine is found to be helpful in determining the number of clusters and recovering the cluster structure of data sets. A genocide and politicide data set is analyzed using Viscovery SOMine, followed by another analysis on the public and private college data sets with the goal to find out schools with best values

    Ranked centroid projection: A data visualization approach based on self-organizing maps

    Get PDF
    The Self-Organizing Map (SOM) is an unsupervised neural network model that provides topology-preserving mapping from high-dimensional input spaces onto a commonly two-dimensional output space. In this study, the clustering and visualization capabilities of the SOM, especially in the analysis of textual data, i.e. document collections, are reviewed and further developed. A novel clustering and visualization approach based on the SOM is proposed for the task of text data mining. The proposed approach first transforms the document space into a multi-dimensional vector space by means of document encoding. Then a growing hierarchical SOM (GHSOM) is trained and used as a baseline framework, which automatically produces maps with various levels of details. Following the training of the GHSOM, a novel projection method, namely the Ranked Centroid Projection (RCP), is applied to project the input vectors onto a hierarchy of two-dimensional output maps. The projection of the input vectors is treated as a vector interpolation into a two-dimensional regular map grid. A ranking scheme is introduced to select the nearest R units around the input vector in the original data space, the positions of which will be taken into account in computing the projection coordinates.The proposed approach can be used both as a data analysis tool and as a direct interface to the data. Its applicability has been demonstrated in this study using an illustrative data set and two real-world document clustering tasks, i.e. the SOM paper collection and the Anthrax paper collection. Based on the proposed approach, a software toolbox is designed for analyzing and visualizing document collections, which provides a user-friendly interface and several exploration and analysis functions.The presented SOM-based approach incorporates several unique features, such as the adaptive structure, the hierarchical training, the automatic parameter adjustment and the incremental clustering. Its advantages include the ability to convey a large amount of information in a limited space with comparatively low computation load, the potential to reveal conceptual relationships among documents, and the facilitation of perceptual inferences on both inter-cluster and within-cluster relationships

    CLASSIFIERS BASED ON A NEW APPROACH TO ESTIMATE THE FISHER SUBSPACE AND THEIR APPLICATIONS

    Get PDF
    In this thesis we propose a novel classifier, and its extensions, based on a novel estimation of the Fisher Subspace. The proposed classifiers have been developed to deal with high dimensional and highly unbalanced datasets whose cardinality is low. The efficacy of the proposed techniques has been proved by the results achieved on real and synthetic datasets, and by the comparison with state of the art predictors

    Efficient feature reduction and classification methods

    Get PDF
    Durch die steigende Anzahl verfügbarer Daten in unterschiedlichsten Anwendungsgebieten nimmt der Aufwand vieler Data-Mining Applikationen signifikant zu. Speziell hochdimensionierte Daten (Daten die über viele verschiedene Attribute beschrieben werden) können ein großes Problem für viele Data-Mining Anwendungen darstellen. Neben höheren Laufzeiten können dadurch sowohl für überwachte (supervised), als auch nicht überwachte (unsupervised) Klassifikationsalgorithmen weitere Komplikationen entstehen (z.B. ungenaue Klassifikationsgenauigkeit, schlechte Clustering-Eigenschaften, …). Dies führt zu einem Bedarf an effektiven und effizienten Methoden zur Dimensionsreduzierung. Feature Selection (die Auswahl eines Subsets von Originalattributen) und Dimensionality Reduction (Transformation von Originalattribute in (Linear)-Kombinationen der Originalattribute) sind zwei wichtige Methoden um die Dimension von Daten zu reduzieren. Obwohl sich in den letzten Jahren vielen Studien mit diesen Methoden beschäftigt haben, gibt es immer noch viele offene Fragestellungen in diesem Forschungsgebiet. Darüber hinaus ergeben sich in vielen Anwendungsbereichen durch die immer weiter steigende Anzahl an verfügbaren und verwendeten Attributen und Features laufend neue Probleme. Das Ziel dieser Dissertation ist es, verschiedene Fragenstellungen in diesem Bereich genau zu analysieren und Verbesserungsmöglichkeiten zu entwickeln. Grundsätzlich, werden folgende Ansprüche an Methoden zur Feature Selection und Dimensionality Reduction gestellt: Die Methoden sollten effizient (bezüglich ihres Rechenaufwandes) sein und die resultierenden Feature-Sets sollten die Originaldaten möglichst kompakt repräsentieren können. Darüber hinaus ist es in vielen Anwendungsgebieten wichtig, die Interpretierbarkeit der Originaldaten beizubehalten. Letztendlich sollte der Prozess der Dimensionsreduzierung keinen negativen Effekt auf die Klassifikationsgenauigkeit haben - sondern idealerweise, diese noch verbessern. Offene Problemstellungen in diesem Bereich betreffen unter anderem den Zusammenhang zwischen Methoden zur Dimensionsreduzierung und der resultierenden Klassifikationsgenauigkeit, wobei sowohl eine möglichst kompakte Repräsentation der Daten, als auch eine hohe Klassifikationsgenauigkeit erzielt werden sollen. Wie bereits erwähnt, ergibt sich durch die große Anzahl an Daten auch ein erhöhter Rechenaufwand, weshalb schnelle und effektive Methoden zur Dimensionsreduzierung entwickelt werden müssen, bzw. existierende Methoden verbessert werden müssen. Darüber hinaus sollte natürlich auch der Rechenaufwand der verwendeten Klassifikationsmethoden möglichst gering sein. Des Weiteren ist die Interpretierbarkeit von Feature Sets zwar möglich, wenn Feature Selection Methoden für die Dimensionsreduzierung verwendet werden, im Fall von Dimensionality Reduction sind die resultierenden Feature Sets jedoch meist Linearkombinationen der Originalfeatures. Daher ist es schwierig zu überprüfen, wie viel Information einzelne Originalfeatures beitragen. Im Rahmen dieser Dissertation konnten wichtige Beiträge zu den oben genannten Problemstellungen präsentiert werden: Es wurden neue, effiziente Initialisierungsvarianten für die Dimensionality Reduction Methode Nonnegative Matrix Factorization (NMF) entwickelt, welche im Vergleich zu randomisierter Initialisierung und im Vergleich zu State-of-the-Art Initialisierungsmethoden zu einer schnelleren Reduktion des Approximationsfehlers führen. Diese Initialisierungsvarianten können darüber hinaus mit neu entwickelten und sehr effektiven Klassifikationsalgorithmen basierend auf NMF kombiniert werden. Um die Laufzeit von NMF weiter zu steigern wurden unterschiedliche Varianten von NMF Algorithmen auf Multi-Prozessor Systemen vorgestellt, welche sowohl Task- als auch Datenparallelismus unterstützen und zu einer erheblichen Reduktion der Laufzeit für NMF führen. Außerdem wurde eine effektive Verbesserung der Matlab Implementierung des ALS Algorithmus vorgestellt. Darüber hinaus wurde eine Technik aus dem Bereich des Information Retrieval -- Latent Semantic Indexing -- erfolgreich als Klassifikationsalgorithmus für Email Daten angewendet. Schließlich wurde eine ausführliche empirische Studie über den Zusammenhang verschiedener Feature Reduction Methoden (Feature Selection und Dimensionality Reduction) und der resultierenden Klassifikationsgenauigkeit unterschiedlicher Lernalgorithmen präsentiert. Der starke Einfluss unterschiedlicher Methoden zur Dimensionsreduzierung auf die resultierende Klassifikationsgenauigkeit unterstreicht dass noch weitere Untersuchungen notwendig sind um das komplexe Zusammenspiel von Dimensionsreduzierung und Klassifikation genau analysieren zu können.The sheer volume of data today and its expected growth over the next years are some of the key challenges in data mining and knowledge discovery applications. Besides the huge number of data samples that are collected and processed, the high dimensional nature of data arising in many applications causes the need to develop effective and efficient techniques that are able to deal with this massive amount of data. In addition to the significant increase in the demand of computational resources, those large datasets might also influence the quality of several data mining applications (especially if the number of features is very high compared to the number of samples). As the dimensionality of data increases, many types of data analysis and classification problems become significantly harder. This can lead to problems for both supervised and unsupervised learning. Dimensionality reduction and feature (subset) selection methods are two types of techniques for reducing the attribute space. While in feature selection a subset of the original attributes is extracted, dimensionality reduction in general produces linear combinations of the original attribute set. In both approaches, the goal is to select a low dimensional subset of the attribute space that covers most of the information of the original data. During the last years, feature selection and dimensionality reduction techniques have become a real prerequisite for data mining applications. There are several open questions in this research field, and due to the often increasing number of candidate features for various application areas (e.\,g., email filtering or drug classification/molecular modeling) new questions arise. In this thesis, we focus on some open research questions in this context, such as the relationship between feature reduction techniques and the resulting classification accuracy and the relationship between the variability captured in the linear combinations of dimensionality reduction techniques (e.\,g., PCA, SVD) and the accuracy of machine learning algorithms operating on them. Another important goal is to better understand new techniques for dimensionality reduction, such as nonnegative matrix factorization (NMF), which can be applied for finding parts-based, linear representations of nonnegative data. This ``sum-of-parts'' representation is especially useful if the interpretability of the original data should be retained. Moreover, performance aspects of feature reduction algorithms are investigated. As data grow, implementations of feature selection and dimensionality reduction techniques for high-performance parallel and distributed computing environments become more and more important. In this thesis, we focus on two types of open research questions: methodological advances without any specific application context, and application-driven advances for a specific application context. Summarizing, new methodological contributions are the following: The utilization of nonnegative matrix factorization in the context of classification methods is investigated. In particular, it is of interest how the improved interpretability of NMF factors due to the non-negativity constraints (which is of central importance in various problem settings) can be exploited. Motivated by this problem context two new fast initialization techniques for NMF based on feature selection are introduced. It is shown how approximation accuracy can be increased and/or how computational effort can be reduced compared to standard randomized seeding of the NMF and to state-of-the-art initialization strategies suggested earlier. For example, for a given number of iterations and a required approximation error a speedup of 3.6 compared to standard initialization, and a speedup of 3.4 compared to state-of-the-art initialization strategies could be achieved. Beyond that, novel classification methods based on the NMF are proposed and investigated. We can show that they are not only competitive in terms of classification accuracy with state-of-the-art classifiers, but also provide important advantages in terms of computational effort (especially for low-rank approximations). Moreover, parallelization and distributed execution of NMF is investigated. Several algorithmic variants for efficiently computing NMF on multi-core systems are studied and compared to each other. In particular, several approaches for exploiting task and/or data-parallelism in NMF are studied. We show that for some scenarios new algorithmic variants clearly outperform existing implementations. Last, but not least, a computationally very efficient adaptation of the implementation of the ALS algorithm in Matlab 2009a is investigated. This variant reduces the runtime significantly (in some settings by a factor of 8) and also provides several possibilities to be executed concurrently. In addition to purely methodological questions, we also address questions arising in the adaptation of feature selection and classification methods to two specific application problems: email classification and in silico screening for drug discovery. Different research challenges arise in the contexts of these different application areas, such as the dynamic nature of data for email classification problems, or the imbalance in the number of available samples of different classes for drug discovery problems. Application-driven advances of this thesis comprise the adaptation and application of latent semantic indexing (LSI) to the task of email filtering. Experimental results show that LSI achieves significantly better classification results than the widespread de-facto standard method for this special application context. In the context of drug discovery problems, several groups of well discriminating descriptors could be identified by utilizing the ``sum-of-parts`` representation of NMF. The number of important descriptors could be further increased when applying sparseness constraints on the NMF factors

    Novel neural approaches to data topology analysis and telemedicine

    Get PDF
    1noL'abstract è presente nell'allegato / the abstract is in the attachmentopen676. INGEGNERIA ELETTRICAnoopenRandazzo, Vincenz

    Feature Selection and Classifier Development for Radio Frequency Device Identification

    Get PDF
    The proliferation of simple and low-cost devices, such as IEEE 802.15.4 ZigBee and Z-Wave, in Critical Infrastructure (CI) increases security concerns. Radio Frequency Distinct Native Attribute (RF-DNA) Fingerprinting facilitates biometric-like identification of electronic devices emissions from variances in device hardware. Developing reliable classifier models using RF-DNA fingerprints is thus important for device discrimination to enable reliable Device Classification (a one-to-many looks most like assessment) and Device ID Verification (a one-to-one looks how much like assessment). AFITs prior RF-DNA work focused on Multiple Discriminant Analysis/Maximum Likelihood (MDA/ML) and Generalized Relevance Learning Vector Quantized Improved (GRLVQI) classifiers. This work 1) introduces a new GRLVQI-Distance (GRLVQI-D) classifier that extends prior GRLVQI work by supporting alternative distance measures, 2) formalizes a framework for selecting competing distance measures for GRLVQI-D, 3) introducing response surface methods for optimizing GRLVQI and GRLVQI-D algorithm settings, 4) develops an MDA-based Loadings Fusion (MLF) Dimensional Reduction Analysis (DRA) method for improved classifier-based feature selection, 5) introduces the F-test as a DRA method for RF-DNA fingerprints, 6) provides a phenomenological understanding of test statistics and p-values, with KS-test and F-test statistic values being superior to p-values for DRA, and 7) introduces quantitative dimensionality assessment methods for DRA subset selection

    Nanostructural Materials with Rare Earth Ions: Synthesis, Physicochemical Characterization, Modification and Applications

    Get PDF
    This Special Issue of "Nanostructural Materials with Rare Earth Ions: Synthesis, Physicochemical Characterization, Modification and Applications" is related to studies of nanometer-sized materials doped and co-doped with rare earth ions and the creation of periodically ordered nanostructures based on single nanoparticles. A small particle size implies a high sensitivity and selectivity. These new effects and possibilities are mainly due to the quantum effects resulting from the increasing ratio of surface-to-volume atoms in low-dimensional systems. An important factor in this context is the design and fabrication of nanocomponents displaying new functionalities and characteristics for the improvement of existing materials, including photonic materials, conductive materials, polymers and biocomposites. With this concept in mind, the aim of the Special Issue is to publish research on innovative materials and their applications.Topics to be covered in this Special Issue include, but are not limited to, the following: Technology and applications of nanomaterials with rare earth ions; Advanced physicochemical properties, characterization and modification of nanomaterials with rare earth ions; Novel active materials, especially organic and inorganic materials, nanocrystalline materials, nanoceramics doped and co-doped with rare-earth ions with bio-related and emerging applications; Magnetic properties of nano-sized rare-earth compounds; Applications of nano-sized rare-earth-doped and co-doped optical materials
    corecore