4,767 research outputs found

    Laplacian Mixture Modeling for Network Analysis and Unsupervised Learning on Graphs

    Full text link
    Laplacian mixture models identify overlapping regions of influence in unlabeled graph and network data in a scalable and computationally efficient way, yielding useful low-dimensional representations. By combining Laplacian eigenspace and finite mixture modeling methods, they provide probabilistic or fuzzy dimensionality reductions or domain decompositions for a variety of input data types, including mixture distributions, feature vectors, and graphs or networks. Provable optimal recovery using the algorithm is analytically shown for a nontrivial class of cluster graphs. Heuristic approximations for scalable high-performance implementations are described and empirically tested. Connections to PageRank and community detection in network analysis demonstrate the wide applicability of this approach. The origins of fuzzy spectral methods, beginning with generalized heat or diffusion equations in physics, are reviewed and summarized. Comparisons to other dimensionality reduction and clustering methods for challenging unsupervised machine learning problems are also discussed.Comment: 13 figures, 35 reference

    An overview of clustering methods with guidelines for application in mental health research

    Get PDF
    Cluster analyzes have been widely used in mental health research to decompose inter-individual heterogeneity by identifying more homogeneous subgroups of individuals. However, despite advances in new algorithms and increasing popularity, there is little guidance on model choice, analytical framework and reporting requirements. In this paper, we aimed to address this gap by introducing the philosophy, design, advantages/disadvantages and implementation of major algorithms that are particularly relevant in mental health research. Extensions of basic models, such as kernel methods, deep learning, semi-supervised clustering, and clustering ensembles are subsequently introduced. How to choose algorithms to address common issues as well as methods for pre-clustering data processing, clustering evaluation and validation are then discussed. Importantly, we also provide general guidance on clustering workflow and reporting requirements. To facilitate the implementation of different algorithms, we provide information on R functions and librarie

    Developing DNS Tools to Study Channel Flow Over Realistic Plaque Morphology

    Get PDF
    In a normal coronary artery, the flow is laminar and the velocity is parabolic in nature. Over time, plaques deposit along the artery wall, narrowing the artery and creating an obstruction, a stenosis. As the stenosis grows, the characteristics of the flow change and transition occurs, resulting in turbulent flow distal to the stenosis. To date, direct numerical simulation (DNS) of turbulent flow has been performed in a number of studies to understand how stenosis modifies flow dynamics. However, the effect of the actual shape and size of the obstruction has been disregarded in these DNS studies. An ideal approach is to obtain geometrical information of the stenotic channel using medical imaging methods such as IVUS (Intravascular Ultrasound) and couple them with numerical solvers that simulate the flow in the stenotic channel. The purpose of the present thesis is to demonstrate the feasibility of coupling the IVUS geometry with DNS solver. This preliminary research will provide the necessary tools to achieve the long term goal of developing a framework for the morphological features of the stenosis on the flow modifications in a diseased coronary artery. In the present study, the geometrical information of the stenotic plaque has been provided by the medical imaging team at the Cleveland Clinic Foundation for 42 patients who underwent IVUS. The integration of the geometrical information of the stenotic plaque with the DNS was performed in 3 stages 1) fuzzy logic scheme was used to group the 42 patients into categories, 2) meshing algorithm was generated to interface with the DNS solver, and 3) the existing DNS for channel flow was modified to account for inhomogeneity in the streamwise direction. A plaque classification system was developed using statistical k-means clustering with fuzzy logic. Four distinct morphological categories were found in plaque measurements obtained from the 42 patients. Patients were then assigned a degree of membership to each category based on a fuzzy evaluation system. Flow simulations showed distinct turbulent flow characteristics when comparing the four categories, and similar characteristics within each category. An existing DNS solver that used the fourth-order velocity second-order vorticity formulation of the Navier-Stokes equations was modified to account for inhomogeneity in the streamwise direction. A multigrid method was implemented, using Green\u27s method to compute unknown boundary conditions at the walls in using an influence matrix approach. The inflow is the free stream laminar flow condition; the outflow is computed explicitly with a buffer domain and by parabolizing the Navier Stokes equation. The transitional flow solver was tested using blowing and suction disturbances at the wall to generate the Tollmien-Schlichting waves predicted by linear stability theory. The toolset developed as a part of this thesis demonstrates the feasibility of integrating realistic geometry with DNS. This tool can be used for patient-specific simulation of stenotic flow in coronary and carotid arteries. Additionally, within the field of fluid dynamics, this framework will contribute to the understanding of transition and turbulence in stenotic flows

    A ‘fuzzy clustering’ approach to conceptual confusion: how to classify natural ecological associations

    Get PDF
    The concept of the marine ecological community has recently experienced renewed attention, mainly owing to a shift in conservation policies from targeting single and specific objec- tives (e.g. species) towards more integrated approaches. Despite the value of communities as dis- tinct entities, e.g. for conservation purposes, there is still an ongoing debate on the nature of spe- cies associations. They are seen either as communities, cohesive units of non-randomly associated and interacting members, or as assemblages, groups of species that are randomly associated. We investigated such dualism using fuzzy logic applied to a large dataset in the German Bight (south- eastern North Sea). Fuzzy logic provides the flexibility needed to describe complex patterns of natural systems. Assigning objects to more than one class, it enables the depiction of transitions, avoiding the rigid division into communities or assemblages. Therefore we identified areas with either structured or random species associations and mapped boundaries between communities or assemblages in this more natural way. We then described the impact of the chosen sampling design on the community identification. Four communities, their core areas and probability of occurrence were identified in the German Bight: AMPHIURA-FILIFORMIS, BATHYPOREIA-TELLINA, GONIADELLA-SPISULA, and PHORONIS. They were assessed by estimating overlap and compactness and supported by analysis of beta-diversity. Overall, 62% of the study area was characterized by high species turnover and instability. These areas are very relevant for conservation issues, but become undetectable when studies choose sampling designs with little information or at small spatial scales

    Unsupervised multiple kernel learning approaches for integrating molecular cancer patient data

    Get PDF
    Cancer is the second leading cause of death worldwide. A characteristic of this disease is its complexity leading to a wide variety of genetic and molecular aberrations in the tumors. This heterogeneity necessitates personalized therapies for the patients. However, currently defined cancer subtypes used in clinical practice for treatment decision-making are based on relatively few selected markers and thus provide only a coarse classifcation of tumors. The increased availability in multi-omics data measured for cancer patients now offers the possibility of defining more informed cancer subtypes. Such a more fine-grained characterization of cancer subtypes harbors the potential of substantially expanding treatment options in personalized cancer therapy. In this thesis, we identify comprehensive cancer subtypes using multidimensional data. For this purpose, we apply and extend unsupervised multiple kernel learning methods. Three challenges of unsupervised multiple kernel learning are addressed: robustness, applicability, and interpretability. First, we show that regularization of the multiple kernel graph embedding framework, which enables the implementation of dimensionality reduction techniques, can increase the stability of the resulting patient subgroups. This improvement is especially beneficial for data sets with a small number of samples. Second, we adapt the objective function of kernel principal component analysis to enable the application of multiple kernel learning in combination with this widely used dimensionality reduction technique. Third, we improve the interpretability of kernel learning procedures by performing feature clustering prior to integrating the data via multiple kernel learning. On the basis of these clusters, we derive a score indicating the impact of a feature cluster on a patient cluster, thereby facilitating further analysis of the cluster-specific biological properties. All three procedures are successfully tested on real-world cancer data. Comparing our newly derived methodologies to established methods provides evidence that our work offers novel and beneficial ways of identifying patient subgroups and gaining insights into medically relevant characteristics of cancer subtypes.Krebs ist eine der häufigsten Todesursachen weltweit. Krebs ist gekennzeichnet durch seine Komplexität, die zu vielen verschiedenen genetischen und molekularen Aberrationen im Tumor führt. Die Unterschiede zwischen Tumoren erfordern personalisierte Therapien für die einzelnen Patienten. Die Krebssubtypen, die derzeit zur Behandlungsplanung in der klinischen Praxis verwendet werden, basieren auf relativ wenigen, genetischen oder molekularen Markern und können daher nur eine grobe Unterteilung der Tumoren liefern. Die zunehmende Verfügbarkeit von Multi-Omics-Daten für Krebspatienten ermöglicht die Neudefinition von fundierteren Krebssubtypen, die wiederum zu spezifischeren Behandlungen für Krebspatienten führen könnten. In dieser Dissertation identifizieren wir neue, potentielle Krebssubtypen basierend auf Multi-Omics-Daten. Hierfür verwenden wir unüberwachtes Multiple Kernel Learning, welches in der Lage ist mehrere Datentypen miteinander zu kombinieren. Drei Herausforderungen des unüberwachten Multiple Kernel Learnings werden adressiert: Robustheit, Anwendbarkeit und Interpretierbarkeit. Zunächst zeigen wir, dass die zusätzliche Regularisierung des Multiple Kernel Learning Frameworks zur Implementierung verschiedener Dimensionsreduktionstechniken die Stabilität der identifizierten Patientengruppen erhöht. Diese Robustheit ist besonders vorteilhaft für Datensätze mit einer geringen Anzahl von Proben. Zweitens passen wir die Zielfunktion der kernbasierten Hauptkomponentenanalyse an, um eine integrative Version dieser weit verbreiteten Dimensionsreduktionstechnik zu ermöglichen. Drittens verbessern wir die Interpretierbarkeit von kernbasierten Lernprozeduren, indem wir verwendete Merkmale in homogene Gruppen unterteilen bevor wir die Daten integrieren. Mit Hilfe dieser Gruppen definieren wir eine Bewertungsfunktion, die die weitere Auswertung der biologischen Eigenschaften von Patientengruppen erleichtert. Alle drei Verfahren werden an realen Krebsdaten getestet. Den Vergleich unserer Methodik mit etablierten Methoden weist nach, dass unsere Arbeit neue und nützliche Möglichkeiten bietet, um integrative Patientengruppen zu identifizieren und Einblicke in medizinisch relevante Eigenschaften von Krebssubtypen zu erhalten

    A survey on pre-processing techniques: relevant issues in the context of environmental data mining

    Get PDF
    One of the important issues related with all types of data analysis, either statistical data analysis, machine learning, data mining, data science or whatever form of data-driven modeling, is data quality. The more complex the reality to be analyzed is, the higher the risk of getting low quality data. Unfortunately real data often contain noise, uncertainty, errors, redundancies or even irrelevant information. Useless models will be obtained when built over incorrect or incomplete data. As a consequence, the quality of decisions made over these models, also depends on data quality. This is why pre-processing is one of the most critical steps of data analysis in any of its forms. However, pre-processing has not been properly systematized yet, and little research is focused on this. In this paper a survey on most popular pre-processing steps required in environmental data analysis is presented, together with a proposal to systematize it. Rather than providing technical details on specific pre-processing techniques, the paper focus on providing general ideas to a non-expert user, who, after reading them, can decide which one is the more suitable technique required to solve his/her problem.Peer ReviewedPostprint (author's final draft

    Investigating Vector Space Embeddings for Database Schema Management

    Get PDF
    Text generation in the area of natural language processing as part of the artificial intelligence field has been greatly improving over the last several years. Here we examine the application of vector space word embeddings to provide additional information and context during the text generation process as a way to improve the resultant output through the lens of database normalization. It is known that words encoded into vector space that are closer together in distance generally share meaning or have some semantic or symbolic relationship. This knowledge, paired with the known ability of recurrent neural networks in learning sequences, will be used to examine how vectorizing words can benefit text generation. While the majority of database normalization has been automated, the naming of the generated normalized tables has not. This work seeks to use word embeddings, generated from the data columns of a database table, to give context to a recurrent neural network model while it learns to generate database table names. Using real world data, a recurrent neural network based artificial intelligence model will be paired with a context vector made of word embeddings to observe how effective word embeddings are at providing additional context information during the learning and generation processes. Several methods for generating the context vector will be examined, such as how the word embeddings are generated and how they are combined. The exploration of these methods yielded very promising results in line with the overall goals of the performed work. The benefit of incorporating word embeddings to supply additional information during the text generation process allows for better learning with the goal of generating more human-useful names for newly normalized database tables from their data column titles

    Classical gully spatial identification and slope stability modeling using high-resolution elevation and data mining technique

    Get PDF
    It is widely known that soil erosion is an issue of concern in soil and water quality, affecting agriculture and natural resources. Thus, scientific efforts must take into consideration the high-resolution elevation dataset in order to implement a precision conservation approach effectively. New advances such as LiDAR products have provided a basic source of information to enable researchers to identify small erosional landscape features. To fill this gap, this study developed a methodology based on data mining of hydrologic and topographic attributes associated with concentrated flow path identification to distinguish classic gully side walls and bed areas. At 0.91 Km2 region of the Keigley Branch-South Skunk River watershed, an area with gullies, we computed profile curvature, mean slope deviation, stream power index, and aspect gridded in 1-m pixel from Iowa LiDAR project. CLARA (CLustering LARge Applications) algorithm. An unsupervised clustering approach was employed on 913,495 points splitting the dataset in six groups, the number in agreement with within-group sum of squared error (WSS) statistical technique. In addition, a new threshold criteria termed gully concentrated flow (GCF) based upon data distribution of flow accumulation and mean overall slope were introduced to produce polylines that identified the main hydrographic flow paths, corresponding to the gully beds. Cluster #6 was classified as gully side walls. After distinguishing gullies and cliffs areas among points belonging to cluster 6, all six gullies were satisfactorily identified. The proposed methodology improves on existent techniques because identifies distinct parts of gullies which include side walls and bed zone. Another important concept is assessing gully slope stability in order to generate useful information for precision conservation planning. Although limit-equilibrium concept has been used widely in rock mechanics its application in precision conservation structures is relatively new. This study evaluated two multi-temporal surveys in a Western Iowa gullied area under the approach of soil stability regarding precision conservation practice The study computed factor of safety (FS) at the gully area, including headcut and gully side walls using digital elevation models originated from surveys conducted in 1999 and 2014. Outcomes of this assessment have revealed significantly less instability of the actual slopes compared to 1999 survey slopes. The internal friction angle (θ) had the largest effect on slope stability factor (S.D.1999 = 0.18, S.D.2014 = 0.24), according the sensitivity analysis, compared to variations of soil cohesion, failure plane angle and slab thickness. In addition, critically instable slopes within gully, based on units of the slope standard deviation, as a threshold, have produced an area of 61 m2 and 396 m2 considering the threshold of one and two slope standard deviation, respectively. The majority of these critical areas were located near the headcut and in the border of side walls. Based on current literature, association of processed material (geotextile) and crop cover with high root density might be an alternative to improve slope instability, but empirical tests are necessary to validate this approach. Nevertheless, the slope instability must include other factors that capture the dynamics of failure
    • …
    corecore