172 research outputs found

    Linear, Deterministic, and Order-Invariant Initialization Methods for the K-Means Clustering Algorithm

    Full text link
    Over the past five decades, k-means has become the clustering algorithm of choice in many application domains primarily due to its simplicity, time/space efficiency, and invariance to the ordering of the data points. Unfortunately, the algorithm's sensitivity to the initial selection of the cluster centers remains to be its most serious drawback. Numerous initialization methods have been proposed to address this drawback. Many of these methods, however, have time complexity superlinear in the number of data points, which makes them impractical for large data sets. On the other hand, linear methods are often random and/or sensitive to the ordering of the data points. These methods are generally unreliable in that the quality of their results is unpredictable. Therefore, it is common practice to perform multiple runs of such methods and take the output of the run that produces the best results. Such a practice, however, greatly increases the computational requirements of the otherwise highly efficient k-means algorithm. In this chapter, we investigate the empirical performance of six linear, deterministic (non-random), and order-invariant k-means initialization methods on a large and diverse collection of data sets from the UCI Machine Learning Repository. The results demonstrate that two relatively unknown hierarchical initialization methods due to Su and Dy outperform the remaining four methods with respect to two objective effectiveness criteria. In addition, a recent method due to Erisoglu et al. performs surprisingly poorly.Comment: 21 pages, 2 figures, 5 tables, Partitional Clustering Algorithms (Springer, 2014). arXiv admin note: substantial text overlap with arXiv:1304.7465, arXiv:1209.196

    Preventive interventions in families with parental depression: children’s psychosocial symptoms and prosocial behaviour

    Get PDF
    The aim is to document the effectiveness of a preventive family intervention (Family Talk Intervention, FTI) and a brief psychoeducational discussion with parents (Let’s Talk about the Children, LT) on children’s psychosocial symptoms and prosocial behaviour in families with parental mood disorder, when the interventions are practiced in psychiatric services for adults in the finnish national health service. Patients with mood disorder were invited to participate with their families. Consenting families were randomized to the two intervention groups. The initial sample comprised 119 families and their children aged 8–16. Of these, 109 completed the interventions and the baseline evaluation. Mothers and fathers filled out questionnaires including standardized rating scales for children’s symptoms and prosocial behaviour at baseline and at 4, 10 and 18Β months post-intervention. The final sample consisted of parental reports on 149 children with 83 complete data sets. Both interventions were effective in decreasing children’s emotional symptoms, anxiety, and marginally hyperactivity and in improving children’s prosocial behaviour. The FTI was more effective than the LT on emotional symptoms particularly immediately after the intervention, while the effect of the LT emerged after a longer interval. The study supports the effectiveness of both interventions in families with depressed parents. The FTI is applicable in cultural settings other than the USA. Our findings provide support for including preventive child mental health measures as part of psychiatric services for mentally ill parents

    Comparative Coastal Risk Index (CCRI): A multidisciplinary risk index for Latin America and the Caribbean

    Get PDF
    As the world's population grows to a projected 11.2 billion by 2100, the number of people living in low-lying areas exposed to coastal hazards is projected to increase. Critical infrastructure and valuable assets continue to be placed in vulnerable areas, and in recent years, millions of people have been displaced by natural hazards. Impacts from coastal hazards depend on the number of people, value of assets, and presence of critical resources in harm's way. Risks related to natural hazards are determined by a complex interaction between physical hazards, the vulnerability of a society or social-ecological system and its exposure to such hazards. Moreover, these risks are amplified by challenging socioeconomic dynamics, including poorly planned urban development, income inequality, and poverty. This study employs a combination of machine learning clustering techniques (Self Organizing Maps and K-Means) and a spatial index, to assess coastal risks in Latin America and the Caribbean (LAC) on a comparative scale. The proposed method meets multiple objectives, including the identification of hotspots and key drivers of coastal risk, and the ability to process large-volume multidimensional and multivariate datasets, effectively reducing sixteen variables related to coastal hazards, geographic exposure, and socioeconomic vulnerability, into a single index. Our results demonstrate that in LAC, more than 500,000 people live in areas where coastal hazards, exposure (of people, assets and ecosystems) and poverty converge, creating the ideal conditions for a perfect storm. Hotspot locations of coastal risk, identified by the proposed Comparative Coastal Risk Index (CCRI), contain more than 300,00 people and include: El Oro, Ecuador; Sinaloa, Mexico; Usulutan, El Salvador; and Chiapas, Mexico. Our results provide important insights into potential adaptation alternatives that could reduce the impacts of future hazards. Effective adaptation options must not only focus on developing coastal defenses, but also on improving practices and policies related to urban development, agricultural land use, and conservation, as well as ameliorating socioeconomic conditions

    Quantifying unpredictability: A multiple-model approach based on satellite imagery data from Mediterranean ponds.

    Get PDF
    Fluctuations in environmental parameters are increasingly being recognized as essential features of any habitat. The quantification of whether environmental fluctuations are prevalently predictable or unpredictable is remarkably relevant to understanding the evolutionary responses of organisms. However, when characterizing the relevant features of natural habitats, ecologists typically face two problems: (1) gathering long-term data and (2) handling the hard-won data. This paper takes advantage of the free access to long-term recordings of remote sensing data (27 years, Landsat TM/ETM+) to assess a set of environmental models for estimating environmental predictability. The case study included 20 Mediterranean saline ponds and lakes, and the focal variable was the water-surface area. This study first aimed to produce a method for accurately estimating the water-surface area from satellite images. Saline ponds can develop salt-crusted areas that make it difficult to distinguish between soil and water. This challenge was addressed using a novel pipeline that combines band ratio water indices and the short near-infrared band as a salt filter. The study then extracted the predictable and unpredictable components of variation in the water-surface area. Two different approaches, each showing variations in the parameters, were used to obtain the stochastic variation around a regular pattern with the objective of dissecting the effect of assumptions on predictability estimations. The first approach, which is based on Colwell's predictability metrics, transforms the focal variable into a nominal one. The resulting discrete categories define the relevant variations in the water-surface area. In the second approach, we introduced General Additive Model (GAM) fitting as a new metric for quantifying predictability. Both approaches produced a wide range of predictability for the studied ponds. Some model assumptions-which are considered very different a priori-had minor effects, whereas others produced predictability estimations that showed some degree of divergence. We hypothesize that these diverging estimations of predictability reflect the effect of fluctuations on different types of organisms. The fluctuation analysis described in this manuscript is applicable to a wide variety of systems, including both aquatic and nonaquatic systems, and will be valuable for quantifying and characterizing predictability, which is essential within the expected global increase in the unpredictability of environmental fluctuations. We advocate that a priori information for organisms of interest should be used to select the most suitable metrics estimating predictability, and we provide some guidelines for this approach

    Genome-wide characterization of simple sequence repeats in cucumber (Cucumis sativus L.)

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Cucumber, <it>Cucumis sativus </it>L. is an important vegetable crop worldwide. Until very recently, cucumber genetic and genomic resources, especially molecular markers, have been very limited, impeding progress of cucumber breeding efforts. Microsatellites are short tandemly repeated DNA sequences, which are frequently favored as genetic markers due to their high level of polymorphism and codominant inheritance. Data from previously characterized genomes has shown that these repeats vary in frequency, motif sequence, and genomic location across taxa. During the last year, the genomes of two cucumber genotypes were sequenced including the Chinese fresh market type inbred line '9930' and the North American pickling type inbred line 'Gy14'. These sequences provide a powerful tool for developing markers in a large scale. In this study, we surveyed and characterized the distribution and frequency of perfect microsatellites in 203 Mbp assembled Gy14 DNA sequences, representing 55% of its nuclear genome, and in cucumber EST sequences. Similar analyses were performed in genomic and EST data from seven other plant species, and the results were compared with those of cucumber.</p> <p>Results</p> <p>A total of 112,073 perfect repeats were detected in the Gy14 cucumber genome sequence, accounting for 0.9% of the assembled Gy14 genome, with an overall density of 551.9 SSRs/Mbp. While tetranucleotides were the most frequent microsatellites in genomic DNA sequence, dinucleotide repeats, which had more repeat units than any other SSR type, had the highest cumulative sequence length. Coding regions (ESTs) of the cucumber genome had fewer microsatellites compared to its genomic sequence, with trinucleotides predominating in EST sequences. AAG was the most frequent repeat in cucumber ESTs. Overall, AT-rich motifs prevailed in both genomic and EST data. Compared to the other species examined, cucumber genomic sequence had the highest density of SSRs (although comparable to the density of poplar, grapevine and rice), and was richest in AT dinucleotides. Using an electronic PCR strategy, we investigated the polymorphism between 9930 and Gy14 at 1,006 SSR loci, and found unexpectedly high degree of polymorphism (48.3%) between the two genotypes. The level of polymorphism seems to be positively associated with the number of repeat units in the microsatellite. The <it>in silico </it>PCR results were validated empirically in 660 of the 1,006 SSR loci. In addition, primer sequences for more than 83,000 newly-discovered cucumber microsatellites, and their exact positions in the Gy14 genome assembly were made publicly available.</p> <p>Conclusions</p> <p>The cucumber genome is rich in microsatellites; AT and AAG are the most abundant repeat motifs in genomic and EST sequences of cucumber, respectively. Considering all the species investigated, some commonalities were noted, especially within the monocot and dicot groups, although the distribution of motifs and the frequency of certain repeats were characteristic of the species examined. The large number of SSR markers developed from this study should be a significant contribution to the cucurbit research community.</p

    Transforming Growth Factor Ξ² Receptor Type 1 Is Essential for Female Reproductive Tract Integrity and Function

    Get PDF
    The transforming growth factor Ξ² (TGFΞ²) superfamily proteins are principle regulators of numerous biological functions. Although recent studies have gained tremendous insights into this growth factor family in female reproduction, the functions of the receptors in vivo remain poorly defined. TGFΞ² type 1 receptor (TGFBR1), also known as activin receptor-like kinase 5, is the major type 1 receptor for TGFΞ² ligands. Tgfbr1 null mice die embryonically, precluding functional characterization of TGFBR1 postnatally. To study TGFBR1–mediated signaling in female reproduction, we generated a mouse model with conditional knockout (cKO) of Tgfbr1 in the female reproductive tract using anti-MΓΌllerian hormone receptor type 2 promoter-driven Cre recombinase. We found that Tgfbr1 cKO females are sterile. However, unlike its role in growth differentiation factor 9 (GDF9) signaling in vitro, TGFBR1 seems to be dispensable for GDF9 signaling in vivo. Strikingly, we discovered that the Tgfbr1 cKO females develop oviductal diverticula, which impair embryo development and transit of embryos to the uterus. Molecular analysis further demonstrated the dysregulation of several cell differentiation and migration genes (e.g., Krt12, Ace2, and MyoR) that are potentially associated with female reproductive tract development. Moreover, defective smooth muscle development was also revealed in the uteri of the Tgfbr1 cKO mice. Thus, TGFBR1 is required for female reproductive tract integrity and function, and disruption of TGFBR1–mediated signaling leads to catastrophic structural and functional consequences in the oviduct and uterus

    Are one or two simple questions sufficient to detect depression in cancer and palliative care? A Bayesian meta-analysis

    Get PDF
    The purpose of this study is to examine the value of one or two simple verbal questions in the detection of depression in cancer settings. This study is a systematic literature search of abstract and full text databases to January 2008. Key authors were contacted for unpublished studies. Seventeen analyses were found. Of these, 13 were conducted in late stage palliative settings. (1) Single depression question: across nine studies, the prevalence of depression was 16%. A single β€˜depression' question enabled the detection of depression in 160 out of 223 true cases, a sensitivity of 72%, and correctly reassured 964 out of 1166 non-depressed cancer sufferers, a specificity of 83%. The positive predictive value (PPV) was 44% and the negative predictive value (NPV) 94%. (2) Single interest question: there were only three studies examining the β€˜loss-of-interest' question, with a combined prevalence of 14%. This question allowed the detection of 60 out of 72 cases (sensitivity 83%) and excluded 394 from 459 non-depressed cases (specificity of 86%). The PPV was 48% and the NPV 97%. (3) Two questions (low mood and low interest): five studies examined two questions with a combined prevalence of 17%. The two-question combination facilitated a diagnosis of depression in 138 of 151 true cases (sensitivity 91%) and gave correct reassurance to 645 of 749 non-cases (specificity 86%). The PPV was 57% and the NPV 98%. Simple verbal methods perform well at excluding depression in the non-depressed but perform poorly at confirming depression. The β€˜two question' method is significantly more accurate than either single question but clinicians should not rely on these simple questions alone and should be prepared to assess the patient more thoroughly
    • …
    corecore