1,671 research outputs found

    On the quantification of complexity and diversity from phenotypes to ecosystems

    Get PDF
    A cornerstone of ecology and evolution is comparing and explaining the complexity of natural systems, be they genomes, phenotypes, communities, or entire ecosystems. These comparisons and explanations then beget questions about how complexity should be quantified in theory and estimated in practice. Here I embrace diversity partitioning using Hill or effective numbers to move the empirical side of the field regarding the quantification of biological complexity. First, at the level of phenotypes, I show that traditional multivariate analyses ignore individual complexity and provide relatively abstract representations of variation among individuals. I then suggest using well-known diversity indices from community ecology to describe phenotypic complexity as the diversity of distinct subsidiary components of a trait. I show how total trait diversity can be partitioned into within-individual complexity (alpha diversity) and between- individual components (beta diversity) within a hierarchical framework. Second, I use simulations to demonstrate that na ̈ıve measures of standardized beta diversity such as turnover or local/regional dissimilarity are biased estimators when the number of sampled units (e.g., quadrats) is less than the “true” number of communities in a system (if it exists). I then propose using average pairwise dissimilarities and show that this measure is unbiased regardless of the number of sample units. Moreover, the measure is intuitively interpreted as the average proportional change in composition as one moves from one sample to the next. Finally, I apply a hierarchical Bayesian approach to the estimation of species abundances within and among samples, communities, or regions. This strategy accommodates difficult problems of bias and uncertainty in the estimation of the diversity of the underlying communities while providing integrated estimates of uncertainty. Moreover, multilevel hierarchies are possible. We can then use model comparison to determine whether patches/communities/habitats within regions or sets are distinct subcommunities of a metapopulation, or whether they are “arbitrary” distinctions from one contiguous system

    Detecting Poisoning Attacks on Hierarchical Malware Classification Systems

    Get PDF
    Anti-virus software based on unsupervised hierarchical clustering (HC) of malware samples has been shown to be vulnerable to poisoning attacks. In this kind of attack, a malicious player degrades anti-virus performance by submitting to the database samples specifically designed to collapse the classification hierarchy utilized by the anti-virus (and constructed through HC) or otherwise deform it in a way that would render it useless. Though each poisoning attack needs to be tailored to the particular HC scheme deployed, existing research seems to indicate that no particular HC method by itself is immune. We present results on applying a new notion of entropy for combinatorial dendrograms to the problem of controlling the influx of samples into the data base and deflecting poisoning attacks. In a nutshell, effective and tractable measures of change in hierarchy complexity are derived from the above, enabling on-the-fly flagging and rejection of potentially damaging samples. The information-theoretic underpinnings of these measures ensure their indifference to which particular poisoning algorithm is being used by the attacker, rendering them particularly attractive in this setting

    Regional surname affinity: a spatial network approach

    Get PDF
    OBJECTIVE We investigate surname affinities among areas of modern‐day China, by constructing a spatial network, and making community detection. It reports a geographical genealogy of the Chinese population that is result of population origins, historical migrations, and societal evolutions. MATERIALS AND METHODS We acquire data from the census records supplied by China's National Citizen Identity Information System, including the surname and regional information of 1.28 billion registered Chinese citizens. We propose a multilayer minimum spanning tree (MMST) to construct a spatial network based on the matrix of isonymic distances, which is often used to characterize the dissimilarity of surname structure among areas. We use the fast unfolding algorithm to detect network communities. RESULTS We obtain a 10‐layer MMST network of 362 prefecture nodes and 3,610 edges derived from the matrix of the Euclidean distances among these areas. These prefectures are divided into eight groups in the spatial network via community detection. We measure the partition by comparing the inter‐distances and intra‐distances of the communities and obtain meaningful regional ethnicity classification. DISCUSSION The visualization of the resulting communities on the map indicates that the prefectures in the same community are usually geographically adjacent. The formation of this partition is influenced by geographical factors, historic migrations, trade and economic factors, as well as isolation of culture and language. The MMST algorithm proves to be effective in geo‐genealogy and ethnicity classification for it retains essential information about surname affinity and highlights the geographical consanguinity of the population.National Natural Science Foundation of China, Grant/Award Numbers: 61773069, 71731002; National Social Science Foundation of China, Grant/Award Number: 14BSH024; Foundation of China of China Scholarships Council, Grant/Award Numbers: 201606045048, 201706040188, 201706040015; DOE, Grant/Award Number: DE-AC07-05Id14517; DTRA, Grant/Award Number: HDTRA1-14-1-0017; NSF, Grant/Award Numbers: CHE-1213217, CMMI-1125290, PHY-1505000 (61773069 - National Natural Science Foundation of China; 71731002 - National Natural Science Foundation of China; 14BSH024 - National Social Science Foundation of China; 201606045048 - Foundation of China of China Scholarships Council; 201706040188 - Foundation of China of China Scholarships Council; 201706040015 - Foundation of China of China Scholarships Council; DE-AC07-05Id14517 - DOE; HDTRA1-14-1-0017 - DTRA; CHE-1213217 - NSF; CMMI-1125290 - NSF; PHY-1505000 - NSF)Published versio

    Concepts and applications in functional diversity

    Get PDF
    The use of functional diversity analyses in ecology has grown exponentially over the past two decades, broadening our understanding of biological diversity and its change across space and time. Virtually all ecological sub-disciplines recognise the critical value of looking at species and communities from a functional perspective, and this has led to a proliferation of methods for estimating contrasting dimensions of functional diversity. Differences between these methods and their development generated terminological inconsistencies and confusion about the selection of the most appropriate approach for addressing any particular ecological question, hampering the potential for comparative studies, simulation exercises and meta-analyses. Two general mathematical frameworks for estimating functional diversity are prevailing: those based on dissimilarity matrices (e.g. Rao entropy, functional dendrograms) and those relying on multidimensional spaces, constructed as either convex hulls or probabilistic hypervolumes. We review these frameworks, discuss their strengths and weaknesses and provide an overview of the main R packages performing these calculations. In parallel, we propose a way for organising functional diversity metrics in a unified scheme to quantify the richness, divergence and regularity of species or individuals under each framework. This overview offers a roadmap for confidently approaching functional diversity analyses both theoretically and practically.Peer reviewe

    Boundary Value Exploration for Software Analysis

    Full text link
    For software to be reliable and resilient, it is widely accepted that tests must be created and maintained alongside the software itself. One safeguard from vulnerabilities and failures in code is to ensure correct behavior on the boundaries between sub-domains of the input space. So-called boundary value analysis (BVA) and boundary value testing (BVT) techniques aim to exercise those boundaries and increase test effectiveness. However, the concepts of BVA and BVT themselves are not clearly defined and it is not clear how to identify relevant sub-domains, and thus the boundaries delineating them, given a specification. This has limited adoption and hindered automation. We clarify BVA and BVT and introduce Boundary Value Exploration (BVE) to describe techniques that support them by helping to detect and identify boundary inputs. Additionally, we propose two concrete BVE techniques based on information-theoretic distance functions: (i) an algorithm for boundary detection and (ii) the usage of software visualization to explore the behavior of the software under test and identify its boundary behavior. As an initial evaluation, we apply these techniques on a much used and well-tested date handling library. Our results reveal questionable behavior at boundaries highlighted by our techniques. In conclusion, we argue that the boundary value exploration that our techniques enable is a step towards automated boundary value analysis and testing which can foster their wider use and improve test effectiveness and efficiency

    WISER Deliverable D3.3-2: The importance of invertebrate spatial and temporal variation for ecological status classification for European lakes

    Get PDF
    European lakes are affected by many human induced disturbances. In principle, ecological theories predict that the structure and functioning of benthic invertebrate assemblage (one of the Biological Quality Elements following the Water Framework Directive, WFD terminology) change in response to the level of disturbances, making this biological element suitable for assessing the status and management of lake ecosystems. In practice, to set up assessment systems based on invertebrates, we need to distiguish community changes that are related to human pressures from those that are inherent natural variability. This task is complicated by the fact that invertebrate communities inhabiting the littoral and the profundal zones of lakes are constrained by different factors and respond unevenly to distinct human disturbances. For example it is not clear yet how the invertebrates assemblages respond to watershed and shoreline alterations, nor the relative importance of spatial and temporal factors on assemblage dynamics and relative bioindicator values of taxa, the habitat constraints on species traits and other taxonomic and methodological limitations. The current lack of knowledge of basic features of invertebrate temporal and spatial variations is limiting the fulfillment of the EU-wide intercalibration of lake ecological quality assessment systems in Europe, and thus compromising the basis for setting the environmental objectives as required by the WFD. The aim of this deliverable is to provide a contribution towards the understanding of basic sources of spatial and temporal variation of lake invertebrate assemblages. The report is structured around selected case studies, manly involving the analysis of existing datasets collated within WISER. The case studies come from different European lake types in the Northern, Central, Alpine and Mediterranean regions. All chapters have an obvious applied objective and our aim is to provide to those dealing with WFD implementation at various levels useful information to consider when designing monitoring programs and / or invertebrate-based classification systems

    On-line relational and multiple relational SOM

    No full text
    International audienceIn some applications and in order to address real-world situations better, data may be more complex than simple numerical vectors. In some examples, data can be known only through their pairwise dissimilarities or through multiple dissimilarities, each of them describing a particular feature of the data set. Several variants of the Self Organizing Map (SOM) algorithm were introduced to generalize the original algorithm to the framework of dissimilarity data. Whereas median SOM is based on a rough representation of the prototypes, relational SOM allows representing these prototypes by a virtual linear combination of all elements in the data set, referring to a pseudo-euclidean framework. In the present article, an on-line version of relational SOM is introduced and studied. Similarly to the situation in the Euclidean framework, this on-line algorithm provides a better organization and is much less sensible to prototype initialization than standard (batch) relational SOM. In a more general case, this stochastic version allows us to integrate an additional stochastic gradient descent step in the algorithm which can tune the respective weights of several dissimilarities in an optimal way: the resulting \emph{multiple relational SOM} thus has the ability to integrate several sources of data of different types, or to make a consensus between several dissimilarities describing the same data. The algorithms introduced in this manuscript are tested on several data sets, including categorical data and graphs. On-line relational SOM is currently available in the R package SOMbrero that can be downloaded at http://sombrero.r-forge.r-project.org or directly tested on its Web User Interface at http://shiny.nathalievilla.org/sombrero

    Finding Optimal Diverse Feature Sets with Alternative Feature Selection

    Full text link
    Feature selection is popular for obtaining small, interpretable, yet highly accurate prediction models. Conventional feature-selection methods typically yield one feature set only, which might not suffice in some scenarios. For example, users might be interested in finding alternative feature sets with similar prediction quality, offering different explanations of the data. In this article, we introduce alternative feature selection and formalize it as an optimization problem. In particular, we define alternatives via constraints and enable users to control the number and dissimilarity of alternatives. Next, we analyze the complexity of this optimization problem and show NP-hardness. Further, we discuss how to integrate conventional feature-selection methods as objectives. Finally, we evaluate alternative feature selection with 30 classification datasets. We observe that alternative feature sets may indeed have high prediction quality, and we analyze several factors influencing this outcome

    Finding Optimal Diverse Feature Sets with Alternative Feature Selection

    Get PDF
    Feature selection is popular for obtaining small, interpretable, yet highly accurate prediction models. Conventional feature-selection methods typically yield one feature set only, which might not suffice in some scenarios. For example, users might be interested in finding alternative feature sets with similar prediction quality, offering different explanations of the data. In this article, we introduce alternative feature selection and formalize it as an optimization problem. In particular, we define alternatives via constraints and enable users to control the number and dissimilarity of alternatives. Next, we analyze the complexity of this optimization problem and show NP-hardness. Further, we discuss how to integrate conventional feature-selection methods as objectives. Finally, we evaluate alternative feature selection with 30 classification datasets. We observe that alternative feature sets may indeed have high prediction quality, and we analyze several factors influencing this outcome
    corecore