95,295 research outputs found

    Scaling laws and fluctuations in the statistics of word frequencies

    Full text link
    In this paper we combine statistical analysis of large text databases and simple stochastic models to explain the appearance of scaling laws in the statistics of word frequencies. Besides the sublinear scaling of the vocabulary size with database size (Heaps' law), here we report a new scaling of the fluctuations around this average (fluctuation scaling analysis). We explain both scaling laws by modeling the usage of words by simple stochastic processes in which the overall distribution of word-frequencies is fat tailed (Zipf's law) and the frequency of a single word is subject to fluctuations across documents (as in topic models). In this framework, the mean and the variance of the vocabulary size can be expressed as quenched averages, implying that: i) the inhomogeneous dissemination of words cause a reduction of the average vocabulary size in comparison to the homogeneous case, and ii) correlations in the co-occurrence of words lead to an increase in the variance and the vocabulary size becomes a non-self-averaging quantity. We address the implications of these observations to the measurement of lexical richness. We test our results in three large text databases (Google-ngram, Enlgish Wikipedia, and a collection of scientific articles).Comment: 19 pages, 4 figure

    Challenges in Complex Systems Science

    Get PDF
    FuturICT foundations are social science, complex systems science, and ICT. The main concerns and challenges in the science of complex systems in the context of FuturICT are laid out in this paper with special emphasis on the Complex Systems route to Social Sciences. This include complex systems having: many heterogeneous interacting parts; multiple scales; complicated transition laws; unexpected or unpredicted emergence; sensitive dependence on initial conditions; path-dependent dynamics; networked hierarchical connectivities; interaction of autonomous agents; self-organisation; non-equilibrium dynamics; combinatorial explosion; adaptivity to changing environments; co-evolving subsystems; ill-defined boundaries; and multilevel dynamics. In this context, science is seen as the process of abstracting the dynamics of systems from data. This presents many challenges including: data gathering by large-scale experiment, participatory sensing and social computation, managing huge distributed dynamic and heterogeneous databases; moving from data to dynamical models, going beyond correlations to cause-effect relationships, understanding the relationship between simple and comprehensive models with appropriate choices of variables, ensemble modeling and data assimilation, modeling systems of systems of systems with many levels between micro and macro; and formulating new approaches to prediction, forecasting, and risk, especially in systems that can reflect on and change their behaviour in response to predictions, and systems whose apparently predictable behaviour is disrupted by apparently unpredictable rare or extreme events. These challenges are part of the FuturICT agenda

    Modeling views in the layered view model for XML using UML

    Get PDF
    In data engineering, view formalisms are used to provide flexibility to users and user applications by allowing them to extract and elaborate data from the stored data sources. Conversely, since the introduction of Extensible Markup Language (XML), it is fast emerging as the dominant standard for storing, describing, and interchanging data among various web and heterogeneous data sources. In combination with XML Schema, XML provides rich facilities for defining and constraining user-defined data semantics and properties, a feature that is unique to XML. In this context, it is interesting to investigate traditional database features, such as view models and view design techniques for XML. However, traditional view formalisms are strongly coupled to the data language and its syntax, thus it proves to be a difficult task to support views in the case of semi-structured data models. Therefore, in this paper we propose a Layered View Model (LVM) for XML with conceptual and schemata extensions. Here our work is three-fold; first we propose an approach to separate the implementation and conceptual aspects of the views that provides a clear separation of concerns, thus, allowing analysis and design of views to be separated from their implementation. Secondly, we define representations to express and construct these views at the conceptual level. Thirdly, we define a view transformation methodology for XML views in the LVM, which carries out automated transformation to a view schema and a view query expression in an appropriate query language. Also, to validate and apply the LVM concepts, methods and transformations developed, we propose a view-driven application development framework with the flexibility to develop web and database applications for XML, at varying levels of abstraction
    • …
    corecore