3,247 research outputs found
Stochastic model for the vocabulary growth in natural languages
We propose a stochastic model for the number of different words in a given
database which incorporates the dependence on the database size and historical
changes. The main feature of our model is the existence of two different
classes of words: (i) a finite number of core-words which have higher frequency
and do not affect the probability of a new word to be used; and (ii) the
remaining virtually infinite number of noncore-words which have lower frequency
and once used reduce the probability of a new word to be used in the future.
Our model relies on a careful analysis of the google-ngram database of books
published in the last centuries and its main consequence is the generalization
of Zipf's and Heaps' law to two scaling regimes. We confirm that these
generalizations yield the best simple description of the data among generic
descriptive models and that the two free parameters depend only on the language
but not on the database. From the point of view of our model the main change on
historical time scales is the composition of the specific words included in the
finite list of core-words, which we observe to decay exponentially in time with
a rate of approximately 30 words per year for English.Comment: corrected typos and errors in reference list; 10 pages text, 15 pages
supplemental material; to appear in Physical Review
Scaling laws and fluctuations in the statistics of word frequencies
In this paper we combine statistical analysis of large text databases and
simple stochastic models to explain the appearance of scaling laws in the
statistics of word frequencies. Besides the sublinear scaling of the vocabulary
size with database size (Heaps' law), here we report a new scaling of the
fluctuations around this average (fluctuation scaling analysis). We explain
both scaling laws by modeling the usage of words by simple stochastic processes
in which the overall distribution of word-frequencies is fat tailed (Zipf's
law) and the frequency of a single word is subject to fluctuations across
documents (as in topic models). In this framework, the mean and the variance of
the vocabulary size can be expressed as quenched averages, implying that: i)
the inhomogeneous dissemination of words cause a reduction of the average
vocabulary size in comparison to the homogeneous case, and ii) correlations in
the co-occurrence of words lead to an increase in the variance and the
vocabulary size becomes a non-self-averaging quantity. We address the
implications of these observations to the measurement of lexical richness. We
test our results in three large text databases (Google-ngram, Enlgish
Wikipedia, and a collection of scientific articles).Comment: 19 pages, 4 figure
A network approach to topic models
One of the main computational and scientific challenges in the modern age is
to extract useful information from unstructured texts. Topic models are one
popular machine-learning approach which infers the latent topical structure of
a collection of documents. Despite their success --- in particular of its most
widely used variant called Latent Dirichlet Allocation (LDA) --- and numerous
applications in sociology, history, and linguistics, topic models are known to
suffer from severe conceptual and practical problems, e.g. a lack of
justification for the Bayesian priors, discrepancies with statistical
properties of real texts, and the inability to properly choose the number of
topics. Here we obtain a fresh view on the problem of identifying topical
structures by relating it to the problem of finding communities in complex
networks. This is achieved by representing text corpora as bipartite networks
of documents and words. By adapting existing community-detection methods --
using a stochastic block model (SBM) with non-parametric priors -- we obtain a
more versatile and principled framework for topic modeling (e.g., it
automatically detects the number of topics and hierarchically clusters both the
words and documents). The analysis of artificial and real corpora demonstrates
that our SBM approach leads to better topic models than LDA in terms of
statistical model selection. More importantly, our work shows how to formally
relate methods from community detection and topic modeling, opening the
possibility of cross-fertilization between these two fields.Comment: 22 pages, 10 figures, code available at https://topsbm.github.io
Using text analysis to quantify the similarity and evolution of scientific disciplines
We use an information-theoretic measure of linguistic similarity to
investigate the organization and evolution of scientific fields. An analysis of
almost 20M papers from the past three decades reveals that the linguistic
similarity is related but different from experts and citation-based
classifications, leading to an improved view on the organization of science. A
temporal analysis of the similarity of fields shows that some fields (e.g.,
computer science) are becoming increasingly central, but that on average the
similarity between pairs has not changed in the last decades. This suggests
that tendencies of convergence (e.g., multi-disciplinarity) and divergence
(e.g., specialization) of disciplines are in balance.Comment: 9 pages, 4 figure
Extracting information from S-curves of language change
It is well accepted that adoption of innovations are described by S-curves
(slow start, accelerating period, and slow end). In this paper, we analyze how
much information on the dynamics of innovation spreading can be obtained from a
quantitative description of S-curves. We focus on the adoption of linguistic
innovations for which detailed databases of written texts from the last 200
years allow for an unprecedented statistical precision. Combining data analysis
with simulations of simple models (e.g., the Bass dynamics on complex networks)
we identify signatures of endogenous and exogenous factors in the S-curves of
adoption. We propose a measure to quantify the strength of these factors and
three different methods to estimate it from S-curves. We obtain cases in which
the exogenous factors are dominant (in the adoption of German orthographic
reforms and of one irregular verb) and cases in which endogenous factors are
dominant (in the adoption of conventions for romanization of Russian names and
in the regularization of most studied verbs). These results show that the shape
of S-curve is not universal and contains information on the adoption mechanism.
(published at "J. R. Soc. Interface, vol. 11, no. 101, (2014) 1044"; DOI:
http://dx.doi.org/10.1098/rsif.2014.1044)Comment: 9 pages, 5 figures, Supplementary Material is available at
http://dx.doi.org/10.6084/m9.figshare.122178
On the influence of thermally induced radial pipe extension on the axial friction resistance
Within the design process of district heating networks, the maximum friction forces between the pipeline and the surrounding soil are calculated from the radial stress state and the coefficient of contact friction. For the estimation of the radial stresses, the soil unit weight, geometric properties such as the pipe's diameter and the depth of embedment, as well as the groundwater level are taken into account. For the coefficient of contact friction, different values are proposed, dependent on the thermal loading condition of the pipeline. Although this is an assumption of practical use, physically the coefficient of friction is a material constant. To revise the interaction behavior of the soil-pipeline system with respect to thermally induced radial pipe extension, a two-dimensional finite element model has been developed. Here, the frictional contact was established using Coulomb's friction law. For the embedment, sand at different states of relative density was considered. This noncohesive, granular material was described by the constitutive model HSsmall, which is able to predict the complex non-linear soil behavior in a realistic manner by stress-dependency of stiffness as well as isotropic frictional and volumetric hardening. In addition to the basic Hardening Soil model, the HSsmall model accounts for an increased stiffness in small strain regions, which is crucial for the presented investigation. After a model validation, a parametric study was carried out wherein a radial pipe displacement was applied due to thermal changes of the transported medium. Different combinations of geometry and soil property were studied. We conclude by presenting a corrective term that enables for an incorporation of thermal expansion effects into the prediction of the maximum friction force
Business Intelligence & Analytics and Decision Quality - Insights on Analytics Specialization and Information Processing Modes
Leveraging the benefits of business intelligence and analytics (BI&A) and improving decision quality does not only depend on establishing BI&A technology, but also on the organization and characteristics of decision processes. This research investigates new perspectives on these decision processes and establishes a link between characteristics of BI&A support and decision makers’ modes of information processing behavior, and how these ultimately contribute to the quality of decision outcomes. We build on the heuristic–systematic model (HSM) of information processing, as a central explanatory mechanism for linking BI&A support and decision quality. This allows us examining the effects of decision makers’ systematic and heuristic modes of information processing behavior in decision making processes. We further elucidate the role of analytics experts in influencing decision makers’ utilization of analytic advice. The analysis of data from 136 BI&A-supported decisions reveals how high levels of analytics elaboration can have a negative effect on decision makers’ information processing behavior. We further show how decision makers’ systematic processing contributes to decision quality and how heuristic processing restrains it. In this context we also find that trustworthiness in the analytics expert plays an important role for the adoption of analytic advice
- …