15 research outputs found

    A high-reproducibility and high-accuracy method for automated topic classification

    Full text link
    Much of human knowledge sits in large databases of unstructured text. Leveraging this knowledge requires algorithms that extract and record metadata on unstructured text documents. Assigning topics to documents will enable intelligent search, statistical characterization, and meaningful classification. Latent Dirichlet allocation (LDA) is the state-of-the-art in topic classification. Here, we perform a systematic theoretical and numerical analysis that demonstrates that current optimization techniques for LDA often yield results which are not accurate in inferring the most suitable model parameters. Adapting approaches for community detection in networks, we propose a new algorithm which displays high-reproducibility and high-accuracy, and also has high computational efficiency. We apply it to a large set of documents in the English Wikipedia and reveal its hierarchical structure. Our algorithm promises to make "big data" text analysis systems more reliable.Comment: 23 pages, 24 figure

    Duality between time series and networks.

    Get PDF
    Studying the interaction between a system's components and the temporal evolution of the system are two common ways to uncover and characterize its internal workings. Recently, several maps from a time series to a network have been proposed with the intent of using network metrics to characterize time series. Although these maps demonstrate that different time series result in networks with distinct topological properties, it remains unclear how these topological properties relate to the original time series. Here, we propose a map from a time series to a network with an approximate inverse operation, making it possible to use network statistics to characterize time series and time series statistics to characterize networks. As a proof of concept, we generate an ensemble of time series ranging from periodic to random and confirm that application of the proposed map retains much of the information encoded in the original time series (or networks) after application of the map (or its inverse). Our results suggest that network analysis can be used to distinguish different dynamic regimes in time series and, perhaps more importantly, time series analysis can provide a powerful set of tools that augment the traditional network analysis toolkit to quantify networks in new and useful ways

    Statistical properties of the time series presented in <b>Figure 9</b>, generated from the <i>Arabidopsis thaliana</i> network and the USA Internet 1997.

    No full text
    <p>Note that the long-range correlations present in the metabolic network are well captured by the autocorrelation function and the corresponding power density spectrum, which displays a clear power-law scaling. On the other hand, the results in the USA Internet 1997 bear the footprint of the short-correlated signal generated by the Internet network. Note a power-law scaling with a less steep slope.</p

    Different realizations of the inverse map in the real newtorks.

    No full text
    <p>We perform four realizations of to the <i>Arabidopsis thaliana</i> metabolic network ( nodes and 100,000 points), and USA Internet 1997 (1,589 nodes and 100,000 points). Note the clear similarity of these time series with the time series presented in <a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0023378#pone-0023378-g009" target="_blank">Figure 9</a>, demonstrating the robustness of the proposed inverse map.</p

    Illustration of the proposed forward map to the problem of detecting differences in the data structures of patients in different health conditions.

    No full text
    <p>We use 100-minute normalized heart rate time series from a healthy subject (upper panel) and a subject with severe congestive heart failure (lower panel) sampled every seconds ( = 10,000 points) <a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0023378#pone.0023378-Physionet1" target="_blank">[36]</a>. We construct the networks using quantiles by applying from the corresponding time series. The resulting networks display clear differences in topology, which are especially apparent on the relatively separated cluster in the network associated with the unhealthy subject. These differences in topology are confirmed by generating networks with different number of nodes (<a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0023378#pone-0023378-g007" target="_blank">Fig. 7</a>) and using time series from different healthy and unhealthy subjects (<a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0023378#pone-0023378-g008" target="_blank">Fig. 8</a>).</p

    Comparison of statistical properties of first generation and second generation time series.

    No full text
    <p>We compare the means of these properties over 10 different realizations of first and second generation time series. Error bars denote standard deviation across realizations. For both the first and second generation time series, the autocorrelation function and the power spectrum reveal a distinct signal when the time series are periodic (), which disappears when the time series become random (). As expected from the toy model that has no biases toward particular values, both the first and second generation time series have values that are uniformly distributed between and for all values of .</p

    Qualitative assessment of the faithfulness of the proposed map and its inverse.

    No full text
    <p>We generate first generation time series from the toy time series model (Eq. 2) ranging from periodic () to random () with and . We then construct the first generation networks using quantiles by applying from the corresponding time series. Time series with different values of result in networks with different topologies. As the toy time series becomes more random, the corresponding networks also become increasingly random. We construct the second generation time series and the second generation networks by sequentially applying and , respectively. These panels suggest that the first and second generation time series and networks have similar properties, supporting the hypothesis that it may be possible to use time series analysis to characterize the topology of networks and networks analysis to characterize the structure of time series.</p

    Application of the proposed forward map to the heart rate time series associated to different subjects.

    No full text
    <p>We apply using nodes to time series from three healthy (left panels) and unhealthy subjects (right panels). Regardless of the number of different subjects, the resulting networks are visually similar with those presented in <a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0023378#pone-0023378-g006" target="_blank">Figure 6</a>. This is another demonstration of the robustness of .</p

    Application of the proposed forward map to the heart rate time series using different number of quantiles.

    No full text
    <p>We apply using and nodes to time series from healthy (left panels) and unhealthy subjects (right panels). Note the visual similarity of these networks with the networks presented in <a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0023378#pone-0023378-g006" target="_blank">Figure 6</a>, attesting the robustness of the results of the proposed forward map, regardless of the value of .</p
    corecore