21 research outputs found

    Effect of extreme data loss on long-range correlated and anti-correlated signals quantified by detrended fluctuation analysis

    Full text link
    We investigate how extreme loss of data affects the scaling behavior of long-range power-law correlated and anti-correlated signals applying the DFA method. We introduce a segmentation approach to generate surrogate signals by randomly removing data segments from stationary signals with different types of correlations. These surrogate signals are characterized by: (i) the DFA scaling exponent α\alpha of the original correlated signal, (ii) the percentage pp of the data removed, (iii) the average length μ\mu of the removed (or remaining) data segments, and (iv) the functional form of the distribution of the length of the removed (or remaining) data segments. We find that the {\it global} scaling exponent of positively correlated signals remains practically unchanged even for extreme data loss of up to 90%. In contrast, the global scaling of anti-correlated signals changes to uncorrelated behavior even when a very small fraction of the data is lost. These observations are confirmed on the examples of human gait and commodity price fluctuations. We systematically study the {\it local} scaling behavior of signals with missing data to reveal deviations across scales. We find that for anti-correlated signals even 10% of data loss leads to deviations in the local scaling at large scales from the original anti-correlated towards uncorrelated behavior. In contrast, positively correlated signals show no observable changes in the local scaling for up to 65% of data loss, while for larger percentage, the local scaling shows overestimated regions (with higher local exponent) at small scales, followed by underestimated regions (with lower local exponent) at large scales. Finally, we investigate how the scaling is affected by the statistics of the remaining data segments in comparison to the removed segments

    Large expert-curated database for benchmarking document similarity detection in biomedical literature search

    Get PDF
    Document recommendation systems for locating relevant literature have mostly relied on methods developed a decade ago. This is largely due to the lack of a large offline gold-standard benchmark of relevant documents that cover a variety of research fields such that newly developed literature search techniques can be compared, improved and translated into practice. To overcome this bottleneck, we have established the RElevant LIterature SearcH consortium consisting of more than 1500 scientists from 84 countries, who have collectively annotated the relevance of over 180 000 PubMed-listed articles with regard to their respective seed (input) article/s. The majority of annotations were contributed by highly experienced, original authors of the seed articles. The collected data cover 76% of all unique PubMed Medical Subject Headings descriptors. No systematic biases were observed across different experience levels, research fields or time spent on annotations. More importantly, annotations of the same document pairs contributed by different scientists were highly concordant. We further show that the three representative baseline methods used to generate recommended articles for evaluation (Okapi Best Matching 25, Term Frequency-Inverse Document Frequency and PubMed Related Articles) had similar overall performances. Additionally, we found that these methods each tend to produce distinct collections of recommended articles, suggesting that a hybrid method may be required to completely capture all relevant articles. The established database server located at https://relishdb.ict.griffith.edu.au is freely available for the downloading of annotation data and the blind testing of new methods. We expect that this benchmark will be useful for stimulating the development of new powerful techniques for title and title/abstract-based search engines for relevant articles in biomedical research.Peer reviewe

    Scaling laws and model of words organization in spoken and written language

    No full text
    A broad range of complex physical and biological systems exhibits scaling laws. The human language is a complex system of words organization. Studies of written texts have revealed intriguing scaling laws that characterize the frequency of words occurrence, rank of words, and growth in the number of distinct words with text length. While studies have predominantly focused on the language system in its written form, such as books, little attention is given to the structure of spoken language. Here we investigate a database of spoken language transcripts and written texts, and we uncover that words organization in both spoken language and written texts exhibits scaling laws, although with different crossover regimes and scaling exponents. We propose a model that provides insight into words organization in spoken language and written texts, and successfully accounts for all scaling laws empirically observed in both language forms

    Procedural Data Processing for Single-Molecule Identification by Nanopore Sensors

    No full text
    Nanopores are promising single-molecule sensing devices that have been successfully used for DNA sequencing, protein identification, as well as virus/particles detection. It is important to understand and characterize the current pulses collected by nanopore sensors, which imply the associated information of the analytes, including the size, structure, and surface charge. Therefore, a signal processing program, based on the MATLAB platform, was designed to characterize the ionic current signals of nanopore measurements. In a movable data window, the selected current segment was analyzed by the adaptive thresholds and corrected by multi-functions to reduce the noise obstruction of pulse signals. Accordingly, a set of single molecular events was identified, and the abundant information of current signals with the dwell time, amplitude, and current pulse area was exported for quantitative analysis. The program contributes to the efficient and fast processing of nanopore signals with a high signal-to-noise ratio, which promotes the development of the nanopore sensing devices in various fields of diagnosis systems and precision medicine

    Detailed information on the database of Chinese and English books and corresponding words statistics for each book.

    No full text
    <p><i>T</i> is the total number of words, and <i>N</i><sub><i>T</i></sub> is the vocabulary size of each book.</p

    Top 20 most frequently used English words and Chinese characters and their frequencies.

    No full text
    <p>A Chinese character can have different functions in the structure of a sentence and carry different meanings depending on the context, as shown in brackets following each Chinese character in the table. The frequencies are calculated using pooled data of all books in our database.</p

    Model parameters of Chinese and English books.

    No full text
    <p>The statistics show significant differences in the model parameters <i>k</i><sub>0</sub>, <i>k</i><sub><i>t</i></sub> and <i>k</i><sub><i>p</i></sub> between Chinese and English texts, indicating differences in the dynamic process underlying the language structure, words organization and the occurrence of new words with text growth.</p

    Functional relations between the empirically observed scaling exponents and model parameters.

    No full text
    <p>(a) Exponent <i>β</i> of the empirical probability distribution <i>P</i>(<i>k</i>) vs. model parameter <i>k</i><sub><i>p</i></sub> indicating linear functional dependence. (b) Heaps’ law scaling exponents λ<sub>2</sub> for English books and spoken transcriptions, and λ<sub>3</sub> for Chinese books vs. model parameter <i>k</i><sub><i>t</i></sub> indicating exponential functional dependence. Data points are obtained from the scaling analyses and simulation of all ten Chinese and English language books listed in <a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0168971#pone.0168971.t001" target="_blank">Table 1</a>, and English spoken language from Ref. [<a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0168971#pone.0168971.ref030" target="_blank">30</a>]. The dotted lines indicate 95% confidence intervals of the data points obtained from empirical and model parameters for each separate book.</p
    corecore