76 research outputs found

    Experiments on predictability of word in context and information rate in natural language

    Get PDF
    Based on data from a large-scale experiment with human subjects, we conclude that the logarithm of probability to guess a word in context (unpredictability) depends linearly on the word length. This result holds both for poetry and prose, even though with prose, the subjects don't know the length of the omitted word. We hypothesize that this effect reflects a tendency of natural language to have an even information rate

    Improving Statistical Language Model Performance with Automatically Generated Word Hierarchies

    Full text link
    An automatic word classification system has been designed which processes word unigram and bigram frequency statistics extracted from a corpus of natural language utterances. The system implements a binary top-down form of word clustering which employs an average class mutual information metric. Resulting classifications are hierarchical, allowing variable class granularity. Words are represented as structural tags --- unique nn-bit numbers the most significant bit-patterns of which incorporate class information. Access to a structural tag immediately provides access to all classification levels for the corresponding word. The classification system has successfully revealed some of the structure of English, from the phonemic to the semantic level. The system has been compared --- directly and indirectly --- with other recent word classification systems. Class based interpolated language models have been constructed to exploit the extra information supplied by the classifications and some experiments have shown that the new models improve model performance.Comment: 17 Page Paper. Self-extracting PostScript Fil

    Maximal information component analysis: a novel non-linear network analysis method.

    Get PDF
    BackgroundNetwork construction and analysis algorithms provide scientists with the ability to sift through high-throughput biological outputs, such as transcription microarrays, for small groups of genes (modules) that are relevant for further research. Most of these algorithms ignore the important role of non-linear interactions in the data, and the ability for genes to operate in multiple functional groups at once, despite clear evidence for both of these phenomena in observed biological systems.ResultsWe have created a novel co-expression network analysis algorithm that incorporates both of these principles by combining the information-theoretic association measure of the maximal information coefficient (MIC) with an Interaction Component Model. We evaluate the performance of this approach on two datasets collected from a large panel of mice, one from macrophages and the other from liver by comparing the two measures based on a measure of module entropy, Gene Ontology (GO) enrichment, and scale-free topology (SFT) fit. Our algorithm outperforms a widely used co-expression analysis method, weighted gene co-expression network analysis (WGCNA), in the macrophage data, while returning comparable results in the liver dataset when using these criteria. We demonstrate that the macrophage data has more non-linear interactions than the liver dataset, which may explain the increased performance of our method, termed Maximal Information Component Analysis (MICA) in that case.ConclusionsIn making our network algorithm more accurately reflect known biological principles, we are able to generate modules with improved relevance, particularly in networks with confounding factors such as gene by environment interactions
    • …
    corecore