6 research outputs found

    The structure of PWMs.

    No full text
    <p>We can generate six PWMs, and each matrix corresponds to a pattern order. For example, the first PWM to the left corresponds to the pattern order (, , ). Each row corresponds to a word, and each column corresponds to a segment, and cells of the matrix represent the frequency of words in each segment.</p

    DEMGD system architecture.

    No full text
    <p>The input to the system is the Input Text, and the output is Summary Tables and Full Reports. The system consists of four modules: Text Pre-processing, Structured Data Representation, Classification and Associations Extraction.</p

    Combining Position Weight Matrices and Document-Term Matrix for Efficient Extraction of Associations of Methylated Genes and Diseases from Free Text

    Get PDF
    <div><p>Background</p><p>In a number of diseases, certain genes are reported to be strongly methylated and thus can serve as diagnostic markers in many cases. Scientific literature in digital form is an important source of information about methylated genes implicated in particular diseases. The large volume of the electronic text makes it difficult and impractical to search for this information manually.</p> <p>Methodology</p><p>We developed a novel text mining methodology based on a new concept of position weight matrices (PWMs) for text representation and feature generation. We applied PWMs in conjunction with the document-term matrix to extract with high accuracy associations between methylated genes and diseases from free text. The performance results are based on large manually-classified data. Additionally, we developed a web-tool, DEMGD, which automates extraction of these associations from free text. DEMGD presents the extracted associations in summary tables and full reports in addition to evidence tagging of text with respect to genes, diseases and methylation words. The methodology we developed in this study can be applied to similar association extraction problems from free text.</p> <p>Conclusion</p><p>The new methodology developed in this study allows for efficient identification of associations between concepts. Our method applied to methylated genes in different diseases is implemented as a Web-tool, DEMGD, which is freely available at <a href="http://www.cbrc.kaust.edu.sa/demgd/" target="_blank">http://www.cbrc.kaust.edu.sa/demgd/</a>. The data is available for online browsing and download.</p> </div

    Computing the scores.

    No full text
    <p>The figure shows an example of a normalized PWM. To compute the score, we sum the weights of one word from each column. For example, the word ‘promoter’ appears in the first segment, so we take its weight from the first column in the PWM. The same step is applied to the second, and the third segments. However, five words appear in the last segment, so we take maximum weight. The score of the pattern is 0.2336+0.1619+0.1724+0.1315=0.5994. </p

    PWM generation.

    No full text
    <p>The PWM summarizes frequency of words in each segment. For example, the words ‘CPG’ and ‘island’ appear in the first segment of the sentence, so the rows that correspond to these words and the first column is incremented by one. Similarly, the same step is applied to words in the remaining three segments. The same matrix is updated using other sentences with the same pattern order.</p

    Dataset representation using PWMs.

    No full text
    <p>Each pattern in a sentence is represented with twelve features and a class label. The first six features correspond to the scores generated from the positive PWMs, and the following six features correspond to the scores generated from the negative PWMs.</p
    corecore