9,328 research outputs found

    Document boundary determination using structural and lexical analysis

    Full text link
    A method of sequentially presented document determination using parallel analyses from various facets of structural document understanding and information retrieval is proposed in this thesis. Specifically, the method presented here intends to serve as a trainable system when determining where one document ends and another begins. Content analysis methods include use of the Vector Space Model, as well as targeted analysis of content on the margins of document fragments. Structural analysis for this implementation has been limited to simple and ubiquitous entities, such as software-generated zones, simple format-specific lines, and the appearance of page numbers. Analysis focuses on change in similarity between comparisons, with the emphasis placed on the fact that the extremities of documents tend to contain significant structural and lexical changes that can be observed and quantified. We combine the various features using nonlinear approximation (neural network) and experimentally test the usefulness of the combinations

    Improving Statistical Language Model Performance with Automatically Generated Word Hierarchies

    Full text link
    An automatic word classification system has been designed which processes word unigram and bigram frequency statistics extracted from a corpus of natural language utterances. The system implements a binary top-down form of word clustering which employs an average class mutual information metric. Resulting classifications are hierarchical, allowing variable class granularity. Words are represented as structural tags --- unique nn-bit numbers the most significant bit-patterns of which incorporate class information. Access to a structural tag immediately provides access to all classification levels for the corresponding word. The classification system has successfully revealed some of the structure of English, from the phonemic to the semantic level. The system has been compared --- directly and indirectly --- with other recent word classification systems. Class based interpolated language models have been constructed to exploit the extra information supplied by the classifications and some experiments have shown that the new models improve model performance.Comment: 17 Page Paper. Self-extracting PostScript Fil

    Π€Π΅Π½ΠΎΠΌΠ΅Π½ синкрСтизма Π² украинской лингвистикС

    Get PDF
    Π£ сучасній лінгвістиці вивчСння складних систСмних зв’язків Ρ‚Π° Π΄ΠΈΠ½Π°ΠΌΡ–Π·ΠΌΡƒ ΠΌΠΎΠ²ΠΈ навряд Ρ‡ΠΈ Π±ΡƒΠ΄Π΅ Π·Π°Π²Π΅Ρ€ΡˆΠ΅Π½ΠΈΠΌ Π±Π΅Π· урахування синкрСтизму. Π’Ρ€Π°Π΄ΠΈΡ†Ρ–ΠΉΠ½ΠΎ явища транзитивності Ρ‚Ρ€Π°ΠΊΡ‚ΡƒΡŽΡ‚ΡŒΡΡ як поєднання Ρ€Ρ–Π·Π½ΠΈΡ… Ρ‚ΠΈΠΏΡ–Π² ΡƒΡ‚Π²ΠΎΡ€Π΅Π½ΡŒ як Ρ€Π΅Π·ΡƒΠ»ΡŒΡ‚Π°Ρ‚ процСсів трансформації Π°Π±ΠΎ відобраТСння ΠΏΡ€ΠΎΠΌΡ–ΠΆΠ½ΠΈΡ…, синкрСтичних Ρ„Π°ΠΊΡ‚Ρ–Π², Ρ‰ΠΎ Ρ…Π°Ρ€Π°ΠΊΡ‚Π΅Ρ€ΠΈΠ·ΡƒΡŽΡ‚ΡŒ ΠΌΠΎΠ²Π½Ρƒ систСму Π² синхронному аспСкті.In modern linguistics, the study of complex systemic relations and language dynamism is unlikely to be complete without considering the transitivity. Traditionally, transitivity phenomena are treated as a combination of different types of entities, formed as a result of the transformation processes or the reflection of the intermediate, syncretic facts that characterize the language system in the synchronous aspect.Π’ соврСмСнной лингвистикС ΠΈΠ·ΡƒΡ‡Π΅Π½ΠΈΠ΅ слоТных систСмных ΠΎΡ‚Π½ΠΎΡˆΠ΅Π½ΠΈΠΉ ΠΈ языкового Π΄ΠΈΠ½Π°ΠΌΠΈΠ·ΠΌΠ° вряд Π»ΠΈ Π±ΡƒΠ΄Π΅Ρ‚ ΠΏΠΎΠ»Π½Ρ‹ΠΌ Π±Π΅Π· ΡƒΡ‡Π΅Ρ‚Π° синкрСтизма. Π’Ρ€Π°Π΄ΠΈΡ†ΠΈΠΎΠ½Π½ΠΎ явлСния транзитивности Ρ‚Ρ€Π°ΠΊΡ‚ΡƒΡŽΡ‚ΡΡ ΠΊΠ°ΠΊ ΡΠΎΠ²ΠΎΠΊΡƒΠΏΠ½ΠΎΡΡ‚ΡŒ Ρ€Π°Π·Π»ΠΈΡ‡Π½Ρ‹Ρ… Ρ‚ΠΈΠΏΠΎΠ² сущностСй, сформированных Π² Ρ€Π΅Π·ΡƒΠ»ΡŒΡ‚Π°Ρ‚Π΅ процСссов прСобразования ΠΈΠ»ΠΈ отраТСния ΠΏΡ€ΠΎΠΌΠ΅ΠΆΡƒΡ‚ΠΎΡ‡Π½Ρ‹Ρ… синкрСтичСских Ρ„Π°ΠΊΡ‚ΠΎΠ², ΠΊΠΎΡ‚ΠΎΡ€Ρ‹Π΅ Ρ…Π°Ρ€Π°ΠΊΡ‚Π΅Ρ€ΠΈΠ·ΡƒΡŽΡ‚ ΡΠ·Ρ‹ΠΊΠΎΠ²ΡƒΡŽ систСму Π² синхронном аспСктС

    Automatic Population of Structured Reports from Narrative Pathology Reports

    Get PDF
    There are a number of advantages for the use of structured pathology reports: they can ensure the accuracy and completeness of pathology reporting; it is easier for the referring doctors to glean pertinent information from them. The goal of this thesis is to extract pertinent information from free-text pathology reports and automatically populate structured reports for cancer diseases and identify the commonalities and differences in processing principles to obtain maximum accuracy. Three pathology corpora were annotated with entities and relationships between the entities in this study, namely the melanoma corpus, the colorectal cancer corpus and the lymphoma corpus. A supervised machine-learning based-approach, utilising conditional random fields learners, was developed to recognise medical entities from the corpora. By feature engineering, the best feature configurations were attained, which boosted the F-scores significantly from 4.2% to 6.8% on the training sets. Without proper negation and uncertainty detection, the quality of the structured reports will be diminished. The negation and uncertainty detection modules were built to handle this problem. The modules obtained overall F-scores ranging from 76.6% to 91.0% on the test sets. A relation extraction system was presented to extract four relations from the lymphoma corpus. The system achieved very good performance on the training set, with 100% F-score obtained by the rule-based module and 97.2% F-score attained by the support vector machines classifier. Rule-based approaches were used to generate the structured outputs and populate them to predefined templates. The rule-based system attained over 97% F-scores on the training sets. A pipeline system was implemented with an assembly of all the components described above. It achieved promising results in the end-to-end evaluations, with 86.5%, 84.2% and 78.9% F-scores on the melanoma, colorectal cancer and lymphoma test sets respectively
    • …
    corecore