9,328 research outputs found
Document boundary determination using structural and lexical analysis
A method of sequentially presented document determination using parallel analyses from various facets of structural document understanding and information retrieval is proposed in this thesis. Specifically, the method presented here intends to serve as a trainable system when determining where one document ends and another begins. Content analysis methods include use of the Vector Space Model, as well as targeted analysis of content on the margins of document fragments. Structural analysis for this implementation has been limited to simple and ubiquitous entities, such as software-generated zones, simple format-specific lines, and the appearance of page numbers. Analysis focuses on change in similarity between comparisons, with the emphasis placed on the fact that the extremities of documents tend to contain significant structural and lexical changes that can be observed and quantified. We combine the various features using nonlinear approximation (neural network) and experimentally test the usefulness of the combinations
Improving Statistical Language Model Performance with Automatically Generated Word Hierarchies
An automatic word classification system has been designed which processes
word unigram and bigram frequency statistics extracted from a corpus of natural
language utterances. The system implements a binary top-down form of word
clustering which employs an average class mutual information metric. Resulting
classifications are hierarchical, allowing variable class granularity. Words
are represented as structural tags --- unique -bit numbers the most
significant bit-patterns of which incorporate class information. Access to a
structural tag immediately provides access to all classification levels for the
corresponding word. The classification system has successfully revealed some of
the structure of English, from the phonemic to the semantic level. The system
has been compared --- directly and indirectly --- with other recent word
classification systems. Class based interpolated language models have been
constructed to exploit the extra information supplied by the classifications
and some experiments have shown that the new models improve model performance.Comment: 17 Page Paper. Self-extracting PostScript Fil
Π€Π΅Π½ΠΎΠΌΠ΅Π½ ΡΠΈΠ½ΠΊΡΠ΅ΡΠΈΠ·ΠΌΠ° Π² ΡΠΊΡΠ°ΠΈΠ½ΡΠΊΠΎΠΉ Π»ΠΈΠ½Π³Π²ΠΈΡΡΠΈΠΊΠ΅
Π£ ΡΡΡΠ°ΡΠ½ΡΠΉ Π»ΡΠ½Π³Π²ΡΡΡΠΈΡΡ Π²ΠΈΠ²ΡΠ΅Π½Π½Ρ ΡΠΊΠ»Π°Π΄Π½ΠΈΡ
ΡΠΈΡΡΠ΅ΠΌΠ½ΠΈΡ
Π·Π²βΡΠ·ΠΊΡΠ² ΡΠ° Π΄ΠΈΠ½Π°ΠΌΡΠ·ΠΌΡ ΠΌΠΎΠ²ΠΈ Π½Π°Π²ΡΡΠ΄ ΡΠΈ Π±ΡΠ΄Π΅ Π·Π°Π²Π΅ΡΡΠ΅Π½ΠΈΠΌ Π±Π΅Π· ΡΡΠ°Ρ
ΡΠ²Π°Π½Π½Ρ ΡΠΈΠ½ΠΊΡΠ΅ΡΠΈΠ·ΠΌΡ. Π’ΡΠ°Π΄ΠΈΡΡΠΉΠ½ΠΎ ΡΠ²ΠΈΡΠ° ΡΡΠ°Π½Π·ΠΈΡΠΈΠ²Π½ΠΎΡΡΡ ΡΡΠ°ΠΊΡΡΡΡΡΡΡ ΡΠΊ ΠΏΠΎΡΠ΄Π½Π°Π½Π½Ρ ΡΡΠ·Π½ΠΈΡ
ΡΠΈΠΏΡΠ² ΡΡΠ²ΠΎΡΠ΅Π½Ρ ΡΠΊ ΡΠ΅Π·ΡΠ»ΡΡΠ°Ρ ΠΏΡΠΎΡΠ΅ΡΡΠ² ΡΡΠ°Π½ΡΡΠΎΡΠΌΠ°ΡΡΡ Π°Π±ΠΎ Π²ΡΠ΄ΠΎΠ±ΡΠ°ΠΆΠ΅Π½Π½Ρ ΠΏΡΠΎΠΌΡΠΆΠ½ΠΈΡ
, ΡΠΈΠ½ΠΊΡΠ΅ΡΠΈΡΠ½ΠΈΡ
ΡΠ°ΠΊΡΡΠ², ΡΠΎ Ρ
Π°ΡΠ°ΠΊΡΠ΅ΡΠΈΠ·ΡΡΡΡ ΠΌΠΎΠ²Π½Ρ ΡΠΈΡΡΠ΅ΠΌΡ Π² ΡΠΈΠ½Ρ
ΡΠΎΠ½Π½ΠΎΠΌΡ Π°ΡΠΏΠ΅ΠΊΡΡ.In modern linguistics, the study of complex systemic relations and language dynamism is unlikely to be complete without considering the transitivity. Traditionally, transitivity phenomena are treated as a combination of different types of entities, formed as a result of the transformation processes or the reflection of the intermediate, syncretic facts that characterize the language system in the synchronous aspect.Π ΡΠΎΠ²ΡΠ΅ΠΌΠ΅Π½Π½ΠΎΠΉ Π»ΠΈΠ½Π³Π²ΠΈΡΡΠΈΠΊΠ΅ ΠΈΠ·ΡΡΠ΅Π½ΠΈΠ΅ ΡΠ»ΠΎΠΆΠ½ΡΡ
ΡΠΈΡΡΠ΅ΠΌΠ½ΡΡ
ΠΎΡΠ½ΠΎΡΠ΅Π½ΠΈΠΉ ΠΈ ΡΠ·ΡΠΊΠΎΠ²ΠΎΠ³ΠΎ Π΄ΠΈΠ½Π°ΠΌΠΈΠ·ΠΌΠ° Π²ΡΡΠ΄ Π»ΠΈ Π±ΡΠ΄Π΅Ρ ΠΏΠΎΠ»Π½ΡΠΌ Π±Π΅Π· ΡΡΠ΅ΡΠ° ΡΠΈΠ½ΠΊΡΠ΅ΡΠΈΠ·ΠΌΠ°. Π’ΡΠ°Π΄ΠΈΡΠΈΠΎΠ½Π½ΠΎ ΡΠ²Π»Π΅Π½ΠΈΡ ΡΡΠ°Π½Π·ΠΈΡΠΈΠ²Π½ΠΎΡΡΠΈ ΡΡΠ°ΠΊΡΡΡΡΡΡ ΠΊΠ°ΠΊ ΡΠΎΠ²ΠΎΠΊΡΠΏΠ½ΠΎΡΡΡ ΡΠ°Π·Π»ΠΈΡΠ½ΡΡ
ΡΠΈΠΏΠΎΠ² ΡΡΡΠ½ΠΎΡΡΠ΅ΠΉ, ΡΡΠΎΡΠΌΠΈΡΠΎΠ²Π°Π½Π½ΡΡ
Π² ΡΠ΅Π·ΡΠ»ΡΡΠ°ΡΠ΅ ΠΏΡΠΎΡΠ΅ΡΡΠΎΠ² ΠΏΡΠ΅ΠΎΠ±ΡΠ°Π·ΠΎΠ²Π°Π½ΠΈΡ ΠΈΠ»ΠΈ ΠΎΡΡΠ°ΠΆΠ΅Π½ΠΈΡ ΠΏΡΠΎΠΌΠ΅ΠΆΡΡΠΎΡΠ½ΡΡ
ΡΠΈΠ½ΠΊΡΠ΅ΡΠΈΡΠ΅ΡΠΊΠΈΡ
ΡΠ°ΠΊΡΠΎΠ², ΠΊΠΎΡΠΎΡΡΠ΅ Ρ
Π°ΡΠ°ΠΊΡΠ΅ΡΠΈΠ·ΡΡΡ ΡΠ·ΡΠΊΠΎΠ²ΡΡ ΡΠΈΡΡΠ΅ΠΌΡ Π² ΡΠΈΠ½Ρ
ΡΠΎΠ½Π½ΠΎΠΌ Π°ΡΠΏΠ΅ΠΊΡΠ΅
Automatic Population of Structured Reports from Narrative Pathology Reports
There are a number of advantages for the use of structured pathology reports: they can ensure the accuracy and completeness of pathology reporting; it is easier for the referring doctors to glean pertinent information from them. The goal of this thesis is to extract pertinent information from free-text pathology reports and automatically populate structured reports for cancer diseases and identify the commonalities and differences in processing principles to obtain maximum accuracy. Three pathology corpora were annotated with entities and relationships between the entities in this study, namely the melanoma corpus, the colorectal cancer corpus and the lymphoma corpus. A supervised machine-learning based-approach, utilising conditional random fields learners, was developed to recognise medical entities from the corpora. By feature engineering, the best feature configurations were attained, which boosted the F-scores significantly from 4.2% to 6.8% on the training sets. Without proper negation and uncertainty detection, the quality of the structured reports will be diminished. The negation and uncertainty detection modules were built to handle this problem. The modules obtained overall F-scores ranging from 76.6% to 91.0% on the test sets. A relation extraction system was presented to extract four relations from the lymphoma corpus. The system achieved very good performance on the training set, with 100% F-score obtained by the rule-based module and 97.2% F-score attained by the support vector machines classifier. Rule-based approaches were used to generate the structured outputs and populate them to predefined templates. The rule-based system attained over 97% F-scores on the training sets. A pipeline system was implemented with an assembly of all the components described above. It achieved promising results in the end-to-end evaluations, with 86.5%, 84.2% and 78.9% F-scores on the melanoma, colorectal cancer and lymphoma test sets respectively
- β¦