Search CORE

24 research outputs found

LHIP: Extended DCGs for Configurable Robust Parsing

Author: Ballim Afzal
Russell Graham
Publication venue
Publication date: 01/01/1994
Field of study

We present LHIP, a system for incremental grammar development using an extended DCG formalism. The system uses a robust island-based parsing method controlled by user-defined performance thresholds.Comment: 10 pages, in Proc. Coling9

arXiv.org e-Print Archive

CiteSeerX

Apportioning Development Effort in a Probabilistic LR Parsing System through Evaluation

Author: Briscoe Ted
Carroll John
Publication venue
Publication date: 01/01/1996
Field of study

We describe an implemented system for robust domain-independent syntactic parsing of English, using a unification-based grammar of part-of-speech and punctuation labels coupled with a probabilistic LR parser. We present evaluations of the system's performance along several different dimensions; these enable us to assess the contribution that each individual part is making to the success of the system as a whole, and thus prioritise the effort to be devoted to its further enhancement. Currently, the system is able to parse around 80% of sentences in a substantial corpus of general text containing a number of distinct genres. On a random sample of 250 such sentences the system has a mean crossing bracket rate of 0.71 and recall and precision of 83% and 84% respectively when evaluated against manually-disambiguated analyses.Comment: 10 pages, 1 Postscript figure. To Appear in Proceedings of the Conference on Empirical Methods in Natural Language Processing, University of Pennsylvania, May 199

arXiv.org e-Print Archive

CiteSeerX

Sussex Research Online

MBT: A Memory-Based Part of Speech Tagger-Generator

Author: Berck Peter
Daelemans Walter
Gillis Steven
Zavrel Jakub
Publication venue
Publication date: 01/01/1996
Field of study

We introduce a memory-based approach to part of speech tagging. Memory-based learning is a form of supervised learning based on similarity-based reasoning. The part of speech tag of a word in a particular context is extrapolated from the most similar cases held in memory. Supervised learning approaches are useful when a tagged corpus is available as an example of the desired output of the tagger. Based on such a corpus, the tagger-generator automatically builds a tagger which is able to tag new text the same way, diminishing development time for the construction of a tagger considerably. Memory-based tagging shares this advantage with other statistical or machine learning approaches. Additional advantages specific to a memory-based approach include (i) the relatively small tagged corpus size sufficient for training, (ii) incremental learning, (iii) explanation capabilities, (iv) flexible integration of information in case representations, (v) its non-parametric nature, (vi) reasonably good results on unknown words without morphological analysis, and (vii) fast learning and tagging. In this paper we show that a large-scale application of the memory-based approach is feasible: we obtain a tagging accuracy that is on a par with that of known statistical approaches, and with attractive space and time complexity properties when using {\em IGTree}, a tree-based formalism for indexing and searching huge case bases.} The use of IGTree has as additional advantage that optimal context size for disambiguation is dynamically computed.Comment: 14 pages, 2 Postscript figure

arXiv.org e-Print Archive

CiteSeerX

Institutional Repository Universiteit Antwerpen

Tilburg University Repository

POS tagging: clarificação histórico-terminológica

Author: Santos Diana
Publication venue
Publication date: 06/10/2009
Field of study

Repositório Comum

Cue Phrase Classification Using Machine Learning

Author: Litman Diane J.
Publication venue
Publication date: 01/01/1996
Field of study

Cue phrases may be used in a discourse sense to explicitly signal discourse structure, but also in a sentential sense to convey semantic rather than structural information. Correctly classifying cue phrases as discourse or sentential is critical in natural language processing systems that exploit discourse structure, e.g., for performing tasks such as anaphora resolution and plan recognition. This paper explores the use of machine learning for classifying cue phrases as discourse or sentential. Two machine learning programs (Cgrendel and C4.5) are used to induce classification models from sets of pre-classified cue phrases and their features in text and speech. Machine learning is shown to be an effective technique for not only automating the generation of classification models, but also for improving upon previous results. When compared to manually derived classification models already in the literature, the learned models often perform with higher accuracy and contain new linguistic insights into the data. In addition, the ability to automatically construct classification models makes it easier to comparatively analyze the utility of alternative feature representations of the data. Finally, the ease of retraining makes the learning approach more scalable and flexible than manual methods.Comment: 42 pages, uses jair.sty, theapa.bst, theapa.st

arXiv.org e-Print Archive

CiteSeerX

Words and their secrets

Author: Finatto Maria José Bocorny
Santos Diana
Publication venue
Publication date: 01/01/2010
Field of study

Repositório Comum

Korean Part-of-Speech Tagging Based on Syllables

Author: 전길호
Publication venue: 한국해양대학교 대학원
Publication date: 01/02/2012
Field of study

인터넷의 급속한 발전으로 각종 포털 사이트의 게시판, 카페, 동호회, 블로그 등에는 수많은 문서가 생성되고 있다. 예를 들어 개인 블로그에는 관심분야에 따른 수많은 정보들이 게시되고 있고, 각종 동호회 게시판에는 동호회의 목적과 관련된 수많은 정보 등이 매일 게시되고 있다. 이렇게 많은 문서들은 분석과 분류를 통해 보다 많은 사람들에게 중요한 정보로 활용될 수 있고, 이러한 이유로 문서의 분석 및 분류와 같은 정보처리의 필요성이 대두되고 있다. 이러한 필요성에 따라 많은 학자들이 문서를 보다 정확하게 분석하고 분류하기 위한 방법들을 연구하고 제안하며 실제로 사용되고 있다(Manning et al., 2010). 이러한 수많은 방법들 중에서 형태소 분석 및 품사 부착은 문서를 분석하고 분류하여 정보로 활용하기 위한 여러 방법들의 공통된 최하위 단계에 속한다. 형태소 분석이란 입력된 문서에 대해 형태소의 변형과 분리 경계를 결정하는 문제를 처리하는 과정으로 언어적 특성에 맞게 구현된다(Dale, etal., 2000). 특히 한국어는 내용어와 기능어의 결합으로 다양한 형태의 변형이 발생된다(서정수, 1996). 이러한 이유로 한국어 형태소 분석기는 영어와 같은 외국어 형태소 분석기 보다 복잡한 구조를 가지고 있다 용언에 대한 형태소 분석은 활용 처리, 불규칙 처리, 음운현상 처리 등 매우 복잡한 과정을 포함하고 있다. . 이렇게 복잡한 구조의 형태소 분석기를 설계하고 구현하기 위해서는 복잡한 지식과 방대한 사전정보가 요구된다(김재훈, 이공주, 2003). 뿐만 아니라 매우 까다로운 구현과정을 거치기 때문에 유지보수를 한다는 것은 형태소 분석기를 구현하는 것만큼 어려운 것이 현실이다. 그러나 일부 정보검색 시스템은 주어진 문장에서 명사만 추출하여 색인하는데 응용분야에 따라서는 모든 종류의 형태소 분석결과를 필요로 하지 않는다. 또한 품사부착은 형태소 분석에서 발생된 여러 분석 결과를 주어진 문장에 가장 적합한 분석을 선택하여 여러 응용분야에 사용된다. 이러한 문제들을 해결하국어 품사를기 위해 음절단위로 한 부착한 연구(심광섭, 2011)가 있으나 복합명사를 분석하기 어려우며 규칙을 사용하기 때문에 규칙의 모호성 문제가 존재한다. 본 논문에서는 이와 같은 문제를 해결하고자 기계학습 기법을 이용한 음절기반 품사 부착 방법을 제안한다. 이 방법은 언어처리 시스템이나 대량의 사전정보를 이용하여 형태소 분석을 하지 않고 기계학습 도구를 이용하여 음절단위로 품사 부착이 가능한 학습모델을 생성하여 입력된 문장을 음절단위로 음절품사를 부착하고 어절경계를 표시하여 복합명사의 분석이 가능하다. 음절품사가 부착된 문장은 음절 복원기를 통해 음절의 원형 복원 결과를 얻는다. 음절을 복원하는 과정에서 발생하는 모호성 문제는 Na&iumlve Bayes 분류기를 이용해서 해결한다. 본 논문에서 제안하는 형태소 분석 및 품사부착은 기계학습 기법을 이용하고 있으며, 구현이 쉽고 간단하기 때문에 단기간 내에 구현할 수 있으며, 복잡한 구조를 가진 기타 품사 부착기와 비슷한 수준의 성능을 가지고 있다. 본 논문의 구성은 다음과 같다. 2장에서 기존의 형태소 분석 및 품사 부착 방법들과 음절기반 언어 처리 방법들에 대해 살펴보고, 3장에서 기계학습에 필요한 학습말뭉치의 가공방법에 대해 살펴본다. 4장에서 기계학습을 이용한 음절 기반 형태소 분석에 대해 논하며 5장에서는 본 논문에서 제안한 방법으로 구현한 시스템의 성능을 평가한다. 마지막으로 6장에서 결론을 맺고 앞으로의 연구 방향을 제시한다.제 1 장 서론 제 2 장 관련 연구 2.1 형태소 분석 및 품사부착 2.2 한국어 형태소 분석 방법 2.3 한국어 품사 부착 방법 2.4 음절정보를 이용한 언어처리 2.4.1 단어 분리 및 범주 결정 2.4.2 한국어 품사 부착 2.4.3 복합명사 분해 2.5 CRF를 이용한 한국어 품사 부착 2.5.1 음절품사 부착기 2.5.2 규칙을 이용한 원형복원 2.5.3 시스템의 문제점 제 3 장 학습말뭉치의 구성 및 가공 3.1 품사 태그 집합 3.2 학습말뭉치의 구성 3.3 학습말뭉치 구축 3.3.1 어절 및 형태소 분석 결과의 정렬 3.3.2 원시말뭉치의 가공 제 4 장 기계학습을 이용한 음절기반 품사부착 4.1 음절품사 부착기 4.1.1 음절품사 부착 학습말뭉치의 자질추출 4.1.2 기계학습 모델 4.2 음절 복원기 4.3 형태소 복원기 4.4 품사 복원기 제 5 장 실험 및 평가 5.1 기계학습 도구 5.2 성능평가 척도 5.3 성능평가 5.3.1 전체 시스템의 성능평가 5.3.2 각 시스템 별 성능 평가 5.4 오류분석 5.4.1 음절품사 부착결과의 오류분석 5.4.2 음절 복원 결과의 오류분석 5.4.3 음절품사 복원결과의 오류분석 제 6 장 결론 및 향후 연구과제 참고문헌 부

한국해양대학교(KMOU)

The application of linguistic processing to automatic abstract generation

Author: Black William J.
Johnson Frances C.
Neal A.P.
Paice Christopher D.
Publication venue: Morgan Kauffman
Publication date: 01/01/1997
Field of study

One approach to the problem of generating abstracts by computer is to extract from a source text those sentences which give a strong indication of the central subject matter and findings of the paper. Not surprisingly, concatenations of extracted sentences show a lack of cohesion, due partly to the frequent occurrence of anaphoric references. This paper describes the text processing which was necessary to identify these anaphors so that they may be utilised in the enhancement of the sentence selection criteria. It is assumed that sentences which contain non-anaphoric nounphrases and introduce key concepts into the text are worthy of inclusion in an abstract. The results suggest that the key concepts are indeed identified but the abstracts are too long. Further recommendations are made to continue this work in abstracting which makes use of text structure

E-space: Manchester Metropolitan University's Research Repository