24 research outputs found
LHIP: Extended DCGs for Configurable Robust Parsing
We present LHIP, a system for incremental grammar development using an
extended DCG formalism. The system uses a robust island-based parsing method
controlled by user-defined performance thresholds.Comment: 10 pages, in Proc. Coling9
Apportioning Development Effort in a Probabilistic LR Parsing System through Evaluation
We describe an implemented system for robust domain-independent syntactic
parsing of English, using a unification-based grammar of part-of-speech and
punctuation labels coupled with a probabilistic LR parser. We present
evaluations of the system's performance along several different dimensions;
these enable us to assess the contribution that each individual part is making
to the success of the system as a whole, and thus prioritise the effort to be
devoted to its further enhancement. Currently, the system is able to parse
around 80% of sentences in a substantial corpus of general text containing a
number of distinct genres. On a random sample of 250 such sentences the system
has a mean crossing bracket rate of 0.71 and recall and precision of 83% and
84% respectively when evaluated against manually-disambiguated analyses.Comment: 10 pages, 1 Postscript figure. To Appear in Proceedings of the
Conference on Empirical Methods in Natural Language Processing, University of
Pennsylvania, May 199
MBT: A Memory-Based Part of Speech Tagger-Generator
We introduce a memory-based approach to part of speech tagging. Memory-based
learning is a form of supervised learning based on similarity-based reasoning.
The part of speech tag of a word in a particular context is extrapolated from
the most similar cases held in memory. Supervised learning approaches are
useful when a tagged corpus is available as an example of the desired output of
the tagger. Based on such a corpus, the tagger-generator automatically builds a
tagger which is able to tag new text the same way, diminishing development time
for the construction of a tagger considerably. Memory-based tagging shares this
advantage with other statistical or machine learning approaches. Additional
advantages specific to a memory-based approach include (i) the relatively small
tagged corpus size sufficient for training, (ii) incremental learning, (iii)
explanation capabilities, (iv) flexible integration of information in case
representations, (v) its non-parametric nature, (vi) reasonably good results on
unknown words without morphological analysis, and (vii) fast learning and
tagging. In this paper we show that a large-scale application of the
memory-based approach is feasible: we obtain a tagging accuracy that is on a
par with that of known statistical approaches, and with attractive space and
time complexity properties when using {\em IGTree}, a tree-based formalism for
indexing and searching huge case bases.} The use of IGTree has as additional
advantage that optimal context size for disambiguation is dynamically computed.Comment: 14 pages, 2 Postscript figure
Cue Phrase Classification Using Machine Learning
Cue phrases may be used in a discourse sense to explicitly signal discourse
structure, but also in a sentential sense to convey semantic rather than
structural information. Correctly classifying cue phrases as discourse or
sentential is critical in natural language processing systems that exploit
discourse structure, e.g., for performing tasks such as anaphora resolution and
plan recognition. This paper explores the use of machine learning for
classifying cue phrases as discourse or sentential. Two machine learning
programs (Cgrendel and C4.5) are used to induce classification models from sets
of pre-classified cue phrases and their features in text and speech. Machine
learning is shown to be an effective technique for not only automating the
generation of classification models, but also for improving upon previous
results. When compared to manually derived classification models already in the
literature, the learned models often perform with higher accuracy and contain
new linguistic insights into the data. In addition, the ability to
automatically construct classification models makes it easier to comparatively
analyze the utility of alternative feature representations of the data.
Finally, the ease of retraining makes the learning approach more scalable and
flexible than manual methods.Comment: 42 pages, uses jair.sty, theapa.bst, theapa.st
Korean Part-of-Speech Tagging Based on Syllables
์ธํฐ๋ท์ ๊ธ์ํ ๋ฐ์ ์ผ๋ก ๊ฐ์ข
ํฌํธ ์ฌ์ดํธ์ ๊ฒ์ํ, ์นดํ, ๋ํธํ, ๋ธ๋ก๊ทธ ๋ฑ์๋ ์๋ง์ ๋ฌธ์๊ฐ ์์ฑ๋๊ณ ์๋ค. ์๋ฅผ ๋ค์ด ๊ฐ์ธ ๋ธ๋ก๊ทธ์๋ ๊ด์ฌ๋ถ์ผ์ ๋ฐ๋ฅธ ์๋ง์ ์ ๋ณด๋ค์ด ๊ฒ์๋๊ณ ์๊ณ , ๊ฐ์ข
๋ํธํ ๊ฒ์ํ์๋ ๋ํธํ์ ๋ชฉ์ ๊ณผ ๊ด๋ จ๋ ์๋ง์ ์ ๋ณด ๋ฑ์ด ๋งค์ผ ๊ฒ์๋๊ณ ์๋ค. ์ด๋ ๊ฒ ๋ง์ ๋ฌธ์๋ค์ ๋ถ์๊ณผ ๋ถ๋ฅ๋ฅผ ํตํด ๋ณด๋ค ๋ง์ ์ฌ๋๋ค์๊ฒ ์ค์ํ ์ ๋ณด๋ก ํ์ฉ๋ ์ ์๊ณ , ์ด๋ฌํ ์ด์ ๋ก ๋ฌธ์์ ๋ถ์ ๋ฐ ๋ถ๋ฅ์ ๊ฐ์ ์ ๋ณด์ฒ๋ฆฌ์ ํ์์ฑ์ด ๋๋๋๊ณ ์๋ค. ์ด๋ฌํ ํ์์ฑ์ ๋ฐ๋ผ ๋ง์ ํ์๋ค์ด ๋ฌธ์๋ฅผ ๋ณด๋ค ์ ํํ๊ฒ ๋ถ์ํ๊ณ ๋ถ๋ฅํ๊ธฐ ์ํ ๋ฐฉ๋ฒ๋ค์ ์ฐ๊ตฌํ๊ณ ์ ์ํ๋ฉฐ ์ค์ ๋ก ์ฌ์ฉ๋๊ณ ์๋ค(Manning et al., 2010). ์ด๋ฌํ ์๋ง์ ๋ฐฉ๋ฒ๋ค ์ค์์ ํํ์ ๋ถ์ ๋ฐ ํ์ฌ ๋ถ์ฐฉ์ ๋ฌธ์๋ฅผ ๋ถ์ํ๊ณ ๋ถ๋ฅํ์ฌ ์ ๋ณด๋ก ํ์ฉํ๊ธฐ ์ํ ์ฌ๋ฌ ๋ฐฉ๋ฒ๋ค์ ๊ณตํต๋ ์ตํ์ ๋จ๊ณ์ ์ํ๋ค.
ํํ์ ๋ถ์์ด๋ ์
๋ ฅ๋ ๋ฌธ์์ ๋ํด ํํ์์ ๋ณํ๊ณผ ๋ถ๋ฆฌ ๊ฒฝ๊ณ๋ฅผ ๊ฒฐ์ ํ๋ ๋ฌธ์ ๋ฅผ ์ฒ๋ฆฌํ๋ ๊ณผ์ ์ผ๋ก ์ธ์ด์ ํน์ฑ์ ๋ง๊ฒ ๊ตฌํ๋๋ค(Dale, etal., 2000). ํนํ ํ๊ตญ์ด๋ ๋ด์ฉ์ด์ ๊ธฐ๋ฅ์ด์ ๊ฒฐํฉ์ผ๋ก ๋ค์ํ ํํ์ ๋ณํ์ด ๋ฐ์๋๋ค(์์ ์, 1996). ์ด๋ฌํ ์ด์ ๋ก ํ๊ตญ์ด ํํ์ ๋ถ์๊ธฐ๋ ์์ด์ ๊ฐ์ ์ธ๊ตญ์ด ํํ์ ๋ถ์๊ธฐ ๋ณด๋ค ๋ณต์กํ ๊ตฌ์กฐ๋ฅผ ๊ฐ์ง๊ณ ์๋ค ์ฉ์ธ์ ๋ํ ํํ์ ๋ถ์์ ํ์ฉ ์ฒ๋ฆฌ, ๋ถ๊ท์น ์ฒ๋ฆฌ, ์์ดํ์ ์ฒ๋ฆฌ ๋ฑ ๋งค์ฐ ๋ณต์กํ ๊ณผ์ ์ ํฌํจํ๊ณ ์๋ค.
. ์ด๋ ๊ฒ ๋ณต์กํ ๊ตฌ์กฐ์ ํํ์ ๋ถ์๊ธฐ๋ฅผ ์ค๊ณํ๊ณ ๊ตฌํํ๊ธฐ ์ํด์๋ ๋ณต์กํ ์ง์๊ณผ ๋ฐฉ๋ํ ์ฌ์ ์ ๋ณด๊ฐ ์๊ตฌ๋๋ค(๊น์ฌํ, ์ด๊ณต์ฃผ, 2003). ๋ฟ๋ง ์๋๋ผ ๋งค์ฐ ๊น๋ค๋ก์ด ๊ตฌํ๊ณผ์ ์ ๊ฑฐ์น๊ธฐ ๋๋ฌธ์ ์ ์ง๋ณด์๋ฅผ ํ๋ค๋ ๊ฒ์ ํํ์ ๋ถ์๊ธฐ๋ฅผ ๊ตฌํํ๋ ๊ฒ๋งํผ ์ด๋ ค์ด ๊ฒ์ด ํ์ค์ด๋ค.
๊ทธ๋ฌ๋ ์ผ๋ถ ์ ๋ณด๊ฒ์ ์์คํ
์ ์ฃผ์ด์ง ๋ฌธ์ฅ์์ ๋ช
์ฌ๋ง ์ถ์ถํ์ฌ ์์ธํ๋๋ฐ ์์ฉ๋ถ์ผ์ ๋ฐ๋ผ์๋ ๋ชจ๋ ์ข
๋ฅ์ ํํ์ ๋ถ์๊ฒฐ๊ณผ๋ฅผ ํ์๋ก ํ์ง ์๋๋ค. ๋ํ ํ์ฌ๋ถ์ฐฉ์ ํํ์ ๋ถ์์์ ๋ฐ์๋ ์ฌ๋ฌ ๋ถ์ ๊ฒฐ๊ณผ๋ฅผ ์ฃผ์ด์ง ๋ฌธ์ฅ์ ๊ฐ์ฅ ์ ํฉํ ๋ถ์์ ์ ํํ์ฌ ์ฌ๋ฌ ์์ฉ๋ถ์ผ์ ์ฌ์ฉ๋๋ค.
์ด๋ฌํ ๋ฌธ์ ๋ค์ ํด๊ฒฐํ๊ตญ์ด ํ์ฌ๋ฅผ๊ธฐ ์ํด ์์ ๋จ์๋ก ํ ๋ถ์ฐฉํ ์ฐ๊ตฌ(์ฌ๊ด์ญ, 2011)๊ฐ ์์ผ๋ ๋ณตํฉ๋ช
์ฌ๋ฅผ ๋ถ์ํ๊ธฐ ์ด๋ ค์ฐ๋ฉฐ ๊ท์น์ ์ฌ์ฉํ๊ธฐ ๋๋ฌธ์ ๊ท์น์ ๋ชจํธ์ฑ ๋ฌธ์ ๊ฐ ์กด์ฌํ๋ค.
๋ณธ ๋
ผ๋ฌธ์์๋ ์ด์ ๊ฐ์ ๋ฌธ์ ๋ฅผ ํด๊ฒฐํ๊ณ ์ ๊ธฐ๊ณํ์ต ๊ธฐ๋ฒ์ ์ด์ฉํ ์์ ๊ธฐ๋ฐ ํ์ฌ ๋ถ์ฐฉ ๋ฐฉ๋ฒ์ ์ ์ํ๋ค. ์ด ๋ฐฉ๋ฒ์ ์ธ์ด์ฒ๋ฆฌ ์์คํ
์ด๋ ๋๋์ ์ฌ์ ์ ๋ณด๋ฅผ ์ด์ฉํ์ฌ ํํ์ ๋ถ์์ ํ์ง ์๊ณ ๊ธฐ๊ณํ์ต ๋๊ตฌ๋ฅผ ์ด์ฉํ์ฌ ์์ ๋จ์๋ก ํ์ฌ ๋ถ์ฐฉ์ด ๊ฐ๋ฅํ ํ์ต๋ชจ๋ธ์ ์์ฑํ์ฌ ์
๋ ฅ๋ ๋ฌธ์ฅ์ ์์ ๋จ์๋ก ์์ ํ์ฌ๋ฅผ ๋ถ์ฐฉํ๊ณ ์ด์ ๊ฒฝ๊ณ๋ฅผ ํ์ํ์ฌ ๋ณตํฉ๋ช
์ฌ์ ๋ถ์์ด ๊ฐ๋ฅํ๋ค. ์์ ํ์ฌ๊ฐ ๋ถ์ฐฉ๋ ๋ฌธ์ฅ์ ์์ ๋ณต์๊ธฐ๋ฅผ ํตํด ์์ ์ ์ํ ๋ณต์ ๊ฒฐ๊ณผ๋ฅผ ์ป๋๋ค. ์์ ์ ๋ณต์ํ๋ ๊ณผ์ ์์ ๋ฐ์ํ๋ ๋ชจํธ์ฑ ๋ฌธ์ ๋ Naïve Bayes ๋ถ๋ฅ๊ธฐ๋ฅผ ์ด์ฉํด์ ํด๊ฒฐํ๋ค. ๋ณธ ๋
ผ๋ฌธ์์ ์ ์ํ๋ ํํ์ ๋ถ์ ๋ฐ ํ์ฌ๋ถ์ฐฉ์ ๊ธฐ๊ณํ์ต ๊ธฐ๋ฒ์ ์ด์ฉํ๊ณ ์์ผ๋ฉฐ, ๊ตฌํ์ด ์ฝ๊ณ ๊ฐ๋จํ๊ธฐ ๋๋ฌธ์ ๋จ๊ธฐ๊ฐ ๋ด์ ๊ตฌํํ ์ ์์ผ๋ฉฐ, ๋ณต์กํ ๊ตฌ์กฐ๋ฅผ ๊ฐ์ง ๊ธฐํ ํ์ฌ ๋ถ์ฐฉ๊ธฐ์ ๋น์ทํ ์์ค์ ์ฑ๋ฅ์ ๊ฐ์ง๊ณ ์๋ค.
๋ณธ ๋
ผ๋ฌธ์ ๊ตฌ์ฑ์ ๋ค์๊ณผ ๊ฐ๋ค. 2์ฅ์์ ๊ธฐ์กด์ ํํ์ ๋ถ์ ๋ฐ ํ์ฌ ๋ถ์ฐฉ ๋ฐฉ๋ฒ๋ค๊ณผ ์์ ๊ธฐ๋ฐ ์ธ์ด ์ฒ๋ฆฌ ๋ฐฉ๋ฒ๋ค์ ๋ํด ์ดํด๋ณด๊ณ , 3์ฅ์์ ๊ธฐ๊ณํ์ต์ ํ์ํ ํ์ต๋ง๋ญ์น์ ๊ฐ๊ณต๋ฐฉ๋ฒ์ ๋ํด ์ดํด๋ณธ๋ค. 4์ฅ์์ ๊ธฐ๊ณํ์ต์ ์ด์ฉํ ์์ ๊ธฐ๋ฐ ํํ์ ๋ถ์์ ๋ํด ๋
ผํ๋ฉฐ 5์ฅ์์๋ ๋ณธ ๋
ผ๋ฌธ์์ ์ ์ํ ๋ฐฉ๋ฒ์ผ๋ก ๊ตฌํํ ์์คํ
์ ์ฑ๋ฅ์ ํ๊ฐํ๋ค. ๋ง์ง๋ง์ผ๋ก 6์ฅ์์ ๊ฒฐ๋ก ์ ๋งบ๊ณ ์์ผ๋ก์ ์ฐ๊ตฌ ๋ฐฉํฅ์ ์ ์ํ๋ค.์ 1 ์ฅ ์๋ก
์ 2 ์ฅ ๊ด๋ จ ์ฐ๊ตฌ
2.1 ํํ์ ๋ถ์ ๋ฐ ํ์ฌ๋ถ์ฐฉ
2.2 ํ๊ตญ์ด ํํ์ ๋ถ์ ๋ฐฉ๋ฒ
2.3 ํ๊ตญ์ด ํ์ฌ ๋ถ์ฐฉ ๋ฐฉ๋ฒ
2.4 ์์ ์ ๋ณด๋ฅผ ์ด์ฉํ ์ธ์ด์ฒ๋ฆฌ
2.4.1 ๋จ์ด ๋ถ๋ฆฌ ๋ฐ ๋ฒ์ฃผ ๊ฒฐ์
2.4.2 ํ๊ตญ์ด ํ์ฌ ๋ถ์ฐฉ
2.4.3 ๋ณตํฉ๋ช
์ฌ ๋ถํด
2.5 CRF๋ฅผ ์ด์ฉํ ํ๊ตญ์ด ํ์ฌ ๋ถ์ฐฉ
2.5.1 ์์ ํ์ฌ ๋ถ์ฐฉ๊ธฐ
2.5.2 ๊ท์น์ ์ด์ฉํ ์ํ๋ณต์
2.5.3 ์์คํ
์ ๋ฌธ์ ์
์ 3 ์ฅ ํ์ต๋ง๋ญ์น์ ๊ตฌ์ฑ ๋ฐ ๊ฐ๊ณต
3.1 ํ์ฌ ํ๊ทธ ์งํฉ
3.2 ํ์ต๋ง๋ญ์น์ ๊ตฌ์ฑ
3.3 ํ์ต๋ง๋ญ์น ๊ตฌ์ถ
3.3.1 ์ด์ ๋ฐ ํํ์ ๋ถ์ ๊ฒฐ๊ณผ์ ์ ๋ ฌ
3.3.2 ์์๋ง๋ญ์น์ ๊ฐ๊ณต
์ 4 ์ฅ ๊ธฐ๊ณํ์ต์ ์ด์ฉํ ์์ ๊ธฐ๋ฐ ํ์ฌ๋ถ์ฐฉ
4.1 ์์ ํ์ฌ ๋ถ์ฐฉ๊ธฐ
4.1.1 ์์ ํ์ฌ ๋ถ์ฐฉ ํ์ต๋ง๋ญ์น์ ์์ง์ถ์ถ
4.1.2 ๊ธฐ๊ณํ์ต ๋ชจ๋ธ
4.2 ์์ ๋ณต์๊ธฐ
4.3 ํํ์ ๋ณต์๊ธฐ
4.4 ํ์ฌ ๋ณต์๊ธฐ
์ 5 ์ฅ ์คํ ๋ฐ ํ๊ฐ
5.1 ๊ธฐ๊ณํ์ต ๋๊ตฌ
5.2 ์ฑ๋ฅํ๊ฐ ์ฒ๋
5.3 ์ฑ๋ฅํ๊ฐ
5.3.1 ์ ์ฒด ์์คํ
์ ์ฑ๋ฅํ๊ฐ
5.3.2 ๊ฐ ์์คํ
๋ณ ์ฑ๋ฅ ํ๊ฐ
5.4 ์ค๋ฅ๋ถ์
5.4.1 ์์ ํ์ฌ ๋ถ์ฐฉ๊ฒฐ๊ณผ์ ์ค๋ฅ๋ถ์
5.4.2 ์์ ๋ณต์ ๊ฒฐ๊ณผ์ ์ค๋ฅ๋ถ์
5.4.3 ์์ ํ์ฌ ๋ณต์๊ฒฐ๊ณผ์ ์ค๋ฅ๋ถ์
์ 6 ์ฅ ๊ฒฐ๋ก ๋ฐ ํฅํ ์ฐ๊ตฌ๊ณผ์
์ฐธ๊ณ ๋ฌธํ
๋ถ
The application of linguistic processing to automatic abstract generation
One approach to the problem of generating abstracts by computer is to extract from a source text those sentences which give a strong indication of the central subject matter and findings of the paper. Not surprisingly, concatenations of extracted sentences show a lack of cohesion, due partly to the frequent occurrence of anaphoric references. This paper describes the text processing which was necessary to identify these anaphors so that they may be utilised in the enhancement of the sentence selection criteria. It is assumed that sentences which contain non-anaphoric nounphrases and introduce key concepts into the text are worthy of inclusion in an abstract. The results suggest that the key
concepts are indeed identified but the abstracts are too long. Further recommendations are made to continue this work in abstracting which makes use of text structure