2,736 research outputs found
From Frequency to Meaning: Vector Space Models of Semantics
Computers understand very little of the meaning of human language. This
profoundly limits our ability to give instructions to computers, the ability of
computers to explain their actions to us, and the ability of computers to
analyse and process text. Vector space models (VSMs) of semantics are beginning
to address these limits. This paper surveys the use of VSMs for semantic
processing of text. We organize the literature on VSMs according to the
structure of the matrix in a VSM. There are currently three broad classes of
VSMs, based on term-document, word-context, and pair-pattern matrices, yielding
three classes of applications. We survey a broad range of applications in these
three categories and we take a detailed look at a specific open source project
in each category. Our goal in this survey is to show the breadth of
applications of VSMs for semantics, to provide a new perspective on VSMs for
those who are already familiar with the area, and to provide pointers into the
literature for those who are less familiar with the field
MBT: A Memory-Based Part of Speech Tagger-Generator
We introduce a memory-based approach to part of speech tagging. Memory-based
learning is a form of supervised learning based on similarity-based reasoning.
The part of speech tag of a word in a particular context is extrapolated from
the most similar cases held in memory. Supervised learning approaches are
useful when a tagged corpus is available as an example of the desired output of
the tagger. Based on such a corpus, the tagger-generator automatically builds a
tagger which is able to tag new text the same way, diminishing development time
for the construction of a tagger considerably. Memory-based tagging shares this
advantage with other statistical or machine learning approaches. Additional
advantages specific to a memory-based approach include (i) the relatively small
tagged corpus size sufficient for training, (ii) incremental learning, (iii)
explanation capabilities, (iv) flexible integration of information in case
representations, (v) its non-parametric nature, (vi) reasonably good results on
unknown words without morphological analysis, and (vii) fast learning and
tagging. In this paper we show that a large-scale application of the
memory-based approach is feasible: we obtain a tagging accuracy that is on a
par with that of known statistical approaches, and with attractive space and
time complexity properties when using {\em IGTree}, a tree-based formalism for
indexing and searching huge case bases.} The use of IGTree has as additional
advantage that optimal context size for disambiguation is dynamically computed.Comment: 14 pages, 2 Postscript figure
Information Retrieval with Finnish Case Law Embeddings
In this work, five text vectorisation models' capability in embedding Finnish case law texts to vector space for inter-textual similarity computation is studied. The embeddings and their computed similarities are used to create a Finnish case law retrieval system that allows effective querying with full documents.
A working web application is presented as a part of the work. The case law data for the work is provided by the Finnish Ministry of Justice, and the studied models are: TF-IDF, LDA, Word2Vec, Doc2Vec and Doc2vecC
Acquisition of morphological families and derivational series from a machine readable dictionary
The paper presents a linguistic and computational model aiming at making the
morphological structure of the lexicon emerge from the formal and semantic
regularities of the words it contains. The model is word-based. The proposed
morphological structure consists of (1) binary relations that connect each
headword with words that are morphologically related, and especially with the
members of its morphological family and its derivational series, and of (2) the
analogies that hold between the words. The model has been tested on the lexicon
of French using the TLFi machine readable dictionary.Comment: proceedings of the 6th D\'ecembrette
Access to recorded interviews: A research agenda
Recorded interviews form a rich basis for scholarly inquiry. Examples include oral histories, community memory projects, and interviews conducted for broadcast media. Emerging technologies offer the potential to radically transform the way in which recorded interviews are made accessible, but this vision will demand substantial investments from a broad range of research communities. This article reviews the present state of practice for making recorded interviews available and the state-of-the-art for key component technologies. A large number of important research issues are identified, and from that set of issues, a coherent research agenda is proposed
- âŚ