4 research outputs found
Learning Language from a Large (Unannotated) Corpus
A novel approach to the fully automated, unsupervised extraction of
dependency grammars and associated syntax-to-semantic-relationship mappings
from large text corpora is described. The suggested approach builds on the
authors' prior work with the Link Grammar, RelEx and OpenCog systems, as well
as on a number of prior papers and approaches from the statistical language
learning literature. If successful, this approach would enable the mining of
all the information needed to power a natural language comprehension and
generation system, directly from a large, unannotated corpus.Comment: 29 pages, 5 figures, research proposa
Evolution of Efficient Symbolic Communication Codes
The paper explores how the human natural language structure can be seen as a
product of evolution of inter-personal communication code, targeting
maximisation of such culture-agnostic and cross-lingual metrics such as
anti-entropy, compression factor and cross-split F1 score. The exploration is
done as part of a larger unsupervised language learning effort, the attempt is
made to perform meta-learning in a space of hyper-parameters maximising F1
score based on the "ground truth" language structure, by means of maximising
the metrics mentioned above. The paper presents preliminary results of
cross-lingual word-level segmentation tokenisation study for Russian, Chinese
and English as well as subword segmentation or morphological parsing study for
English. It is found that language structure form the word-level segmentation
or tokenisation can be found as driven by all of these metrics, anti-entropy
being more relevant to English and Russian while compression factor more
specific for Chinese. The study for subword segmentation or morphological
parsing on English lexicon has revealed straight connection between the
compression been found to be associated with compression factor, while,
surprising, the same connection with anti-entropy has turned to be the inverse.Comment: 9 pages, 6 figure