4,363 research outputs found
The structure of verbal sequences analyzed with unsupervised learning techniques
Data mining allows the exploration of sequences of phenomena, whereas one
usually tends to focus on isolated phenomena or on the relation between two
phenomena. It offers invaluable tools for theoretical analyses and exploration
of the structure of sentences, texts, dialogues, and speech. We report here the
results of an attempt at using it for inspecting sequences of verbs from French
accounts of road accidents. This analysis comes from an original approach of
unsupervised training allowing the discovery of the structure of sequential
data. The entries of the analyzer were only made of the verbs appearing in the
sentences. It provided a classification of the links between two successive
verbs into four distinct clusters, allowing thus text segmentation. We give
here an interpretation of these clusters by applying a statistical analysis to
independent semantic annotations
Ordering-sensitive and Semantic-aware Topic Modeling
Topic modeling of textual corpora is an important and challenging problem. In
most previous work, the "bag-of-words" assumption is usually made which ignores
the ordering of words. This assumption simplifies the computation, but it
unrealistically loses the ordering information and the semantic of words in the
context. In this paper, we present a Gaussian Mixture Neural Topic Model
(GMNTM) which incorporates both the ordering of words and the semantic meaning
of sentences into topic modeling. Specifically, we represent each topic as a
cluster of multi-dimensional vectors and embed the corpus into a collection of
vectors generated by the Gaussian mixture model. Each word is affected not only
by its topic, but also by the embedding vector of its surrounding words and the
context. The Gaussian mixture components and the topic of documents, sentences
and words can be learnt jointly. Extensive experiments show that our model can
learn better topics and more accurate word distributions for each topic.
Quantitatively, comparing to state-of-the-art topic modeling approaches, GMNTM
obtains significantly better performance in terms of perplexity, retrieval
accuracy and classification accuracy.Comment: To appear in proceedings of AAAI 201
Context and Keyword Extraction in Plain Text Using a Graph Representation
Document indexation is an essential task achieved by archivists or automatic
indexing tools. To retrieve relevant documents to a query, keywords describing
this document have to be carefully chosen. Archivists have to find out the
right topic of a document before starting to extract the keywords. For an
archivist indexing specialized documents, experience plays an important role.
But indexing documents on different topics is much harder. This article
proposes an innovative method for an indexing support system. This system takes
as input an ontology and a plain text document and provides as output
contextualized keywords of the document. The method has been evaluated by
exploiting Wikipedia's category links as a termino-ontological resources
- …