Search CORE

2 research outputs found

Тематическое моделирование русскоязычных текстов с опорой на леммы и лексические конструкции

Author: Sedova Anastasiia
Седова Анастасия Георгиевна
Publication venue
Publication date: 01/01/2017
Field of study

Данная работа посвящена усовершенствованию методов вероятностного тематического моделирования, направленных на выявление скрытых взаимосвязей между словами, документами и темами в текстовых коллекциях. В большинстве тематических моделей темы представлены исключительно униграммами, что в некоторых случаях влечет за собой ухудшение точности и повышает сложность содержательной интерпретации выделяемых тем. Нами предложен новый алгоритм на основе метода LDA, позволяющий автоматически выделять в корпусе словосочетания, состоящие из двух слов, и добавлять их в тематические модели. В практической части данного исследования описана работа алгоритма и приведены результаты его применения в автоматической обработке двух корпусов русского языка: корпуса текстов по радиоэлектронике, ракетостроению и технике и корпуса текстов на лингвистическую тематику.The graduation qualification paper is devoted to the improvement of topic modelling algorithms aimed at extraction of latent relations between words, documents and topics in processed corpora. In the majority of cases topics generated by topic models contain only unigrams, so that the interpretation of extracted topics turns out to be a complicated task. This paper presents a new algorithm based on the classic LDA model which provides automatic extraction of bigrams in the given text collection and further incorporation of bigrams into the topic model. In the second part of paper at hand we describe our algorithm in action and discuss results achieved in course of processing the Russian corpora on radioengineering and linguistics

Saint Petersburg State University

Empirical Software Engineering Automated Topic Naming: Supporting Cross-project Analysis of Software Maintenance Activities--Manuscript Draft-- Manuscript Number: Article Type: Keywords: Corresponding Author: First Author: Order of Authors:

Author: Abram Hindle Ph. D
Michael W Godfrey
Neil A Ernst
Ph. D
Ph. D
Si Msr
Publication venue
Publication date
Field of study

Software repositories provide a deluge of software artifacts to analyze. Researchers have attempted to summarize, categorize, and relate these artifacts by using semiunsupervised machine-learning algorithms, such as Latent Dirichlet Allocation (LDA), used for concept and topic analysis to suggest candidate word-lists or topics that describe and relate software artifacts. However, these word-lists and topics are difficult to interpret in the absence of meaningful summary labels. Current topic modeling techniques assume manual labelling and do not use domain-specific knowledge to improve, contextualize, or describe results for the developers. We propose a solution: automated labelled topic extraction. Topics are extracted using LDA from commit-log comments recovered from source control systems. These topics are given labels from a generalizable cross-project taxonomy, consisting of non-functional requirements. Our approach was evaluated with experiments and case studies on three large-scale Relational Database Management System (RDBMS) projects: MySQL, PostgreSQL and MaxDB. The case studies show that labelled topic extraction can produce appropriate, context-sensitive labels that are relevant to these projects, and provid

CiteSeerX