2 research outputs found

    ВСматичСскоС ΠΌΠΎΠ΄Π΅Π»ΠΈΡ€ΠΎΠ²Π°Π½ΠΈΠ΅ русскоязычных тСкстов с ΠΎΠΏΠΎΡ€ΠΎΠΉ Π½Π° Π»Π΅ΠΌΠΌΡ‹ ΠΈ лСксичСскиС конструкции

    Get PDF
    Данная Ρ€Π°Π±ΠΎΡ‚Π° посвящСна ΡƒΡΠΎΠ²Π΅Ρ€ΡˆΠ΅Π½ΡΡ‚Π²ΠΎΠ²Π°Π½ΠΈΡŽ ΠΌΠ΅Ρ‚ΠΎΠ΄ΠΎΠ² вСроятностного тСматичСского модСлирования, Π½Π°ΠΏΡ€Π°Π²Π»Π΅Π½Π½Ρ‹Ρ… Π½Π° выявлСниС скрытых взаимосвязСй ΠΌΠ΅ΠΆΠ΄Ρƒ словами, Π΄ΠΎΠΊΡƒΠΌΠ΅Π½Ρ‚Π°ΠΌΠΈ ΠΈ Ρ‚Π΅ΠΌΠ°ΠΌΠΈ Π² тСкстовых коллСкциях. Π’ Π±ΠΎΠ»ΡŒΡˆΠΈΠ½ΡΡ‚Π²Π΅ тСматичСских ΠΌΠΎΠ΄Π΅Π»Π΅ΠΉ Ρ‚Π΅ΠΌΡ‹ прСдставлСны ΠΈΡΠΊΠ»ΡŽΡ‡ΠΈΡ‚Π΅Π»ΡŒΠ½ΠΎ ΡƒΠ½ΠΈΠ³Ρ€Π°ΠΌΠΌΠ°ΠΌΠΈ, Ρ‡Ρ‚ΠΎ Π² Π½Π΅ΠΊΠΎΡ‚ΠΎΡ€Ρ‹Ρ… случаях Π²Π»Π΅Ρ‡Π΅Ρ‚ Π·Π° собой ΡƒΡ…ΡƒΠ΄ΡˆΠ΅Π½ΠΈΠ΅ точности ΠΈ ΠΏΠΎΠ²Ρ‹ΡˆΠ°Π΅Ρ‚ ΡΠ»ΠΎΠΆΠ½ΠΎΡΡ‚ΡŒ ΡΠΎΠ΄Π΅Ρ€ΠΆΠ°Ρ‚Π΅Π»ΡŒΠ½ΠΎΠΉ ΠΈΠ½Ρ‚Π΅Ρ€ΠΏΡ€Π΅Ρ‚Π°Ρ†ΠΈΠΈ выдСляСмых Ρ‚Π΅ΠΌ. Нами ΠΏΡ€Π΅Π΄Π»ΠΎΠΆΠ΅Π½ Π½ΠΎΠ²Ρ‹ΠΉ Π°Π»Π³ΠΎΡ€ΠΈΡ‚ΠΌ Π½Π° основС ΠΌΠ΅Ρ‚ΠΎΠ΄Π° LDA, ΠΏΠΎΠ·Π²ΠΎΠ»ΡΡŽΡ‰ΠΈΠΉ автоматичСски Π²Ρ‹Π΄Π΅Π»ΡΡ‚ΡŒ Π² корпусС словосочСтания, состоящиС ΠΈΠ· Π΄Π²ΡƒΡ… слов, ΠΈ Π΄ΠΎΠ±Π°Π²Π»ΡΡ‚ΡŒ ΠΈΡ… Π² тСматичСскиС ΠΌΠΎΠ΄Π΅Π»ΠΈ. Π’ практичСской части Π΄Π°Π½Π½ΠΎΠ³ΠΎ исслСдования описана Ρ€Π°Π±ΠΎΡ‚Π° Π°Π»Π³ΠΎΡ€ΠΈΡ‚ΠΌΠ° ΠΈ ΠΏΡ€ΠΈΠ²Π΅Π΄Π΅Π½Ρ‹ Ρ€Π΅Π·ΡƒΠ»ΡŒΡ‚Π°Ρ‚Ρ‹ Π΅Π³ΠΎ примСнСния Π² автоматичСской ΠΎΠ±Ρ€Π°Π±ΠΎΡ‚ΠΊΠ΅ Π΄Π²ΡƒΡ… корпусов русского языка: корпуса тСкстов ΠΏΠΎ радиоэлСктроникС, Ρ€Π°ΠΊΠ΅Ρ‚ΠΎΡΡ‚Ρ€ΠΎΠ΅Π½ΠΈΡŽ ΠΈ Ρ‚Π΅Ρ…Π½ΠΈΠΊΠ΅ ΠΈ корпуса тСкстов Π½Π° Π»ΠΈΠ½Π³Π²ΠΈΡΡ‚ΠΈΡ‡Π΅ΡΠΊΡƒΡŽ Ρ‚Π΅ΠΌΠ°Ρ‚ΠΈΠΊΡƒ.The graduation qualification paper is devoted to the improvement of topic modelling algorithms aimed at extraction of latent relations between words, documents and topics in processed corpora. In the majority of cases topics generated by topic models contain only unigrams, so that the interpretation of extracted topics turns out to be a complicated task. This paper presents a new algorithm based on the classic LDA model which provides automatic extraction of bigrams in the given text collection and further incorporation of bigrams into the topic model. In the second part of paper at hand we describe our algorithm in action and discuss results achieved in course of processing the Russian corpora on radioengineering and linguistics

    Empirical Software Engineering Automated Topic Naming: Supporting Cross-project Analysis of Software Maintenance Activities--Manuscript Draft-- Manuscript Number: Article Type: Keywords: Corresponding Author: First Author: Order of Authors:

    No full text
    Software repositories provide a deluge of software artifacts to analyze. Researchers have attempted to summarize, categorize, and relate these artifacts by using semiunsupervised machine-learning algorithms, such as Latent Dirichlet Allocation (LDA), used for concept and topic analysis to suggest candidate word-lists or topics that describe and relate software artifacts. However, these word-lists and topics are difficult to interpret in the absence of meaningful summary labels. Current topic modeling techniques assume manual labelling and do not use domain-specific knowledge to improve, contextualize, or describe results for the developers. We propose a solution: automated labelled topic extraction. Topics are extracted using LDA from commit-log comments recovered from source control systems. These topics are given labels from a generalizable cross-project taxonomy, consisting of non-functional requirements. Our approach was evaluated with experiments and case studies on three large-scale Relational Database Management System (RDBMS) projects: MySQL, PostgreSQL and MaxDB. The case studies show that labelled topic extraction can produce appropriate, context-sensitive labels that are relevant to these projects, and provid
    corecore