In this paper, we study the problem of extracting technical paraphrases from a parallel software corpus, namely, a collection of duplicate bug reports. Paraphrase acquisition is a fundamental task in the emerging area of text mining for software engineering. Existing paraphrase extraction methods are not entirely suitable here due to the noisy nature of bug reports. We propose a number of techniques to address the noisy data problem. The empirical evaluation shows that our method significantly improves an existing method by up to 58%. 

JIANG, Jing

LO, David

Mei, Hong

WANG, Xiaoyin

ZHANG, LU

Institutional Knowledge at Singapore Management University

Singapore Management UniversityInstitutional Knowledge at Singapore Management UniversityResearch Collection School Of Information Systems School of Information Systems8-2009Extracting Paraphrases of Technical Terms fromNoisy Parallel Software CorpusXiaoyin WANGDavid LOSingapore Management University, davidlo@smu.edu.sgJing JIANGSingapore Management University, jingjiang@smu.edu.sgLU ZHANGHong MeiFollow this and additional works at: https://ink.library.smu.edu.sg/sis_researchPart of the Software Engineering CommonsThis Conference Paper is brought to you for free and open access by the School of Information Systems at Institutional Knowledge at SingaporeManagement University. It has been accepted for inclusion in Research Collection School Of Information Systems by an authorized administrator ofInstitutional Knowledge at Singapore Management University. For more information, please email libIR@smu.edu.sg.CitationWANG, Xiaoyin; LO, David; JIANG, Jing; ZHANG, LU; and Mei, Hong. Extracting Paraphrases of Technical Terms from NoisyParallel Software Corpus. (2009). CLShort '09: Proceedings of the ACL-IJCNLP 2009 Conference Short Papers. 197-200. ResearchCollection School Of Information Systems.Available at: https://ink.library.smu.edu.sg/sis_research/472Extracting Paraphrases of Technical Termsfrom Noisy Parallel Software CorporaXiaoyin Wang1,2, David Lo1, Jing Jiang1, Lu Zhang2, Hong Mei21School of Information Systems, Singapore Management University, Singapore, 178902{xywang, davidlo, jingjiang}@smu.edu.sg2Key Laboratory of High Confidence Software Technologies (Peking University), Ministry of EducationBeijing, 100871, China{zhanglu, meih}@sei.pku.edu.cnAbstractIn this paper, we study the problem of ex-tracting technical paraphrases from a par-allel software corpus, namely, a collec-tion of duplicate bug reports. Paraphraseacquisition is a fundamental task in theemerging area of text mining for softwareengineering. Existing paraphrase extrac-tion methods are not entirely suitable heredue to the noisy nature of bug reports. Wepropose a number of techniques to addressthe noisy data problem. The empiricalevaluation shows that our method signifi-cantly improves an existing method by upto 58%.1 IntroductionUsing natural language processing (NLP) tech-niques to mine software corpora such as code com-ments and bug reports to assist software engineer-ing (SE) is an emerging and promising researchdirection (Wang et al., 2008; Tan et al., 2007).Paraphrase extraction is one of the fundamentalproblems that have not been addressed in this area.It has many applications including software ontol-ogy construction and query expansion for retriev-ing relevant technical documents.In this paper, we study automatic paraphrase ex-traction from a large collection of software bug re-ports. Most large software projects have bug track-ing systems, e.g., Bugzilla1, to help global users todescribe and report the bugs they encounter whenusing the software. However, since the same bugmay be seen by many users, many duplicate bugreports are sent to bug tracking systems. The du-plicate bug reports are manually tagged and asso-ciated to the original bug report by either the sys-tem manager or software developers. These fam-ilies of duplicate bug reports form a semi-parallel1http://www.bugzilla.org/Parallel bug reports with a pair of true paraphrases1: connector extend with a straight line in full screenmode2: connector show straight line in presentation modeNon-parallel bug reports referring to the same bug1: Settle language for part of text and spellcheckingpart of text2: Feature requested to improve the management of amulti-language documentContext-peculiar paraphrases (shown in italics)1: status bar appear in the middle of the screen2: maximizing window create phantom status bar inmiddle of documentTable 1: Bug Report Examplescorpus and therefore a good candidate for extrac-tion of paraphrases of technical terms. Hence, bugreports interest us because (1) they are abundantand freely available,(2) they naturally form a semi-parallel corpus, and (3) they contain many techni-cal terms.However, bug reports have characteristics thatraise many new challenges. Different from manyother parallel corpora, bug reports are noisy. Weobserve at least three types of noise common inbug reports. First, many bug reports have manyspelling, grammatical and sentence structure er-rors. To address this we extend a suitable state-of-the-art technique that is robust to such cor-pora, i.e. (Barzilay and McKeown, 2001). Sec-ond, many duplicate bug report families containsentences that are not truly parallel. An exam-ple is shown in Table 1 (middle). We handle thisby considering lexical similarity between dupli-cate bug reports. Third, even if the bug reports areparallel, we find many cases of context-peculiarparaphrases, i.e., a pair of phrases that have thesame meaning in a very narrow context. An exam-ple is shown in Table 1 (bottom). To address this,we introduce two notions of global context-basedscore and co-occurrence based score which takeinto account all good and bad occurrences of thephrases in a candidate paraphrase in the corpus.These scores are then used to identify and removecontext-peculiar paraphrases.The contributions of our work are twofold.First, we studied the important problem of para-phrase extraction from a noisy semi-parallel soft-ware corpus, which has not been studied either inthe NLP or the SE community. Second, takinginto consideration the special characteristics of ournoisy data, we proposed several improvements toan existing general paraphrase extraction method,resulting in a significant performance gain – up to58% relative improvement in precision.2 Related WorkIn the area of text mining for software engineer-ing, paraphrases have been used in many tasks,e.g., (Wang et al., 2008; Tan et al., 2007). How-ever, most paraphrases used are obtained manu-ally. A recent study using synonyms from Word-Net highlights the fact that these are not effectivein software engineering tasks due to domain speci-ficity (Sridhara et al., 2008). Therefore, an auto-matic way to derive technical paraphrases specificto software engineering is desired.Paraphrases can be extracted from non-parallelcorpora using contextual similarity (Lin, 1998).They can also be obtained from parallel corporaif such data is available (Barzilay and McKeown,2001; Ibrahim et al., 2003). Recently, there arealso a number of studies that extract paraphrasesfrom multilingual corpora (Bannard and Callison-Burch, 2005; Zhao et al., 2008).The approach in (Barzilay and McKeown,2001) does not use deep linguistic analysis andtherefore is suitable to noisy corpora like ours.Due to this reason, we build our technique on topof theirs. The following provides a summary oftheir technique.Two types of paraphrase patterns are defined:(1) Syntactic patterns which consist of the POStags of the phrases. For example, the paraphrases“a VGA monitor” and “a monitor” are representedas “DT1 JJ NN2” ↔ “DT1 NN2”, where the sub-scripts denote common words. (2) Contextual pat-terns which consist of the POS tags before and af-ter the phrases. For example, the contexts “in themiddle of” and “in middle of” in Table 1 (bottom)are represented as “IN1 DT NN2 IN3” ↔ “IN1NN2 IN3”.During pre-processing, the parallel corpus isaligned to give a list of parallel sentence pairs.The sentences are then processed by a POS tag-ger and a chunker. The authors first used identi-cal words and phrases as seeds to find and scorecontextual patterns. The patterns are scored basedon the following formula: (n+)/n, in which, n+refers to the number of positively labeled para-phrases satisfying the patterns and n refers to thenumber of all paraphrases satisfying the patterns.Only patterns with scores above a threshold areconsidered. More paraphrases are identified usingthese contextual patterns, and more patterns arethen found and scored using the newly-discoveredparaphrases. This co-training algorithm is em-ployed in an iterative fashion to find more patternsand positively labeled paraphrases.3 MethodologyOur paraphrase extraction method consists ofthree components: sentence selection, globalcontext-based scoring and co-occurrence-basedscoring. We marry the three components togetherinto a holistic solution.Selection of Parallel Sentences Our corpus con-sists of short bug report summaries, each contain-ing one or two sentences only, grouped by thebugs they report. Each group corresponds to re-ports pertaining to a single bug and are duplicateof one another. Therefore, reports belonging to thesame group can be naturally regarded as parallelsentences.However, these sentences are only partially par-allel because two users may describe the same bugin very different ways. An example is shown in Ta-ble 1 (middle). This kind of sentence pairs shouldnot be regarded as parallel. To address this prob-lem, we take a heuristic approach and only selectsentence pairs that have strong similarities. Oursimilarity score is based on the number of com-mon words, bigrams and trigrams shared betweentwo parallel sentences. We use a threshold of 5 tofilter out non-parallel sentences.Global Context-Based Scoring Our context-based paraphrase scoring method is an extensionof (Barzilay and McKeown, 2001) described inSec. 2. Parallel bug reports are usually noisy.At times, some words might be detected as para-phrases incidentally due to the noise. In (Barzi-lay and McKeown, 2001), a paraphrase is reportedas long as there is a single good supporting pairof sentences. Although this works well for a rel-atively clean parallel corpus considered in theirwork, i.e., novels, this does not work well for bugreports. Consider the context-peculiar example inTable 1 (bottom). For a context-peculiar para-phrase, there can be many sentences containingthe pair of phrases but very few support them tobe a paraphrase. We develop a technique to off-set this noise by computing a global context-basedscore for two phrases being a paraphrase over alltheir parallel occurrences. This is defined by thefollowing formula: Sg = 1nΣni=1si, where n isthe number of parallel bug reports with the twophrases occurring in parallel, and si is the scorefor the i’th occurrence. si is computed as follows:1. We compute the set of patterns with affixedpattern scores based on (Barzilay and McK-eown, 2001).2. For the i’th parallel occurrence of the pair ofphrases we want to score, we try to find a pat-tern that matches the occurrence and assignthe pattern score to the pair of phrases as si.If no such pattern exists, we set si to 0.By taking the average of si as the global scorefor a pair of phrases, we do not rely much on a sin-gle si and can therefore prevent context-peculiarparaphrases to some degree.Co-occurrence-Based Scoring We also consideranother global co-occurrence-based score that iscommonly used for finding collocations. A gen-eral observation is that noise tends to appear inrandom but random things do not occur in thesame way often. It is less likely for randomlypaired words or paraphrases to co-occur togethermany times. To compute the likelihood of twophrases occurring together, we use the followingcommonly used co-occurrence-based score:Sc =P (w1, w2)P (w1)P (w2). (1)The expression P (w1, w2) refers to the probabilityof a pair of phrases w1 and w2 appearing together.It is estimated based on the proportion of the cor-pus containing both w1 and w2 in parallel. Sim-ilarly, P (w1) and P (w2) each corresponds to theprobability of w1 and w2 appearing respectively.We normalize the co-occurrence to the range of 0-1 by dividing it with the size of the corpus.Holistic Solution We employ the parallel sen-tence selection as a pre-processing step, and mergeco-occurrence-based scoring with global context-based scoring. For each parallel sentence pairs, achunker is used to get chunks from each sentence.All possible pairings of chunks are then formed.This set of chunk pairs are later fed to the methodin (Barzilay and McKeown, 2001) to produce aset of patterns with affixed scores. With this wecompute our global-context based scores. The co-occurrence based scores are computed followingthe approach described above.Two thresholds are used and candidate para-phrases whose scores are below the respectivethresholds are removed. Alternatively, one of thescore is used as a filter, while the other is used torank the candidates. The next section describesour experimental results.4 EvaluationData Set Our bug report corpus is built fromOpenOffice2. OpenOffice is a well-known opensource software which has similar functionalitiesas Microsoft Office. We use the bug reports thatare submitted before Jan 1, 2008. Also, we onlyuse the summary part of the bug reports.We build our corpus in the following steps. Wecollect a total of 13,898 duplicate bug reports fromOpenOffice. Each duplicate bug report is associ-ated to a master report—there is one master re-port for each unique bug. From this information,we create duplicate bug report groups where eachmember of a group is a duplicate of all other mem-bers in the same group. Finally, we extract dupli-cate bug report pairs by pairing each two membersof each group. We get in total 53,363 duplicatebug report pairs.As the first step, we employ parallel sentenceselection, described in Sec. 3, to remove non-parallel duplicate bug report pairs. After this step,we find 5,935 parallel duplicate bug report pairs.Experimental Setup The baseline method weconsider is the one in (Barzilay and McKeown,2001) without sentence alignment – as the bug re-ports are usually of one sentence long. We call itBL. As described in Sec. 2, BL utilizes a thresholdto control the number of patterns mined. Thesepatterns are later used to select paraphrases. In theexperiment, we find that running BL using theirdefault threshold of 0.95 on the 5,935 parallel bugreports only gives us 18 paraphrases. This num-ber is too small for practical purposes. Therefore,we reduce the threshold to get more paraphrases.For each threshold in the range of 0.45-0.95 (stepsize: 0.05), we extract paraphrases and computethe corresponding precision.In our approach, we first form chunk pairs fromthe 5,935 pairs of parallel sentences and then usethe baseline approach at a low threshold to ob-2http://www.openoffice.org/tain patterns. Using these patterns we computethe global context-based scores Sg. We also com-pute the co-occurrence scores Sc. We rank andextract top-k paraphrases based on these scores.We consider 4 different methods: We can use ei-ther Sg or Sc to rank the discovered paraphrases.We call them Rk-Sg and Rk-Sc. We also considerusing one of the scores for ranking and the otherfor filtering bad candidate paraphrases. A thresh-old of 0.05 is used for filtering. We call these twomethods Rk-Sc+Ft-Sg and Rk-Sg+Ft-Sc. Withranked lists from these 4 methods, we can com-pute precision@k for the top-k paraphrases.Results The comparison among these methodsis plotted in Figure 1. From the figure we cansee that our holistic approach using global-contextscore to rank and co-occurrence score to filter(i.e., Rk-Sg+Ft-Sc) has higher precision than thebaseline approach (i.e., BL) in all ks. In general,the other holistic configuration (i.e., Rk-Sc+Ft-Sg)also works well for most of the ks considered. In-terestingly, the graph shows that using only one ofthe scores alone (i.e., Rk-Sg and Rk-Sc) does notresult in a significantly higher precision than thebaseline approach. A holistic approach by merg-ing global-context score and co-occurrence scoreis needed to yield higher precision.In Table 2, we show some examples of the para-phrases our algorithm extracted from the bug re-port corpus. As we can see, most of the para-phrases are very technical and only make sense inthe software domain. It demonstrates the effec-tiveness of our method. 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 50  100  150  200  250  300  350  400  450precision at kkBLRk-SgRk-ScRk-Sc+Ft-SgRk-Sg+Ft-ScFigure 1: Precision@k for a range of k.5 ConclusionIn this paper, we develop a new technique to ex-tract paraphrases of technical terms from softwarebug reports. Paraphrases of technical terms havebeen shown to be useful for various software en-the edit-field ↔ input line fieldpresentation mode ↔ full screen modeword separator ↔ a word delimiterapplication ↔ appfreeze ↔ crashmru file list ↔ recent file listmultiple monitor ↔ extended desktopxl file ↔ excel fileTable 2: Examples of paraphrases of technicalterms mined from bug reports.gineering tasks. These paraphrases could not beobtained via general purpose thesaurus e.g., Word-Net. Interestingly, there is a wealth of text data,in particular bug reports, available for analysis inopen-source software repositories. Despite theiravailability, a good technique is needed to extractparaphrases from these corpora as they are oftennoisy. We develop several approaches to addressnoisy data via parallel sentence selection, global-context based scoring and co-occurrence basedscoring. To show the utility of our approach, weexperimented with many parallel bug reports froma large software project. The preliminary exper-iment result is promising as it could significantlyimproves an existing method by up to 58%.ReferencesC. Bannard and C. Callison-Burch. 2005. Paraphras-ing with bilingual parallel corpora. In ACL: AnnualMeet. of Assoc. of Computational Linguistics.R. Barzilay and K. R. McKeown. 2001. Extractingparaphrases from a parallel corpus. In ACL: AnnualMeet. of Assoc. of Computational Linguistics.A. Ibrahim, B. Katz, and J. Lin. 2003. Extract-ing structural paraphrases from aligned monolingualcorpora. In Int. Workshop on Paraphrasing.D. Lin. 1998. Automatic retrieval and clustering ofsimilar words. In ACL: Annual Meet. of Assoc. ofComputational Linguistics.G. Sridhara, E. Hill, L. Pollock, and K. Vijay-Shanker.2008. Identifying word relations in software: Acomparative study of semantic similarity tools. InICPC: Int. Conf. on Program Comprehension.L. Tan, D. Yuan, G. Krishna, and Y. Zhou. 2007./*icomment: bugs or bad comments?*/. In SOSP:Symp. on Operating System Principles.X. Wang, L. Zhang, T. Xie, J. Anvik, and J. Sun. 2008.An approach to detecting duplicate bug reports us-ing natural language and execution information. InICSE: Int. Conf. on Software Engineering.S. Zhao, H. Wang, T. Liu, and S. Li. 2008. Pivot ap-proach for extracting paraphrase patterns from bilin-gual corpora. In ACL: Annual Meet. of Assoc. ofComputational Linguistics.

Extracting Paraphrases of Technical Terms from Noisy Parallel Software Corpus

https://ink.library.smu.edu.sg/cgi/viewcontent.cgi?article=1471&amp;context=sis_research

Extracting Paraphrases of Technical Terms from Noisy Parallel Software Corpus

Abstract

Similar works

Full text

Available Versions

Institutional Knowledge at Singapore Management University