18 research outputs found
Visual Cortex Inspired CNN Model for Feature Construction in Text Analysis
Recently, biologically inspired models are gradually proposed to solve the problem in text analysis. Convolutional neural networks (CNN) are hierarchical artificial neural networks, which include a various of multilayer perceptrons. According to biological research, CNN can be improved by bringing in the attention modulation and memory processing of primate visual cortex. In this paper, we employ the above properties of primate visual cortex to improve CNN and propose a biological-mechanism-driven-feature-construction based answer recommendation method (BMFC-ARM), which is used to recommend the best answer for the corresponding given questions in community question answering. BMFC-ARM is an improved CNN with four channels respectively representing questions, answers, asker information and answerer information, and mainly contains two stages: biological mechanism driven feature construction (BMFC) and answer ranking. BMFC imitates the attention modulation property by introducing the asker information and answerer information of given questions and the similarity between them, and imitates the memory processing property through bringing in the user reputation information for answerers. Then the feature vector for answer ranking is constructed by fusing the asker-answerer similarities, answerer's reputation and the corresponding vectors of question, answer, asker and answerer. Finally, the Softmax is used at the stage of answer ranking to get best answers by the feature vector. The experimental results of answer recommendation on the Stackexchange dataset show that BMFC-ARM exhibits better performance
Software expert discovery via knowledge domain embeddings in a collaborative network
© 2018 Elsevier B.V. Community Question Answering (CQA) websites can be claimed as the most major venues for knowledge sharing, and the most effective way of exchanging knowledge at present. Considering that massive amount of users are participating online and generating huge amount data, management of knowledge here systematically can be challenging. Expert recommendation is one of the major challenges, as it highlights users in CQA with potential expertise, which may help match unresolved questions with existing high quality answers while at the same time may help external services like human resource systems as another reference to evaluate their candidates. In this paper, we in this work we propose to exploring experts in CQA websites. We take advantage of recent distributed word representation technology to help summarize text chunks, and in a semantic view exploiting the relationships between natural language phrases to extract latent knowledge domains. By domains, the users’ expertise is determined on their historical performance, and a rank can be compute to given recommendation accordingly. In particular, Stack Overflow is chosen as our dataset to test and evaluate our work, where inclusive experiment shows our competence
CQARank: Jointly Model Topics and Expertise in Community Question Answering
Community Question Answering (CQA) websites, where people share expertise on open platforms, have become large repositories of valuable knowledge. To bring the best value out of these knowledge repositories, it is critically important for CQA services to know how to find the right experts, retrieve archived similar questions and recommend best answers to new questions. To tackle this cluster of closely related problems in a principled approach, we proposed Topic Expertise Model (TEM), a novel probabilistic generative model with GMM hybrid, to jointly model topics and expertise by integrating textual content model and link structure analysis. Based on TEM results, we proposed CQARank to measure user interests and expertise score under different topics. Leveraging the question answering history based on long-term community reviews and voting, our method could find experts with both similar topical preference and high topical expertise. Experiments carried out on Stack Overflow data, the largest CQA focused on computer programming, show that our method achieves significant improvement over existing methods on multiple metrics. Copyright is held by the owner/author(s).EI
Expert2Vec: Distributed Expert Representation Learning in Question Answering Community
© 2019, Springer Nature Switzerland AG. Community question answering (CQA) has attracted increasing attention recently due to its potential as a de facto knowledge base. Expert finding in CQA websites also has considerably board applications. Stack Overflow is one of the most popular question answering platforms, which is often utilized by recent studies on the recommendation of the domain expert. Despite the substantial progress seen recently, it still lacks relevant research on the direct representation of expert users. Hence hereby we propose Expert2Vec, a distributed Expert Representation learning in question answering community to boost the recommendation of the domain expert. Word2Vec is used to preprocess the Stack Overflow dataset, which helps to generate representations of domain topics. Weight rankings are then extracted based on domains and variational autoencoder (VAE) is unitized to generate representations of user-topic information. This finally adopts the reinforcement learning framework with the user-topic matrix to improve it internally. Experiments show the adequate performance of our proposed approaches in the recommendation system
Predicting best answerers for new questions: An approach leveraging topic modeling and collaborative voting
Workshop of Quality, Motivation and Coordination of Open Collaboration</p
Recommended from our members
Community and Thread Methods for Identifying Best Answers in Online Question Answering Communities
Much research has recently investigated the measurement of quality answers in Question Answering (Q&A) communities in the form of automatic best answer identification. Previous approaches have focused on manual user annotations and diverse features based on intuition for identifying best answers and proved relatively successful despite considering best answer identification as a general classification problem.
Best answer modelling is generally distanced from community studies about what users regard as important for identifying quality content. In particular, previous research tends to only focus on the automatic aspects of best answers identification model by applying generic learning algorithms.
This thesis introduces the concepts of qualitative and structural design in order to investigate if features derived from community questionnaires can enrich the understanding of best answer identification in Q&A communities and if the thread-like structure of Q&A communities can be exploited for better results. Two different approaches for exploiting the thread structure of Q&A communities are proposed and two new, previously unstudied, features are introduced. First, a measure of question complexity is introduced as a proxy measure of answerer knowledge. Second, different models of contribution effort are proposed for representing the answering reactivity of contributors.
The experiments are systematically conducted on datasets issued from three different communities that vary in size, content and structure. The results show that the newly proposed features allow for better understanding of what constitute best answers. The findings also reveal that the thread-wise algorithms and optimisation techniques created from the structural design methodology correlate with best answers. In general both structural and qualitative design appear to improve best answer identification meaning that structural and qualitative methods may improve unrelated classification tasks
Code similarity and clone search in large-scale source code data
Software development is tremendously benefited from the Internet by having online code corpora that enable instant sharing of source code and online developer's guides and documentation. Nowadays, duplicated code (i.e., code clones) not only exists within or across software projects but also between online code repositories and websites. We call them "online code clones."' They can lead to license violations, bug propagation, and re-use of outdated code similar to classic code clones between software systems. Unfortunately, they are difficult to locate and fix since the search space in online code corpora is large and no longer confined to a local repository. This thesis presents a combined study of code similarity and online code clones. We empirically show that many code snippets on Stack Overflow are cloned from open source projects. Several of them become outdated or violate their original license and are possibly harmful to reuse. To develop a solution for finding online code clones, we study various code similarity techniques to gain insights into their strengths and weaknesses. A framework, called OCD, for evaluating code similarity and clone search tools is introduced and used to compare 34 state-of-the-art techniques on pervasively modified code and boiler-plate code. We also found that clone detection techniques can be enhanced by compilation and decompilation. Using the knowledge from the comparison of code similarity analysers, we create and evaluate Siamese, a scalable token-based clone search technique via multiple code representations. Our evaluation shows that Siamese scales to large-scale source code data of 365 million lines of code and offers high search precision and recall. Its clone search precision is comparable to seven state-of-the-art clone detection tools on the OCD framework. Finally, we demonstrate the usefulness of Siamese by applying the tool to find online code clones, automatically analyse clone licenses, and recommend tests for reuse