18,111 research outputs found

    Predicting Good Configurations for GitHub and Stack Overflow Topic Models

    Full text link
    Software repositories contain large amounts of textual data, ranging from source code comments and issue descriptions to questions, answers, and comments on Stack Overflow. To make sense of this textual data, topic modelling is frequently used as a text-mining tool for the discovery of hidden semantic structures in text bodies. Latent Dirichlet allocation (LDA) is a commonly used topic model that aims to explain the structure of a corpus by grouping texts. LDA requires multiple parameters to work well, and there are only rough and sometimes conflicting guidelines available on how these parameters should be set. In this paper, we contribute (i) a broad study of parameters to arrive at good local optima for GitHub and Stack Overflow text corpora, (ii) an a-posteriori characterisation of text corpora related to eight programming languages, and (iii) an analysis of corpus feature importance via per-corpus LDA configuration. We find that (1) popular rules of thumb for topic modelling parameter configuration are not applicable to the corpora used in our experiments, (2) corpora sampled from GitHub and Stack Overflow have different characteristics and require different configurations to achieve good model fit, and (3) we can predict good configurations for unseen corpora reliably. These findings support researchers and practitioners in efficiently determining suitable configurations for topic modelling when analysing textual data contained in software repositories.Comment: to appear as full paper at MSR 2019, the 16th International Conference on Mining Software Repositorie

    Constructing experimental indicators for Open Access documents

    Get PDF
    The ongoing paradigm change in the scholarly publication system ('science is turning to e-science') makes it necessary to construct alternative evaluation criteria/metrics which appropriately take into account the unique characteristics of electronic publications and other research output in digital formats. Today, major parts of scholarly Open Access (OA) publications and the self-archiving area are not well covered in the traditional citation and indexing databases. The growing share and importance of freely accessible research output demands new approaches/metrics for measuring and for evaluating of these new types of scientific publications. In this paper we propose a simple quantitative method which establishes indicators by measuring the access/download pattern of OA documents and other web entities of a single web server. The experimental indicators (search engine, backlink and direct access indicator) are constructed based on standard local web usage data. This new type of web-based indicator is developed to model the specific demand for better study/evaluation of the accessibility, visibility and interlinking of open accessible documents. We conclude that e-science will need new stable e-indicators.Comment: 9 pages, 3 figure

    Web Mining Functions in an Academic Search Application

    Get PDF
    This paper deals with Web mining and the different categories of Web mining like content, structure and usage mining. The application of Web mining in an academic search application has been discussed. The paper concludes with open problems related to Web mining. The present work can be a useful input to Web users, Web Administrators in a university environment.Database, HITS, IR, NLP, Web mining

    Is Stack Overflow Overflowing With Questions and Tags

    Full text link
    Programming question and answer (Q & A) websites, such as Quora, Stack Overflow, and Yahoo! Answer etc. helps us to understand the programming concepts easily and quickly in a way that has been tested and applied by many software developers. Stack Overflow is one of the most frequently used programming Q\&A website where the questions and answers posted are presently analyzed manually, which requires a huge amount of time and resource. To save the effort, we present a topic modeling based technique to analyze the words of the original texts to discover the themes that run through them. We also propose a method to automate the process of reviewing the quality of questions on Stack Overflow dataset in order to avoid ballooning the stack overflow with insignificant questions. The proposed method also recommends the appropriate tags for the new post, which averts the creation of unnecessary tags on Stack Overflow.Comment: 11 pages, 7 figures, 3 tables Presented at Third International Symposium on Women in Computing and Informatics (WCI-2015

    APIs and Your Privacy

    Get PDF
    Application programming interfaces, or APIs, have been the topic of much recent discussion. Newsworthy events, including those involving Facebook’s API and Cambridge Analytica obtaining information about millions of Facebook users, have highlighted the technical capabilities of APIs for prominent websites and mobile applications. At the same time, media coverage of ways that APIs have been misused has sparked concern for potential privacy invasions and other issues of public policy. This paper seeks to educate consumers on how APIs work and how they are used within popular websites and mobile apps to gather, share, and utilize data. APIs are used in mobile games, search engines, social media platforms, news and shopping websites, video and music streaming services, dating apps, and mobile payment systems. If a third-party company, like an app developer or advertiser, would like to gain access to your information through a website you visit or a mobile app or online service you use, what data might they obtain about you through APIs and how? This report analyzes 11 prominent online services to observe general trends and provide you an overview of the role APIs play in collecting and distributing information about consumers. For example, how might your data be gathered and shared when using your Facebook account login to sign up for Venmo or to access the Tinder dating app? How might advertisers use Pandora’s API when you are streaming music? After explaining what APIs are and how they work, this report categorizes and characterizes different kinds of APIs that companies offer to web and app developers. Services may offer content-focused APIs, feature APIs, unofficial APIs, and analytics APIs that developers of other apps and websites may access and use in different ways. Likewise, advertisers can use APIs to target a desired subset of a service’s users and possibly extract user data. This report explains how websites and apps can create user profiles based on your online behavior and generate revenue from advertiser-access to their APIs. The report concludes with observations on how various companies and platforms connecting through APIs may be able to learn information about you and aggregate it with your personal data from other sources when you are browsing the internet or using different apps on your smartphone or tablet. While the paper does not make policy recommendations, it demonstrates the importance of approaching consumer privacy from a broad perspective that includes first parties and third parties, and that considers the integral role of APIs in today’s online ecosystem
    corecore