6,099 research outputs found

    Parsing Using the Role and Reference Grammar Paradigm

    Get PDF
    Much effort has been put into finding ways of parsing natural language. Role and Reference Grammar (RRG) is a linguistic paradigm that has credibility in linguistic circles. In this paper we give a brief overview of RRG and show how this can be implemented into a standard rule-based parser. We used the chart parser to test the concept on sentences from student work. We present results that show the potential role of this method for parsing ungrammatical sentences

    Automatic Genre Classification in Web Pages Applied to Web Comments

    Get PDF
    Automatic Web comment detection could significantly facilitate information retrieval systems, e.g., a focused Web crawler. In this paper, we propose a text genre classifier for Web text segments as intermediate step for Web comment detection in Web pages. Different feature types and classifiers are analyzed for this purpose. We compare the two-level approach to state-of-the-art techniques operating on the whole Web page text and show that accuracy can be improved significantly. Finally, we illustrate the applicability for information retrieval systems by evaluating our approach on Web pages achieved by a Web crawler

    The Evolution of Wikipedia's Norm Network

    Full text link
    Social norms have traditionally been difficult to quantify. In any particular society, their sheer number and complex interdependencies often limit a system-level analysis. One exception is that of the network of norms that sustain the online Wikipedia community. We study the fifteen-year evolution of this network using the interconnected set of pages that establish, describe, and interpret the community's norms. Despite Wikipedia's reputation for \textit{ad hoc} governance, we find that its normative evolution is highly conservative. The earliest users create norms that both dominate the network and persist over time. These core norms govern both content and interpersonal interactions using abstract principles such as neutrality, verifiability, and assume good faith. As the network grows, norm neighborhoods decouple topologically from each other, while increasing in semantic coherence. Taken together, these results suggest that the evolution of Wikipedia's norm network is akin to bureaucratic systems that predate the information age.Comment: 22 pages, 9 figures. Matches published version. Data available at http://bit.ly/wiki_nor

    Name and Subject Heading Reconciliation to Linked Open Data Authorities using Virtual International Authority File and Library of Congress Linked Data Service APIs: A Case Study featuring Emblematica Online

    Get PDF
    Libraries are actively exploring ways to use Linked Open Data (LOD) services to enhance discovery and facilitate the use of collections. Emblematica Online, which provides integrated discovery of digitized emblem books, incorporates LOD in its design. As an implementation prerequisite, the Virtual International Authority File (VIAF) and Library of Congress (LC) Linked Data Service APIs were used to reconcile name and subject strings from legacy catalog records with global authoritative links from LOD resources. This case study reports on the automated reconciliation process used and examines the efficacy of the APIs in reconciling name and subject heading entities. While a majority of strings were successfully reconciled, analysis suggests that data cleanup, rigorously consistent formatting of metadata strings, and addressing challenges in existing LOD resources and services could improve results for this corpus

    Learning from Jesus’ Wife: What Does Forgery Have to Do with the Digital Humanities?

    Get PDF
    McGrath’s chapter on the so-called Gospel of Jesus’ Wife sets aside as settled the question of the papyrus’ authenticity, and explores instead what we can learn about the Digital Humanities and scholarly interaction in a digital era from the way the discussions and investigations of that work unfolded, and how issues that arose were handled. As news of purported new finds can spread around the globe instantaneously facilitated by current technology and social media, how can academics utilize similar technology to evaluate authenticity, but even more importantly, inform the broader public about the importance of provenance, and the need for skepticism towards finds that appear via the antiquities market

    Human evaluation of Kea, an automatic keyphrasing system.

    Get PDF
    This paper describes an evaluation of the Kea automatic keyphrase extraction algorithm. Tools that automatically identify keyphrases are desirable because document keyphrases have numerous applications in digital library systems, but are costly and time consuming to manually assign. Keyphrase extraction algorithms are usually evaluated by comparison to author-specified keywords, but this methodology has several well-known shortcomings. The results presented in this paper are based on subjective evaluations of the quality and appropriateness of keyphrases by human assessors, and make a number of contributions. First, they validate previous evaluations of Kea that rely on author keywords. Second, they show Kea's performance is comparable to that of similar systems that have been evaluated by human assessors. Finally, they justify the use of author keyphrases as a performance metric by showing that authors generally choose good keywords

    Automatic Prediction of Rejected Edits in Stack Overflow

    Full text link
    The content quality of shared knowledge in Stack Overflow (SO) is crucial in supporting software developers with their programming problems. Thus, SO allows its users to suggest edits to improve the quality of a post (i.e., question and answer). However, existing research shows that many suggested edits in SO are rejected due to undesired contents/formats or violating edit guidelines. Such a scenario frustrates or demotivates users who would like to conduct good-quality edits. Therefore, our research focuses on assisting SO users by offering them suggestions on how to improve their editing of posts. First, we manually investigate 764 (382 questions + 382 answers) rejected edits by rollbacks and produce a catalog of 19 rejection reasons. Second, we extract 15 texts and user-based features to capture those rejection reasons. Third, we develop four machine learning models using those features. Our best-performing model can predict rejected edits with 69.1% precision, 71.2% recall, 70.1% F1-score, and 69.8% overall accuracy. Fourth, we introduce an online tool named EditEx that works with the SO edit system. EditEx can assist users while editing posts by suggesting the potential causes of rejections. We recruit 20 participants to assess the effectiveness of EditEx. Half of the participants (i.e., treatment group) use EditEx and another half (i.e., control group) use the SO standard edit system to edit posts. According to our experiment, EditEx can support SO standard edit system to prevent 49% of rejected edits, including the commonly rejected ones. However, it can prevent 12% rejections even in free-form regular edits. The treatment group finds the potential rejection reasons identified by EditEx influential. Furthermore, the median workload suggesting edits using EditEx is half compared to the SO edit system.Comment: Accepted for publication in Empirical Software Engineering (EMSE) journa

    Investigating the Quality Aspects of Crowd-Sourced Developer Forum: A Case Study of Stack Overflow

    Get PDF
    Technical question and answer (Q&A) websites have changed how developers seek information on the web and become more popular due to the shortcomings in official documentation and alternative knowledge sharing resources. Stack Overflow (SO) is one of the largest and most popular online Q&A websites for developers where they can share knowledge by answering questions and learn new skills by asking questions. Unfortunately, a large number of questions (up to 29%) are not answered at all, which might hurt the quality or purpose of this community-oriented knowledge base. In this thesis, we first attempt to detect the potentially unanswered questions during their submission using machine learning models. We compare unanswered and answered questions quantitatively and qualitatively. The quantitative analysis suggests that topics discussed in the question, the experience of the question submitter, and readability of question texts could often determine whether a question would be answered or not. Our qualitative study also reveals why the questions remain unanswered that could guide novice users to improve their questions. During analyzing the questions of SO, we see that many of them remain unanswered and unresolved because they contain such code segments that could potentially have programming issues (e.g., error, unexpected behavior); unfortunately, the issues could always not be reproduced by other users. This irreproducibility of issues might prevent questions of SO from getting answers or appropriate answers. In our second study, we thus conduct an exploratory study on the reproducibility of the issues discussed in questions and the correlation between issue reproducibility status (of questions) and corresponding answer meta-data such as the presence of an accepted answer. According to our analysis, a question with reproducible issues has at least three times higher chance of receiving an accepted answer than the question with irreproducible issues. However, users can improve the quality of questions and answers by editing. Unfortunately, such edits may be rejected (i.e., rollback) due to undesired modifications and ambiguities. We thus offer a comprehensive overview of reasons and ambiguities in the SO rollback edits. We identify 14 reasons for rollback edits and eight ambiguities that are often present in those edits. We also develop algorithms to detect ambiguities automatically. During the above studies, we find that about half of the questions that received working solutions have negative scores. About 18\% of the accepted answers also do not score the maximum votes. Furthermore, many users are complaining against the downvotes that are cast to their questions and answers. All these findings cast serious doubts on the reliability of the evaluation mechanism employed at SO. We thus concentrate on the assessment mechanism of SO to ensure a non-biased, reliable quality assessment mechanism of SO. This study compares the subjective assessment of questions with their objective assessment using 2.5 million questions and ten text analysis metrics. We also develop machine learning models to classify the promoted and discouraged questions and predict them during their submission time. We believe that the findings from our studies and proposed techniques have the potential to (1) help the users to ask better questions with appropriate code examples, and (2) improve the editing and assessment mechanism of SO to promote better content quality

    Walking across Wikipedia: a scale-free network model of semantic memory retrieval.

    Get PDF
    Semantic knowledge has been investigated using both online and offline methods. One common online method is category recall, in which members of a semantic category like "animals" are retrieved in a given period of time. The order, timing, and number of retrievals are used as assays of semantic memory processes. One common offline method is corpus analysis, in which the structure of semantic knowledge is extracted from texts using co-occurrence or encyclopedic methods. Online measures of semantic processing, as well as offline measures of semantic structure, have yielded data resembling inverse power law distributions. The aim of the present study is to investigate whether these patterns in data might be related. A semantic network model of animal knowledge is formulated on the basis of Wikipedia pages and their overlap in word probability distributions. The network is scale-free, in that node degree is related to node frequency as an inverse power law. A random walk over this network is shown to simulate a number of results from a category recall experiment, including power law-like distributions of inter-response intervals. Results are discussed in terms of theories of semantic structure and processing
    • 

    corecore