19 research outputs found

    Wikipedia Citations: A comprehensive dataset of citations with identifiers extracted from English Wikipedia

    Get PDF
    Wikipedia's contents are based on reliable and published sources. To this date, relatively little is known about what sources Wikipedia relies on, in part because extracting citations and identifying cited sources is challenging. To close this gap, we release Wikipedia Citations, a comprehensive dataset of citations extracted from Wikipedia. A total of 29.3M citations were extracted from 6.1M English Wikipedia articles as of May 2020, and classified as being to books, journal articles or Web contents. We were thus able to extract 4.0M citations to scholarly publications with known identifiers -- including DOI, PMC, PMID, and ISBN -- and further equip an extra 261K citations with DOIs from Crossref. As a result, we find that 6.7% of Wikipedia articles cite at least one journal article with an associated DOI, and that Wikipedia cites just 2% of all articles with a DOI currently indexed in the Web of Science. We release our code to allow the community to extend upon our work and update the dataset in the future

    Conflict and Computation on Wikipedia: a Finite-State Machine Analysis of Editor Interactions

    Full text link
    What is the boundary between a vigorous argument and a breakdown of relations? What drives a group of individuals across it? Taking Wikipedia as a test case, we use a hidden Markov model to approximate the computational structure and social grammar of more than a decade of cooperation and conflict among its editors. Across a wide range of pages, we discover a bursty war/peace structure where the systems can become trapped, sometimes for months, in a computational subspace associated with significantly higher levels of conflict-tracking "revert" actions. Distinct patterns of behavior characterize the lower-conflict subspace, including tit-for-tat reversion. While a fraction of the transitions between these subspaces are associated with top-down actions taken by administrators, the effects are weak. Surprisingly, we find no statistical signal that transitions are associated with the appearance of particularly anti-social users, and only weak association with significant news events outside the system. These findings are consistent with transitions being driven by decentralized processes with no clear locus of control. Models of belief revision in the presence of a common resource for information-sharing predict the existence of two distinct phases: a disordered high-conflict phase, and a frozen phase with spontaneously-broken symmetry. The bistability we observe empirically may be a consequence of editor turn-over, which drives the system to a critical point between them.Comment: 23 pages, 3 figures. Matches published version. Code for HMM fitting available at http://bit.ly/sfihmm ; time series and derived finite state machines at bit.ly/wiki_hm

    A Semantic Wiki-based Platform for IT Service Management

    Get PDF
    The book researches the use of a semantic wiki in the area of IT Service Management within the IT department of an SME. An emphasis of the book lies in the design and prototypical implementation of tools for the integration of ITSM-relevant information into the semantic wiki, as well as tools for interactions between the wiki and external programs. The result of the book is a platform for agile, semantic wiki-based ITSM for IT administration teams of SMEs

    Approaches for enriching and improving textual knowledge bases

    Get PDF
    [no abstract

    Approaches to Automatic Text Structuring

    Get PDF
    Structured text helps readers to better understand the content of documents. In classic newspaper texts or books, some structure already exists. In the Web 2.0, the amount of textual data, especially user-generated data, has increased dramatically. As a result, there exists a large amount of textual data which lacks structure, thus making it more difficult to understand. In this thesis, we will explore techniques for automatic text structuring to help readers to fulfill their information needs. Useful techniques for automatic text structuring are keyphrase identification, table-of-contents generation, and link identification. We improve state of the art results for approaches to text structuring on several benchmark datasets. In addition, we present new representative datasets for users’ everyday tasks. We evaluate the quality of text structuring approaches with regard to these scenarios and discover that the quality of approaches highly depends on the dataset on which they are applied. In the first chapter of this thesis, we establish the theoretical foundations regarding text structuring. We describe our findings from a user survey regarding web usage from which we derive three typical scenarios of Internet users. We then proceed to the three main contributions of this thesis. We evaluate approaches to keyphrase identification both by extracting and assigning keyphrases for English and German datasets. We find that unsupervised keyphrase extraction yields stable results, but for datasets with predefined keyphrases, additional filtering of keyphrases and assignment approaches yields even higher results. We present a de- compounding extension, which further improves results for datasets with shorter texts. We construct hierarchical table-of-contents of documents for three English datasets and discover that the results for hierarchy identification are sufficient for an automatic system, but for segment title generation, user interaction based on suggestions is required. We investigate approaches to link identification, including the subtasks of identifying the mention (anchor) of the link and linking the mention to an entity (target). Approaches that make use of the Wikipedia link structure perform best, as long as there is sufficient training data available. For identifying links to sense inventories other than Wikipedia, approaches that do not make use of the link structure outperform the approaches using existing links. We further analyze the effect of senses on computing similarities. In contrast to entity linking, where most entities can be discriminated by their name, we consider cases where multiple entities with the same name exist. We discover that similarity de- pends on the selected sense inventory. To foster future evaluation of natural language processing components for text structuring, we present two prototypes of text structuring systems, which integrate techniques for automatic text structuring in a wiki setting and in an e-learning setting with eBooks

    Information Management and Market Engineering. Vol. II

    Get PDF
    The research program Information Management and Market Engineering focuses on the analysis and the design of electronic markets. Taking a holistic view of the conceptualization and realization of solutions, the research integrates the disciplines business administration, economics, computer science, and law. Topics of interest range from the implementation, quality assurance, and advancement of electronic markets to their integration into business processes and legal frameworks

    Proceedings der 11. Internationalen Tagung Wirtschaftsinformatik (WI2013) - Band 1

    Get PDF
    The two volumes represent the proceedings of the 11th International Conference on Wirtschaftsinformatik WI2013 (Business Information Systems). They include 118 papers from ten research tracks, a general track and the Student Consortium. The selection of all submissions was subject to a double blind procedure with three reviews for each paper and an overall acceptance rate of 25 percent. The WI2013 was organized at the University of Leipzig between February 27th and March 1st, 2013 and followed the main themes Innovation, Integration and Individualization.:Track 1: Individualization and Consumerization Track 2: Integrated Systems in Manufacturing Industries Track 3: Integrated Systems in Service Industries Track 4: Innovations and Business Models Track 5: Information and Knowledge ManagementDie zweibändigen Tagungsbände zur 11. Internationalen Tagung Wirtschaftsinformatik (WI2013) enthalten 118 Forschungsbeiträge aus zehn thematischen Tracks der Wirtschaftsinformatik, einem General Track sowie einem Student Consortium. Die Selektion der Artikel erfolgte nach einem Double-Blind-Verfahren mit jeweils drei Gutachten und führte zu einer Annahmequote von 25%. Die WI2013 hat vom 27.02. - 01.03.2013 unter den Leitthemen Innovation, Integration und Individualisierung an der Universität Leipzig stattgefunden.:Track 1: Individualization and Consumerization Track 2: Integrated Systems in Manufacturing Industries Track 3: Integrated Systems in Service Industries Track 4: Innovations and Business Models Track 5: Information and Knowledge Managemen

    How is encyclopaedia authority established?

    Get PDF
    I embarked on this research because I wanted to explore the basis of textual authority. Such an understanding is particularly important in a world where there is such an overload of information that it is a challenge for the public to identify which publications to choose when looking for specific information. I decided to look at the case of encyclopaedias because of the widespread belief that encyclopaedias are the ultimate authorities. I also made the choice based on the observation that, besides the research on Wikipedia, the scientific community seems to overlook encyclopaedias, despite of the role these latter play as key sources of information for the general public. Two theories are combined to serve as a framework for the thesis. On the one hand, there is the theory of cognitive authority as defined by Józef Maria Bocheński, Richard De George, and Patrick Wilson. On the other hand, there is the theory of quality as defined from the various frameworks recommended by librarians and information scientists on how to assess the quality of reference works. These two theoretical frameworks are used to deconstruct the concept of authority and to highlight aspects of authority which may be particularly worthy of investigation. In this thesis, studies were conducted on the following: (1) a literature review on the origin and evolution of encyclopaedia authority throughout the history of encyclopaedia, (2) a review of previous research pertaining to the quality and the authority of Wikipedia, (3) an analysis of the publishing and dissemination of science and technology encyclopaedias published in the 21st century throughout worldwide libraries, (4) a survey of perspective of encyclopaedia authors on the role of encyclopaedias in society and on the communication of scientific uncertainties and controversies, and (5) an analysis of book reviews towards a general assessment of encyclopaedia quality. The thesis illustrates how a concept such as authority which is typically taken for granted can actually be more complex and more problematic than it appears, thereby challenging widespread beliefs in society. In particular, the thesis pinpoints potential contradictions regarding the importance of the author and the publishers in ensuring encyclopaedia authority. On a theoretical level, the thesis revisits the concept of cognitive authority and initiates a discussion on the complex interaction between authority and quality. On a more pragmatic level, the thesis contributes towards the creation of guidelines for encyclopaedia development. As an exploratory study, the thesis also identifies a range of areas which should be of priority for future research

    Wiktionary: The Metalexicographic and the Natural Language Processing Perspective

    Get PDF
    Dictionaries are the main reference works for our understanding of language. They are used by humans and likewise by computational methods. So far, the compilation of dictionaries has almost exclusively been the profession of expert lexicographers. The ease of collaboration on the Web and the rising initiatives of collecting open-licensed knowledge, such as in Wikipedia, caused a new type of dictionary that is voluntarily created by large communities of Web users. This collaborative construction approach presents a new paradigm for lexicography that poses new research questions to dictionary research on the one hand and provides a very valuable knowledge source for natural language processing applications on the other hand. The subject of our research is Wiktionary, which is currently the largest collaboratively constructed dictionary project. In the first part of this thesis, we study Wiktionary from the metalexicographic perspective. Metalexicography is the scientific study of lexicography including the analysis and criticism of dictionaries and lexicographic processes. To this end, we discuss three contributions related to this area of research: (i) We first provide a detailed analysis of Wiktionary and its various language editions and dictionary structures. (ii) We then analyze the collaborative construction process of Wiktionary. Our results show that the traditional phases of the lexicographic process do not apply well to Wiktionary, which is why we propose a novel process description that is based on the frequent and continual revision and discussion of the dictionary articles and the lexicographic instructions. (iii) We perform a large-scale quantitative comparison of Wiktionary and a number of other dictionaries regarding the covered languages, lexical entries, word senses, pragmatic labels, lexical relations, and translations. We conclude the metalexicographic perspective by finding that the collaborative Wiktionary is not an appropriate replacement for expert-built dictionaries due to its inconsistencies, quality flaws, one-fits-all-approach, and strong dependence on expert-built dictionaries. However, Wiktionary's rapid and continual growth, its high coverage of languages, newly coined words, domain-specific vocabulary and non-standard language varieties, as well as the kind of evidence based on the authors' intuition provide promising opportunities for both lexicography and natural language processing. In particular, we find that Wiktionary and expert-built wordnets and thesauri contain largely complementary entries. In the second part of the thesis, we study Wiktionary from the natural language processing perspective with the aim of making available its linguistic knowledge for computational applications. Such applications require vast amounts of structured data with high quality. Expert-built resources have been found to suffer from insufficient coverage and high construction and maintenance cost, whereas fully automatic extraction from corpora or the Web often yields resources of limited quality. Collaboratively built encyclopedias present a viable solution, but do not cover well linguistically oriented knowledge as it is found in dictionaries. That is why we propose extracting linguistic knowledge from Wiktionary, which we achieve by the following three main contributions: (i) We propose the novel multilingual ontology OntoWiktionary that is created by extracting and harmonizing the weakly structured dictionary articles in Wiktionary. A particular challenge in this process is the ambiguity of semantic relations and translations, which we resolve by automatic word sense disambiguation methods. (ii) We automatically align Wiktionary with WordNet 3.0 at the word sense level. The largely complementary information from the two dictionaries yields an aligned resource with higher coverage and an enriched representation of word senses. (iii) We represent Wiktionary according to the ISO standard Lexical Markup Framework, which we adapt to the peculiarities of collaborative dictionaries. This standardized representation is of great importance for fostering the interoperability of resources and hence the dissemination of Wiktionary-based research. To this end, our work presents a foundational step towards the large-scale integrated resource UBY, which facilitates a unified access to a number of standardized dictionaries by means of a shared web interface for human users and an application programming interface for natural language processing applications. A user can, in particular, switch between and combine information from Wiktionary and other dictionaries without completely changing the software. Our final resource and the accompanying datasets and software are publicly available and can be employed for multiple different natural language processing applications. It particularly fills the gap between the small expert-built wordnets and the large amount of encyclopedic knowledge from Wikipedia. We provide a survey of previous works utilizing Wiktionary, and we exemplify the usefulness of our work in two case studies on measuring verb similarity and detecting cross-lingual marketing blunders, which make use of our Wiktionary-based resource and the results of our metalexicographic study. We conclude the thesis by emphasizing the usefulness of collaborative dictionaries when being combined with expert-built resources, which bears much unused potential
    corecore