1,164 research outputs found

    BlogForever D2.6: Data Extraction Methodology

    Get PDF
    This report outlines an inquiry into the area of web data extraction, conducted within the context of blog preservation. The report reviews theoretical advances and practical developments for implementing data extraction. The inquiry is extended through an experiment that demonstrates the effectiveness and feasibility of implementing some of the suggested approaches. More specifically, the report discusses an approach based on unsupervised machine learning that employs the RSS feeds and HTML representations of blogs. It outlines the possibilities of extracting semantics available in blogs and demonstrates the benefits of exploiting available standards such as microformats and microdata. The report proceeds to propose a methodology for extracting and processing blog data to further inform the design and development of the BlogForever platform

    A blog mining framework

    Get PDF
    Blogs have become increasingly popular, and new blogs are generated every day. Many of the contents are useful for applications in various domains, such as business, politics, research, social work, and linguistics. However, automatically collecting and analyzing blogs isn't straightforward due to the large size and dynamic nature of the blogosphere. In this article, the authors propose a framework for blog mining that includes spiders, parsers, analyzers, and visualizers. They present several examples of blog mining applications based on their framework. © 2006 IEEE.published_or_final_versio

    Mining Meaning from Wikipedia

    Get PDF
    Wikipedia is a goldmine of information; not just for its many readers, but also for the growing community of researchers who recognize it as a resource of exceptional scale and utility. It represents a vast investment of manual effort and judgment: a huge, constantly evolving tapestry of concepts and relations that is being applied to a host of tasks. This article provides a comprehensive description of this work. It focuses on research that extracts and makes use of the concepts, relations, facts and descriptions found in Wikipedia, and organizes the work into four broad categories: applying Wikipedia to natural language processing; using it to facilitate information retrieval and information extraction; and as a resource for ontology building. The article addresses how Wikipedia is being used as is, how it is being improved and adapted, and how it is being combined with other structures to create entirely new resources. We identify the research groups and individuals involved, and how their work has developed in the last few years. We provide a comprehensive list of the open-source software they have produced.Comment: An extensive survey of re-using information in Wikipedia in natural language processing, information retrieval and extraction and ontology building. Accepted for publication in International Journal of Human-Computer Studie

    Content analysis of web sites from 2000 to 2004: a thematic meta-analysis

    Get PDF
    The rise of the World Wide Web attracted concerns among social science scholars, especially those in the communication school who studied it by various methods like content analysis. However, the dynamic environment of the World Wide Web challenged this traditional research method, and, in turn, scholars tried to figure out valid solutions, which were summarized in the literature review section. After 2000, few studies focused on the content analysis of Web sites, while the World Wide Web developed rapidly and affected people??s everyday life. This study conducted a thematic meta-analysis to examine how researchers apply content analysis to the World Wide Web after 2000. A total of 39 studies that used content analysis to study Web sites were identified from three sources. Then data were collected and analyzed. This study found that, from 2000 to 2004, content analysis of the World Wide Web proliferated. The content analytical scholars had created new strategies to cope with challenges posed by the WWW. The suggestions made in this study forms some guidelines in the steps of content analysis research design, potentially aiding the future research of content analysis to Web sites in developing their own valid methods to study the rapid-paced WWW

    The document/book as a form of curatorial creativity

    Get PDF
    This thesis discusses the document/book as an act of recording that can serve as a form for curatorial creativity. Firstly, it explores the document as a space in a hybrid analog and digital era. Then, it introduces concrete examples of how curating the gallery and the book has changed in the 20th and 21st centuries. It follows a critical analysis of society as one big accumulation of documents. It proposes the invention of writing as the base of our current digital spaces and the space of the book as architecture. In respect to the curatorial discourse, it focuses on Springer's proposition of engaging with the library in order to develop new ways of organizing, collecting, and reassembling information. The first chapter introduces Benjamin Bratton's diagram of "The Stack", which serves to explore the physical spaces of information, describing how the infrastructure of books has come to expand significantly from clay to paper, and now to the Cloud. It proposes the codex-book, a stack of paper sheets as an analogy of the stack through the example of artist Irfan Hendrian "Some Other Matter" exhibition. It also proposes the page—a place of inventory and invention— as the first virtual space of humanity. The second chapter discusses the library's primary functions of storage and retrievability —proposing the Library of Alexandria as the first information organization. Then comes back to an example of how the old model of the library can be used for creating a new display for the gallery as well as giving value to its collection through physical activation. Finally, it explores some of the invisible systems (covers, algorithms, tags) that are now building our digital libraries. The third chapter focuses on copy and print as essential tools for recording, preservation, and building collections. It introduces the history of mass digitization and the changes it has brought to analog documents. It also explores the space of digital and print through Kenneth Goldsmith's curatorial project that called out to print out all the internet. This example leads us to discuss the history of the A4-size paper sheet as the first completely standardize product. The fourth chapter presents the "neutral" containers —starting from the concept of the "gallery-book" proposed by Bernard Teyssendier as a place of movement, pleasure, and learning. It also explores architecture and design as curatorial infrastructure for exhibitions happening both in a gallery space and on a blank document. Finally, it creates a parallel between the white paper page and the white gallery wall as places of artistic intervention, which far from being invisible follow specific predefined structures. The fifth chapter focuses on presenting projects that propose new curated writing and reading contexts between the print and digital. Here, Brian O'Doherty's issue for "Aspen" magazine is proposed as proto-hypertext or as a premonition of the website. Then, the website-as-gallery concept is explored through the example of Kadist's "One Sentence Exhibition" project. This example leads to exploring the fragility and impermanence of the hyperlink, in contrast to its printed counterparts. The final chapter presents three projects that use the infrastructure of the book and the library as a curatorial agency —proposing new methods for curating information through collection, organization, and research. "Intercalations", a paginated exhibition series by Anna-Sophie Springer and Etienne Turpin; "MAP", a folded encyclopedia by the David A. Garcia architecture studio; and "Carte(s) MĂ©moire(s)" by ExposerPublier that proposes the exhibition as a moment of activation

    GAUGING PUBLIC INTEREST FROM SERVER LOGS, SURVEYS AND INLINKS

    Get PDF
    As the World Wide Web (the Web) has turned into a full-fledged medium to disseminate news, it is very important for journalism and information science researchers to investigate how Web users access online news reports and how to interpret such usage patterns. This doctoral thesis collected and analyzed Web server log statistics, online surveys results, online reprints of the top 50 news reports, as well as external inlinks data of a leading comprehensive online newspaper (the People’s Daily Online) in China, one of the biggest Web/information markets in today’s world. The aim of the thesis was to explore various methods to gauge the public interest from a Webometrics perspective. A total of 129 days of Web server log statistics, including the top 50 Chinese and English news stories with the highest daily pageview numbers, the comments attracted by these news items and the emailed frequencies of the same stories were collected from October 2007 to September 2008. These top 50 news items’positions on the Chinese and English homepages and the top 50 queries submitted to the website search engine of the People’s Daily Online were also retrieved. Results of the two online surveys launched in March 2008 and March 2009 were collected after their respective closing dates. The external inlinks to the People’s Daily Online were retrieved by Yahoo! (Chinese and English versions), and the online reprints were retrieved by Google. Besides the general usage patterns identified from the top 50 news stories, this study, by conducting statistical tests on the data sets, also reveals the following findings. First, the editors’ choices and the readers’ favorites do not always match each other; thus content of news title is more important than its homepage position in attracting online visits. Second, the Chinese and English readers’ interests in the same events are different. Third, the pageview numbers and comments posted to the news items reflect the unfavorable attitudes of the Chinese people toward the United States and Japan, which might offer us a method to investigate the public interest in some other issues or nations after necessary modifications. More importantly, some publicly available data, such as the comments posted to the news stories and online survey results, further show that the pageview measure does reflect readers’ interests/needs truthfully, as proved by the strong correlations between the top news reports and relevant top queries. The external ininks to the news websites and the online reprints of the top news items help us examine readers\u27 interests from other perspectives, as well as establish online profiles of the news websites. Such publicly accessible information could be an alternative data source for researchers to study readers\u27 interests when the Web server log data are not available. This doctoral thesis not only shows the usefulness of Web server log statistics, survey results, and other publicly accessible data in studying Web user’s information needs, but also offers practical suggestions for online news sites to improve their contents and homepage designs. However, no single method can draw a complete picture of the online news readers’ interests. The above mentioned research methodologies should be employed together, in order to make more comprehensive conclusions. Future research is especially needed to investigate the continuously rapid growth of the “Mobile News Readers,” which poses both challenges and opportunities to the press industry in the 21st century

    Intellectual property rights in a knowledge-based economy

    Get PDF
    Intellectual property rights (IPR) have been created as economic mechanisms to facilitate ongoing innovation by granting inventors a temporary monopoly in return for disclosure of technical know-how. Since the beginning of 1980s, IPR have come under scrutiny as new technological paradigms appeared with the emergence of knowledge-based industries. Knowledge-based products are intangible, non-excludable and non-rivalrous goods. Consequently, it is difficult for their creators to control their dissemination and use. In particular, many information goods are based on network externalities and on the creation of market standards. At the same time, information technologies are generic in the sense of being useful in many places in the economy. Hence, policy makers often define current IPR regimes in the context of new technologies as both over- and under-protective. They are over-protective in the sense that they prevent the dissemination of information which has a very high social value; they are under-protective in the sense that they do not provide strong control over the appropriation of rents from their invention and thus may not provide strong incentives to innovate. During the 1980s, attempts to assess the role of IPR in the process of technological learning have found that even though firms in high-tech sectors do use patents as part of their strategy for intellectual property protection, the reliance of these sectors on patents as an information source for innovation is lower than in traditional industries. Intellectual property rights are based mainly on patents for technical inventions and on copyrights for artistic works. Patents are granted only if inventions display minimal levels of utility, novelty and non-obviousness of technical know-how. By contrast, copyrights protect only final works and their derivatives, but guarantee protection for longer periods, according to the Berne Convention. Licensing is a legal aid that allows the use of patented technology by other firms, in return for royalty fees paid to the inventor. Licensing can be contracted on an exclusive or non-exclusive basis, but in most countries patented knowledge can be exclusively held by its inventors, as legal provisions for compulsory licensing of technologies do not exist. The fair use doctrine aims to prevent formation of perfect monopolies over technological fields and copyrighted artefacts as a result of IPR application. Hence, the use of patented and copyrighted works is permissible in academic research, education and the development of technologies that are complimentary to core technologies. Trade secrecy is meant to prevent inadvertent technology transfer to rival firms and is based on contracts between companies and employees. However, as trade secrets prohibit transfer of knowledge within industries, regulators have attempted to foster disclosure of technical know-how by institutional means of patents, copyrights and sui-generis laws. And indeed, following the provisions formed by IPR regulation, firms have shifted from methods of trade secrecy towards patenting strategies to achieve improved protection of intellectual property, as well as means to acquire competitive advantages in the market by monopolization of technological advances.economics of technology ;

    EBSLG Annual General Conference, 18. - 21.05.2010, Cologne. Selected papers

    Get PDF
    Am 18.-21. Mai 2010 fand in der UniversitĂ€ts- und Stadtbibliothek (USB) Köln die „Annual General Conference“ der European Business Schools Librarians Group (EBSLG) statt. Die EBSLG ist eine relativ kleine, aber exklusive Gruppe von Bibliotheksdirektorinnen und –direktoren bzw. Bibliothekarinnen und Bibliothekaren in Leitungspositionen aus den Bibliotheken fĂŒhrender Business Schools. Im Mittelpunkt der Tagung standen zwei Themenschwerpunkte: Der erste Themenkreis beschĂ€ftigte sich mit Bibliotheksportalen und bibliothekarischen Suchmaschinen. Der zweite Themenschwerpunkt Fragen der Bibliotheksorganisation wie die Aufbauorganisation einer Bibliothek, Outsourcing und Relationship Management. Der vorliegende Tagungsband enthĂ€lt ausgewĂ€hlte TagungsbeitrĂ€ge

    GAUGING PUBLIC INTEREST FROM SERVER LOGS, SURVEYS AND INLINKS A Multi-Method Approach to Analyze News Websites

    Get PDF
    As the World Wide Web (the Web) has turned into a full-fledged medium to disseminate news, it is very important for journalism and information science researchers to investigate how Web users access online news reports and how to interpret such usage patterns. This doctoral thesis collected and analyzed Web server log statistics, online surveys results, online reprints of the top 50 news reports, as well as external inlinks data of a leading comprehensive online newspaper (the People\u27s Daily Online) in China, one of the biggest Web/information markets in today\u27s world. The aim of the thesis was to explore various methods to gauge the public interest from a Webometrics perspective. A total of 129 days of Web server log statistics, including the top 50 Chinese and English news stories with the highest daily pageview numbers, the comments attracted by these news items and the emailed frequencies of the same stories were collected from October 2007 to September 2008. These top 50 news items’positions on the Chinese and English homepages and the top 50 queries submitted to the website search engine of the People’s Daily Online were also retrieved. Results of the two online surveys launched in March 2008 and March 2009 were collected after their respective closing dates. The external inlinks to the People’s Daily Online were retrieved by Yahoo! (Chinese and English versions), and the online reprints were retrieved by Google. Besides the general usage patterns identified from the top 50 news stories, this study, by conducting statistical tests on the data sets, also reveals the following findings. First, the editors’ choices and the readers’ favorites do not always match each other; thus content of news title is more important than its homepage position in attracting online visits. Second, the Chinese and English readers’ interests in the same events are different. Third, the pageview numbers and comments posted to the news items reflect the unfavorable attitudes of the Chinese people toward the United States and Japan, which might offer us a method to investigate the public interest in some other issues or nations after necessary modifications. More importantly, some publicly available data, such as the comments posted to the news stories and online survey results, further show that the pageview measure does reflect readers’ interests/needs truthfully, as proved by the strong correlations between the top news reports and relevant top queries. The external ininks to the news websites and the online reprints of the top news items help us examine readers\u27 interests from other perspectives, as well as establish online profiles of the news websites. Such publicly accessible information could be an alternative data source for researchers to study readers\u27 interests when the Web server log data are not available. This doctoral thesis not only shows the usefulness of Web server log statistics, survey results, and other publicly accessible data in studying Web user’s information needs, but also offers practical suggestions for online news sites to improve their contents and homepage designs. However, no single method can draw a complete picture of the online news readers’ interests. The above mentioned research methodologies should be employed together, in order to make more comprehensive conclusions. Future research is especially needed to investigate the continuously rapid growth of the “Mobile News Readers,” which poses both challenges and opportunities to the press industry in the 21st century
    • 

    corecore