12 research outputs found

    Web Data Extraction, Applications and Techniques: A Survey

    Full text link
    Web Data Extraction is an important problem that has been studied by means of different scientific tools and in a broad range of applications. Many approaches to extracting data from the Web have been designed to solve specific problems and operate in ad-hoc domains. Other approaches, instead, heavily reuse techniques and algorithms developed in the field of Information Extraction. This survey aims at providing a structured and comprehensive overview of the literature in the field of Web Data Extraction. We provided a simple classification framework in which existing Web Data Extraction applications are grouped into two main classes, namely applications at the Enterprise level and at the Social Web level. At the Enterprise level, Web Data Extraction techniques emerge as a key tool to perform data analysis in Business and Competitive Intelligence systems as well as for business process re-engineering. At the Social Web level, Web Data Extraction techniques allow to gather a large amount of structured data continuously generated and disseminated by Web 2.0, Social Media and Online Social Network users and this offers unprecedented opportunities to analyze human behavior at a very large scale. We discuss also the potential of cross-fertilization, i.e., on the possibility of re-using Web Data Extraction techniques originally designed to work in a given domain, in other domains.Comment: Knowledge-based System

    How comprehensive is the PubMed Central Open Access full-text database?

    Get PDF
    The comprehensiveness of database is a prerequisite for the quality of scientific works established on this increasingly significant infrastructure. This is especially so for large-scale text-mining analyses of scientific publications facilitated by open-access full-text scientific databases. Given the lack of research concerning the comprehensiveness of this type of academic resource, we conducted a project to analyze the coverage of materials in the PubMed Central Open Access Subset (PMCOAS), a popular source for open-access scientific publications, in terms of the PubMed database. The preliminary results show that the PMCOAS coverage is in a rapid increase in recent years, despite the vast difference by MeSH descriptor

    Adverse Drug Event Detection, Causality Inference, Patient Communication and Translational Research

    Get PDF
    Adverse drug events (ADEs) are injuries resulting from a medical intervention related to a drug. ADEs are responsible for nearly 20% of all the adverse events that occur in hospitalized patients. ADEs have been shown to increase the cost of health care and the length of stays in hospital. Therefore, detecting and preventing ADEs for pharmacovigilance is an important task that can improve the quality of health care and reduce the cost in a hospital setting. In this dissertation, we focus on the development of ADEtector, a system that identifies ADEs and medication information from electronic medical records and the FDA Adverse Event Reporting System reports. The ADEtector system employs novel natural language processing approaches for ADE detection and provides a user interface to display ADE information. The ADEtector employs machine learning techniques to automatically processes the narrative text and identify the adverse event (AE) and medication entities that appear in that narrative text. The system will analyze the entities recognized to infer the causal relation that exists between AEs and medications by automating the elements of Naranjo score using knowledge and rule based approaches. The Naranjo Adverse Drug Reaction Probability Scale is a validated tool for finding the causality of a drug induced adverse event or ADE. The scale calculates the likelihood of an adverse event related to drugs based on a list of weighted questions. The ADEtector also presents the user with evidence for ADEs by extracting figures that contain ADE related information from biomedical literature. A brief summary is generated for each of the figures that are extracted to help users better comprehend the figure. This will further enhance the user experience in understanding the ADE information better. The ADEtector also helps patients better understand the narrative text by recognizing complex medical jargon and abbreviations that appear in the text and providing definitions and explanations for them from external knowledge resources. This system could help clinicians and researchers in discovering novel ADEs and drug relations and also hypothesize new research questions within the ADE domain

    Mapping Scholarly Communication Infrastructure: A Bibliographic Scan of Digital Scholarly Communication Infrastructure

    Get PDF
    This bibliography scan covers a lot of ground. In it, I have attempted to capture relevant recent literature across the whole of the digital scholarly communications infrastructure. I have used that literature to identify significant projects and then document them with descriptions and basic information. Structurally, this review has three parts. In the first, I begin with a diagram showing the way the projects reviewed fit into the research workflow; then I cover a number of topics and functional areas related to digital scholarly communication. I make no attempt to be comprehensive, especially regarding the technical literature; rather, I have tried to identify major articles and reports, particularly those addressing the library community. The second part of this review is a list of projects or programs arranged by broad functional categories. The third part lists individual projects and the organizations—both commercial and nonprofit—that support them. I have identified 206 projects. Of these, 139 are nonprofit and 67 are commercial. There are 17 organizations that support multiple projects, and six of these—Artefactual Systems, Atypon/Wiley, Clarivate Analytics, Digital Science, Elsevier, and MDPI—are commercial. The remaining 11—Center for Open Science, Collaborative Knowledge Foundation (Coko), LYRASIS/DuraSpace, Educopia Institute, Internet Archive, JISC, OCLC, OpenAIRE, Open Access Button, Our Research (formerly Impactstory), and the Public Knowledge Project—are nonprofit.Andrew W. Mellon Foundatio

    A General Architecture to Enhance Wiki Systems with Natural Language Processing Techniques

    Get PDF
    Wikis are web-based software applications that allow users to collaboratively create and edit web page content, through a Web browser using a simplified syntax. The ease-of-use and “open” philosophy of wikis has brought them to the attention of organizations and online communities, leading to a wide-spread adoption as a simple and “quick” way of collaborative knowledge management. However, these characteristics of wiki systems can act as a double-edged sword: When wiki content is not properly structured, it can turn into a “tangle of links”, making navigation, organization and content retrieval difficult for their end-users. Since wiki content is mostly written in unstructured natural language, we believe that existing state-of-the-art techniques from the Natural Language Processing (NLP) and Semantic Computing domains can help mitigating these common problems when using wikis and improve their users’ experience by introducing new features. The challenge, however, is to find a solution for integrating novel semantic analysis algorithms into the multitude of existing wiki systems, without the need for modifying their engines. In this research work, we present a general architecture that allows wiki systems to benefit from NLP services made available through the Semantic Assistants framework – a service-oriented architecture for brokering NLP pipelines as web services. Our main contributions in this thesis include an analysis of wiki engines, the development of collaboration patterns be- tween wikis and NLP, and the design of a cohesive integration architecture. As a concrete application, we deployed our integration to MediaWiki – the powerful wiki engine behind Wikipedia – to prove its practicability. Finally, we evaluate the usability and efficiency of our integration through a number of user studies we performed in real-world projects from various domains, including cultural heritage data management, software requirements engineering, and biomedical literature curation

    B!SON: A Tool for Open Access Journal Recommendation

    Get PDF
    Finding a suitable open access journal to publish scientific work is a complex task: Researchers have to navigate a constantly growing number of journals, institutional agreements with publishers, funders’ conditions and the risk of Predatory Publishers. To help with these challenges, we introduce a web-based journal recommendation system called B!SON. It is developed based on a systematic requirements analysis, built on open data, gives publisher-independent recommendations and works across domains. It suggests open access journals based on title, abstract and references provided by the user. The recommendation quality has been evaluated using a large test set of 10,000 articles. Development by two German scientific libraries ensures the longevity of the project

    Figure summarizer browser extensions for PubMed Central

    No full text
    Summary: Figures in biomedical articles present visual evidence for research facts and help readers understand the article better. However, when figures are taken out of context, it is difficult to understand their content. We developed a summarization algorithm to summarize the content of figures and used it in our figure search engine (http://figuresearch.askhermes.org/). In this article, we report on the development of web browser extensions for Mozilla Firefox, Google Chrome and Apple Safari to display summaries for figures in PubMed Central and NCBI Images

    Knowledge Management, Trust and Communication in the Era of Social Media

    Get PDF
    The article entitled "Selected Aspects of Evaluating Knowledge Management Quality in Contemporary Enterprises" broadens the understanding of knowledge management and estimates select aspects of knowledge management quality evaluations in modern enterprises from theoretical and practical perspectives. The seventh article aims to present the results of pilot studies on the four largest Information Communication Technology (ICT) companies' involvement in promoting the Sustainable Development Goals (SDGs) through social media. Studies examine which communication strategy is used by companies in social media. The primary purpose of the eighth article is to present the relationship between trust and knowledge sharing, taking into account the importance of this issue in the efficiency of doing business. The results showed that trust is vital in sharing knowledge and essential in achieving a high-performance efficiency level. The ninth article presents the impact of social media on consumer choices in tourism and tourist products' specificity. The study's main purpose was to indicate the most commonly used social media in selecting a tourist destination and implementing Generation Y's journey. The 10th article aims to identify the most critical purposes of using social media by responding to women's attitudes according to age and their respective countries' economic development. The research was done through an online survey in 2017–2018, followed by an analysis of eight countries' results. The article entitled "Integrated Question-Answering System for Natural Disaster Domains Based on Social Media Messages Posted at the Time of Disaster" presents the framework of a question-answering system that was developed using a Twitter dataset containing more than 9 million tweets compiled during the Osaka North Earthquake that occurred on 18 June 2018. The authors also study the structure of the questions posed and develop methods for classifying them into particular categories to find answers from the dataset using an ontology, word similarity, keyword frequency, and natural language processing. The book provides a theoretical and practical background related to trust, knowledge management, and communication in the era of social media. The editor believes that the collection of articles can be relevant to professionals, researchers, and students' needs. The authors try to diagnose the situation and show the new challenges and future directions in this area

    Preface

    Get PDF

    The Future of Information Sciences : INFuture2015 : e-Institutions – Openness, Accessibility, and Preservation

    Get PDF
    corecore