Search CORE

1,272 research outputs found

Building domain-specific web collections for scientific digital libraries: A meta-search enhanced focused crawling method

Author: Jialun Qin
Michael Chau
Yilu Zhou
Publication venue
Publication date: 01/01/2004
Field of study

Collecting domain-specific documents from the Web using focused crawlers has been considered one of the most important strategies to build digital libraries that serve the scientific community. However, because most focused crawlers use local search algorithms to traverse the Web space, they could be easily trapped within a limited sub-graph of the Web that surrounds the starting URLs and build domain-specific collections that are not comprehensive and diverse enough to scientists and researchers. In this study, we investigated the problems of traditional focused crawlers caused by local search algorithms and proposed a new crawling approach, meta-search enhanced focused crawling, to address the problems. We conducted two user evaluation experiments to examine the performance of our proposed approach and the results showed that our approach could build domain-specific collections with higher quality than traditional focused crawling techniques

CiteSeerX

HKU Scholars Hub

Research on Cognitive Pattern of the Concept of Smart City with Crawler Technology

Author: Jiang Qiaowen
Wan Jiangping
Xie Leqi
Publication venue: AIS Electronic Library (AISeL)
Publication date: 26/05/2017
Field of study

Smart city is a new form of information city and digital city and a new type of innovative means of planning and management of city, its theoretical research and construction practice have entered a period of rapid development. In-depth understanding of concept related to smart city will contribute to avoid the one-sidedness and blindness of smart city construction. This paper collects and analyzes social media data by means of network crawler technology. Then, we build the cognitive model of the concept of smart city by using e-commerce portrait technology, and discusses definition and information label of smart city. Finally, the technology-based smart city and the sustainable smart city are compared and analyzed by using the cognitive model of the concept of smart city. The purpose is to provide revelation for the future development of smart city

AIS Electronic Library (AISeL)

Mining the Automotive Industry: A Network Analysis of Corporate Positioning and Technological Trends

Author: A Broder
A Ortiz-Cordova
A-L Barabási
F Braesemann
H-O Günther
HA Wan
KD Thoben
Y Wang
Publication venue
Publication date: 01/01/2019
Field of study

The digital transformation is driving revolutionary innovations and new market entrants threaten established sectors of the economy such as the automotive industry. Following the need for monitoring shifting industries, we present a network-centred analysis of car manufacturer web pages. Solely exploiting publicly-available information, we construct large networks from web pages and hyperlinks. The network properties disclose the internal corporate positioning of the three largest automotive manufacturers, Toyota, Volkswagen and Hyundai with respect to innovative trends and their international outlook. We tag web pages concerned with topics like e-mobility and environment or autonomous driving, and investigate their relevance in the network. Sentiment analysis on individual web pages uncovers a relationship between page linking and use of positive language, particularly with respect to innovative trends. Web pages of the same country domain form clusters of different size in the network that reveal strong correlations with sales market orientation. Our approach maintains the web content's hierarchical structure imposed by the web page networks. It, thus, presents a method to reveal hierarchical structures of unstructured text content obtained from web scraping. It is highly transparent, reproducible and data driven, and could be used to gain complementary insights into innovative strategies of firms and competitive landscapes, which would not be detectable by the analysis of web content alone.Comment: Preprint version to be published in Springer Nature (presented at CompleNet 2020

arXiv.org e-Print Archive

Crossref

UCL Discovery

Oxford University Research Archive

An extensible framework for automatic knowledge extraction from student blogs

Author: Connor AM
Joe S
Martin M
Publication venue: AIRCC Publishing Corporation
Publication date: 19/06/2014
Field of study

This article introduces a framework for automatically extracting knowledge from student blogs and injecting it into a shared resource, namely a Wiki. This is motivated by the need to preserve knowledge generated by students beyond their time of study. The framework is described in the context of the Bachelor of Creative Technologies degree at the Auckland University of Technology in New Zealand where it is being deployed alongside an existing blogging and ePortfolio process. The framework uses an extensible, layered architecture that allows for incremental development of components in the system to enhance the functionality over time. The current implementation is in beta-testing and uses simple heuristics in the core components. This article presents a road map for extending the functionality to improve the quality of knowledge extraction by introducing techniques from the artificial intelligence field

AUT Scholarly Commons

Applying digital content management to support localisation

Author: Jones Gareth J.F.
Lawless Séamus
O'Connor Alexander
Wade Vincent
Zhou Dong
Publication venue: Localisation Research Centre
Publication date: 01/10/2009
Field of study

The retrieval and presentation of digital content such as that on the World Wide Web (WWW) is a substantial area of research. While recent years have seen huge expansion in the size of web-based archives that can be searched efficiently by commercial search engines, the presentation of potentially relevant content is still limited to ranked document lists represented by simple text snippets or image keyframe surrogates. There is expanding interest in techniques to personalise the presentation of content to improve the richness and effectiveness of the user experience. One of the most significant challenges to achieving this is the increasingly multilingual nature of this data, and the need to provide suitably localised responses to users based on this content. The Digital Content Management (DCM) track of the Centre for Next Generation Localisation (CNGL) is seeking to develop technologies to support advanced personalised access and presentation of information by combining elements from the existing research areas of Adaptive Hypermedia and Information Retrieval. The combination of these technologies is intended to produce significant improvements in the way users access information. We review key features of these technologies and introduce early ideas for how these technologies can support localisation and localised content before concluding with some impressions of future directions in DCM

Irish Universities

DCU Online Research Access Service

Revisiting the Technology Challenges and Proposing Enhancements in Ambient Assisted Living for the Elderly

Author: Berger Andrew
Bozan Karoly
Publication venue: AIS Electronic Library (AISeL)
Publication date: 01/01/2019
Field of study

Several social and technical trends support the elderly’s desire to live independently in their preferred environment, despite their increasing medical needs, and enhance their quality of life at home. Ambient-assisted living (AAL) has the capabilities to support the elderly and to decrease their dependency on formal or informal caregivers. We provide a review of the technological challenges that were identified as inhibiting factors in the past decade and then present recent technological advances, e.g., cloud computing, machine learning, artificial intelligence, the Internet of Things. We also fill the gap in the current literature in regard to specific AAL solutions and propose fourth-generation AAL technology design. We find that most informal caregivers are family members who are medically untrained and that the use of advanced analytical processes on AAL-generated data could significantly increase symptom identification. We also present the implications and remaining challenges along with recommendations for future research

Crossref

ScholarSpace at University of Hawai'i at Manoa

AIS Electronic Library (AISeL)

NLP-Based Techniques for Cyber Threat Intelligence

Author: A. Rafidha Rehiman K.
Arazzi Marco
Arikkat Dincy R.
Conti Mauro
Nicolazzo Serena
Nocera Antonino
P. Vinod
Publication venue
Publication date: 15/11/2023
Field of study

In the digital era, threat actors employ sophisticated techniques for which, often, digital traces in the form of textual data are available. Cyber Threat Intelligence~(CTI) is related to all the solutions inherent to data collection, processing, and analysis useful to understand a threat actor's targets and attack behavior. Currently, CTI is assuming an always more crucial role in identifying and mitigating threats and enabling proactive defense strategies. In this context, NLP, an artificial intelligence branch, has emerged as a powerful tool for enhancing threat intelligence capabilities. This survey paper provides a comprehensive overview of NLP-based techniques applied in the context of threat intelligence. It begins by describing the foundational definitions and principles of CTI as a major tool for safeguarding digital assets. It then undertakes a thorough examination of NLP-based techniques for CTI data crawling from Web sources, CTI data analysis, Relation Extraction from cybersecurity data, CTI sharing and collaboration, and security threats of CTI. Finally, the challenges and limitations of NLP in threat intelligence are exhaustively examined, including data quality issues and ethical considerations. This survey draws a complete framework and serves as a valuable resource for security professionals and researchers seeking to understand the state-of-the-art NLP-based threat intelligence techniques and their potential impact on cybersecurity

arXiv.org e-Print Archive

Acquisition des contenus intelligents dans l’archivage du Web

Author: Faheem Muhammad
Publication venue: HAL CCSD
Publication date: 17/12/2014
Field of study

Web sites are dynamic by nature with content and structure changing overtime; many pages on the Web are produced by content management systems (CMSs). Tools currently used by Web archivists to preserve the content of the Web blindly crawl and store Web pages, disregarding the CMS the site is based on and whatever structured content is contained in Web pages. We first present an application-aware helper (AAH) that fits into an archiving crawl processing chain to perform intelligent and adaptive crawling of Web applications, given a knowledge base of common CMSs. The AAH has been integrated into two Web crawlers in the framework of the ARCOMEM project: the proprietary crawler of the Internet Memory Foundation and a customized version of Heritrix. Then we propose an efficient unsupervised Web crawling system ACEBot (Adaptive Crawler Bot for data Extraction), a structure-driven crawler that utilizes the inner structure of the pages and guides the crawling process based on the importance of their content. ACEBot works intwo phases: in the offline phase, it constructs a dynamic site map (limiting the number of URLs retrieved), learns a traversal strategy based on the importance of navigation patterns (selecting those leading to valuable content); in the online phase, ACEBot performs massive downloading following the chosen navigation patterns. The AAH and ACEBot makes 7 and 5 times, respectively, fewer HTTP requests as compared to a generic crawler, without compromising on effectiveness. We finally propose OWET (Open Web Extraction Toolkit) as a free platform for semi-supervised data extraction. OWET allows a user to extract the data hidden behind Web formsLes sites Web sont par nature dynamiques, leur contenu et leur structure changeant au fil du temps ; de nombreuses pages sur le Web sont produites par des systèmes de gestion de contenu (CMS). Les outils actuellement utilisés par les archivistes du Web pour préserver le contenu du Web collectent et stockent de manière aveugle les pages Web, en ne tenant pas compte du CMS sur lequel le site est construit et du contenu structuré de ces pages Web. Nous présentons dans un premier temps un application-aware helper (AAH) qui s’intègre à une chaine d’archivage classique pour accomplir une collecte intelligente et adaptative des applications Web, étant donnée une base de connaissance deCMS courants. L’AAH a été intégrée à deux crawlers Web dans le cadre du projet ARCOMEM : le crawler propriétaire d’Internet Memory Foundation et une version personnalisée d’Heritrix. Nous proposons ensuite un système de crawl efficace et non supervisé, ACEBot (Adaptive Crawler Bot for data Extraction), guidé par la structure qui exploite la structure interne des pages et dirige le processus de crawl en fonction de l’importance du contenu. ACEBot fonctionne en deux phases : dans la phase hors-ligne, il construit un plan dynamique du site (en limitant le nombre d’URL récupérées), apprend une stratégie de parcours basée sur l’importance des motifs de navigation (sélectionnant ceux qui mènent à du contenu de valeur) ; dans la phase en-ligne, ACEBot accomplit un téléchargement massif en suivant les motifs de navigation choisis. L’AAH et ACEBot font 7 et 5 fois moins, respectivement, de requêtes HTTP qu’un crawler générique, sans compromis de qualité. Nous proposons enfin OWET (Open Web Extraction Toolkit), une plate-forme libre pour l’extraction de données semi-supervisée. OWET permet à un utilisateur d’extraire les données cachées derrière des formulaires Web

Thèses en Ligne