Search CORE

627 research outputs found

Enhance Crawler For Efficiently Harvesting Deep Web Interfaces

Author: Sujata R. Gutte, Shubhangi S. Gujar
Publication venue: 'Auricle Technologies, Pvt., Ltd.'
Publication date: 31/10/2017
Field of study

Scenario in web is varying quickly and size of web resources is rising, efficiency has become a challenging problem for crawling such data. The hidden web content is the data that cannot be indexed by search engines as they always stay behind searchable web interfaces. The proposed system purposes to develop a framework for focused crawler for efficient gathering hidden web interfaces. Firstly Crawler performs site-based searching for getting center pages with the help of web search tools to avoid from visiting additional number of pages. To get more specific results for a focused crawler, projected crawler ranks websites by giving high priority to more related ones for a given search. Crawler accomplishes fast in-site searching via watching for more relevant links with an adaptive link ranking. Here we have incorporated spell checker for giving correct input and apply reverse searching with incremental site prioritizing for wide-ranging coverage of hidden web sites

International Journal on Recent and Innovation Trends in Computing and Communication

Personalization of Search Engine by Using Cache based Approach

Author: Krupali Bhaware, Shubham Narkhede, Prof. Neeranjan Chitare
Publication venue: Auricle Global Society of Education and Research
Publication date: 31/03/2018
Field of study

As profound web develops at a quick pace, there has been expanded enthusiasm for strategies that assistance effectively find profound web interfaces. Be that as it may, because of the extensive volume of web assets and the dynamic idea of profound web, accomplishing wide scope and high productivity is a testing issue. In this venture propose a three-organize structure, for productive reaping profound web interfaces. In the principal organize, web crawler performs website based hunting down focus pages with the assistance of web indexes, abstaining from going to countless. To accomplish more precise outcomes for an engaged creep, Web Crawler positions sites to organize exceedingly pertinent ones for a given theme. In the second stage the proposed framework opens the pages inside in application with the assistance of Jsoup API and preprocess it. At that point it plays out the word include of inquiry website pages. In the third stage the proposed framework performs recurrence examination in view of TF and IDF. It additionally utilizes a mix of TF*IDF for positioning website pages. To wipe out inclination on going to some very applicable connections in shrouded web registries, In this undertaking propose plan a connection tree information structure to accomplish more extensive scope for a site. Undertaking trial comes about on an arrangement of delegate areas demonstrate the nimbleness and precision of our proposed crawler structure, which productively recovers profound web interfaces from extensive scale locales and accomplishes higher reap rates than different crawlers utilizing Na�ve Bayes algorithms

International Journal on Future Revolution in Computer Science & Communication Engineering

An Efficient Method for Deep Web Crawler based on Accuracy

Author: Pranali Zade, Dr. S.W Mohod
Publication venue: Auricle Global Society of Education and Research
Publication date: 30/04/2018
Field of study

As deep web grows at a very fast pace, there has been increased interest in techniques that help efficiently locate deep-web interfaces. However, due to the large volume of web resources and the dynamic nature of deep web, achieving wide coverage and high efficiency is a challenging issue. We propose a three-stage framework, for efficient harvesting deep web interfaces.Project experimental results on a set of representative domains show the agility and accuracy of our proposed crawler framework, which efficiently retrieves deep-web interfaces from large-scale sites and achieves higher harvest rates than other crawlers using Na�ve Bayes algorithm. In this paper we have made a survey on how web crawler works and what are the methodologies available in existing system from different researchers

International Journal on Future Revolution in Computer Science & Communication Engineering

A Framework to Evaluate Information Quality in Public Administration Website

Author: Geraci Filippo
Martinelli Maurizio
Pellegrini Marco
Serrecchia Michela
Publication venue: AIS Electronic Library (AISeL)
Publication date: 30/09/2013
Field of study

This paper presents a framework aimed at assessing the capacity of Public Administration bodies (PA) to offer a good quality of information and service on their web portals. Our framework is based on the extraction of “.it” domain names registered by Italian public institutions and the subsequent analysis of their relative websites. The analysis foresees an automatic gathering of the web pages of PA portals by means of web crawling and an assessment of the quality of their online information services. This assessment is carried out by verifying their compliance with current legislation on the basis of the criteria established in government guidelines[1]. This approach provides an ongoing monitoring process of the PA websites that can contribute to the improvement of their overall quality. Moreover, our approach can also hopefully be of benefit to local governments in other countries. Available at: https://aisel.aisnet.org/pajais/vol5/iss3/3

AIS Electronic Library (AISeL)

A novel defense mechanism against web crawler intrusion

Author: Aghamohammadi Alireza
Publication venue: DigitalCommons@EMU
Publication date: 05/11/2013
Field of study

Web robots also known as crawlers or spiders are used by search engines, hackers and spammers to gather information about web pages. Timely detection and prevention of unwanted crawlers increases privacy and security of websites. In this research, a novel method to identify web crawlers is proposed to prevent unwanted crawler to access websites. The proposed method suggests a five-factor identification process to detect unwanted crawlers. This study provides the pretest and posttest results along with a systematic evaluation of web pages with the proposed identification technique versus web pages without the proposed identification process. An experiment was performed with repeated measures for two groups with each group containing ninety web pages. The outputs of the logistic regression analysis of treatment and control groups confirm the novel five-factor identification process as an effective mechanism to prevent unwanted web crawlers. This study concluded that the proposed five distinct identifier process is a very effective technique as demonstrated by a successful outcome

Eastern Michigan University: Digital Commons@EMU

NLP-Based Techniques for Cyber Threat Intelligence

Author: A. Rafidha Rehiman K.
Arazzi Marco
Arikkat Dincy R.
Conti Mauro
Nicolazzo Serena
Nocera Antonino
P. Vinod
Publication venue
Publication date: 15/11/2023
Field of study

In the digital era, threat actors employ sophisticated techniques for which, often, digital traces in the form of textual data are available. Cyber Threat Intelligence~(CTI) is related to all the solutions inherent to data collection, processing, and analysis useful to understand a threat actor's targets and attack behavior. Currently, CTI is assuming an always more crucial role in identifying and mitigating threats and enabling proactive defense strategies. In this context, NLP, an artificial intelligence branch, has emerged as a powerful tool for enhancing threat intelligence capabilities. This survey paper provides a comprehensive overview of NLP-based techniques applied in the context of threat intelligence. It begins by describing the foundational definitions and principles of CTI as a major tool for safeguarding digital assets. It then undertakes a thorough examination of NLP-based techniques for CTI data crawling from Web sources, CTI data analysis, Relation Extraction from cybersecurity data, CTI sharing and collaboration, and security threats of CTI. Finally, the challenges and limitations of NLP in threat intelligence are exhaustively examined, including data quality issues and ethical considerations. This survey draws a complete framework and serves as a valuable resource for security professionals and researchers seeking to understand the state-of-the-art NLP-based threat intelligence techniques and their potential impact on cybersecurity

arXiv.org e-Print Archive

Web competitive intelligence methodology

Author: Fonseca Joaquim Pedro Nogueira da Costa de Castro
Publication venue: Faculdade de Ciências e Tecnologia
Publication date: 01/01/2012
Field of study

Master’s Degree DissertationThe present dissertation covers academic concerns in disruptive change that causes value displacements in today’s competitive economic environment. To enhance survival capabilities organizations are increasing efforts in more untraditional business value assets such intellectual capital and competitive intelligence. Dynamic capabilities, a recent strategy theory states that companies have to develop adaptive capabilities to survive disruptive change and increase competitive advantage in incremental change phases. Taking advantage of the large amount of information in the World Wide Web it is propose a methodology to develop applications to gather, filter and analyze web data and turn it into usable intelligence (WeCIM). In order to enhance information search and management quality it is proposed the use of ontologies that allow computers to “understand” particular knowledge domains. Two case studies were conducted with satisfactory results. Two software prototypes were developed according to the proposed methodology. It is suggested that even a bigger step can be made. Not only the success of the methodology was proved but also common software architecture elements are present which suggests that a solid base can be design for different field applications based on web competitive intelligence tools

Repositório da Universidade Nova de Lisboa

ACUTE WEB SPIDER: AN INTENSIFIED APPROACH FOR PROFOUND ACCUMULATION

Author: Kampe Shilpa
Sadia Nausheen Syeda
Publication venue: International Journal of Innovative Technology and Research
Publication date: 05/07/2016
Field of study

Using WebSpider, we determine the topical relevance of the site in line with the items in its homepage. Whenever a new site comes, the homepage content from the website is removed and parsed by getting rid of stop words and stemming. As deep web develops in an extremely fast pace, there's been elevated curiosity about techniques which help efficiently locate deep-web connects. However, because of the large amount of web sources and also the dynamic nature of deep web, achieving wide coverage and efficiency is really a challenging issue. We advise a 2-stage framework, namely WebSpider, for efficient farming deep web connects. Within the first stage, WebSpider performs site-based trying to find center pages with the aid of search engines like Google, staying away from going to a lot of pages. To attain better recent results for a focused crawl, WebSpider ranks websites you prioritized highly relevant ones for any given subject. Within the second stage, WebSpider accomplishes fast in-site searching by digging up best links by having an adaptive link-ranking. To get rid of bias on going to some highly relevant links in hidden websites, we design a hyperlink tree data structure to attain wider coverage for any website. Focused crawlers for example Form-Focused Crawler (FFC) and Adaptive Crawler for Hidden-web Records (Pain) can instantly search on the internet databases on the specific subject. FFC was created with link, page, and form classifiers for focused moving of web forms, and it is extended by Pain with a lot more components for form filtering and adaptive link student

International Journal of Innovative Technology and Research (IJITR)

Recommended from our members

The Corpus Expansion Toolkit: finding what we want on the web

Author: Pay Jack Frederick
Publication venue
Publication date: 13/08/2020
Field of study

This thesis presents the Corpus Expansion Toolkit (CET), a generally applicable toolkit that allows researchers to build domain-specific corpora from the web. The main purpose of the work presented in this thesis and the development of the CET is to provide a solution to discovering desired content on the web from possibly unknown locations or a poorly defined domain. Using an iterative process, the CET is able to solve the problem of discovering domain-specific online content and expand a corpus using only a very small number of example documents or characteristic phrases taken from the target domain. Using a human-in-the-loop strategy and a chain of discrete software components the CET also allows the concept of a domain to be iteratively defined using the very online resources used to expand the original corpus. The CET combines feature extraction, search, web crawling and machine learning methods to collected, store, filter and perform information extraction on collected documents. Using a small number of example ‘seed’ documents the CET is able to expand the original corpus by finding more relevant documents from the web and provide a number of tools to support their analysis. This thesis presents a case study-based methodology that introduces the various contributions and components of the CET through the discussion of five case studies covering a wide variety of domains and requirements that the CET has been applied. These case studies hope to illustrate three main use cases, listed below, where the CET is applicable: 1. Domain known – source known 2. Domain known – source unknown 3. Domain unknown – source unknown First, use cases where the sites for document collection are known and the topic of research is clearly defined. Second, instances where the topic of research is clearly defined but where to find relevant documents on the web is unknown. Third, the most extreme use case, where the domain is poorly defined or unknown to the researcher and the location of the information is also unknown. This thesis presents a solution that allows researchers to begin with very little information on a specific topic and iteratively build a clear conception of a domain and translate that to a computational system

Sussex Research Online