10,080 research outputs found
Performance and Comparative Analysis of the Two Contrary Approaches for Detecting Near Duplicate Web Documents in Web Crawling
Recent years have witnessed the drastic development of World Wide Web (WWW). Information is being accessible at the finger tip anytime anywhere through the massive web repository. The performance and reliability of web engines thus face huge problems due to the presence of enormous amount of web data. The voluminous amount of web documents has resulted in problems for search engines leading to the fact that the search results are of less relevance to the user. In addition to this, the presence of duplicate and near-duplicate web documents has created an additional overhead for the search engines critically affecting their performance. The demand for integrating data from heterogeneous sources leads to the problem of near-duplicate web pages. The detection of near duplicate documents within a collection has recently become an area of great interest. In this research, we have presented an efficient approach for the detection of near duplicate web pages in web crawling which uses keywords and the distance measure. Besides that, G.S. Manku et al.’s fingerprint based approach proposed in 2007 was considered as one of the “state-of-the-art" algorithms for finding near-duplicate web pages. Then we have implemented both the approaches and conducted an extensive comparative study between our similarity score based approach and G.S. Manku et al.’s fingerprint based approach. We have analyzed our results in terms of time complexity, space complexity, Memory usage and the confusion matrix parameters. After taking into account the above mentioned performance factors for the two approaches, the comparison study clearly portrays our approach the better (less complex) of the two based on the factors considered.DOI:http://dx.doi.org/10.11591/ijece.v2i6.1746
Extracting News Events from Microblogs
Twitter stream has become a large source of information for many people, but
the magnitude of tweets and the noisy nature of its content have made
harvesting the knowledge from Twitter a challenging task for researchers for a
long time. Aiming at overcoming some of the main challenges of extracting the
hidden information from tweet streams, this work proposes a new approach for
real-time detection of news events from the Twitter stream. We divide our
approach into three steps. The first step is to use a neural network or deep
learning to detect news-relevant tweets from the stream. The second step is to
apply a novel streaming data clustering algorithm to the detected news tweets
to form news events. The third and final step is to rank the detected events
based on the size of the event clusters and growth speed of the tweet
frequencies. We evaluate the proposed system on a large, publicly available
corpus of annotated news events from Twitter. As part of the evaluation, we
compare our approach with a related state-of-the-art solution. Overall, our
experiments and user-based evaluation show that our approach on detecting
current (real) news events delivers a state-of-the-art performance
A Survey to Fix the Threshold and Implementation for Detecting Duplicate Web Documents
The drastic development in the information accessible on the World Wide Web has made the employment of automated tools to locate the information resources of interest, and for tracking and analyzing the same a certainty. Web Mining is the branch of data mining that deals with the analysis of World Wide Web. The concepts from various areas such as Data Mining, Internet technology and World Wide Web, and recently, Semantic Web can be said as the origin of web mining. Web mining can be defined as the procedure of determining hidden yet potentially beneficial knowledge from the data accessible in the web. Web mining comprise the sub areas: web content mining, web structure mining, and web usage mining. Web content mining is the process of mining knowledge from the web pages besides other web objects. The process of mining knowledge about the link structure linking web pages and some other web objects is defined as Web structure mining. Web usage mining is defined as the process of mining the usage patterns created by the users accessing the web pages.
The search engine technology has led to the development of World Wide. The search engines are the chief gateways for access of information in the web. The ability to locate contents of particular interest amidst a huge heap has turned businesses beneficial and productive. The search engines respond to the queries by employing the process of web crawling that populates an indexed repository of web pages. The programs construct a confined repository of the segment of the web that they visit by navigating the web graph and retrieving pages.
There are two main types of crawling, namely, Generic and Focused crawling. Generic crawlers crawls documents and links of diverse topics. Focused crawlers limit the number of pages with the aid of some prior obtained specialized knowledge. The systems that index, mine, and otherwise analyze pages (such as, the search engines) are provided with inputs from the repositories of web pages built by the web crawlers. The drastic development of the Internet and the growing necessity to incorporate heterogeneous data is accompanied by the issue of the existence of near duplicate data. Even if the near duplicate data don’t exhibit bit wise identical nature they are remarkably similar. The duplicate and near duplicate web pages either increase the index storage space or slow down or increase the serving costs which annoy the users, thus causing huge problems for the web search engines. Hence it is inevitable to design algorithms to detect such pages
Duplicate Question Retrieval and Confirmation Time Prediction in Software Communities
Community Question Answering (CQA) in different domains is growing at a large
scale because of the availability of several platforms and huge shareable
information among users. With the rapid growth of such online platforms, a
massive amount of archived data makes it difficult for moderators to retrieve
possible duplicates for a new question and identify and confirm existing
question pairs as duplicates at the right time. This problem is even more
critical in CQAs corresponding to large software systems like askubuntu where
moderators need to be experts to comprehend something as a duplicate. Note that
the prime challenge in such CQA platforms is that the moderators are themselves
experts and are therefore usually extremely busy with their time being
extraordinarily expensive. To facilitate the task of the moderators, in this
work, we have tackled two significant issues for the askubuntu CQA platform:
(1) retrieval of duplicate questions given a new question and (2) duplicate
question confirmation time prediction. In the first task, we focus on
retrieving duplicate questions from a question pool for a particular newly
posted question. In the second task, we solve a regression problem to rank a
pair of questions that could potentially take a long time to get confirmed as
duplicates. For duplicate question retrieval, we propose a Siamese neural
network based approach by exploiting both text and network-based features,
which outperforms several state-of-the-art baseline techniques. Our method
outperforms DupPredictor and DUPE by 5% and 7% respectively. For duplicate
confirmation time prediction, we have used both the standard machine learning
models and neural network along with the text and graph-based features. We
obtain Spearman's rank correlation of 0.20 and 0.213 (statistically
significant) for text and graph based features respectively.Comment: Full paper accepted at ASONAM 2023: The 2023 IEEE/ACM International
Conference on Advances in Social Networks Analysis and Minin
Linking Representations with Multimodal Contrastive Learning
Many applications require grouping instances contained in diverse document
datasets into classes. Most widely used methods do not employ deep learning and
do not exploit the inherently multimodal nature of documents. Notably, record
linkage is typically conceptualized as a string-matching problem. This study
develops CLIPPINGS, (Contrastively Linking Pooled Pre-trained Embeddings), a
multimodal framework for record linkage. CLIPPINGS employs end-to-end training
of symmetric vision and language bi-encoders, aligned through contrastive
language-image pre-training, to learn a metric space where the pooled
image-text representation for a given instance is close to representations in
the same class and distant from representations in different classes. At
inference time, instances can be linked by retrieving their nearest neighbor
from an offline exemplar embedding index or by clustering their
representations. The study examines two challenging applications: constructing
comprehensive supply chains for mid-20th century Japan through linking firm
level financial records - with each firm name represented by its crop in the
document image and the corresponding OCR - and detecting which image-caption
pairs in a massive corpus of historical U.S. newspapers came from the same
underlying photo wire source. CLIPPINGS outperforms widely used string matching
methods by a wide margin and also outperforms unimodal methods. Moreover, a
purely self-supervised model trained on only image-OCR pairs also outperforms
popular string-matching methods without requiring any labels
- …