Search CORE

320 research outputs found

A Novel Web Scraping Approach Using the Additional Information Obtained from Web Pages

Author: Uzun Erdinç
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2020
Field of study

Web scraping is a process of extracting valuable and interesting text information from web pages. Most of the current studies targeting this task are mostly about automated web data extraction. In the extraction process, these studies first create a DOM tree and then access the necessary data through this tree. The construction process of this tree increases the time cost depending on the data structure of the DOM Tree. In the current web scraping literature, it is observed that time efficiency is ignored. This study proposes a novel approach, namely UzunExt, which extracts content quickly using the string methods and additional information without creating a DOM Tree. The string methods consist of the following consecutive steps: searching for a given pattern, then calculating the number of closing HTML elements for this pattern, and finally extracting content for the pattern. In the crawling process, our approach collects the additional information, including the starting position for enhancing the searching process, the number of inner tag for improving the extraction process, and tag repetition for terminating the extraction process. The string methods of this novel approach are about 60 times faster than extracting with the DOM-based method. Moreover, using these additional information improves extraction time by 2.35 times compared to using only the string methods. Furthermore, this approach can easily be adapted to other DOM-based studies/parsers in this task to enhance their time efficiencies. © 2013 IEEE

Namik Kemal University Institutional Repository

Prediction of new outlinks for focused Web crawling

Author: Atil Berk
Bucur Doina
Dang Thi Kim Nhung
Kadkhodaei Hamidreza
Litvak Nelly
Pitel Guillaume
Ruis Frank
Publication venue
Publication date: 09/11/2021
Field of study

Discovering new hyperlinks enables Web crawlers to find new pages that have not yet been indexed. This is especially important for focused crawlers because they strive to provide a comprehensive analysis of specific parts of the Web, thus prioritizing discovery of new pages over discovery of changes in content. In the literature, changes in hyperlinks and content have been usually considered simultaneously. However, there is also evidence suggesting that these two types of changes are not necessarily related. Moreover, many studies about predicting changes assume that long history of a page is available, which is unattainable in practice. The aim of this work is to provide a methodology for detecting new links effectively using a short history. To this end, we use a dataset of ten crawls at intervals of one week. Our study consists of three parts. First, we obtain insight in the data by analyzing empirical properties of the number of new outlinks. We observe that these properties are, on average, stable over time, but there is a large difference between emergence of hyperlinks towards pages within and outside the domain of a target page (internal and external outlinks, respectively). Next, we provide statistical models for three targets: the link change rate, the presence of new links, and the number of new links. These models include the features used earlier in the literature, as well as new features introduced in this work. We analyze correlation between the features, and investigate their informativeness. A notable finding is that, if the history of the target page is not available, then our new features, that represent the history of related pages, are most predictive for new links in the target page. Finally, we propose ranking methods as guidelines for focused crawlers to efficiently discover new pages, which achieve excellent performance with respect to the corresponding targets

University of Twente Research Information

Integrated system based on the automation of the extraction and characterization, using natural processing language, of user ratings in sales platforms

Author: Durán de las Heras Pablo
Publication venue
Publication date: 01/01/2022
Field of study

[Abstract] At present, more and more importance is given to what people think, what they think and what are your preferences. With the rise of social networks and online stores, these data are accessible more than ever in a simple and global way, so it is increasingly important to know manage and analyze all this information. Thus, it is not surprising that over time more companies are focusing their interest on sentiment mining, which allows them to identify possible business opportunities, maintain a good reputation of the company in the networks or improve the marketing your products. For all these reasons, we are interested in building a tool capable of extracting from all the reviews that exists throughout the Internet, what is the opinion of a product/service (its most praised characteristics, its greatest shortcomings…), so that, through a good analysis, we can extract important information in an automated way. This paper will discuss how to build a tool capable of analyzing and retrieving reviews from the World Wide Web. Building two main systems, one for extracting reviews with the use of scraping tools and another for analyzing sentiment in texts with the use of machine learning and natural language processing.[Resumo] Na actualidade cada vez dáselle máis importancia a qué opinan as persoas, qué é o que pensan e cales son as suas preferencias. Coa subida da popularidade das redes sociais e as tendas online, estes datos estan máis accesibles que nunca, de manera sinxela e global, co que cada vez é máis importante saber xestionar e analizar toda esa información. Así, non é de extrañar que co tempo máis compañías estén centrando o seu interés na minaría de sentimentos, que lles permite posibles oportunidades de negocio, mantener unha boa reputación de empresa nas redes e mellorar o marketing. Por todo isto, interésanos construir unha ferramenta capaz de extraer de todas as reseñas que existen ao longo de internet, cal é a opinión que se ten dun produto/servizo, de forma que, a través dunha boa análise podamos obter información de importancia de maneira automatizada. Neste traballo falarase de como construir unha ferramenta capaz de analizar e obter as reseñas da World Wide Web. Construindo un sistema de extracción de reviews co uso de ferramentas de scraping e outro de análise de sentimentos en textos co uso de aprendizaxe automático e procesamento da linguaxe natural.Traballo fin de grao. Enxeñaría Informática. Curso 2021/202

Repositorio da Universidade da Coruña

Integração automatizada de informação de horários de transportes

Author: Westerberg João Baptista Monteiro
Publication venue
Publication date: 01/01/2020
Field of study

The ever-growing Web contains a large amount of data. This large amount of data is useful when combined with applications that can refine it and use it to improve its users’ lives. However, using the data available is not an easy task since most of the information is not represented in machine-friendly formats. Instead, this information is represented in formats ideal for human users, resulting in an additional effort for having machines interpreting, extracting, and integrating it, while at the same time ensuring the consistency of information from different sources. In this project, a solution using an ontology-based integration combined with web robots’ extraction automates the process required for updating information regarding schedules of public transports. An already existing application receives that information and uses it to calculate efficient routes for commuters. The proposed solution can extract information from multiple online sources and transform it into different formats. It can extract and transform the information from PDFs and HTML. The system provides a web service for the exportation of these formats by a route optimization system. This document contains the detailed process of the design and construction of the integration system. It describes the alternatives and selections that lead to the application created. Lastly, it evaluates the solution by performing extraction from several sources relevant to the project’s domain

Repositório Científico do Instituto Politécnico do Porto

An integrating text retrieval framework for Digital Ecosystems Paradigm

Author: Dreher Heinz
Zhu Dengya
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2007
Field of study

The purpose of the research is to provide effective information retrieval services for digital ?organisms? in a digital ecosystem by leveraging the power of Web searching technology. A novel integrating digital ecosystem search framework (a new digital organism) is proposed which employs the Web search technology and traditional database searching techniques to provide economic organisms with comprehensive, dynamic, and organization-oriented information retrieval ranging from the Internet to personal (semantic) desktop

Crossref

espace@Curtin

Topical web-page classification with similarity neural networks

Author: Gili Bueno Guillem
Publication venue: Universitat Politècnica de Catalunya
Publication date: 23/04/2019
Field of study

El propòsit d'aquest projecte es resoldre el problema de la categorització de webs per tòpic. Ser capaç de trobar el tòpic del contingut de una web donat un conjunt discret de categories. Per a conseguir-ho, farem servir Xarxes Neuronals i Word Embeddings juntament amb un Crawler.The purpose of this project is to solve the problem of website topic categorization. To be able to discernish the topic of the contents of a website offers, given a discrete amount of categories. To achieve this goal, we will be using Neural Networks and Word Embeddings together with a Crawler

UPCommons. Portal del coneixement obert de la UPC

Let's Discover More API Relations: A Large Language Model-based AI Chain for Unsupervised API Relation Inference

Author: Cao Yuanlong
Chen Jieshan
Huang Qing
Jin Huan
Lu Jiaxing
Sun Yanbang
Xing Zhenchang
Xu Xiwei
Publication venue
Publication date: 02/11/2023
Field of study

APIs have intricate relations that can be described in text and represented as knowledge graphs to aid software engineering tasks. Existing relation extraction methods have limitations, such as limited API text corpus and affected by the characteristics of the input text.To address these limitations, we propose utilizing large language models (LLMs) (e.g., GPT-3.5) as a neural knowledge base for API relation inference. This approach leverages the entire Web used to pre-train LLMs as a knowledge base and is insensitive to the context and complexity of input texts. To ensure accurate inference, we design our analytic flow as an AI Chain with three AI modules: API FQN Parser, API Knowledge Extractor, and API Relation Decider. The accuracy of the API FQN parser and API Relation Decider module are 0.81 and 0.83, respectively. Using the generative capacity of the LLM and our approach's inference capability, we achieve an average F1 value of 0.76 under the three datasets, significantly higher than the state-of-the-art method's average F1 value of 0.40. Compared to CoT-based method, our AI Chain design improves the inference reliability by 67%, and the AI-crowd-intelligence strategy enhances the robustness of our approach by 26%

arXiv.org e-Print Archive

Scalable and Declarative Information Extraction in a Parallel Data Analytics System

Author: Rheinländer Astrid
Publication venue: Humboldt-Universität zu Berlin
Publication date: 06/07/2017
Field of study

Informationsextraktions (IE) auf sehr großen Datenmengen erfordert hochkomplexe, skalierbare und anpassungsfähige Systeme. Obwohl zahlreiche IE-Algorithmen existieren, ist die nahtlose und erweiterbare Kombination dieser Werkzeuge in einem skalierbaren System immer noch eine große Herausforderung. In dieser Arbeit wird ein anfragebasiertes IE-System für eine parallelen Datenanalyseplattform vorgestellt, das für konkrete Anwendungsdomänen konfigurierbar ist und für Textsammlungen im Terabyte-Bereich skaliert. Zunächst werden konfigurierbare Operatoren für grundlegende IE- und Web-Analytics-Aufgaben definiert, mit denen komplexe IE-Aufgaben in Form von deklarativen Anfragen ausgedrückt werden können. Alle Operatoren werden hinsichtlich ihrer Eigenschaften charakterisiert um das Potenzial und die Bedeutung der Optimierung nicht-relationaler, benutzerdefinierter Operatoren (UDFs) für Data Flows hervorzuheben. Anschließend wird der Stand der Technik in der Optimierung nicht-relationaler Data Flows untersucht und herausgearbeitet, dass eine umfassende Optimierung von UDFs immer noch eine Herausforderung ist. Darauf aufbauend wird ein erweiterbarer, logischer Optimierer (SOFA) vorgestellt, der die Semantik von UDFs mit in die Optimierung mit einbezieht. SOFA analysiert eine kompakte Menge von Operator-Eigenschaften und kombiniert eine automatisierte Analyse mit manuellen UDF-Annotationen, um die umfassende Optimierung von Data Flows zu ermöglichen. SOFA ist in der Lage, beliebige Data Flows aus unterschiedlichen Anwendungsbereichen logisch zu optimieren, was zu erheblichen Laufzeitverbesserungen im Vergleich mit anderen Techniken führt. Als Viertes wird die Anwendbarkeit des vorgestellten Systems auf Korpora im Terabyte-Bereich untersucht und systematisch die Skalierbarkeit und Robustheit der eingesetzten Methoden und Werkzeuge beurteilt um schließlich die kritischsten Herausforderungen beim Aufbau eines IE-Systems für sehr große Datenmenge zu charakterisieren.Information extraction (IE) on very large data sets requires highly complex, scalable, and adaptive systems. Although numerous IE algorithms exist, their seamless and extensible combination in a scalable system still is a major challenge. This work presents a query-based IE system for a parallel data analysis platform, which is configurable for specific application domains and scales for terabyte-sized text collections. First, configurable operators are defined for basic IE and Web Analytics tasks, which can be used to express complex IE tasks in the form of declarative queries. All operators are characterized in terms of their properties to highlight the potential and importance of optimizing non-relational, user-defined operators (UDFs) for dataflows. Subsequently, we survey the state of the art in optimizing non-relational dataflows and highlight that a comprehensive optimization of UDFs is still a challenge. Based on this observation, an extensible, logical optimizer (SOFA) is introduced, which incorporates the semantics of UDFs into the optimization process. SOFA analyzes a compact set of operator properties and combines automated analysis with manual UDF annotations to enable a comprehensive optimization of data flows. SOFA is able to logically optimize arbitrary data flows from different application areas, resulting in significant runtime improvements compared to other techniques. Finally, the applicability of the presented system to terabyte-sized corpora is investigated. Hereby, we systematically evaluate scalability and robustness of the employed methods and tools in order to pinpoint the most critical challenges in building an IE system for very large data sets

Dokumenten-Publikationsserver der Humboldt-Universität zu Berlin