65 research outputs found

    Web scraping technologies in an API world

    Get PDF
    Web services are the de facto standard in biomedical data integration. However, there are data integration scenarios that cannot be fully covered by Web services. A number of Web databases and tools do not support Web services, and existing Web services do not cover for all possible user data demands. As a consequence, Web data scraping, one of the oldest techniques for extracting Web contents, is still in position to offer a valid and valuable service to a wide range of bioinformatics applications, ranging from simple extraction robots to online meta-servers. This article reviews existing scraping frameworks and tools, identifying their strengths and limitations in terms of extraction capabilities. The main focus is set on showing how straightforward it is today to set up a data scraping pipeline, with minimal programming effort, and answer a number of practical needs. For exemplification purposes, we introduce a biomedical data extraction scenario where the desired data sources, well-known in clinical microbiology and similar domains, do not offer programmatic interfaces yet. Moreover, we describe the operation of WhichGenes and PathJam, two bioinformatics meta-servers that use scraping as means to cope with gene set enrichment analysis.This work was partially funded by (i) the [TIN2009-14057-C03-02] project from the Spanish Ministry of Science and Innovation, the Plan E from the Spanish Government and the European Union from the European Regional Development Fund (ERDF), (ii) the Portugal-Spain cooperation action sponsored by the Foundation of Portuguese Universities [E 48/11] and the Spanish Ministry of Science and Innovation [AIB2010PT-00353] and (iii) the Agrupamento INBIOMED [2012/273] from the DXPCTSUG (Direccion Xeral de Promocion Cientifica e Tecnoloxica do Sistema Universitario de Galicia) from the Galician Government and the European Union from the ERDF unha maneira de facer Europa. H. L. F. was supported by a pre-doctoral fellowship from the University of Vigo

    Desarrollo y aplicación de una herramienta de extracción y almacenamiento de datos de Twitter a un contexto social de violencia política

    Get PDF
    Auxiliar de InvestigaciónEste proyecto se orientó a la construcción de una herramienta web para la extracción y almacenamiento de datos de la red social twitter, la cual permitirá a futuro con apoyo de un software externo o integrado, establecer un análisis estadístico de estos datos, enfocado en la necesidad del usuario y también alentar la construcción de nuevas solución.INTRODUCCIÓN 1. GENERALIDADES 2. OBJETIVOS DEL PROYECTO 3. MARCO DE REFERENCIA 4. MARCO CONCEPTUAL 5. METODOLOGÍA 6. DESARROLLO DEL PROYECTO 7. RESULTADOS CONCLUSIONES RECOMENDACIONES BIBLIOGRAFÍA ANEXOS GLOSARIOPregradoIngeniero de Sistema

    The benefits of in silico modeling to identify possible small-molecule drugs and their off-target interactions

    Get PDF
    Accepted for publication in a future issue of Future Medicinal Chemistry.The research into the use of small molecules as drugs continues to be a key driver in the development of molecular databases, computer-aided drug design software and collaborative platforms. The evolution of computational approaches is driven by the essential criteria that a drug molecule has to fulfill, from the affinity to targets to minimal side effects while having adequate absorption, distribution, metabolism, and excretion (ADME) properties. A combination of ligand- and structure-based drug development approaches is already used to obtain consensus predictions of small molecule activities and their off-target interactions. Further integration of these methods into easy-to-use workflows informed by systems biology could realize the full potential of available data in the drug discovery and reduce the attrition of drug candidates.Peer reviewe

    Web Data Extraction Dalam Analitika Data Audit: Pengembangan Artefak Teknologi Dalam Perspektif Design Science Research

    Get PDF
    Perkembangan implementasi Teknologi Informasi dan Komunikasi (TIK) sebagai bagian pengendalian internal organisasi mendorong auditor mengembangkan analitika data audit (ADA/Audit Data Analytics) sebagai kerangka pengetahuan dan praktik untuk mendapatkan bukti audit dan informasi lainnya dari sekumpulan data elektronik terkait dengan pelaksanaan pada semua tahapan pekerjaan audit. Pada saat yang sama, terdapat kecenderungan organisasi untuk menyajikan datanya dengan aplikasi berbasis web. Terkait dengan keberadaan laman web sebagai sumber data (bukti audit) tersebut, telah berkembang teknik  ekstraksi data dari laman web yang disebut dengan web data extraction. Penelitian ini dengan menggunakan design science research methodology mengajukan temuan artefak yang berkaitan dengan model dan instantiasi (instantiation) web data extraction untuk implementasi ADA. Hasil penelitian ini diharapkan dapat menjadi tambahan referensi dalam ranah praktik audit berupa artefak dalam bentuk instantiasi penggunaan web data extraction untuk akusisi data sebagai bukti audit dengan sumber dari halaman web, baik dari aplikasi berbasis intranet ataupun internet. Penelitian ini juga berkontribusi dengan mengajukan kerangka praktikal implementasi web data extraction sebagai bagian dari ADA dalam melaksanakan pekerjaan audit. Selain itu, hasil kajian ini juga diharapkan menjadi referensi untuk penggunaan design science research methodology yang ternyata belum terlalu banyak diaplikasikan dalam penelitian dalam disiplin audit di Indonesia

    The Impact of the COVID-19 Pandemic on Faculty Productivity and Gender Inequalities in STEM Disciplines

    Get PDF
    Women and minorities within STEM disciplines historically encounter obstacles in academic advancement, a situation compounded by the COVID-19 pandemic due to the imposition of additional responsibilities like caregiving. This study meticulously probes into the pandemic\u27s influence on traditional academic productivity metrics – specifically publication and submission frequency, citation volume, and leadership in scholarly entities, by employing Natural Language Processing to extract and analyze data from key journals within various scientific domains. A critical revelation from the research indicates a notable downturn in publication activity during 2021, potentially attributed to pandemic-induced disruptions, with a compensatory surge observed in 2022. Although a gradual ascendancy towards gender parity in academic authorship was observed, the journey toward substantive equality is confronted with future challenges, including policy shifts and societal factors. This investigation not only illuminates the nuanced disparities in academic publishing but also endeavors to guide institutional strategies towards genuinely equitable promotion, tenure policies, and practices, ensuring that the academic merit of all scholars, regardless of gender or minority status, is acknowledged and rewarded

    Generación de una herramienta para la búsqueda de metadata asociada a transcriptos

    Get PDF
    Al analizar un transcriptoma en primera instancia se debe realizar un pre-procesamiento computacional de las lecturas de transcriptos generados por secuenciadores masivos y posterior obtención de la expresión diferencial de los genes asociados a muestras control y tratadas. Los genes cuantificados deben ser caracterizados con el fin de obtener toda la información posible de la secuencia del genoma, para luego ser anotados funcionalmente. En ese sentido, el desarrollo de un buscador de metadata asociada a cada lectura de manera personalizada para cada experimento y/o especie conlleva grandes ventajas. A la hora de procesar los datos, a partir del nombre del gen o secuencia puede obtenerse, por ejemplo, la descripción de la proteína, los términos GO, vías de Uniprot, así como resúmenes sobre las categorías funcionales. En el presente trabajo, se utilizó como punto de partida una tabla de expresión diferencial obtenida a partir del procesamiento bioinformático de la colección biológica PRJNA417324 y mediante la técnica de raspado web, empleando el programa Beautiful Soup, se obtuvo metadata asociada para cada transcripto a partir de la base de datos UNIPROT.Sociedad Argentina de Informática e Investigación Operativ

    Web-Scraping Teknikan Oinarritutako Azpiegitura Informatikoak. Xerka Online eta Minerva aplikazioak

    Get PDF
    Erabiltzailearen zereginak errazten dituzten sistema informatiko ugari erabiltzen dira, bai arlo profesionalean, bai eta arlo pertsonalean ere. Hala ere, kasu batzuetan erabiltzaileen beharren eta sistema informatikoak eskaintzen duenaren arteko distantzia handia da. Artikulu honetan web-scraping teknikarekin sortutako bi azpiegitura informatiko deskribatu dira, jatorrizko beste azpiegitura batzuen funtzionalitatea hobetu dutenak. Alde batetik, Xerka Online aplikazioak ikertzaileen curriculum vitaearen (CVaren) sortze- eta mantentze-lana errazten du, ikertzaileek egin behar izaten duten ataza nagusia modu automatizatuan eginez: argitalpenak bilatu eta haiei dagozkien kalitate-adierazle (eragin-faktore eta aipamen kopuru) eguneratuak ezarri. Minerva aplikazioak, ordea, Vitoria-Gasteizko Ingeniaritza Eskolan egiten diren kalitate-txostenak kudeatzen ditu. Horretarako, Euskal Herriko Unibertsitateko (UPV/EHUko) GAUR web-aplikaziotik automatikoki jaisten ditu itxitako aktak, jarritako kalifikazioen estatistikak kalkulatzen ditu, eta maila ezberdinetan egiten diren txostenak batzen ditu. Bi aplikazioen abantaila nagusiak lan horiek egiteko behar den denboraren eta giza akatsen murrizpena dira.; Computational systems that facilitate the tasks of the customer are frequently used, both for professional and personal purposes. However, in some cases, the computer system does not meet the users’ needs. In this article, two computational infrastructures based on the use of web-scraping are described, which improve the functionality of the original infrastructures.Xerka Online allows the creation and maintenance of a researcher’s curriculum vitae (CV) by searching his/her publications and their updated quality indicators (impact factor and cites). Minerva manages the quality assessment reports in the Faculty of Engineering of Vitoria-Gasteiz. To that end, it downloads the grade records from GAUR (a web-application of the University of the Basque Country UPV/EHU), calculates statistics, and merges reports generated in different levels of the quality assessment process. The main advantages of these applications are time reduction and avoidance of human errors

    Strategies to access web-enabled urban spatial data for socioeconomic research using R functions

    Full text link
    This version of the article has been accepted for publication, after peer review (when applicable) and is subject to Springer Nature’s AM terms of use, but is not the Version of Record and does not reflect post-acceptance improvements, or any corrections. The Version of Record is available online at: http://dx.doi.org/10.1007/s10109-019-00309-ySince the introduction of the World Wide Web in the 1990s, available information for research purposes has increased exponentially, leading to a significant proliferation of research based on web-enabled data. Nowadays the use of internet-enabled databases, obtained by either primary data online surveys or secondary official and non-official registers, is common. However, information disposal varies depending on data category and country and specifically, the collection of microdata at low geographical level for urban analysis can be a challenge. The most common difficulties when working with secondary web-enabled data can be grouped into two categories: accessibility and availability problems. Accessibility problems are present when the data publication in the servers blocks or delays the download process, which becomes a tedious reiterative task that can produce errors in the construction of big databases. Availability problems usually arise when official agencies restrict access to the information for statistical confidentiality reasons. In order to overcome some of these problems, this paper presents different strategies based on URL parsing, PDF text extraction, and web scraping. A set of functions, which are available under a GPL-2 license, were built in an R package to specifically extract and organize databases at the municipality level (NUTS 5) in Spain for population, unemployment, vehicle fleet, and firm characteristicsThis work was supported by Spanish Ministry of Economics and Competitiveness (ECO2015-65758-P) and the Regional Government of Extremadura (Spain

    Core characteristics of Muslim-friendly accommodation service quality in Norway: international visitor’s opinions

    Get PDF
    Purpose: In the worldwide context, Muslim-friendly accommodation services have suddenly increased competition. This study aims to research the core characteristics of Muslim-friendly accommodation service quality present and what shapes their level of quality service for the international visitor’s opinions visiting Norway. Design/methodology/approach: The study utilizes a mixed-method analysis approach to study 500 reviews using Leximancer software. At the same time, data was gathered from tripadvisor.com using a web data scraper apify.com, an online-based tool. Findings: Qualitative analysis has presented seven different themes, namely accommodation, room, food, staff, location, cleanliness, and facilities. Furthermore, this study contributes to understanding what conceptualized attributes of a Muslim-friendly accommodation are in Norway and whether the perceived service quality in a Muslim-friendly accommodation for international visitors by utilizing themes linked to customer satisfaction or dissatisfaction. Originality/Value: The study provides an overview of valuable insights regarding the context of Muslim-friendly accommodations in Norway from the perspectives of the international visitors based on user-generated content and identifies dominant themes linked to various values for ratings. Keywords: Muslim-friendly accommodations, Online, Content Analysis, satisfaction, Leximance

    Web Scraping and Review Analytics. Extracting Insights from Commercial Data

    Get PDF
    Web scraping has numerous applications. It can be used complementary with APIs to extract useful data from web pages. For instance, commercial data is abundant, but not always relevant as it is presented on websites. In this paper, we propose the usage of web scraping techniques (namely, two popular libraries – BeautifulSoup and Selenium) to extract data from web and other Python libraries and techniques (vaderSentiment, SentimentIntensityAnalyzer, nltk, n consecutive words) to analyze the reviews and obtain useful insights from this data. A web scraper is built in which prices are extracted and variations are tracked. Furthermore, the reviews are extracted and analyzed in order to identify the relevant opinions, including complaints of the customers
    corecore