114 research outputs found

    LiveRank: How to Refresh Old Datasets

    Get PDF
    This paper considers the problem of refreshing a dataset. More precisely , given a collection of nodes gathered at some time (Web pages, users from an online social network) along with some structure (hyperlinks, social relationships), we want to identify a significant fraction of the nodes that still exist at present time. The liveness of an old node can be tested through an online query at present time. We call LiveRank a ranking of the old pages so that active nodes are more likely to appear first. The quality of a LiveRank is measured by the number of queries necessary to identify a given fraction of the active nodes when using the LiveRank order. We study different scenarios from a static setting where the Liv-eRank is computed before any query is made, to dynamic settings where the LiveRank can be updated as queries are processed. Our results show that building on the PageRank can lead to efficient LiveRanks, for Web graphs as well as for online social networks

    A comparison between public-domain search engines

    Get PDF
    The enormous amount of information available today on the Internet requires the use of search tools such as search engines, meta-search engines and directories for rapid retrieval of useful and appropriate information. Indexing a website\u27s content by search engine allows its information to be located quickly and improves the site\u27s usability. In the case of a large number of pages distributed over different systems (e.g. an organization with several autonomous branches/departments) a local search engine rapidly provides a comprehensive overview of all information and services offered. Local indexing generally has fewer requirements than global indexing (i.e. resources, performance, code optimization), thus public-domain SW can be used effectively. In this paper, we compare four open-source search engines available in the Unix environment in order to evaluate their features and effectiveness, and to understand any problems that may arise in an operative environment. Specifically, the comparison includes: - The SW features (installation, configuration options, scalability); - User interfaces; - The overall performance when indexing a sample page set; - Effectiveness of searches; - State of development and maintenance; - Documentation and support

    Dynamic OSINT System Sourcing from Social Networks

    Get PDF
    Nowadays, the World Wide Web (WWW) is simultaneously an accumulator and a provider of huge amounts of information, which is delivered to users through news, blogs, social networks, etc. The exponential growth of information is a major challenge for the community in general, since the frequent demand and correlation of news becomes a repetitive task, potentially tedious and prone to errors. Although information scrutiny is still performed manually and on a regular basis by most people, the emergence of Open-Source Intelligence (OSINT) systems in recent years for monitoring, selection and extraction of textual information from social networks and the Web promise to change the life of some of them. These systems are now very popular and useful tools for professionals from different areas, such as the cyber-security community, where being updated with the latest news and trends can lead to a direct impact on threat response. This work aims to address the previously motivated problem through the implementation of a dynamic OSINT system. For this system, two algorithms were developed: one to dynamically add, remove and rate user accounts with relevant tweets in the computer security area; and another one to classify the publications of those users. The relevance of a user depends not only on how frequently he publishes, but also on his importance (status) in the social network, as well as on the relevance of the information published by him. Text mining functions are proposed herein to achieve the objective of measuring the relevance of text segments. The proposed approach is innovative, involving dynamic management of the relevance of users and their publications, thus ensuring a more reliable and important source of information framework. Apart from the algorithms and functions on which they were build (which were also proposed in the scope of this work), this dissertation describes several experiments and tests used in their evaluation. The qualitative results are very interesting and demonstrate the practical usefulness of the approach. In terms of human-machine interface, a mural of information, generated dynamically and automatically from the social network Twitter, is provided to the end-user. In the current version of the system, the mural is presented in the form of a web page, highlighting the news by its relevancy (red for high relevance, yellow for moderate relevance, and green for low relevance). The main contributions of this work are the two proposed algorithms and their evaluation. A fully working prototype of a system with their implementation, along with a mural for showing selected news, is another important output of this work.Atualmente, a World Wide Web (WWW) fornece aos utilizadores enormes quantidades de informação sob os mais diversos formatos: notícias, blogs, nas redes sociais, entre outros. O crescimento exponencial desta informação representa um grande desafio para a comunidade em geral, uma vez que a procura e correlação frequente de notícias acaba por se tornar numa tarefa repetitiva, potencialmente aborrecida e sujeita a erros. Apesar da maioria das pessoas ainda fazer o escrutínio da informação de forma manual e regularmente, têm surgido, nos últimos anos, sistemas Open-Source Intelligence (OSINT) que efetuam a vigilância, seleção e extração de informação textual, a partir de redes sociais e da web em geral. Estes sistemas são hoje ferramentas muito populares e úteis aos profissionais de diversas áreas, como a da cibersegurança, onde estar atualizado com as notícias e as tendências mais recentes pode levar a um impacto direto na reação a ameaças. O objetivo deste trabalho passa pela tentativa de solucionar o problema motivado anteriormente, através da implementação de um sistema dinâmico OSINT. Para este sistema foram desenvolvidos dois algoritmos: um para adicionar, remover e classificar, dinamicamente, contas de utilizadores com tweets relevantes na área da segurança informática e outro para classificar as publicações desses utilizadores. A relevância de um utilizador depende não só da sua frequência de publicação mas também da sua importância (status) na rede social, bem como a relevância da informação publicada. Neste último ponto, são propostas funções de prospeção de texto que permitem medir a relevância de segmentos de texto. A abordagem proposta é inovadora, envolvendo gestão dinâmica da relevância dos utilizadores e das suas publicações, garantindo assim um quadro de fonte de informação mais fidedigna e importante. Para além dos algoritmos e das funções que os compõem (também propostas no contexto deste trabalho), esta dissertação descreve várias experiências e testes usados na sua avaliação. Os resultados qualitativos constatados são pertinentes, denotando uma elevada utilidade prática. Em termos de interface homem-máquina, é disponibilizado um mural de informação contínua que vai sendo gerado dinâmica e automaticamente, a partir da rede social Twitter, e apresentado sob a forma de uma página web, destacando as notícias apresentadas pelo grau de relevância que possuem (vermelho para relevância elevada, amarelo para relevância moderada e verde para relevância reduzida). As contribuições principais deste trabalho compreendem os dois algoritmos propostos e a sua avaliação. Um protótipo totalmente funcional de um sistema que os implementa, acompanhado pelo mural que mostra as notícias selecionadas, constituem outro resultado importante do trabalho

    Look back, look around:A systematic analysis of effective predictors for new outlinks in focused Web crawling

    Get PDF
    Small and medium enterprises rely on detailed Web analytics to be informed about their market and competition. Focused crawlers meet this demand by crawling and indexing specific parts of the Web. Critically, a focused crawler must quickly find new pages that have not yet been indexed. Since a new page can be discovered only by following a new outlink, predicting new outlinks is very relevant in practice. In the literature, many feature designs have been proposed for predicting changes in the Web. In this work we provide a structured analysis of this problem, using new outlinks as our running prediction target. Specifically, we unify earlier feature designs in a taxonomic arrangement of features along two dimensions: static versus dynamic features, and features of a page versus features of the network around it. Within this taxonomy, complemented by our new (mainly, dynamic network) features, we identify best predictors for new outlinks. Our main conclusion is that most informative features are the recent history of new outlinks on a page itself, and of its content-related pages. Hence, we propose a new 'look back, look around' (LBLA) model, that uses only these features. With the obtained predictions, we design a number of scoring functions to guide a focused crawler to pages with most new outlinks, and compare their performance. The LBLA approach proved extremely effective, outperforming other models including those that use a most complete set of features. One of the learners we use, is the recent NGBoost method that assumes a Poisson distribution for the number of new outlinks on a page, and learns its parameters. This connects the two so far unrelated avenues in the literature: predictions based on features of a page, and those based on probabilistic modelling. All experiments were carried out on an original dataset, made available by a commercial focused crawler.Comment: 23 pages, 15 figures, 4 tables, uses arxiv.sty, added new title, heuristic features and their results added, figures 7, 14, and 15 updated, accepted versio

    An intelligent real-time help system for clinical reasoning in virtual reality environment based on emotional analysis

    Full text link
    Le raisonnement clinique est l'une des compétences les plus importantes de la pratique médicale. Hypocrates est une plateforme logicielle d'évaluation médicale et d'analyse émotionnelle construite sur un environnement de réalité virtuelle. Grâce à cette plateforme, les étudiants en médecine peuvent évaluer leurs connaissances médicales par des cas cliniques virtuels bien conçus à partir d'une base de données de cas. Pendant le processus d'évaluation, les signaux d’électroencéphalogramme sont collectés simultanément pour évaluer l'état émotionnel de l'élève, ce qui aide les chercheurs à étudier le changement émotionnel de l'élève pendant tout le processus d'évaluation. Des études antérieures montrent que le maintien d'une émotion paisible et positive est nécessaire à de bonne performance d’évaluation. Pour maintenir un état émotionnel positif, une possibilité est d'aider les élèves de manière à éviter des émotions négatives (frustration, stress, confusion, etc.) qui peuvent découler des erreurs d’évaluation. Dans cette recherche, nous avons étudié, conçu et développé un système d'aide en temps réel et l'a intégré dans Hypocrates pour former sa prochaine version: Hypocrates +. Le système d'aide est intelligent car il fournit un contenu d'aide personnalisé et hautement associé lorsque la plate-forme estime qu'un contenu d'aide est nécessaire pour maintenir le statut émotionnel paisible et positif d'un élève. Le système d'aide est en temps réel car il a un délai de réponse très court afin que l'étudiant puisse obtenir très rapidement des connaissances médicales utiles. Le contenu de l'aide est généré à partir d'Internet, ou plus précisément, à partir de pages Wiki en ligne, pour que le contenu de l'aide soit toujours à jour. Le langage C # et Visual Studio IDE ont été utilisés pour développer le système d'aide en temps réel. Une application console fonctionne comme un serveur fournissant des services à ses clients et constitue une partie de la plateforme Hypocrates+. Des techniques telles que la recherche d'informations, l'intelligence artificielle, la programmation réseau UDP ont été largement utilisées pour aider au développement d'un serveur intelligent. Avec le système d'aide en temps réel souhaité intégré, Hypocrates+ est passé à une plateforme virtuelle de formation médicale au lieu 6 d'une simple plateforme d'évaluation, qui devient de plus en plus populaire dans la formation médicale moderne. Des tests et des expériences ont été effectués sur le système d'aide en temps réel et Hypocrates+ pour étudier la qualité et l'utilité du contenu d'aide généré et le temps de réponse. La réponse est très rapide avec un temps de réponse moyen de 1,5 seconde. Les résultats de l'analyse émotionnelle montrent que le contenu d'aide a réduit l'émotion négative de 4 participants sur 5. Nous concluons qu'un système d'aide en temps réel avec une bonne qualité de contenu d'aide enrichit les fonctionnalités d'une plate-forme de formation médicale en réalité virtuelle.students can evaluate their medical knowledge by well-designed virtual clinical cases from a case database. During the process of the evaluation, electroencephalogram signals are collected simultaneously for evaluating the emotional status of the student, which helps researchers study the emotional change of the student during the whole evaluation process. Previous studies showed that maintaining a peaceful and positive emotion is necessary for good performance, and one possible way to maintain such emotional status is to help avoid negative emotions (frustration, stress, confusion, etc.) that can arise by errors during the evaluation. In this research, we studied, designed and developed a real-time help system and integrated it into Hypocrates to form its next version: Hypocrates+. The help system is intelligent as it provides personalized and high related help content when the whole platform believes such help content is necessary to maintain a student’s peaceful and positive emotion status. The help system is real-time as it has a very shot delay of response so that the student can obtain useful medical knowledge very quickly. The help content is generated from internet, or more precisely, from online Wiki pages, to keep the help content is always up to date. C# language and Visual Studio IDE were used to develop the real-time help system: a console application functions as a server providing services to its clients: other part of Hypocrates+ platform. Techniques like information retrieval, artificial intelligence, UDP network programming were widely involved to help developing the intelligent server. With the desired real-time help system integrated, Hypocrates+ upgraded to a virtual medical education platform instead of just an evaluation one, which is becoming more and more popular in modern medical education. Tests and Experiments were performed on the real-time help system and Hypocrates+ to investigate the quality and usefulness of the generated help content and the response time. Results show that the content quality is quite good. The response is very quick with average response time of 1.5 seconds. The results of emotion analysis show that help content reduced 4 8 of 5 participants’ negative emotion. We conclude that the real-time help system with good quality of help content enriches the functionality of a virtual reality medical education platform and it probably helps medical students reduce negative emotion during clinical reasoning

    Effective web crawlers

    Get PDF
    Web crawlers are the component of a search engine that must traverse the Web, gathering documents in a local repository for indexing by a search engine so that they can be ranked by their relevance to user queries. Whenever data is replicated in an autonomously updated environment, there are issues with maintaining up-to-date copies of documents. When documents are retrieved by a crawler and have subsequently been altered on the Web, the effect is an inconsistency in user search results. While the impact depends on the type and volume of change, many existing algorithms do not take the degree of change into consideration, instead using simple measures that consider any change as significant. Furthermore, many crawler evaluation metrics do not consider index freshness or the amount of impact that crawling algorithms have on user results. Most of the existing work makes assumptions about the change rate of documents on the Web, or relies on the availability of a long history of change. Our work investigates approaches to improving index consistency: detecting meaningful change, measuring the impact of a crawl on collection freshness from a user perspective, developing a framework for evaluating crawler performance, determining the effectiveness of stateless crawl ordering schemes, and proposing and evaluating the effectiveness of a dynamic crawl approach. Our work is concerned specifically with cases where there is little or no past change statistics with which predictions can be made. Our work analyses different measures of change and introduces a novel approach to measuring the impact of recrawl schemes on search engine users. Our schemes detect important changes that affect user results. Other well-known and widely used schemes have to retrieve around twice the data to achieve the same effectiveness as our schemes. Furthermore, while many studies have assumed that the Web changes according to a model, our experimental results are based on real web documents. We analyse various stateless crawl ordering schemes that have no past change statistics with which to predict which documents will change, none of which, to our knowledge, has been tested to determine effectiveness in crawling changed documents. We empirically show that the effectiveness of these schemes depends on the topology and dynamics of the domain crawled and that no one static crawl ordering scheme can effectively maintain freshness, motivating our work on dynamic approaches. We present our novel approach to maintaining freshness, which uses the anchor text linking documents to determine the likelihood of a document changing, based on statistics gathered during the current crawl. We show that this scheme is highly effective when combined with existing stateless schemes. When we combine our scheme with PageRank, our approach allows the crawler to improve both freshness and quality of a collection. Our scheme improves freshness regardless of which stateless scheme it is used in conjunction with, since it uses both positive and negative reinforcement to determine which document to retrieve. Finally, we present the design and implementation of Lara, our own distributed crawler, which we used to develop our testbed

    NewsView: A Recommender System for Usenet based on FAST Data Search

    Get PDF
    This thesis combines aspects from two approaches to information access, information filtering and information retrieval, in an effort to improve the signal to noise ratio in interfaces to conversational data. These two ideas are blended into one system by augmenting a search engine indexing Usenet messages with concepts and ideas from recommender systems theory. My aim is to achieve a situation where the overall result relevance is improved by exploiting the qualities of both approaches. Important issues in this context are obtaining ratings, evaluating relevance rankings and the application of useful user profiles. An architecture called NewsView has been designed as part of the work on this thesis. NewsView describes a framework for interfaces to Usenet with information retrieval and information filtering concepts built into it, as well as extensive navigational possibilities within the data. My aim with this framework is to provide a testbed for user interface, information filtering and information retrieval issues, and, most importantly, combinations of the three

    Models and methods for web archive crawling

    Get PDF
    Web archives offer a rich and plentiful source of information to researchers, analysts, and legal experts. For this purpose, they gather Web sites as the sites change over time. In order to keep up to high standards of data quality, Web archives have to collect all versions of the Web sites. Due to limited resuources and technical constraints this is not possible. Therefore, Web archives consist of versions archived at various time points without guarantee for mutual consistency. This thesis presents a model for assessing the data quality in Web archives as well as a family of crawling strategies yielding high-quality captures. We distinguish between single-visit crawling strategies for exploratory and visit-revisit crawling strategies for evidentiary purposes. Single-visit strategies download every page exactly once aiming for an “undistorted” capture of the ever-changing Web. We express the quality of such the resulting capture with the “blur” quality measure. In contrast, visit-revisit strategies download every page twice. The initial downloads of all pages form the visit phase of the crawling strategy. The second downloads are grouped together in the revisit phase. These two phases enable us to check which pages changed during the crawling process. Thus, we can identify the pages that are consistent with each other. The quality of the visit-revisit captures is expressed by the “coherence” measure. Quality-conscious strategies are based on predictions of the change behaviour of individual pages. We model the Web site dynamics by Poisson processes with pagespecific change rates. Furthermore, we show that these rates can be statistically predicted. Finally, we propose visualization techniques for exploring the quality of the resulting Web archives. A fully functional prototype demonstrates the practical viability of our approach.Ein Webarchiv ist eine umfassende Informationsquelle für eine Vielzahl von Anwendern, wie etwa Forscher, Analysten und Juristen. Zu diesem Zweck enthält es Repliken von Webseiten, die sich typischerweise im Laufe der Zeit geändert haben. Um ein möglichst umfassendes und qualitativ hochwertiges Archiv zu erhalten, sollten daher - im Idealfall - alle Versionen der Webseiten archiviert worden sein. Dies ist allerdings sowohl aufgrund mangelnder Ressourcen als auch technischer Rahmenbedingungen nicht einmal annähernd möglich. Das Archiv besteht daher aus zahlreichen zu unterschiedlichen Zeitpunkten erstellten “Mosaiksteinen”, die mehr oder minder gut zueinander passen. Diese Dissertation führt ein Modell zur Beurteilung der Datenqualität eines Webarchives ein und untersucht Archivierungsstrategien zur Optimierung der Datenqualität. Zu diesem Zweck wurden im Rahmen der Arbeit “Einzel-” und “Doppelarchivierungsstrategien” entwickelt. Bei der Einzelarchivierungsstrategie werden die Inhalte für jede zu erstellende Replik genau einmal gespeichert, wobei versucht wird, das Abbild des sich kontinuierlich verändernden Webs möglichst “unverzerrt” zu archivieren. Die Qualität einer solchen Einzelarchivierungsstrategie kann dabei durch den Grad der “Verzerrung” (engl. “blur”) gemessen werden. Bei einer Doppelarchivierungsstrategie hingegen werden die Inhalte pro Replik genau zweimal besucht. Dazu teilt man den Archivierungsvorgang in eine “Besuchs-” und “Kontrollphase” ein. Durch die Aufteilung in die zuvor genannten Phasen ist es dann möglich festzustellen, welche Inhalte sich im Laufe des Archivierungsprozess geändert haben. Dies ermöglicht exakt festzustellen, ob und welche Inhalte zueinander passen. Die Güte einer Doppelarchivierungsstrategie wird dazu mittels der durch sie erzielten “Kohärenz” (engl. “coherence”) gemessen. Die Archivierungsstrategien basieren auf Vorhersagen über das Änderungsverhalten der zur archivierenden Inhalte, die als Poissonprozesse mit inhaltsspezifischen Änderungsraten modelliert wurden. Weiterhin wird gezeigt, dass diese Änderungsraten statistisch bestimmt werden können. Abschließend werden Visualisierungstechniken für die Qualitätsanalyse des resultierenden Webarchivs vorgestellt. Ein voll funktionsfähiger Prototyp demonstriert die Praxistauglichkeit unseres Ansatzes
    • …
    corecore