10 research outputs found

    A three-year study on the freshness of Web search engine databases

    Get PDF
    This paper deals with one aspect of the index quality of search engines: index freshness. The purpose is to analyse the update strategies of the major Web search engines Google, Yahoo, and MSN/Live.com. We conducted a test of the updates of 40 daily updated pages and 30 irregularly updated pages, respectively. We used data from a time span of six weeks in the years 2005, 2006, and 2007. We found that the best search engine in terms of up-to-dateness changes over the years and that none of the engines has an ideal solution for index freshness. Frequency distributions for the pages’ ages are skewed, which means that search engines do differentiate between often- and seldom-updated pages. This is confirmed by the difference between the average ages of daily updated pages and our control group of pages. Indexing patterns are often irregular, and there seems to be no clear policy regarding when to revisit Web pages. A major problem identified in our research is the delay in making crawled pages available for searching, which differs from one engine to another

    An Evaluation of Breast Cancer Website: Assessing the Readability of Breast Cancer Websites for The Public.

    Get PDF
    Patients including breast cancer patients’ participation in the management of their health is now an important practice and they need information about their condition for them to make an informed decision about their health. This information can be sought through various media and internet has been found to be the most important medium even for cancer patients. Literature has shown the recommended readability level of online health consumers to be sixth grade level. Websites were selected by trying to mimic how the public search for breast cancer information on the internet. These websites were then evaluated using a readability tests. This study found out that readability is poor with all the websites written above the recommended grade level for health information. Information about breast cancer can be found on the internet by the public. The readability of online health information is a very serious issue. Keywords: Readability, Informed patient, health information online, internet

    Time series analysis of the dynamics of news websites

    Get PDF
    Abstract-The content of news websites changes frequently and rapidly and its relevance tends to decay with time. To be of any value to the users, tools, such as, search engines, have to cope with these evolving websites and detect in a timely manner their changes. In this paper we apply time series analysis to study the properties and the temporal patterns of the change rates of the content of three news websites. Our investigation shows that changes are characterized by large fluctuations with periodic patterns and time dependent behavior. The time series describing the change rate is decomposed into trend, seasonal and irregular components and models of each component are then identified. The trend and seasonal components describe the daily and weekly patterns of the change rates. Trigonometric polynomials best fit these deterministic components, whereas the class of ARMA models represents the irregular component. The resulting models can be used to describe the dynamics of the changes and predict future change rates

    A three-year study on the freshness of Web search engine databases

    Get PDF
    This paper deals with one aspect of the index quality of search engines: index freshness. The purpose is to analyse the update strategies of the major Web search engines Google, Yahoo, and MSN/Live.com. We conducted a test of the updates of 40 daily updated pages and 30 irregularly updated pages, respectively. We used data from a time span of six weeks in the years 2005, 2006, and 2007. We found that the best search engine in terms of up-to-dateness changes over the years and that none of the engines has an ideal solution for index freshness. Frequency distributions for the pages’ ages are skewed, which means that search engines do differentiate between often- and seldom-updated pages. This is confirmed by the difference between the average ages of daily updated pages and our control group of pages. Indexing patterns are often irregular, and there seems to be no clear policy regarding when to revisit Web pages. A major problem identified in our research is the delay in making crawled pages available for searching, which differs from one engine to another

    A novel defense mechanism against web crawler intrusion

    Get PDF
    Web robots also known as crawlers or spiders are used by search engines, hackers and spammers to gather information about web pages. Timely detection and prevention of unwanted crawlers increases privacy and security of websites. In this research, a novel method to identify web crawlers is proposed to prevent unwanted crawler to access websites. The proposed method suggests a five-factor identification process to detect unwanted crawlers. This study provides the pretest and posttest results along with a systematic evaluation of web pages with the proposed identification technique versus web pages without the proposed identification process. An experiment was performed with repeated measures for two groups with each group containing ninety web pages. The outputs of the logistic regression analysis of treatment and control groups confirm the novel five-factor identification process as an effective mechanism to prevent unwanted web crawlers. This study concluded that the proposed five distinct identifier process is a very effective technique as demonstrated by a successful outcome

    Using an ontology to improve the web search experience

    Get PDF
    The search terms that a user passes to a search engine are often ambiguous, referring to homonyms. The results in these cases are a mixture of links to documents that contain different meanings of the search terms. Current search engines provide suggested query completions in a dropdown list. However, such lists are not well organized, mixing completions for different meanings. In addition, the suggested search phrases are not discriminating enough. Moreover, current search engines often return an unexpected number of results. Zero hits are naturally undesirable, while too many hits are likely to be overwhelming and of low precision. This dissertation work aims at providing a better Web search experience for the users by addressing the above described problems.To improve the search for homonyms, suggested completions are well organized and visually separated. In addition, this approach supports the use of negative terms to disambiguate the suggested completions in the list. The dissertation presents an algorithm to generate the suggested search completion terms using an ontology and new ways of displaying homonymous search results. These algorithms have been implemented in the Ontology-Supported Web Search (OSWS) System for famous people. This dissertation presents a method for dynamically building the necessary ontology of famous people based on mining the suggested completions of a search engine. This is combined with data from DBpedia. To enhance the OSWS ontology, Facebook is used as a secondary data source. Information from people public pages is mined and Facebook attributes are cleaned up and mapped to the OSWS ontology. To control the size of the result sets returned by the search engines, this dissertation demonstrates a query rewriting method for generating alternative query strings and implements a model for predicting the number of search engine hits for each alternative query string, based on the English language frequencies of the words in the search terms. Evaluation experiments of the hit count prediction model are presented for three major search engines. The dissertation also discusses and quantifies how far the Google, Yahoo! and Bing search engines diverge from monotonic behavior, considering negative and positive search terms separately

    Web dynamics and their ramifications for the development of Web search engines

    No full text
    The World Wide Web has become the largest hypertext system in existence, providing an extremely rich collection of information resources. Compared with conventional information sources, the Web is highly dynamic in the following four factors: size (i.e., the growing number of Web sites and pages), Web pages (page content and page existence), hyperlink structures and users ’ searching needs. As the most popular and important tools for finding information on the Web, Web search engines have to face many challenges arising from the Web dynamics. This paper surveys the research issues on Web dynamics and discusses how search engines address the four factors of Web dynamics. We then briefly discuss the main issues and directions of future development of Web search engines.

    Behaviour on Linked Data - Specification, Monitoring, and Execution

    Get PDF
    People, organisations, and machines around the globe make use of web technologies to communicate. For instance, 4.16 bn people with access to the internet made 4.6 bn pages on the web accessible using the transfer protocol HTTP, organisations such as Amazon built ecosystems around the HTTP-based access to their businesses under the headline RESTful APIs, and the Linking Open Data movement has put billions of facts on the web available in the data model RDF via HTTP. Moreover, under the headline Web of Things, people use RDF and HTTP to access sensors and actuators on the Internet of Things. The necessary communication requires interoperable systems at a truly global scale, for which web technologies provide the necessary standards regarding the transfer and the representation of data: the HTTP protocol specifies how to transfer messages, besides defining the semantics of sending/receiving different types of messages, and the RDF family of languages specifies how to represent the data in the messages, besides providing means to elaborate the semantics of the data in the messages. The combination of HTTP and RDF -together with the shared assumption of HTTP and RDF to use URIs as identifiers- is called Linked Data. While the representation of static data in the context of Linked Data has been formally grounded in mathematical logic, a formal treatment of dynamics and behaviour on Linked Data is largely missing. We regard behaviour in this context as the way in which a system (e.g. a user agent or server) works, and this behaviour manifests itself in dynamic data. Using a formal treatment of behaviour on Linked Data, we could specify applications that use or provide Linked Data in a way that allows for formal analysis (e.g. expressivity, validation, verification). Using an experimental treatment of behaviour, or a treatment of the behaviour\u27s manifestation in dynamic data, we could better design the handling of Linked Data in applications. Hence, in this thesis, we investigate the notion of behaviour in the context of Linked Data. Specifically, we investigate the research question of how to capture the dynamics of Linked Data to inform the design of applications. The first contribution is a corpus that we built and analysed to monitor dynamic Linked Data on the web to study the update behaviour. We provide an extensive analysis to set up a long-term study of the dynamics of Linked Data on the web. We analyse data from the long-term study for dynamics on the level of accessing changing documents and on the level of changes within the documents. The second contribution is a model of computation for Linked Data that allows for expressing executable specifications of application behaviour. We provide a mapping from the conceptual foundations of the standards around Linked Data to Abstract State Machines, a Turing-complete model of computation rooted in mathematical logic. The third contribution is a workflow ontology and corresponding operational semantics to specify applications that execute and monitor behaviour in the context of Linked Data. Our approach allows for monitoring and executing behaviour specified in workflow models and respects the assumptions of the standards and practices around Linked Data. We evaluate our findings using the experimental corpus of dynamic Linked Data on the web and a synthetic benchmark from the Internet of Things, specifically the domain of building automation

    Digitale Methoden in der Kommunikationswissenschaft

    Get PDF
    Was bedeuten Big-Data-Untersuchungen für die Entwicklung von Theorien und für forschungsethische Aspekte? Wie können öffentliche Spuren digitaler Kommunikation eingefangen, analysiert und interpretiert werden? Wie lassen sich Metriken von Social-Media-Plattformen in empirisch fundierte Forschung überführen? Welche Strategien gibt es, um in algorithmische Blackboxes wie Suchmaschinen und News Feeds zu schauen? Dieser Band beschäftigt sich mit diesen und vielen ähnlichen Fragen, die bei der kommunikationswissenschaftlichen Forschungsarbeit im digitalen Zeitalter auftreten. Das Buch versammelt sowohl theoretische und ethische Auseinandersetzungen wie auch Aufsätze, die empirische Forschung zu digitaler Kommunikation dokumentieren. Immer im Mittelpunkt: Jene Praktiken, die sich an das Medium anpassen, die seine Objekte, Akteure und Infrastrukturen erforschen - also das, was wir "digitale Methoden" der Kommunikationswissenschaft nennen.What do big data investigations mean for the development of theories and research ethics? How can public traces of digital communication be captured, analyzed, and interpreted? How can metrics of social media platforms be translated into empirically grounded research? What strategies allow us to look into algorithmic black boxes like search engines and newsfeeds? This volume deals with these and many similar questions arising in connection with communication research in the digital age. The book brings together both, theoretical and ethical essays, and articles documenting empirical research on digital communication. Always in the center: The practices that adapt to the medium exploring its objects, actors, and infrastructures – in other words what we call "digital methods" of communication science

    Suchmaschinen - Eine industrieökonomische Analyse der Konzentration und ihrer Ursachen

    Get PDF
    The main topic of this doctoral thesis is to investigate the concentration in search engine markets. It is investigated, whether the structural characteristics of the market, favors a natural concentration (monopoly/oligopoly) or if an abuse of a dominant position are responsible for this. Further the (qualitative) efficiency of the providers is examined based on surveys of quality and satisfaction. In succession to the introduction, Chapter 2 depicts background information to support a better understanding of the analysis. The 3rd Chapter describes the structure and operation of a search engine. The service is split into sub-processes and particular functions are analyzed regarding their contribution to the search engine quality. The 4th chapter analyzes the demand side characteristics. The focuses are: possible switching costs, network effects, and platform properties. The hypothesis is supported by behavioral studies. The service side of a search engine is considered in the 5th chapter. Especially the cost structure to maintain a search engine service is being investigated. In chapter 6, the high concentration of the search engine market is empirically analyzed. The Chapter 7 addresses whether the concentration can be justified by the search engines quality or by the determined economic characteristics. The barriers to entry and the concentration process (winner takes it all) are distinguished. In the final analytical section, the concentration factors are analyzed in chronological sequence and the search engines are examined within the framework of the theory of contestable markets. The thesis concludes with a summary and a discussion of the regulatory proposals.Hauptinteresse der vorliegenden Arbeit besteht darin, die bestehende hohe Konzentration der Suchmaschinenmärkte zu ergründen. Vor allem wird untersucht, ob diese auf die strukturellen Eigenschaften der Märkte zurückzuführen ist und somit eine natürliche Konzentration (Monopol/Oligopol) darstellt oder ob diese auf missbräuchliche Verhaltensweisen der etablierten Suchmaschinenbetreiber zurückzuführen ist. Als weiterer Erklärungsansatz der Konzentration wird anhand der Qualitäts- und Zufriedenheitsstudien eine höhere (qualitative) Effizienz der Betreiber untersucht.Im Anschluss an eine Einleitung werden im 2. Kapitel Hintergrundinformationen für das bessere Verständnis der durchgeführten Untersuchung dargestellt. Im 3. Kapitel werden der Aufbau und die Funktionsweise einer Suchmaschine beschrieben. Die Zerlegung der Dienstleistung einer Suchmaschine in Teilprozesse (Wertschöpfungsstufen) sowie die Ermittlung der Bedeutung einzelner Funktionen für die Suchmaschinenqualität bilden die Grundlagen für die Analyse der ökonomischen Eigenschaften in den Kapiteln 4 und 5.Das 4. Kapitel befasst sich mit den nachfrageseitigen Eigenschaften. Die wichtigsten Untersuchungspunkte sind hierbei: die möglichen Wechselbarrieren der Nachfragegruppen; die zwischen den und innerhalb der Nachfragegruppen bestehenden Netzwerkeffekte sowie die auf den Netzwerkeffekten aufbauende Analyse der Plattformeigenschaften. Zur Untermauerung der argumentativen Untersuchung werden Verhaltensstudien verwendet.Im 5. Kapitel der Arbeit wird die Angebotsseite einer Suchmaschine betrachtet. Hierbei wird die Kostenstruktur zur Unterhaltung einer Suchmaschine analysiert, um unter anderem mögliche Betriebsgrößen- oder Verbundvorteile zu erfassen. Im 6. Kapitel wird die hohe Konzentration der Suchmaschinenmärkte empirisch analysiert. Daran anschließend wird im 7. Kapitel die Konzentration anhand von Studien zur Qualität von Suchmaschinen sowie anhand der ermittelten ökonomischen Eigenschaften begründet. Hierbei werden die Markteintrittsbarrieren und der Konzentrationsprozess (Winner takes all) unterschieden. Im letzten analytischen Abschnitt werden die Konzentrationsfaktoren im zeitlichen Ablauf sowie die Suchmaschinen auf bestreitbare natürliche Monopole analysiert.Die Arbeit schließt mit einer Schlussbetrachtung, in der die Erkenntnisse zusammengefasst sowie Regulierungsvorhaben diskutiert werden
    corecore