18 research outputs found

    A Comparison of Techniques for Sampling Web Pages

    Get PDF
    As the World Wide Web is growing rapidly, it is getting increasingly challenging to gather representative information about it. Instead of crawling the web exhaustively one has to resort to other techniques like sampling to determine the properties of the web. A uniform random sample of the web would be useful to determine the percentage of web pages in a specific language, on a topic or in a top level domain. Unfortunately, no approach has been shown to sample the web pages in an unbiased way. Three promising web sampling algorithms are based on random walks. They each have been evaluated individually, but making a comparison on different data sets is not possible. We directly compare these algorithms in this paper. We performed three random walks on the web under the same conditions and analyzed their outcomes in detail. We discuss the strengths and the weaknesses of each algorithm and propose improvements based on experimental results

    Purely URL-based Topic Classification

    Get PDF
    Given only the URL of a web page, can we identify its topic? This is the question that we examine in this paper. Usually, web pages are classified using their content, but a URL-only classifier is preferable, (i) when speed is crucial, (ii) to enable content filtering before an (objectionable) web page is downloaded, (iii) when a page's content is hidden in images, (iv) to annotate hyperlinks in a personalized web browser, without fetching the target page, and (v) when a focused crawler wants to infer the topic of a target page before devoting bandwidth to download it. We apply a machine learning approach to the topic identification task and evaluate its performance in extensive experiments on categorized web pages from the Open Directory Project (ODP). When training separate binary classifiers for each topic, we achieve typical F-measure values between 80 and 85, and a typical precision of around 85. We also ran experiments on a small data set of university web pages. For the task of classifying these pages into faculty, student, course and project pages, our methods improve over previous approaches by 13.8 points of F-measure

    Understanding the Web

    No full text
    The World Wide Web is one of the most widely used information resources. Understanding the web better will enable us to benefit more of it. In this thesis we develop techniques to learn the properties of the web pages like language and topic using only the URLs of web pages. Furthermore we make a comparison and evaluation of web page sampling algorithms to learn about the web properties like content length, top level domain and outdegree distribution. In the first part of this thesis, we develop high performance classifiers for web page language classification using only the URL of web pages. We make a comprehensive study of features and algorithms and test the performance of our classifiers on various real data sets. For language classification the quality of our URL-based classifiers rival the quality of classifiers based on content. Language classification from URL is useful when the content of the web page is not available and when the classification speed is important. Language classifiers based on URLs can be used by crawlers of general and language specific search engines to avoid bandwidth waste. In the second part of this thesis, we investigate whether web page topic classification can be done only with URL. We explore this problem in various dimensions like experimenting with different algorithms, features, data sets and topics. URL-based topic classification is useful when the content of the web page is not available or the content is hidden in images. Topic classifiers based on URLs can be used to filter information and in applications like topic focused crawlers. Although content based topic classifiers give better performances, our URL-based topic classifiers work reasonably well and can be used as a signal to improve the performance of content based classifiers. In the third part of this thesis, we compare the state of art web page sampling algorithms and analyze the samples returned by these algorithms using the web properties like content length, top level domain and outdegree distribution. We discuss the strengths and weaknesses of each algorithm and propose improvements based on experimental results. The sampling algorithms we run on the web are influenced by the structure of the web. We investigate the relationship between the properties of the web and the structure of it. A uniform random sample of the web would be quite useful to learn about the composition and development of the web as it is not possible to download all the web pages to determine the properties of the web

    PERCEPTION OF THE INFLUENCES OF TOURISM ON LOCAL CULTURE BY LOCAL PUBLIC (AN APPLICATION DIRECTED TO THE URGUP REGION)

    No full text
    Turistler, ziyaretleri süresince kendi kültürlerini de beraberlerinde getirebilmektedirler. Ziyaret ettikleri yörede yaşayan yöre halkı ile olan iletişimleri sürecinde ise, yöre halkı, karşılaştığı farklı kültürden etkilenip, farklı bir yaşam tarzına sahip olmak isteyebilmektedir. Böyle bir durum ise, turistik yörelerin sahip olduğu yerel kültürü olumlu ya da olumsuz yönde etkileyebilmektedir. Bu araştırmanın amacını da turizmin yerel kültür üzerindeki etkilerinin belirlenmesi oluşturmaktadır. Bu amaç doğrultusunda, turizmin genel olarak toplum üzerindeki etkileriyle sosyolojik yönü ve yerel kültür üzerindeki etkilerine ilişkin kaynak taraması yapılmış ve uygulama bölümü için, veri toplama aracı olarak, turizmin yerel kültür üzerindeki etkilerine ilişkin, bir anket geliştirilmiştir. Söz konusu olan bu anket, Nevşehir ilinin Ürgüp yöresinde yaşayan yöre halkı ile bizzat görüşerek, 697 kişiye uygulanmıştır. Ankete katılan yöre halkının kişisel özellikleri ve anket ifadelerine katılma düzeyleri, yüzde ve frekans olarak verilmiştir. Geliştirilen hipotezler, T testi ve Varyans Analizi (Anova) ile kontrol edilmiştir. Ayrıca ortaya çıkan farklılıkların hangi gruplardan kaynaklandığının bilinmesi için ise Tukey testi uygulanmıştır. Sonuçta, turizmin yerel kültür üzerindeki etkileri tespit edilmekle birlikte, yöre halkının konuya ilişkin görüşlerinin, kişisel özelliklerine göre anlamlı farklılıklar gösterdiği belirlenmiş ve buna göre çözüm önerileri geliştirilmiştir. http://www.oyunlarikavga.com http://www.oyunlarierkek.com http://www.3doyun.tv.tr http://www.oyundoyun.com The tourist may bring their cultures with them during their visits. During their communications with the local public the local public may desire a different life style because they are influenced with the culture they experience. This kind of a situation may damage the touristical regions become and positive or negatively influence. The main objective of this study is to determine the effects of tourism on the local culture. According to this objective, a resource scan related to the main influences of the tourism on generally on the society, its social approach and the local culture has been performed and a survey as a data collection tool has been developed according to the influences of the tourism on the local culture. This survey has been applied to 697 local people from Urgup district of the Nevsehir province by interviewing with them face to face. The personal characteristics and their levels of participation of the public participated in this survey have been given as percentage and frequency. The developed hypotheses have been controlled with the T test and the Variation Analysis (Anova). Besides the Tukey test has been applied in order to learn the differences occur from which groups. As a result, the influences of the tourism on the local culture have been determined and also the meaningful differences of the local public's visions related to the subject according to their personal characteristics have been determined and the solution recommendations have been developed according to them

    Recent Research on Database System Performance Advanced Research Topics in Databases

    No full text
    Performance of databases highly depend on how data is stored and accessed. It is a challenging task to design a data layout scheme that gives good performance results under different query types and changing workloads when concurrent queries are given to the system. In this report, our aim is to present recent research proposals that aim to overcome performance bottlenecks in databases.

    Web page language identification based on urls

    Get PDF
    Given only the URL of a web page, can we identify its language? This is the question that we examine in this paper. Such a language classifier is, for example, useful for crawlers of web search engines, which frequently try to satisfy certain language quotas. To determine the language of uncrawled web pages, they have to download the page, which might be wasteful, if the page is not in the desired language. With URL-based language classifiers these redundant downloads can be avoided. We apply a variety of machine learning algorithms to the language identification task and evaluate their performance in extensive experiments for five languages: English, French, German, Spanish and Italian. Our best methods achieve an F-measure, averaged over all languages, of around.90 for both a random sample of 1,260 web page from a large web crawl and for 25k pages from the ODP directory. For 5k pages of web search engine results we even achieve an F-measure of.96. The achieved recall for these collections is.93,.88 and.95 respectively. Two independent human evaluators performed considerably worse on the task, with an F-measure of.75 and a typical recall of a mere.67. Using only country-code top-level domains, such as.de or.fr yields a good precision, but a typical recall of below.60 and an F-measure of around.68. 1
    corecore