73,835 research outputs found

    News Items Extraction dengan Menggunakan Pattern Based Strategy

    Get PDF
    ABSTRAKSI: Seiring dengan semakin berkembangannya dunia internet, semakin banyak pula web berita online yang tersedia di dunia maya. Pada dasarnya web berita online banyak menyediakan informasi penting. Berbagai teknik dalam text mining dapat diterapkan dengan tujuan untuk memperoleh manfaat yang lebih banyak dari informasi yang disediakan, diantaranya yaitu dengan menggunakan page-based clustering maupun keyword-based search. Namun, page-page pada newspaper biasanya terdiri dari beragam item berita dengan topik yang saling tidak berhubungan satu sama lain, sehingga page-based clustering kurang memberikan hasil yang optimal.Pada Tugas Akhir ini dilakukan pendekatan dengan melakukan ekstraksi terhadap item-item berita pada web pages secara individual dan melakukan mining secara terpisah dengan menggunakan pattern based strategy. Pendekatan ini menggunakan pattern URL text dan anchor text dalam mengekstrak link berita serta menggunakan crawler untuk penelusuran link berita dalam rangka mencari full story dari masing-masing item berita.Tahap analisis dan pengujian memberikan hasil bahwa pendekatan pattern based strategy yang dibangun terbukti dapat mengekstrak full story pada halaman Web berita meskipun tidak semua full story dari setiap item berita dapat diekstrak. Hasil ekstraksi item berita akan mencapai nilai optimal jika link-link pada web input bersifat homogen.Kata Kunci : pattern based strategy , URL text , anchor text, web berita, full storyABSTRACT: As the increasing of the development of the internet, the number of online news web available in the net is also increasing. Basically, the online news web provides important information. Various techniques in text mining can be applied to gain more advantages from the available information, such as using page-based clustering or even keyword based search. But, pages on newspaper, usually consist of many kinds of news item with unrelated topic each other, so the page-based clustering gives an optimal result deficientlyIn this final assignment, an approximation is applied by extracting the news items on web pages individually and by mining it separately using pattern based strategy. This approximation is using the pattern URL text and anchor text in order to extract news link and also using crawler for news link search to seek the full story from each news item.The analysis and testing phase results that the pattern based strategy approach build, proved can extract the full story of the news web page although not all full stories from each news item is extractable. The extracted news item will reach the optimal value if the links on the input web is homogenyKeyword: pattern based strategy , URL text , anchor text, newspaper web, full stor

    A Taxonomy of Hyperlink Hiding Techniques

    Full text link
    Hidden links are designed solely for search engines rather than visitors. To get high search engine rankings, link hiding techniques are usually used for the profitability of black industries, such as illicit game servers, false medical services, illegal gambling, and less attractive high-profit industry, etc. This paper investigates hyperlink hiding techniques on the Web, and gives a detailed taxonomy. We believe the taxonomy can help develop appropriate countermeasures. Study on 5,583,451 Chinese sites' home pages indicate that link hidden techniques are very prevalent on the Web. We also tried to explore the attitude of Google towards link hiding spam by analyzing the PageRank values of relative links. The results show that more should be done to punish the hidden link spam.Comment: 12 pages, 2 figure

    Setting per-field normalisation hyper-parameters for the named-page finding search task

    Get PDF
    Per-field normalisation has been shown to be effective for Web search tasks, e.g. named-page finding. However, per-field normalisation also suffers from having hyper-parameters to tune on a per-field basis. In this paper, we argue that the purpose of per-field normalisation is to adjust the linear relationship between field length and term frequency. We experiment with standard Web test collections, using three document fields, namely the body of the document, its title, and the anchor text of its incoming links. From our experiments, we find that across different collections, the linear correlation values, given by the optimised hyper-parameter settings, are proportional to the maximum negative linear correlation. Based on this observation, we devise an automatic method for setting the per-field normalisation hyper-parameter values without the use of relevance assessment for tuning. According to the evaluation results, this method is shown to be effective for the body and title fields. In addition, the difficulty in setting the per-field normalisation hyper-parameter for the anchor text field is explained

    Ensemble clustering for result diversification

    Get PDF
    This paper describes the participation of the University of Twente in the Web track of TREC 2012. Our baseline approach uses the Mirex toolkit, an open source tool that sequantially scans all the documents. For result diversification, we experimented with improving the quality of clusters through ensemble clustering. We combined clusters obtained by different clustering methods (such as LDA and K-means) and clusters obtained by using different types of data (such as document text and anchor text). Our two-layer ensemble run performed better than the LDA based diversification and also better than a non-diversification run

    University of Glasgow at WebCLEF 2005: experiments in per-field normalisation and language specific stemming

    Get PDF
    We participated in the WebCLEF 2005 monolingual task. In this task, a search system aims to retrieve relevant documents from a multilingual corpus of Web documents from Web sites of European governments. Both the documents and the queries are written in a wide range of European languages. A challenge in this setting is to detect the language of documents and topics, and to process them appropriately. We develop a language specific technique for applying the correct stemming approach, as well as for removing the correct stopwords from the queries. We represent documents using three fields, namely content, title, and anchor text of incoming hyperlinks. We use a technique called per-field normalisation, which extends the Divergence From Randomness (DFR) framework, to normalise the term frequencies, and to combine them across the three fields. We also employ the length of the URL path of Web documents. The ranking is based on combinations of both the language specific stemming, if applied, and the per-field normalisation. We use our Terrier platform for all our experiments. The overall performance of our techniques is outstanding, achieving the overall top four performing runs, as well as the top performing run without metadata in the monolingual task. The best run only uses per-field normalisation, without applying stemming

    University of Twente @ TREC 2009: Indexing half a billion web pages

    Get PDF
    This report presents results for the TREC 2009 adhoc task, the diversity task, and the relevance feedback task. We present ideas for unsupervised tuning of search system, an approach for spam removal, and the use of categories and query log information for diversifying search results
    corecore