8 research outputs found

    Web knowledge bases

    Get PDF
    Knowledge is key to natural language understanding. References to specific people, places and things in text are crucial to resolving ambiguity and extracting meaning. Knowledge Bases (KBs) codify this information for automated systems — enabling applications such as entity-based search and question answering. This thesis explores the idea that sites on the web may act as a KB, even if that is not their primary intent. Dedicated kbs like Wikipedia are a rich source of entity information, but are built and maintained at an ongoing cost in human effort. As a result, they are generally limited in terms of the breadth and depth of knowledge they index about entities. Web knowledge bases offer a distributed solution to the problem of aggregating entity knowledge. Social networks aggregate content about people, news sites describe events with tags for organizations and locations, and a diverse assortment of web directories aggregate statistics and summaries for long-tail entities notable within niche movie, musical and sporting domains. We aim to develop the potential of these resources for both web-centric entity Information Extraction (IE) and structured KB population. We first investigate the problem of Named Entity Linking (NEL), where systems must resolve ambiguous mentions of entities in text to their corresponding node in a structured KB. We demonstrate that entity disambiguation models derived from inbound web links to Wikipedia are able to complement and in some cases completely replace the role of resources typically derived from the KB. Building on this work, we observe that any page on the web which reliably disambiguates inbound web links may act as an aggregation point for entity knowledge. To uncover these resources, we formalize the task of Web Knowledge Base Discovery (KBD) and develop a system to automatically infer the existence of KB-like endpoints on the web. While extending our framework to multiple KBs increases the breadth of available entity knowledge, we must still consolidate references to the same entity across different web KBs. We investigate this task of Cross-KB Coreference Resolution (KB-Coref) and develop models for efficiently clustering coreferent endpoints across web-scale document collections. Finally, assessing the gap between unstructured web knowledge resources and those of a typical KB, we develop a neural machine translation approach which transforms entity knowledge between unstructured textual mentions and traditional KB structures. The web has great potential as a source of entity knowledge. In this thesis we aim to first discover, distill and finally transform this knowledge into forms which will ultimately be useful in downstream language understanding tasks

    Modeling Visual Rhetoric and Semantics in Multimedia

    Get PDF
    Recent advances in machine learning have enabled computer vision algorithms to model complicated visual phenomena with accuracies unthinkable a mere decade ago. Their high-performance on a plethora of vision-related tasks has enabled computer vision researchers to begin to move beyond traditional visual recognition problems to tasks requiring higher-level image understanding. However, most computer vision research still focuses on describing what images, text, or other media literally portrays. In contrast, in this dissertation we focus on learning how and why such content is portrayed. Rather than viewing media for its content, we recast the problem as understanding visual communication and visual rhetoric. For example, the same content may be portrayed in different ways in order to present the story the author wishes to convey. We thus seek to model not only the content of the media, but its authorial intent and latent messaging. Understanding how and why visual content is portrayed a certain way requires understanding higher level abstract semantic concepts which are themselves latent within visual media. By latent, we mean the concept is not readily visually accessible within a single image (e.g. right vs left political bias), in contrast to explicit visual semantic concepts such as objects. Specifically, we study the problems of modeling photographic style (how professional photographers portray their subjects), understanding visual persuasion in image advertisements, modeling political bias in multimedia (image and text) news articles, and learning cross-modal semantic representations. While most past research in vision and natural language processing studies the case where visual content and paired text are highly aligned (as in the case of image captions), we target the case where each modality conveys complementary information to tell a larger story. We particularly focus on the problem of learning cross-modal representations from multimedia exhibiting weak alignment between the image and text modalities. A variety of techniques are presented which improve modeling of multimedia rhetoric in real-world data and enable more robust artificially intelligent systems

    Pengambilan Konten Utama Pada Website Pemerintah Daerah Menggunakan Pendekatan Template-Based dan Klasifikasi Naive-Baye

    Get PDF
    Internet dan World Wide Web dapat memberikan kemampuan penting yang dapat mendorong kemampuan yang dimiliki oleh pemerintahan. Hal tersebut adalah kemampuan yang dapat membuat pemerintah lokal dapat mendistribusikan informasi serta warga negara dapat menerima informasi terkini mengenai urusan pemerintah lokal dengan biaya yang murah dan mudah. Hal ini lebih sering disebut dengan E-Government. Penggunaan E-Government di Indonesia sendiri sudah didukung penuh oleh instruksi Presiden Republik Indonesia. Egovbench adalah aplikasi monitoring dan pengukuran performa dari website dan media sosial resmi dari pemerintah daerah di Indonesia. Untuk melakukan tugasnya tersebut egovbench harus melakukan proses pengambilan informasi ke setiap halaman web pada setiap situs web resmi yang dimiliki oleh pemerintah daerah. Setiap halaman web yang ada akan memiliki sebuah main content. main content adalah sebuah bagian, segmen atau blok yang berisi konten yang berupa teks atau dalam bentuk multimedia yang berada pada sebuah halaman web yang bukan merupakan halaman web landing atau beranda dan bersifat unik pada satu halaman web. Informasi-informasi penting mengenai pemerintahan daerah umumnya berada di dalam main content sehingga diperlukan web content extractor untuk mengambil informasi-infomrasi tersebut. Pada penelitian ini, untuk mengatasi permasalahan mengenai pengambilan main content tersebut, dilakukan penggabungan antara dua pendekatan yang telah ada yaitu pendekatan template-based dan pendekatan machine learning dengan menggunakan Na’ıve-Bayes Classifier. Umumnya penelitian terdahulu yang dilakukan masih menggunakan satu tipe pendekatan yaitu antara menggunakan pendekat an template-based atau menggunakan machine learning. Kontribusi dari penelitian ini adalah mengenai bagaimana hasil dari pengambilan main content dengan menggunakan gabungan dua pendekatan antara pendekatan template-based dan pendekatan Klasifikasi Na¨ıve-Bayes. Tantangan yang dihadapi pada penelitian ini adalah bagaimana struktur halaman web yang dimiliki oleh pemerintah daerah dapat menyulitkan pada tahap pengambilan main content dengan pendekatan template-based terutama efek dari Content Management System (CMS) pada struktur halaman web. Hasil yang didapatkan memperlihatkan bahwa dengan menggunakan bahwa dengan menggunakan gabungan dua tipe pendekatan dapat memberikan hasil yang lebih akurat dalam memprediksi kategori halaman web dengan akurasi yang dicapai sebesar 68% dibandingkan pendekatan yang digunakan saat ini pada egovbench dengan akurasi sebesar 59%. ====================================================================================================== The Internet and the World Wide Web offer capabilities that can help increase government capabilities further. It is a capability that enables local governments to distribute information and citizens can receive up-to-date information about local government affairs at low cost and convenience. These activities is commonly referred as E-Government. The use of E-Government in Indonesia itself is fully supported by the instruction of the President of the Republic of Indonesia. Egovbench Is a monitoring and performance measurement application of official website and social media from local government in Indonesia. For egovbench do this task, egovbench need to take information on every web page on every official website of local government. Every web page will have a main content. Main content is a section, segment or block that contains text or multimedia on single web page that is not a landing page of web site or homepage and is unique to single web page. Important information about local governance generally lies within main content thus the need of web content extractor to extract that information. In this research, we combine the two approaches that already existed, template-based approach and machine learning approach using Naïve-Bayes Classifier, to solve the problem of extracting main content from the webpage. Generally, previous research that has been conducted is using one type of approach, it is either using a template-based approach or using machine learning approach. The contribution of this research is on how the results of the main content extraction using a combination of two approaches between the template-based approach and the Naïve-Bayes Classification approach. The challenge that this research faced is mainly come from the structure of web pages in official website owned by local government which hamper the template-based approach especially the effect of Content Management System on web page structures. The results show that using a combination of two types of approaches can yield more accurate results in predicting web page categories with an accuracy of 68% compared to current approach in egovbench with an accuracy of 59%
    corecore