337,337 research outputs found

    Boilerplate Removal using a Neural Sequence Labeling Model

    Full text link
    The extraction of main content from web pages is an important task for numerous applications, ranging from usability aspects, like reader views for news articles in web browsers, to information retrieval or natural language processing. Existing approaches are lacking as they rely on large amounts of hand-crafted features for classification. This results in models that are tailored to a specific distribution of web pages, e.g. from a certain time frame, but lack in generalization power. We propose a neural sequence labeling model that does not rely on any hand-crafted features but takes only the HTML tags and words that appear in a web page as input. This allows us to present a browser extension which highlights the content of arbitrary web pages directly within the browser using our model. In addition, we create a new, more current dataset to show that our model is able to adapt to changes in the structure of web pages and outperform the state-of-the-art model.Comment: WWW20 Demo pape

    Revising Knowledge Discovery for Object Representation with Spatio-Semantic Feature Integration

    Get PDF
    In large social networks, web objects become increasingly popular. Multimedia object classification and representation is a necessary step of multimedia information retrieval. Indexing and organizing these web objects for the purpose of convenient browsing and search of the objects, and to effectively reveal interesting patterns from the objects. For all these tasks, classifying the web objects into manipulable semantic categories is an essential procedure. One important issue for classification of objects is the representation of images. To perform supervised classification tasks, the knowledge is extracted from unlabeled objects through unsupervised learning. In order to represent the images in a more meaningful and effective way rather than using the basic Bag-of-words (BoW) model, a novel image representation model called Bag-of-visual phrases(BoP) is used. In this model visual words are obtained using hierarchical clustering and visual phrases are generated by vector classifier of visual words. To obtain the Spatio-semantic correlation knowledge the frequently co-occurring pairs are calculated from visual vocabulary. After the successful object representation, the tags, comments, and descriptions of web objects are separated by using most likelihood method. The spatial and semantic differentiation power of image features can be enhanced via this BoP model and likelihood method. DOI: 10.17762/ijritcc2321-8169.15065

    Retrieval Models for Genre Classification

    Get PDF
    Genre provides a characterization of a document with respect to its form or functional trait. Genre is orthogonal to topic, rendering genre information a powerful filter technology for information seekers in digital libraries. However, an efficient means for genre classification is an open and controversially discussed issue. This paper gives an overview and presents new results related to automatic genre classification of text documents. We present a comprehensive survey which contrasts the genre retrieval models that have been developed for Web and non-Web corpora. With the concept of genre-specific core vocabularies the paper provides an original contribution related to computational aspects and classification performance of genre retrieval models: we show how such vocabularies are acquired automatically and introduce new concentration measures that quantify the vocabulary distribution in a sensible way. Based on these findings we construct lightweight genre retrieval models and evaluate their discriminative power and computational efficiency. The presented concepts go beyond the existing utilization of vocabulary-centered, genre-revealing features and open new possibilities for the construction of genre classifiers that operate in real-time

    Library of medium-resolution fiber optic echelle spectra of F, G, K, and M field dwarfs to giants stars

    Get PDF
    We present a library of Penn State Fiber Optic Echelle (FOE) observations of a sample of field stars with spectral types F to M and luminosity classes V to I. The spectral coverage is from 3800 AA to 10000 AA with nominal a resolving power 12000. These spectra include many of the spectral lines most widely used as optical and near-infrared indicators of chromospheric activity such as the Balmer lines (H_alpha, H_beta), Ca II H & K, Mg I b triplet, Na I D_{1} and D_{2}, He I D_{3}, and Ca II IRT lines. There are also a large number of photospheric lines, which can also be affected by chromospheric activity, and temperature sensitive photospheric features such as TiO bands. The spectra have been compiled with the goal of providing a set of standards observed at medium resolution. We have extensively used such data for the study of active chromosphere stars by applying a spectral subtraction technique. However, the data set presented here can also be utilized in a wide variety of ways ranging from radial velocity templates to study of variable stars and stellar population synthesis. This library can also be used for spectral classification purposes and determination of atmospheric parameters (T_eff, log{g}, [Fe/H]). A digital version of all the fully reduced spectra is available via ftp and the World Wide Web (WWW) in FITS format.Comment: Latex file with 17 pages, 4 figures. Full postscript (text and figures) available at http://www.ucm.es/info/Astrof/fgkmsl/FOEfgkmsl.html To be published in ApJ

    Satellite Image Spoofing: Creating Remote Sensing Dataset with Generative Adversarial Networks (Short Paper)

    Get PDF
    The rise of Artificial Intelligence (AI) has brought up both opportunities and challenges for today\u27s evolving GIScience. Its ability in image classification, object detection and feature extraction has been frequently praised. However, it may also apply for falsifying geospatial data. To demonstrate the thrilling power of AI, this research explored the potentials of deep learning algorithms in capturing geographic features and creating fake satellite images according to the learned \u27sense\u27. Specifically, Generative Adversarial Networks (GANs) is used to capture geographic features of a certain place from a group of web maps and satellite images, and transfer the features to another place. Corvallis is selected as the study area, and fake datasets with \u27learned\u27 style from three big cities (i.e. New York City, Seattle and Beijing) are generated through CycleGAN. The empirical results show that GANs can \u27remember\u27 a certain \u27sense of place\u27 and further apply that \u27sense\u27 to another place. With this paper, we would like to raise both public and GIScientists\u27 awareness in the potential occurrence of fake satellite images, and its impacts on various geospatial applications, such as environmental monitoring, urban planning, and land use development

    Malware distributions and graph structure of the Web

    Full text link
    Knowledge about the graph structure of the Web is important for understanding this complex socio-technical system and for devising proper policies supporting its future development. Knowledge about the differences between clean and malicious parts of the Web is important for understanding potential treats to its users and for devising protection mechanisms. In this study, we conduct data science methods on a large crawl of surface and deep Web pages with the aim to increase such knowledge. To accomplish this, we answer the following questions. Which theoretical distributions explain important local characteristics and network properties of websites? How are these characteristics and properties different between clean and malicious (malware-affected) websites? What is the prediction power of local characteristics and network properties to classify malware websites? To the best of our knowledge, this is the first large-scale study describing the differences in global properties between malicious and clean parts of the Web. In other words, our work is building on and bridging the gap between \textit{Web science} that tackles large-scale graph representations and \textit{Web cyber security} that is concerned with malicious activities on the Web. The results presented herein can also help antivirus vendors in devising approaches to improve their detection algorithms

    Feature Extraction from Degree Distribution for Comparison and Analysis of Complex Networks

    Full text link
    The degree distribution is an important characteristic of complex networks. In many data analysis applications, the networks should be represented as fixed-length feature vectors and therefore the feature extraction from the degree distribution is a necessary step. Moreover, many applications need a similarity function for comparison of complex networks based on their degree distributions. Such a similarity measure has many applications including classification and clustering of network instances, evaluation of network sampling methods, anomaly detection, and study of epidemic dynamics. The existing methods are unable to effectively capture the similarity of degree distributions, particularly when the corresponding networks have different sizes. Based on our observations about the structure of the degree distributions in networks over time, we propose a feature extraction and a similarity function for the degree distributions in complex networks. We propose to calculate the feature values based on the mean and standard deviation of the node degrees in order to decrease the effect of the network size on the extracted features. The proposed method is evaluated using different artificial and real network datasets, and it outperforms the state of the art methods with respect to the accuracy of the distance function and the effectiveness of the extracted features.Comment: arXiv admin note: substantial text overlap with arXiv:1307.362
    • …
    corecore