Search CORE

5,504 research outputs found

TopSig: Topology Preserving Document Signatures

Author: De Vries Christopher M.
Geva Shlomo
Publication venue
Publication date: 01/01/2011
Field of study

Performance comparisons between File Signatures and Inverted Files for text retrieval have previously shown several significant shortcomings of file signatures relative to inverted files. The inverted file approach underpins most state-of-the-art search engine algorithms, such as Language and Probabilistic models. It has been widely accepted that traditional file signatures are inferior alternatives to inverted files. This paper describes TopSig, a new approach to the construction of file signatures. Many advances in semantic hashing and dimensionality reduction have been made in recent times, but these were not so far linked to general purpose, signature file based, search engines. This paper introduces a different signature file approach that builds upon and extends these recent advances. We are able to demonstrate significant improvements in the performance of signature file based indexing and retrieval, performance that is comparable to that of state of the art inverted file based systems, including Language models and BM25. These findings suggest that file signatures offer a viable alternative to inverted files in suitable settings and from the theoretical perspective it positions the file signatures model in the class of Vector Space retrieval models.Comment: 12 pages, 8 figures, CIKM 201

arXiv.org e-Print Archive

CiteSeerX

Crossref

Queensland University of Technology ePrints Archive

A Survey to Fix the Threshold and Implementation for Detecting Duplicate Web Documents

Author: Bhimireddy Manojreddy
Gandi Krishna Pavan
Hicks Reuven
Veeramachaneni Bhargav Roy
Publication venue: OPUS Open Portal to University Scholarship
Publication date: 01/10/2015
Field of study

The drastic development in the information accessible on the World Wide Web has made the employment of automated tools to locate the information resources of interest, and for tracking and analyzing the same a certainty. Web Mining is the branch of data mining that deals with the analysis of World Wide Web. The concepts from various areas such as Data Mining, Internet technology and World Wide Web, and recently, Semantic Web can be said as the origin of web mining. Web mining can be defined as the procedure of determining hidden yet potentially beneficial knowledge from the data accessible in the web. Web mining comprise the sub areas: web content mining, web structure mining, and web usage mining. Web content mining is the process of mining knowledge from the web pages besides other web objects. The process of mining knowledge about the link structure linking web pages and some other web objects is defined as Web structure mining. Web usage mining is defined as the process of mining the usage patterns created by the users accessing the web pages. The search engine technology has led to the development of World Wide. The search engines are the chief gateways for access of information in the web. The ability to locate contents of particular interest amidst a huge heap has turned businesses beneficial and productive. The search engines respond to the queries by employing the process of web crawling that populates an indexed repository of web pages. The programs construct a confined repository of the segment of the web that they visit by navigating the web graph and retrieving pages. There are two main types of crawling, namely, Generic and Focused crawling. Generic crawlers crawls documents and links of diverse topics. Focused crawlers limit the number of pages with the aid of some prior obtained specialized knowledge. The systems that index, mine, and otherwise analyze pages (such as, the search engines) are provided with inputs from the repositories of web pages built by the web crawlers. The drastic development of the Internet and the growing necessity to incorporate heterogeneous data is accompanied by the issue of the existence of near duplicate data. Even if the near duplicate data don’t exhibit bit wise identical nature they are remarkably similar. The duplicate and near duplicate web pages either increase the index storage space or slow down or increase the serving costs which annoy the users, thus causing huge problems for the web search engines. Hence it is inevitable to design algorithms to detect such pages

Governors State University

Ground correlation investigation of thruster spacecraft interactions to be measured on the IAPS flight test

Author: Power J. L.
Publication venue
Publication date
Field of study

Preliminary ground correlation testing has been conducted with an 8 cm mercury ion thruster and diagnostic instrumentation replicating to a large extent the IAPS flight test hardware, configuration, and electrical grounding/isolation. Thruster efflux deposition retained at 25 C was measured and characterized. Thruster ion efflux was characterized with retarding potential analyzers. Thruster-generated plasma currents, the spacecraft common (SCC) potential, and ambient plasma properties were evaluated with a spacecraft potential probe (SPP). All the measured thruster/spacecraft interactions or their IAPS measurements depend critically on the SCC potential, which can be controlled by a neutralizer ground switch and by the SPP operation

NASA Technical Reports Server

Recommended from our members

Multimodal News Summarization, Tracking and Annotation Incorporating Tensor Analysis of Memes

Author: Tsai Chun-Yu
Publication venue: 'Columbia University Libraries/Information Services'
Publication date: 01/01/2017
Field of study

We demonstrate four novel multimodal methods for efficient video summarization and comprehensive cross-cultural news video understanding. First, For video quick browsing, we demonstrate a multimedia event recounting system. Based on nine people-oriented design principles, it summarizes YouTube-like videos into short visual segments (812sec) and textual words (less than 10 terms). In the 2013 Trecvid Multimedia Event Recounting competition, this system placed first in recognition time efficiency, while remaining above average in description accuracy. Secondly, we demonstrate the summarization of large amounts of online international news videos. In order to understand an international event such as Ebola virus, AirAsia Flight 8501 and Zika virus comprehensively, we present a novel and efficient constrained tensor factorization algorithm that first represents a video archive of multimedia news stories concerning a news event as a sparse tensor of order 4. The dimensions correspond to extracted visual memes, verbal tags, time periods, and cultures. The iterative algorithm approximately but accurately extracts coherent quad-clusters, each of which represents a significant summary of an important independent aspect of the news event. We give examples of quad-clusters extracted from tensors with at least 108 entries derived from international news coverage. We show the method is fast, can be tuned to give preferences to any subset of its four dimensions, and exceeds three existing methods in performance. Thirdly, noting that the co-occurrence of visual memes and tags in our summarization result is sparse, we show how to model cross-cultural visual meme influence based on normalized PageRank, which more accurately captures the rates at which visual memes are reposted in a specified time period in a specified culture. Lastly, we establish the correspondences of videos and text descriptions in different cultures by reliable visual cues, detect culture-specific tags for visual memes and then annotate videos in a cultural settings. Starting with any video with less text or no text in one culture (say, US), we select candidate annotations in the text of another culture (say, China) to annotate US video. Through analyzing the similarity of images annotated by those candidates, we can derive a set of proper tags from the viewpoints of another culture (China). We illustrate cultural-based annotation examples by segments of international news. We evaluate the generated tags by cross-cultural tag frequency, tag precision, and user studies

Columbia University Academic Commons

An Analysis of C/C++ Datasets for Machine Learning-Assisted Software Vulnerability Detection

Author: Grahn Daniel
Zhang Junjie
Publication venue: CORE Scholar
Publication date: 01/01/2021
Field of study

As machine learning-assisted vulnerability detection research matures, it is critical to understand the datasets being used by existing papers. In this paper, we explore 7 C/C++ datasets and evaluate their suitability for machine learning-assisted vulnerability detection. We also present a new dataset, named Wild C, containing over 10.3 million individual opensource C/C++ files – a sufficiently large sample to be reasonably considered representative of typical C/C++ code. To facilitate comparison, we tokenize all of the datasets and perform the analysis at this level. We make three primary contributions. First, while all the datasets differ from our Wild C dataset, some do so to a greater degree. This includes divergence in file lengths and token usage frequency. Additionally, none of the datasets contain the entirety of the C/C++ vocabulary. These missing tokens account for up to 11% of all token usage. Second, we find all the datasets contain duplication with some containing a significant amount. In the Juliet dataset, we describe augmentations of test cases making the dataset susceptible to data leakage. This augmentation occurs with such frequency that a random 80/20 split has roughly 58% overlap of the test with the training data. Finally, we collect and process a large dataset of C code, named Wild C. This dataset is designed to serve as a representative sample of all C/C++ code and is the basis for our analyses

CORE

Extensive Copy-Number Variation of Young Genes across Stickleback Populations

Author: A Abyzov
A Alexa
A Conesa
A Hussain
AJ Iafrate
AJ Sharp
AJ Vilella
AR Boyko
AR Quinlan
B Guo
BE Deagle
C Eizaguirre
C Eizaguirre
Christophe Eizaguirre
CL McGrath
CL Peichel
D Bryant
D Juan
D Tautz
DE Cook
DH Huson
DJ Turner
DR Schrider
DR Schrider
DR Zerbino
E Gazave
E Proux
Erich Bornberg-Bauer
FA Kondrashov
FC Jones
Frédéric J. J. Chain
G Gibson
G Orti
GC Conant
GH Perry
GH Perry
GM Cooper
H Kehrer-Sawatzki
H Li
Irene E. Samonte
J Sebat
JA Fawcett
Jianzhi Zhang
JJ Emerson
JK Colbourne
JO Korbel
JO Korbel
K Chen
K Khalturin
K Ye
KJ Lipinski
KJ Livak
KM Teshima
KM Wegner
L Xu
LC Hsing
LR Saraiva
M Hiraiwa
M Long
M Long
M Lynch
M Lynch
M Milinski
M Roesti
MA DePristo
Mahesh Panchal
Manfred Milinski
Martin Kalbe
Monika Stoll
N Ghanem
P Danecek
P Flicek
P Sjödin
PA Hohenlohe
PGD Feulner
PH Sudmant
Philine G. D. Feulner
PM Kim
R Redon
RC Iskow
S Moretti
S Sawyer
SF Altschul
SH Williamson
SM Waszak
SR Browning
T Marques-Bonet
T Rausch
TD Schmittgen
Thorsten B. H. Reusch
Tobias L. Lenz
V Guryev
V Katju
V Katju
V Ranwez
X Huang
Y Hashiguchi
Y Hashiguchi
Y Zheng
YE Zhang
YF Chan
Z Yang
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 01/01/2014
Field of study

MM received funding from the Max Planck innovation funds for this project. PGDF was supported by a Marie Curie European Reintegration Grant (proposal nr 270891). CE was supported by German Science Foundation grants (DFG, EI 841/4-1 and EI 841/6-1). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript

OceanRep

Crossref

Directory of Open Access Journals

PubMed Central

Queen Mary Research Online

Bern Open Repository and Information System (BORIS)

MPG.PuRe

FigShare

BagMinHash - Minwise Hashing Algorithm for Weighted Sets

Author: Alonso O.
Broder A. Z.
Chum O.
Dahlgaard S.
Haveliwala T.
Li P.
Luo C.
Shrivastava A.
Wu W.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 22/07/2018
Field of study

Minwise hashing has become a standard tool to calculate signatures which allow direct estimation of Jaccard similarities. While very efficient algorithms already exist for the unweighted case, the calculation of signatures for weighted sets is still a time consuming task. BagMinHash is a new algorithm that can be orders of magnitude faster than current state of the art without any particular restrictions or assumptions on weights or data dimensionality. Applied to the special case of unweighted sets, it represents the first efficient algorithm producing independent signature components. A series of tests finally verifies the new algorithm and also reveals limitations of other approaches published in the recent past.Comment: 10 pages, KDD 201

arXiv.org e-Print Archive

Crossref