Search CORE

3,238 research outputs found

From Frequency to Meaning: Vector Space Models of Semantics

Author: Pantel Patrick
Turney Peter D.
Publication venue: 'AI Access Foundation'
Publication date: 01/01/2010
Field of study

Computers understand very little of the meaning of human language. This profoundly limits our ability to give instructions to computers, the ability of computers to explain their actions to us, and the ability of computers to analyse and process text. Vector space models (VSMs) of semantics are beginning to address these limits. This paper surveys the use of VSMs for semantic processing of text. We organize the literature on VSMs according to the structure of the matrix in a VSM. There are currently three broad classes of VSMs, based on term-document, word-context, and pair-pattern matrices, yielding three classes of applications. We survey a broad range of applications in these three categories and we take a detailed look at a specific open source project in each category. Our goal in this survey is to show the breadth of applications of VSMs for semantics, to provide a new perspective on VSMs for those who are already familiar with the area, and to provide pointers into the literature for those who are less familiar with the field

arXiv.org e-Print Archive

CiteSeerX

NRC Publications Archive

Crossref

Automatically Generating Searchable Fingerprints For WordPress Plugins Using Static Program Analysis

Author: Li Chuang
Publication venue: CORE Scholar
Publication date: 01/01/2022
Field of study

This thesis introduces a novel method to automatically generate fingerprints for WordPress plugins. Our method performs static program analysis using Abstract Syntax Trees (ASTs) of WordPress plugins. The generated fingerprints can be used for identifying these plugins using search engines, which have support critical applications such as proactively identifying web servers with vulnerable WordPress plugins. We have used our method to generate fingerprints for over 10,000 WordPress plugins and analyze the resulted fingerprints. Our fingerprints have also revealed 453 websites that are potentially vulnerable. We have also compared fingerprints for vulnerable plugins and those for vulnerability-free plugins

CORE

A Survey to Fix the Threshold and Implementation for Detecting Duplicate Web Documents

Author: Bhimireddy Manojreddy
Gandi Krishna Pavan
Hicks Reuven
Veeramachaneni Bhargav Roy
Publication venue: OPUS Open Portal to University Scholarship
Publication date: 01/10/2015
Field of study

The drastic development in the information accessible on the World Wide Web has made the employment of automated tools to locate the information resources of interest, and for tracking and analyzing the same a certainty. Web Mining is the branch of data mining that deals with the analysis of World Wide Web. The concepts from various areas such as Data Mining, Internet technology and World Wide Web, and recently, Semantic Web can be said as the origin of web mining. Web mining can be defined as the procedure of determining hidden yet potentially beneficial knowledge from the data accessible in the web. Web mining comprise the sub areas: web content mining, web structure mining, and web usage mining. Web content mining is the process of mining knowledge from the web pages besides other web objects. The process of mining knowledge about the link structure linking web pages and some other web objects is defined as Web structure mining. Web usage mining is defined as the process of mining the usage patterns created by the users accessing the web pages. The search engine technology has led to the development of World Wide. The search engines are the chief gateways for access of information in the web. The ability to locate contents of particular interest amidst a huge heap has turned businesses beneficial and productive. The search engines respond to the queries by employing the process of web crawling that populates an indexed repository of web pages. The programs construct a confined repository of the segment of the web that they visit by navigating the web graph and retrieving pages. There are two main types of crawling, namely, Generic and Focused crawling. Generic crawlers crawls documents and links of diverse topics. Focused crawlers limit the number of pages with the aid of some prior obtained specialized knowledge. The systems that index, mine, and otherwise analyze pages (such as, the search engines) are provided with inputs from the repositories of web pages built by the web crawlers. The drastic development of the Internet and the growing necessity to incorporate heterogeneous data is accompanied by the issue of the existence of near duplicate data. Even if the near duplicate data don’t exhibit bit wise identical nature they are remarkably similar. The duplicate and near duplicate web pages either increase the index storage space or slow down or increase the serving costs which annoy the users, thus causing huge problems for the web search engines. Hence it is inevitable to design algorithms to detect such pages

Governors State University

CHORUS Deliverable 4.3: Report from CHORUS workshops on national initiatives and metadata

Author: Dosch Christoph
Karlgren Jussi
Ortgies Robert
Rudström Åsa
Publication venue: Chorus Project Consortium
Publication date: 01/01/2007
Field of study

Minutes of the following Workshops: • National Initiatives on Multimedia Content Description and Retrieval, Geneva, October 10th, 2007. • Metadata in Audio-Visual/Multimedia production and archiving, Munich, IRT, 21st – 22nd November 2007 Workshop in Geneva 10/10/2007 This highly successful workshop was organised in cooperation with the European Commission. The event brought together the technical, administrative and financial representatives of the various national initiatives, which have been established recently in some European countries to support research and technical development in the area of audio-visual content processing, indexing and searching for the next generation Internet using semantic technologies, and which may lead to an internet-based knowledge infrastructure. The objective of this workshop was to provide a platform for mutual information and exchange between these initiatives, the European Commission and the participants. Top speakers were present from each of the national initiatives. There was time for discussions with the audience and amongst the European National Initiatives. The challenges, communalities, difficulties, targeted/expected impact, success criteria, etc. were tackled. This workshop addressed how these national initiatives could work together and benefit from each other. Workshop in Munich 11/21-22/2007 Numerous EU and national research projects are working on the automatic or semi-automatic generation of descriptive and functional metadata derived from analysing audio-visual content. The owners of AV archives and production facilities are eagerly awaiting such methods which would help them to better exploit their assets.Hand in hand with the digitization of analogue archives and the archiving of digital AV material, metadatashould be generated on an as high semantic level as possible, preferably fully automatically. All users of metadata rely on a certain metadata model. All AV/multimedia search engines, developed or under current development, would have to respect some compatibility or compliance with the metadata models in use. The purpose of this workshop is to draw attention to the specific problem of metadata models in the context of (semi)-automatic multimedia search

RISE – Research Institutes of Sweden

Digitala Vetenskapliga Arkivet - Academic Archive On-line

Swedish Institute of Computer Science Publications Database

SURVEY OF PRIVATE COPYRIGHT DOCUMENTATION SYSTEMS AND PRACTICES

Author: De Martin Juan Carlos
Hsu S.
Morando Federico
Ouma M.
Ricolfi M.
Rubiano C.
Publication venue: World Intellectual Property Organization
Publication date: 01/01/2011
Field of study

PORTO Publications Open Repository TOrino

Plagiarism detection for Indonesian texts

Author: Krisnawati Lucia Dwi
Publication venue: Ludwig-Maximilians-Universität München
Publication date: 18/05/2016
Field of study

As plagiarism becomes an increasing concern for Indonesian universities and research centers, the need of using automatic plagiarism checker is becoming more real. However, researches on Plagiarism Detection Systems (PDS) in Indonesian documents have not been well developed, since most of them deal with detecting duplicate or near-duplicate documents, have not addressed the problem of retrieving source documents, or show tendency to measure document similarity globally. Therefore, systems resulted from these researches are incapable of referring to exact locations of ``similar passage'' pairs. Besides, there has been no public and standard corpora available to evaluate PDS in Indonesian texts. To address the weaknesses of former researches, this thesis develops a plagiarism detection system which executes various methods of plagiarism detection stages in a workflow system. In retrieval stage, a novel document feature coined as phraseword is introduced and executed along with word unigram and character n-grams to address the problem of retrieving source documents, whose contents are copied partially or obfuscated in a suspicious document. The detection stage, which exploits a two-step paragraph-based comparison, is aimed to address the problems of detecting and locating source-obfuscated passage pairs. The seeds for matching source-obfuscated passage pairs are based on locally-weighted significant terms to capture paraphrased and summarized passages. In addition to this system, an evaluation corpus was created through simulation by human writers, and by algorithmic random generation. Using this corpus, the performance evaluation of the proposed methods was performed in three scenarios. On the first scenario which evaluated source retrieval performance, some methods using phraseword and token features were able to achieve the optimum recall rate 1. On the second scenario which evaluated detection performance, our system was compared to Alvi's algorithm and evaluated in 4 levels of measures: character, passage, document, and cases. The experiment results showed that methods resulted from using token as seeds have higher scores than Alvi's algorithm in all 4 levels of measures both in artificial and simulated plagiarism cases. In case detection, our systems outperform Alvi's algorithm in recognizing copied, shaked, and paraphrased passages. However, Alvi's recognition rate on summarized passage is insignificantly higher than our system. The same tendency of experiment results were demonstrated on the third experiment scenario, only the precision rates of Alvi's algorithm in character and paragraph levels are higher than our system. The higher Plagdet scores produced by some methods in our system than Alvi's scores show that this study has fulfilled its objective in implementing a competitive state-of-the-art algorithm for detecting plagiarism in Indonesian texts. Being run at our test document corpus, Alvi's highest scores of recall, precision, Plagdet, and detection rate on no-plagiarism cases correspond to its scores when it was tested on PAN'14 corpus. Thus, this study has contributed in creating a standard evaluation corpus for assessing PDS for Indonesian documents. Besides, this study contributes in a source retrieval algorithm which introduces phrasewords as document features, and a paragraph-based text alignment algorithm which relies on two different strategies. One of them is to apply local-word weighting used in text summarization field to select seeds for both discriminating paragraph pair candidates and matching process. The proposed detection algorithm results in almost no multiple detection. This contributes to the strength of this algorithm

Text Summarization

Author: Umar Siti Noor Arfah
Publication venue: Universiti Teknologi PETRONAS
Publication date: 01/06/2006
Field of study

Text summarization is the process of distilling the most important information from a source (or sources) to produce an abridged version for a particular user (or users) and task (or tasks) [2]. By providing a text summarization system that will simplify the bulk of information and producing only the most important points, the task of reading and understanding a text would inevitably be made easier and faster. With a large volume of text documents, a summary of each document greatly facilitates the task of finding the desired documents and the desired data from the documents. As a solution for the above matter, this project objective is to simplify the texts from a previous text summarization system and further reducing the number of words in a sentence, shortening the sentences and eliminating sentences with similar meanings and also produce grammar rules that generate sentences that are human-like. The waterfall model is chosen as the project development life cycle. A detailed research has been conducted during the requirement definition phase and the system prototype is designed in the system and software design phase. During the development phase, the coding implementation will be conducted and the unit testing part will be done throughout that development process. After the entire unit has been tested, they will be integrated together and the system testing can be done as a whole. The complete program is put through thorough test and evaluation to ensure its functionality and efficiency. As the conclusion, this project should be able to produce a summarized text as the output product and meet the project requirements and objectives

UTPedia

Retrieval of case law to provide layman with information about liability:Preliminary results of the BEST-project

Author: Klein M.C.A.
Lodder A.R.
Sie R.L.L.
Uijttenbroek E.M.
van Harmelen F.A.H.
Wildeboer Gwen R.
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2008
Field of study

VU Research Portal