Search CORE

51 research outputs found

Automatic Classification of Text Databases through Query Probing

Author: D. Hawking
D. Koller
D. Koller
J. P. Callan
J. Xu
L. Gravano
M. Perkowitz
S. Gauch
W. Meng
W. Meng
W. W. Cohen
W. W. Cohen
Publication venue
Publication date: 01/01/2000
Field of study

Many text databases on the web are "hidden" behind search interfaces, and their documents are only accessible through querying. Search engines typically ignore the contents of such search-only databases. Recently, Yahoo-like directories have started to manually organize these databases into categories that users can browse to find these valuable resources. We propose a novel strategy to automate the classification of search-only text databases. Our technique starts by training a rule-based document classifier, and then uses the classifier's rules to generate probing queries. The queries are sent to the text databases, which are then classified based on the number of matches that they produce for each query. We report some initial exploratory experiments that show that our approach is promising to automatically characterize the contents of text databases accessible on the web.Comment: 7 pages, 1 figur

arXiv.org e-Print Archive

CiteSeerX

Crossref

Columbia University Academic Commons

Testing Market Response to Auditor Change Filings: a comparison of machine learning classifiers

Author: Holowczak Richard
Louton David
Saraoglu Hakan
Publication venue: Bryant Digital Repository
Publication date: 23/08/2018
Field of study

The use of textual information contained in company filings with the Securities Exchange Commission (SEC), including annual reports on Form 10-K, quarterly reports on Form 10-Q, and current reports on Form 8-K, has gained the increased attention of finance and accounting researchers. In this paper we use a set of machine learning methods to predict the market response to changes in a firm\u27s auditor as reported in public filings. We vectorize the text of 8-K filings to test whether the resulting feature matrix can explain the sign of the market response to the filing. Specifically, using classification algorithms and a sample consisting of the Item 4.01 text of 8-K documents, which provides information on changes in auditors of companies that are registered with the SEC, we predict the sign of the cumulative abnormal return (CAR) around 8-K filing dates. We report the correct classification performance and time efficiency of the classification algorithms. Our results show some improvement over the naïve classification method

DigitalCommons@Bryant University

ANTIDS: Self-Organized Ant-based Clustering Model for Intrusion Detection System

Author: Abraham Ajith
Ramos Vitorino
Publication venue
Publication date: 17/12/2004
Field of study

Security of computers and the networks that connect them is increasingly becoming of great significance. Computer security is defined as the protection of computing systems against threats to confidentiality, integrity, and availability. There are two types of intruders: the external intruders who are unauthorized users of the machines they attack, and internal intruders, who have permission to access the system with some restrictions. Due to the fact that it is more and more improbable to a system administrator to recognize and manually intervene to stop an attack, there is an increasing recognition that ID systems should have a lot to earn on following its basic principles on the behavior of complex natural systems, namely in what refers to self-organization, allowing for a real distributed and collective perception of this phenomena. With that aim in mind, the present work presents a self-organized ant colony based intrusion detection system (ANTIDS) to detect intrusions in a network infrastructure. The performance is compared among conventional soft computing paradigms like Decision Trees, Support Vector Machines and Linear Genetic Programming to model fast, online and efficient intrusion detection systems.Comment: 13 pages, 3 figures, Swarm Intelligence and Patterns (SIP)- special track at WSTST 2005, Muroran, JAPA

arXiv.org e-Print Archive

CiteSeerX

Humanising pedagogy: An alternative approach to curriculum design that enhances rigour in a B.Ed. programme

Author: Geduld Deidre
Sathorar Heloise
Publication venue: 'University of the Free State'
Publication date: 01/03/2016
Field of study

The minimum requirements for teacher education qualifications (MRTEQ) draws attention to the complexity of teaching as an activity that is premised upon the acquisition, integration and application of different types of knowledge practices or learning. As such, all initial teacher education programmes in South Africa should be designed such that they include disciplinary knowledge, pedagogical knowledge, practical knowledge, fundamental knowledge and situational knowledge. These types of knowledge underpin a teacher’s ability to facilitate meaningful learning in the classroom, which in turn facilitates higher education’s responsiveness to societal needs.In this article, we reflect on the faculty’s recent curriculum renewal journey towards designing a coherent and rigorous B.Ed. programme. We locate our curriculum renewal journey in critical theory and our new curriculum itself is grounded in humanising pedagogies, critical reflection and inquiry. We also describe the consultation and collaborative processes we engaged in to ensure that our new B.Ed. programme would be responsive to the needs of our students and society

KovsieJournals - University of the Free State (UFS)

Crossref

Directory of Open Access Journals

Malware Detection Using a Heterogeneous Distance Function

Author: Jureček Martin
Lórencz Róbert
Publication venue: Institute of Informatics, Slovak Academy of Sciences
Publication date: 26/07/2018
Field of study

Classification of automatically generated malware is an active research area. The amount of new malware is growing exponentially and since manual investigation is not possible, automated malware classification is necessary. In this paper, we present a static malware detection system for the detection of unknown malicious programs which is based on combination of the weighted k-nearest neighbors classifier and the statistical scoring technique from [12]. We have extracted the most relevant features from portable executable (PE) file format using gain ratio and have designed a heterogeneous distance function that can handle both linear and nominal features. Our proposed detection method was evaluated on a dataset with tens of thousands of malicious and benign samples and the experimental results show that the accuracy of our classifier is 98.80 %. In addition, preliminary results indicate that the proposed similarity metric on our feature space could be used for clustering malware into families

Computing and Informatics (E-Journal - Institute of Informatics, SAS, Bratislava)

A Semi-automatic and Low Cost Approach to Build Scalable Lemma-based Lexical Resources for Arabic Verbs

Author: Ahmed Abdelali
Ahmed Lehireche
Denis Maurel
Noureddine Doumi
null null
Publication venue: IJIT
Publication date: 01/02/2016
Field of study

International audienceThis work presents a method that enables Arabic NLP community to build scalable lexical resources. The proposed method is low cost and efficient in time in addition to its scalability and extendibility. The latter is reflected in the ability for the method to be incremental in both aspects, processing resources and generating lexicons. Using a corpus; firstly, tokens are drawn from the corpus and lemmatized. Secondly, finite state transducers (FSTs) are generated semi-automatically. Finally, FSTsare used to produce all possible inflected verb forms with their full morphological features. Among the algorithm’s strength is its ability to generate transducers having 184 transitions, which is very cumbersome, if manually designed. The second strength is a new inflection scheme of Arabic verbs; this increases the efficiency of FST generation algorithm. The experimentation uses a representative corpus of Modern Standard Arabic. The number of semi-automatically generated transducers is 171. The resulting open lexical resources coverage is high. Our resources cover more than 70% Arabic verbs. The built resources contain 16,855 verb lemmas and 11,080,355 fully, partially and not vocalized verbal inflected forms. All these resources are being made public and currently used as an open package in the Unitex framework available under the LGPL license

Crossref

Directory of Open Access Journals

HAL Descartes

HAL Université de Tours

Hal-Diderot

Dialogue Act Recognition Approaches

Author: Cerisara Christophe
Král Pavel
Publication venue: Institute of Informatics, Slovak Academy of Sciences
Publication date: 26/01/2012
Field of study

This paper deals with automatic dialogue act (DA) recognition. Dialogue acts are sentence-level units that represent states of a dialogue, such as questions, statements, hesitations, etc. The knowledge of dialogue act realizations in a discourse or dialogue is part of the speech understanding and dialogue analysis process. It is of great importance for many applications: dialogue systems, speech recognition, automatic machine translation, etc. The main goal of this paper is to study the existing works about DA recognition and to discuss their respective advantages and drawbacks. A major concern in the DA recognition domain is that, although a few DA annotation schemes seem now to emerge as standards, most of the time, these DA tag-sets have to be adapted to the specificities of a given application, which prevents the deployment of standardized DA databases and evaluation procedures. The focus of this review is put on the various kinds of information that can be used to recognize DAs, such as prosody, lexical, etc., and on the types of models proposed so far to capture this information. Combining these information sources tends to appear nowadays as a prerequisite to recognize DAs

Computing and Informatics (E-Journal - Institute of Informatics, SAS, Bratislava)

Recommended from our members

Summarizing and Searching Hidden-Web Databases Hierarchically Using Focused Probes

Author: Gravano Luis
Ipeirotis Panagiotis G.
Publication venue: 'Columbia University Libraries/Information Services'
Publication date: 01/01/2001
Field of study

Many valuable text databases on the web have non-crawlable contents that are "hidden" behind search interfaces. Metasearchers are helpful tools for searching over many such databases at once through a unified query interface. A critical task for a metasearcher to process a query efficiently and effectively is the selection of the most promising databases for the query, a task that typically relies on statistical summaries of the database contents. Unfortunately, web-accessible text databases do not generally export content summaries. In this paper, we present an algorithm to derive content summaries from "uncooperative" databases by using "focused query probes," which adaptively zoom in on and extract documents that are representative of the topic coverage of the databases. The content summaries that result from this algorithm are efficient to derive and more accurate than those from previously proposed probing techniques for content-summary extraction. We also present a novel database selection algorithm that exploits both the extracted content summaries and a hierarchical classification of the databases, automatically derived during probing, to produce accurate results even for imperfect content summaries. Finally, we evaluate our techniques thoroughly using a variety of databases, including 50 real web-accessible text databases

Columbia University Academic Commons

Statistical Acquisition of Content Selection Rules for Natural Language Generation

Author: Duboue Pablo A.
McKeown Kathleen
Publication venue: 'Columbia University Libraries/Information Services'
Publication date: 01/01/2003
Field of study

A Natural Language Generation system produces text using as input semantic data. One of its very first tasks is to decide which pieces of information to convey in the output. This task, called Content Selection, is quite domain dependent, requiring considerable re-engineering to transport the system from one scenario to another. In (Duboue and McKeown, 2003), we presented a method to acquire content selection rules automatically from a corpus of text and associated semantics. Our proposed technique was evaluated by comparing its output with information selected by human authors in unseen texts, where we were able to filter half the input data set without loss of recall. This report contains additional technical information about our system

CiteSeerX

Columbia University Academic Commons