51 research outputs found

    Automatic Classification of Text Databases through Query Probing

    Get PDF
    Many text databases on the web are "hidden" behind search interfaces, and their documents are only accessible through querying. Search engines typically ignore the contents of such search-only databases. Recently, Yahoo-like directories have started to manually organize these databases into categories that users can browse to find these valuable resources. We propose a novel strategy to automate the classification of search-only text databases. Our technique starts by training a rule-based document classifier, and then uses the classifier's rules to generate probing queries. The queries are sent to the text databases, which are then classified based on the number of matches that they produce for each query. We report some initial exploratory experiments that show that our approach is promising to automatically characterize the contents of text databases accessible on the web.Comment: 7 pages, 1 figur

    Testing Market Response to Auditor Change Filings: a comparison of machine learning classifiers

    Get PDF
    The use of textual information contained in company filings with the Securities Exchange Commission (SEC), including annual reports on Form 10-K, quarterly reports on Form 10-Q, and current reports on Form 8-K, has gained the increased attention of finance and accounting researchers. In this paper we use a set of machine learning methods to predict the market response to changes in a firm\u27s auditor as reported in public filings. We vectorize the text of 8-K filings to test whether the resulting feature matrix can explain the sign of the market response to the filing. Specifically, using classification algorithms and a sample consisting of the Item 4.01 text of 8-K documents, which provides information on changes in auditors of companies that are registered with the SEC, we predict the sign of the cumulative abnormal return (CAR) around 8-K filing dates. We report the correct classification performance and time efficiency of the classification algorithms. Our results show some improvement over the naĂŻve classification method

    ANTIDS: Self-Organized Ant-based Clustering Model for Intrusion Detection System

    Full text link
    Security of computers and the networks that connect them is increasingly becoming of great significance. Computer security is defined as the protection of computing systems against threats to confidentiality, integrity, and availability. There are two types of intruders: the external intruders who are unauthorized users of the machines they attack, and internal intruders, who have permission to access the system with some restrictions. Due to the fact that it is more and more improbable to a system administrator to recognize and manually intervene to stop an attack, there is an increasing recognition that ID systems should have a lot to earn on following its basic principles on the behavior of complex natural systems, namely in what refers to self-organization, allowing for a real distributed and collective perception of this phenomena. With that aim in mind, the present work presents a self-organized ant colony based intrusion detection system (ANTIDS) to detect intrusions in a network infrastructure. The performance is compared among conventional soft computing paradigms like Decision Trees, Support Vector Machines and Linear Genetic Programming to model fast, online and efficient intrusion detection systems.Comment: 13 pages, 3 figures, Swarm Intelligence and Patterns (SIP)- special track at WSTST 2005, Muroran, JAPA

    Humanising pedagogy: An alternative approach to curriculum design that enhances rigour in a B.Ed. programme

    Get PDF
    The minimum requirements for teacher education qualifications (MRTEQ) draws attention to the complexity of teaching as an activity that is premised upon the acquisition, integration and application of different types of knowledge practices or learning. As such, all initial teacher education programmes in South Africa should be designed such that they include disciplinary knowledge, pedagogical knowledge, practical knowledge, fundamental knowledge and situational knowledge. These types of knowledge underpin a teacher’s ability to facilitate meaningful learning in the classroom, which in turn facilitates higher education’s responsiveness to societal needs.In this article, we reflect on the faculty’s recent curriculum renewal journey towards designing a coherent and rigorous B.Ed. programme. We locate our curriculum renewal journey in critical theory and our new curriculum itself is grounded in humanising pedagogies, critical reflection and inquiry. We also describe the consultation and collaborative processes we engaged in to ensure that our new B.Ed. programme would be responsive to the needs of our students and society

    Malware Detection Using a Heterogeneous Distance Function

    Get PDF
    Classification of automatically generated malware is an active research area. The amount of new malware is growing exponentially and since manual investigation is not possible, automated malware classification is necessary. In this paper, we present a static malware detection system for the detection of unknown malicious programs which is based on combination of the weighted k-nearest neighbors classifier and the statistical scoring technique from [12]. We have extracted the most relevant features from portable executable (PE) file format using gain ratio and have designed a heterogeneous distance function that can handle both linear and nominal features. Our proposed detection method was evaluated on a dataset with tens of thousands of malicious and benign samples and the experimental results show that the accuracy of our classifier is 98.80 %. In addition, preliminary results indicate that the proposed similarity metric on our feature space could be used for clustering malware into families

    A Semi-automatic and Low Cost Approach to Build Scalable Lemma-based Lexical Resources for Arabic Verbs

    Get PDF
    International audienceThis work presents a method that enables Arabic NLP community to build scalable lexical resources. The proposed method is low cost and efficient in time in addition to its scalability and extendibility. The latter is reflected in the ability for the method to be incremental in both aspects, processing resources and generating lexicons. Using a corpus; firstly, tokens are drawn from the corpus and lemmatized. Secondly, finite state transducers (FSTs) are generated semi-automatically. Finally, FSTsare used to produce all possible inflected verb forms with their full morphological features. Among the algorithm’s strength is its ability to generate transducers having 184 transitions, which is very cumbersome, if manually designed. The second strength is a new inflection scheme of Arabic verbs; this increases the efficiency of FST generation algorithm. The experimentation uses a representative corpus of Modern Standard Arabic. The number of semi-automatically generated transducers is 171. The resulting open lexical resources coverage is high. Our resources cover more than 70% Arabic verbs. The built resources contain 16,855 verb lemmas and 11,080,355 fully, partially and not vocalized verbal inflected forms. All these resources are being made public and currently used as an open package in the Unitex framework available under the LGPL license

    Dialogue Act Recognition Approaches

    Get PDF
    This paper deals with automatic dialogue act (DA) recognition. Dialogue acts are sentence-level units that represent states of a dialogue, such as questions, statements, hesitations, etc. The knowledge of dialogue act realizations in a discourse or dialogue is part of the speech understanding and dialogue analysis process. It is of great importance for many applications: dialogue systems, speech recognition, automatic machine translation, etc. The main goal of this paper is to study the existing works about DA recognition and to discuss their respective advantages and drawbacks. A major concern in the DA recognition domain is that, although a few DA annotation schemes seem now to emerge as standards, most of the time, these DA tag-sets have to be adapted to the specificities of a given application, which prevents the deployment of standardized DA databases and evaluation procedures. The focus of this review is put on the various kinds of information that can be used to recognize DAs, such as prosody, lexical, etc., and on the types of models proposed so far to capture this information. Combining these information sources tends to appear nowadays as a prerequisite to recognize DAs

    Statistical Acquisition of Content Selection Rules for Natural Language Generation

    Get PDF
    A Natural Language Generation system produces text using as input semantic data. One of its very first tasks is to decide which pieces of information to convey in the output. This task, called Content Selection, is quite domain dependent, requiring considerable re-engineering to transport the system from one scenario to another. In (Duboue and McKeown, 2003), we presented a method to acquire content selection rules automatically from a corpus of text and associated semantics. Our proposed technique was evaluated by comparing its output with information selected by human authors in unseen texts, where we were able to filter half the input data set without loss of recall. This report contains additional technical information about our system
    • …
    corecore