Search CORE

1,241 research outputs found

Adaptive text mining: Inferring structure from sequences

Author: Witten Ian H.
Publication venue: 'Elsevier BV'
Publication date: 01/01/2004
Field of study

Text mining is about inferring structure from sequences representing natural language text, and may be defined as the process of analyzing text to extract information that is useful for particular purposes. Although hand-crafted heuristics are a common practical approach for extracting information from text, a general, and generalizable, approach requires adaptive techniques. This paper studies the way in which the adaptive techniques used in text compression can be applied to text mining. It develops several examples: extraction of hierarchical phrase structures from text, identification of keyphrases in documents, locating proper names and quantities of interest in a piece of text, text categorization, word segmentation, acronym extraction, and structure recognition. We conclude that compression forms a sound unifying principle that allows many text mining problems to be tacked adaptively

Research Commons@Waikato

A New Feature Extraction Algorithm to Extract Differentiate Information and Improve KNN-based Model Accuracy on Aquaculture Dataset

Author: Dewantara Bima Sena Bayu
Gunawan Agus Indra
Natan Oskar
Publication venue: 'Insight Society'
Publication date: 02/06/2019
Field of study

In the world of aquaculture, understanding the condition of a pond is very important for a farmer in deciding which action should they take to prevent any bad condition occurred. Condition of a pond can be justified by measuring plenty of water parameters which can be divided into 3 categories that are physical, chemical and biological. The physical parameter is any physical quantity that can be measured in the pond. The chemical parameter is any kind of chemical substances that are dissolved in water. The biological parameter is any organic matter that lives in water. However, all of these parameters are not so distinguishable in representing the condition of a pond. Therefore, the farmer experience difficulties in justifying the condition and taking proper action to their pond. Even with the help of the K-Nearest Neighbors (KNN) algorithm combined with grid search optimization to model the data, the result is still not satisfying where the model only achieve accuracy of 0.701 in leave one out validation. To overcome this problem, a kind of feature extraction algorithm is needed to extract more information and make the data become more differentiate in representing the condition of the pond. With the help of our proposed feature extraction algorithm, optimized KNN can model the data easier and achieve higher accuracy. From the experiment results, the proposed feature extraction algorithm gives an impressive performance where it increases the accuracy to 0.741. A comparison with other feature extraction algorithms such as Principal Component Analysis (PCA), Non-negative Matrix Factorization (NMF), and Singular Value Decomposition (SVD) is also conducted to validate how good the proposed feature extraction algorithm is. As a result, the proposed algorithm is surpassing the other algorithms which only achieve the accuracy of 0.707, 0.718, and 0.718, respectively

International Journal on Advanced Science, Engineering and Information Technology

Smoothing in Probability Estimation Trees

Author: Han Zhimeng
Publication venue: 'University of Waikato'
Publication date: 26/04/2011
Field of study

Classification learning is a type of supervised machine learning technique that uses a classification model (e.g. decision tree) to predict unknown class labels for previously unseen instances. In many applications it can be very useful to additionally obtain class probabilities for the different class labels. Decision trees that yield these probabilities are also called probability estimation trees (PETs). Smoothing is a technique used to improve the probability estimates. There are several existing smoothing methods, such as the Laplace correction, M-Estimate smoothing and M-Branch smoothing. Smoothing does not just apply to PETs. In the field of text compression, PPM in particular, smoothing methods play a important role. This thesis migrates smoothing methods from text compression to PETs. The newly migrated methods in PETs are compared with the best of the existing smoothing methods considered in this thesis under different experiment setups. Unpruned, pruned and bagged trees are considered in the experiments. The main finding is that the PPM-based methods yield the best probability estimate when used with bagged trees, but not when used with individual (pruned or unpruned) trees

CiteSeerX

Research Commons@Waikato

Identifying hazardousness of sewer pipeline gas mixture using classification methods: a comparative study

Author: AS Weigend
Atal Chaudhuri
C Cantalini
C Cortes
C Wongchoosuk
CG Atkeson
D Lowe
D-S Lee
DH Wolpert
DR Cox
DS Simonton
DW Aha
E Llobet
F Esposito
H Baha
J Whorton
JJ Rodriguez
JR Quinlan
KD Mitzner
L Breiman
L Olshen
LI Kuncheva
LK Weaver
N Landwehr
Paramartha Dutta
R Polikar
RJ Lewis
SH Walker
TK Ho
Varun Kumar Ojha
VK Ojha
VK Ojha
VK Ojha
VK Ojha
W So
Y Freund
Y Zhang
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2017
Field of study

In this work, we formulated a real-world problem related to sewer pipeline gas detection using the classification-based approaches. The primary goal of this work was to identify the hazardousness of sewer pipeline to offer safe and non-hazardous access to sewer pipeline workers so that the human fatalities, which occurs due to the toxic exposure of sewer gas components, can be avoided. The dataset acquired through laboratory tests, experiments, and various literature sources was organized to design a predictive model that was able to identify/classify hazardous and non-hazardous situation of sewer pipeline. To design such prediction model, several classification algorithms were used and their performances were evaluated and compared, both empirically and statistically, over the collected dataset. In addition, the performances of several ensemble methods were analyzed to understand the extent of improvement offered by these methods. The result of this comprehensive study showed that the instance-based learning algorithm performed better than many other algorithms such as multilayer perceptron, radial basis function network, support vector machine, reduced pruning tree. Similarly, it was observed that multi-scheme ensemble approach enhanced the performance of base predictors

arXiv.org e-Print Archive

Central Archive at the University of Reading

Crossref

DSpace at VSB Technical University of Ostrava

Adaptive Text Entry for Mobile Devices

Author: Proschowsky Morten Smidt
Publication venue: Technical University of Denmark
Publication date: 01/07/2009
Field of study

Online Research Database In Technology

Referrer Graph: A cost-effective algorithm and pruning method for predicting web accesses

Author: A. Pont
B. de la Ossa
Deshpande
Doménech
J. Sahuquillo
J.A. Gil
Nanopoulos
Yang
Publication venue: 'Elsevier BV'
Publication date: 01/05/2013
Field of study

This paper presents the Referrer Graph (RG) web prediction algorithm and a pruning method for the associated graph as a low-cost solution to predict next web users accesses. RG is aimed at being used in a real web system with prefetching capabilities without degrading its performance. The algorithm learns from users accesses and builds a Markov model. These kinds of algorithms use the sequence of the user accesses to make predictions. Unlike previous Markov model based proposals, the RG algorithm differentiates dependencies in objects of the same page from objects of different pages by using the object URI and the referrer in each request. Although its design permits us to build a simple data structure that is easier to handle and, consequently, needs lower computational cost in comparison with other algorithms, a pruning mechanism has been devised to avoid the continuous growing of this data structure. Results show that, compared with the best prediction algorithms proposed in the open literature, the RG algorithm achieves similar precision values and page latency savings but requiring much less computational and memory resources. Furthermore, when pruning is applied, additional and notable resource consumption savings can be achieved without degrading original performance. In order to reduce further the resource consumption, a mechanism to prune de graph has been devised, which reduces resource consumption of the baseline system without degrading the latency savings. 2013 Elsevier B.V. All rights reserved.This work has been partially supported by Spanish Ministry of Science and Innovation under Grant TIN2009-08201. The authors would also like to thank the technical staff of the School of Computer Science at the Polytechnic University of Valencia for providing us recent and customized trace files logged by their web server.De La Ossa Perez, BA.; Gil Salinas, JA.; Sahuquillo Borrás, J.; Pont Sanjuan, A. (2013). Referrer Graph: A cost-effective algorithm and pruning method for predicting web accesses. Computer Communications. 36(8):881-894. https://doi.org/10.1016/j.comcom.2013.02.005S88189436

Crossref

RiuNet

Predicting real-time roadside CO and NO2 concentrations using neural networks

Author: Zito P.
Chen H.
Bell M.C.
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2008
Field of study

The main aim of this paper is to develop a model based on neural network (NN) theory to estimate real-time roadside CO and

hbox{NO}_{2}

concentrations using traffic and meteorological condition data. The location of the study site is at a road intersection in Melton Mowbray, which is a town in Leicestershire, U.K. Several NNs, which can be classified into three types, namely, the multilayer perceptron, the radial basis function, and the modular network, were developed to model the nonlinear relationships that exist in the pollutant concentrations. Their performances are analyzed and compared. The transferability of the developed models is studied using data collected from a road intersection in another city. It was concluded that all NNs provide reliable estimates of pollutant concentrations using limited information and noisy data

Crossref

MIT Libraries Dome

White Rose Research Online

Predicting real-time roadside CO and NO2 concentrations using neural networks

Author: Bell M.C.
Chen H.
Zito P.
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2008
Field of study

The main aim of this paper is to develop a model based on neural network (NN) theory to estimate real-time roadside CO and

hbox{NO}_{2}

Crossref

White Rose Research Online

Archivio istituzionale della ricerca - Università di Palermo

IDENTIFICATION OF COVER SONGS USING INFORMATION THEORETIC MEASURES OF SIMILARITY

Author: Dixon S
Foster P
IEEE
Klapuri A
Publication venue
Publication date: 01/01/2013
Field of study

13 pages, 5 figures, 4 tables. v3: Accepted version13 pages, 5 figures, 4 tables. v3: Accepted version13 pages, 5 figures, 4 tables. v3: Accepted versio

Queen Mary Research Online

Qos-aware fine-grained power management in networked computing systems

Author: Gong Jiayu
Publication venue: DigitalCommons@WayneState
Publication date: 01/01/2011
Field of study

Power is a major design concern of today\u27s networked computing systems, from low-power battery-powered mobile and embedded systems to high-power enterprise servers. Embedded systems are required to be power efficiency because most embedded systems are powered by battery with limited capacity. Similar concern of power expenditure rises as well in enterprise server environments due to cooling requirement, power delivery limit, electricity costs as well as environment pollutions. The power consumption in networked computing systems includes that on circuit board and that for communication. In the context of networked real-time systems, the power dissipation on wireless communication is more significant than that on circuit board. We focus on packet scheduling for wireless real-time systems with renewable energy resources. In such a scenario, it is required to transmit data with higher level of importance periodically. We formulate this packet scheduling problem as an NP-hard reward maximization problem with time and energy constraints. An optimal solution with pseudo polynomial time complexity is presented. In addition, we propose a sub-optimal solution with polynomial time complexity. Circuit board, especially processor, power consumption is still the major source of system power consumption. We provide a general-purposed, practical and comprehensive power management middleware for networked computing systems to manage circuit board power consumption thus to affect system-level power consumption. It has the functionalities of power and performance monitoring, power management (PM) policy selection and PM control, as well as energy efficiency analysis. This middleware includes an extensible PM policy library. We implemented a prototype of this middleware on Base Band Units (BBUs) with three PM policies enclosed. These policies have been validated on different platforms, such as enterprise servers, virtual environments and BBUs. In enterprise environments, the power dissipation on circuit board dominates. Regulation on computing resources on board has a significant impact on power consumption. Dynamic Voltage and Frequency Scaling (DVFS) is an effective technique to conserve energy consumption. We investigate system-level power management in order to avoid system failures due to power capacity overload or overheating. This management needs to control the power consumption in an accurate and responsive manner, which cannot be achieve by the existing black-box feedback control. Thus we present a model-predictive feedback controller to regulate processor frequency so that power budget can be satisfied without significant loss on performance. In addition to providing power guarantee alone, performance with respect to service-level agreements (SLAs) is required to be guaranteed as well. The proliferation of virtualization technology imposes new challenges on power management due to resource sharing. It is hard to achieve optimization in both power and performance on shared infrastructures due to system dynamics. We propose vPnP, a feedback control based coordination approach providing guarantee on application-level performance and underlying physical host power consumption in virtualized environments. This system can adapt gracefully to workload change. The preliminary results show its flexibility to achieve different levels of tradeoffs between power and performance as well as its robustness over a variety of workloads. It is desirable for improve energy efficiency of systems, such as BBUs, hosting soft-real time applications. We proposed a power management strategy for controlling delay and minimizing power consumption using DVFS. We use the Robbins-Monro (RM) stochastic approximation method to estimate delay quantile. We couple a fuzzy controller with the RM algorithm to scale CPU frequency that will maintain performance within the specified QoS

Digital Commons@Wayne State University