13 research outputs found
Recommended from our members
Parallel computing in information retrieval - An updated review
The progress of parallel computing in Information Retrieval (IR) is reviewed. In particular we stress the importance of the motivation in using parallel computing for Text Retrieval. We analyse parallel IR systems using a classification due to Rasmussen [1] and describe some parallel IR systems. We give a description of the retrieval models used in parallel Information Processing.. We describe areas of research which we believe are needed
Methods of Hierarchical Clustering
We survey agglomerative hierarchical clustering algorithms and discuss
efficient implementations that are available in R and other software
environments. We look at hierarchical self-organizing maps, and mixture models.
We review grid-based clustering, focusing on hierarchical density-based
approaches. Finally we describe a recently developed very efficient (linear
time) hierarchical clustering algorithm, which can also be viewed as a
hierarchical grid-based algorithm.Comment: 21 pages, 2 figures, 1 table, 69 reference
Recommended from our members
Distributed Inverted Files and Performance: A Study of Parallelism and Data Distribution Methods in IR
The study investigates the performance of parallel information retrieval (IR) algorithms on different data distribution methods for Inverted files to identify which is the best for the requirements of specific IR tasks. We define a data distribution method as a way of distributing Inverted file data to local disks on a parallel machine. A data distribution method may be on-the-fly (with one copy of the index held), replication (all nodes have all of the index) or partitioning (data for index is split amongst nodes). Partitioning of inverted file data can be done in many ways but we consider only two: by term (Termld) and by document (Dodd). Termld partitioning is a type of partitioning which distributes unique word data to a single partition, while D odd partitioning distributes unique document data to a single partition. We consider the issue of improving the performance of standard IR algorithms on these data distribution methods by looking at sequential job service not concurrent job service, e.g. we consider the issue of sequential query service not concurrent query service. This methodology rules out some distribution methods for some tasks studied. We consider the following main tasks of IR: indexing, search, passage retrieval, inverted file update and query optimisation for routing /filtering. We produce a synthetic performance model for each of these tasks for the purposes of comparison. We have two subsidiary aims; one was to demonstrate portability of our implemented data structures and algorithms on different parallel machines. Secondly, we also study the possibility of increased retrieval effectiveness by examining a larger section of the search space for both passage retrieval and routing/filtering. We consider the implications of concurrency in updates on Inverted files. Our theoretical and empirical results show that in most cases the D odd partitioning method is the best data distribution method apart from routing/filtering where replication was found to be superior
Characterisation of Condition Monitoring Information for Diagnosis and Prognosis using Advanced Statistical Models
This research focuses on classification of categorical events using advanced statistical models. Primarily utilised to detect and identify individual component faults and deviations from normal healthy operation of reciprocating compressors. Effective monitoring of condition ensuring optimal efficiency and reliability whilst maintaining the highest possible safety standards and reducing costs and inconvenience due to impaired performance.
Variability of operating conditions being revealed through examination of vibration signals recorded at strategic points of the process. Analysis of these signals informing expectations with respect to tolerable degrees of imperfection in specific components.
Isolating inherent process variability from extraneous variability affords reliable means of ascertaining system health and functionality. Vibration envelope spectra offering highly responsive model parameters for diagnostic purposes.
This thesis examines novel approaches to alleviating the computational burdens of large data analysis through investigation of the potential input variables. Three methods are investigated as follows:
Method one employs multivariate variable clustering to ascertain homogeneity amongst input variables. A series of heterogeneous groups being formed from each of which explanatory input variables are selected.
Data reduction techniques, method two, offer an alternative means of constructing predictive classifiers. A reduced number of reconstructed explanatory variables provide enhanced modelling capabilities ensuring algorithmic convergence.
The final novel approach proposed combines both these methods alongside wavelet data compression techniques. Simplifying number of input parameters and individual signal volume whilst retaining crucial information for deterministic supremacy
A new framework for clustering
The difficulty of clustering and the variety of clustering methods suggest the need for a theoretical study of clustering. Using the idea of a standard statistical framework, we propose a new framework for clustering.
For a well-defined clustering goal we assume that the data to be clustered come from an underlying distribution and we aim to find a high-density cluster tree. We regard this tree as a parameter of interest for the underlying distribution. However, it is not obvious how to determine a connected subset in a discrete distribution whose support is located in a Euclidean space. Building a cluster tree for such a distribution is an open problem and presents interesting conceptual and computational challenges. We solve this problem using graph-based approaches and further parameterize clustering using the high-density cluster tree and its extension.
Motivated by the connection between clustering outcomes and graphs, we propose a graph family framework. This framework plays an important role in our clustering framework. A direct application of the graph family framework is a new cluster-tree distance measure. This distance measure can be written as an inner product or kernel. It makes our clustering framework able to perform statistical assessment of clustering via simulation. Other applications such as a method for integrating partitions into a cluster tree and methods for cluster tree averaging and bagging are also derived from the graph family framework
Análise de clusters aplicada ao sucesso/insucesso em matemática
De acordo com [Mirkin B., 1996], classificação é um agrupamento existente ou
ideal daqueles que se parecem (ou são semelhantes) e separação dos que são
dissemelhantes. Sendo o objectivo/razão da classificação: (1) formar e adquirir
conhecimento, (2) analizar a estrutura do fenómeno e (3) relacionar entre si
diferentes aspectos do fenómeno em questão.
No estudo do sucesso/insucesso da Matemática está de algum modo subjacente nos nossos objectivos “classificar” os alunos de acordo com os factores que se pretende que sejam determinantes nos resultados a Matemática.
Por outro lado, voltamos a recorrer à classificação quando pretendemos
estabelecer os tipos de factores determinantes nos resultados da Matemática.
Os objectivos da Análise de Clusters são: (1) analisar a estrutura dos dados;
(2) verificar/relacionar os aspectos dos dados entre si; (3) ajudar na concepção da
classificação.
Pensámos que esta técnica da análise exploratória de dados poderia
representar uma ferramenta muito potente para o estudo do sucesso/insucesso da
Matemática no Ensino Básico.
O trabalho desenvolvido nesta dissertação prova que a Análise de Clusters
responde adequadamente às questões que se podem formular quando se tenta
enquadrar socialmente e pedagogicamente o sucesso/insucesso da Matemática.Rita Vasconcelo