10 research outputs found

    Generalised median of a set of correspondences based on the hamming distance.

    Get PDF
    A correspondence is a set of mappings that establishes a relation between the elements of two data structures (i.e. sets of points, strings, trees or graphs). If we consider several correspondences between the same two structures, one option to define a representative of them is through the generalised median correspondence. In general, the computation of the generalised median is an NP-complete task. In this paper, we present two methods to calculate the generalised median correspondence of multiple correspondences. The first one obtains the optimal solution in cubic time, but it is restricted to the Hamming distance. The second one obtains a sub-optimal solution through an iterative approach, but does not have any restrictions with respect to the used distance. We compare both proposals in terms of the distance to the true generalised median and runtime

    Online learning the consensus of multiple correspondences between sets.

    Get PDF
    When several subjects solve the assignment problem of two sets, differences on the correspondences computed by these subjects may occur. These differences appear due to several factors. For example, one of the subjects may give more importance to some of the elements’ attributes than another subject. Another factor could be that the assignment problem is computed through a suboptimal algorithm and different non-optimal correspondences can appear. In this paper, we present a consensus methodology to deduct the consensus of several correspondences between two sets. Moreover, we also present an online learning algorithm to deduct some weights that gauge the impact of each initial correspondence on the consensus. In the experimental section, we show the evolution of these parameters together with the evolution of the consensus accuracy. We observe that there is a clear dependence of the learned weights with respect to the quality of the initial correspondences. Moreover, we also observe that in the first iterations of the learning algorithm, the consensus accuracy drastically increases and then stabilises

    AN ENSEMBLE APPROACH FOR SENTIMENT CLASSIFICATION: VOTING FOR CLASSES AND AGAINST THEM

    Get PDF
    Sentiment denotes a person's opinion or feeling towards a subject that they are discussing about in that conversation. This has been one of the most researched and industrially promising fields in natural language processing. There are several methods employed for performing sentiment analytics. Since this classification problem involves natural language processing, every solution has its own advantages and disadvantages. Hence mostly, a combination of these methods provides better results. Various such ensemble approaches exist. The objective of this work is to design a better ensemble approach that uses a complex voting method, where classifiers are given rights not only to vote in favour of classes but also against them. This in turn will give chances to the algorithms that are weaker in classifying a sentence toward a particular class but better at rejecting it. The performance of the ensemble is compared to the individual classifiers used in the ensemble and also the other simple voting ensemble methods to verify whether the performance is better compared to them. The designed ensemble is currently implemented for sentiment analytics. This can also be used for other classification problems, where generalization is required for better results

    Obtaining the consensus of multiple correspondences between graphs through online learning.

    Get PDF
    In structural pattern recognition, it is usual to compare a pair of objects through the generation of a correspondence between the elements of each of their local parts. To do so, one of the most natural ways to represent these objects is through attributed graphs. Several existing graph extraction methods could be implemented and thus, numerous graphs, which may not only differ in their nodes and edge structure but also in their attribute domains, could be created from the same object. Afterwards, a matching process is implemented to generate the correspondence between two attributed graphs, and depending on the selected graph matching method, a unique correspondence is generated from a given pair of attributed graphs. The combination of these factors leads to the possibility of a large quantity of correspondences between the two original objects. This paper presents a method that tackles this problem by considering multiple correspondences to conform a single one called a consensus correspondence, eliminating both the incongruences introduced by the graph extraction and the graph matching processes. Additionally, through the application of an online learning algorithm, it is possible to deduce some weights that influence on the generation of the consensus correspondence. This means that the algorithm automatically learns the quality of both the attribute domain and the correspondence for every initial correspondence proposal to be considered in the consensus, and defines a set of weights based on this quality. It is shown that the method automatically tends to assign larger values to high quality initial proposals, and therefore is capable to deduce better consensus correspondences

    Correspondence consensus of two sets of correspondences through optimisation functions.

    Get PDF
    We present a consensus method which, given the two correspondences between sets of elements generated by separate entities, enounces a final correspondence consensus considering the existence of outliers. Our method is based on an optimisation technique that minimises the cost of the correspondence while forcing (to the most) to be the mean correspondence of the two original correspondences. The method decides the mapping of the elements that the original correspondences disagree and returns the same element mapping when both correspondences agree. We first show the validity of the method through an experiment in ideal conditions based on palmprint identification, and subsequently present two practical experiments based on image retrieval

    Optimized classification predictions with a new index combining machine learning algorithms

    Get PDF
    Voting is a commonly used ensemble method aiming to optimize classification predictions by combining results from individual base classifiers. However, the selection of appropriate classifiers to participate in voting algorithm is currently an open issue. In this study we developed a novel Dissimilarity-Performance (DP) index which incorporates two important criteria for the selection of base classifiers to participate in voting: their differential response in classification (dissimilarity) when combined in triads and their individual performance. To develop this empirical index we firstly used a range of different datasets to evaluate the relationship between voting results and measures of dissimilarity among classifiers of different types (rules, trees, lazy classifiers, functions and Bayes). Secondly, we computed the combined effect on voting performance of classifiers with different individual performance and/or diverse results in the voting performance. Our DP index was able to rank the classifier combinations according to their voting performance and thus to suggest the optimal combination. The proposed index is recommended for individual machine learning users as a preliminary tool to identify which classifiers to combine in order to achieve more accurate classification predictions avoiding computer intensive and time-consuming search

    Learning the Consensus of Multiple Correspondences between Data Structures

    Get PDF
    En aquesta tesi presentem un marc de treball per aprendre el consens donades múltiples correspondències. S'assumeix que les diferents parts involucrades han generat aquestes correspondències per separat, i el nostre sistema actua com un mecanisme que calibra diferents característiques i considera diferents paràmetres per aprendre les millors assignacions i així, conformar una correspondència amb la major precisió possible a costa d'un cost computacional raonable. Aquest marc de treball de consens és presentat en una forma gradual, començant pels desenvolupaments més bàsics que utilitzaven exclusivament conceptes ben definits o únicament un parell de correspondències, fins al model final que és capaç de considerar múltiples correspondències, amb la capacitat d'aprendre automàticament alguns paràmetres de ponderació. Cada pas d'aquest marc de treball és avaluat fent servir bases de dades de naturalesa variada per demostrar efectivament que és possible tractar diferents escenaris de matching. Addicionalment, dos avanços suplementaris relacionats amb correspondències es presenten en aquest treball. En primer lloc, una nova mètrica de distància per correspondències s'ha desenvolupat, la qual va derivar en una nova estratègia per a la cerca de mitjanes ponderades. En segon lloc, un marc de treball específicament dissenyat per a generar correspondències al camp del registre d'imatges s'ha modelat, on es considera que una de les imatges és una imatge completa, i l'altra és una mostra petita d'aquesta. La conclusió presenta noves percepcions de com el nostre marc de treball de consens pot ser millorada, i com els dos desenvolupaments paral·lels poden convergir amb el marc de treball de consens.En esta tesis presentamos un marco de trabajo para aprender el consenso dadas múltiples correspondencias. Se asume que las distintas partes involucradas han generado dichas correspondencias por separado, y nuestro sistema actúa como un mecanismo que calibra distintas características y considera diferentes parámetros para aprender las mejores asignaciones y así, conformar una correspondencia con la mayor precisión posible a expensas de un costo computacional razonable. El marco de trabajo de consenso es presentado en una forma gradual, comenzando por los acercamientos más básicos que utilizaban exclusivamente conceptos bien definidos o únicamente un par de correspondencias, hasta el modelo final que es capaz de considerar múltiples correspondencias, con la capacidad de aprender automáticamente algunos parámetros de ponderación. Cada paso de este marco de trabajo es evaluado usando bases de datos de naturaleza variada para demostrar efectivamente que es posible tratar diferentes escenarios de matching. Adicionalmente, dos avances suplementarios relacionados con correspondencias son presentados en este trabajo. En primer lugar, una nueva métrica de distancia para correspondencias ha sido desarrollada, la cual derivó en una nueva estrategia para la búsqueda de medias ponderadas. En segundo lugar, un marco de trabajo específicamente diseñado para generar correspondencias en el campo del registro de imágenes ha sido establecida, donde se considera que una de las imágenes es una imagen completa, y la otra es una muestra pequeña de ésta. La conclusión presenta nuevas percepciones de cómo nuestro marco de trabajo de consenso puede ser mejorada, y cómo los dos desarrollos paralelos pueden converger con éste.In this work, we present a framework to learn the consensus given multiple correspondences. It is assumed that the several parties involved have generated separately these correspondences, and our system acts as a mechanism that gauges several characteristics and considers different parameters to learn the best mappings and thus, conform a correspondence with the highest possible accuracy at the expense of a reasonable computational cost. The consensus framework is presented in a gradual form, starting from the most basic approaches that used exclusively well-known concepts or only two correspondences, until the final model which is able to consider multiple correspondences, with the capability of automatically learning some weighting parameters. Each step of the framework is evaluated using databases of varied nature to effectively demonstrate that it is capable to address different matching scenarios. In addition, two supplementary advances related on correspondences are presented in this work. Firstly, a new distance metric for correspondences has been developed, which lead to a new strategy for the weighted mean correspondence search. Secondly, a framework specifically designed for correspondence generation in the image registration field has been established, where it is considered that one of the images is a full image, and the other one is a small sample of it. The conclusion presents insights of how our consensus framework can be enhanced, and how these two parallel developments can converge with it

    Hipervinculación de documentos con Máquinas de Soporte Vectorial

    Get PDF
    En la actualidad el acceso a la información se da por medio de hipervínculos, los cuales interconectan los textos entre si únicamente si contienen una relación. Varios investigadores han estudiado la forma en que los humanos crean los hipervínculos y han tratado de replicar el modo de trabajo específicamente de la colección de Wikipedia. El uso de hipervínculos se ha pensado como un prometedor recurso para la recuperación de información, que fue inspirado por el análisis de citas de la literatura (Merlino-Santesteban, 2003). Según Dreyfus (Dreyfus, 2003) la hipervinculación no tiene ningún criterio específico, ni tampoco jerarquías. Por ello cuando todo puede vincularse indiscriminadamente y sin obedecer un propósito o significado en particular, el tamaño de la red y la arbitrariedad entre sus hipervínculos, hacen extremadamente difícil para un usuario encontrar exactamente el tipo de información que busca. En las organizaciones, la familiaridad y la confianza durante mucho tiempo han sido identificadas como las dimensiones de credibilidad de la fuente de información en publicidad (Eric Haley, 1996). Un hipervínculo, como una forma de información, puede, por lo tanto, tener un mayor impacto cuando se presenta por un objetivo conocido (Stewart & Zhang, 2003). Mientras tanto, los hipervínculos entre los sitios web pueden generan confianza en el remitente y el receptor del enlace, por lo que estas interacciones tienen efectos positivos de reputación para el destinatario (Stewart, 2006) (Lee, Lee, & Hwang, 2014). El estudio de documentos por medio de los hipervínculos es un área importante de investigación en minería de datos, en una red social a menudo lleva una gran cantidad de información estructural formada por los hipervínculos creando nodos compartidos dentro de la comunidad. Algunas importantes aplicaciones de los métodos de minería de datos para redes sociales son la recomendación social mediante las experiencias similares de los usuarios (Alhajj & Rokne, 2014). En marketing y publicidad se aprovechan las cascadas en las redes sociales y se obtienen beneficios sobre modelos de propagación de la información (Domingos & Richardson, 2001). Las empresas de publicidad están interesados en cuantificar el valor de un solo nodo en la red, tomando en cuenta que sus acciones pueden desencadenar cascadas a sus nodos vecinos. Los resultados de (Allan, 1997) (Bellot et al., 2013) (Agosti, Crestani, & Melucci, 1997) (Blustein, Webber, & Tague-Sutcliffe, 1997) sugieren que el descubrimiento de hipervínculos automatizado no es un problema resuelto y que cualquier evaluación de los sistemas de descubrimiento de Hipervínculos de Wikipedia debe basarse en la evaluación manual, no en los hipervínculos existentes

    a priori synthetic sampling for increasing classification sensitivity in imbalanced data sets

    Get PDF
    Building accurate classifiers for predicting group membership is made difficult when data is skewed or imbalanced which is typical of real world data sets. The classifier has the tendency to be biased towards the over represented group as a result. This imbalance is considered a class imbalance problem which will induce bias into the classifier particularly when the imbalance is high. Class imbalance data usually suffers from data intrinsic properties beyond that of imbalance alone. The problem is intensified with larger levels of imbalance most commonly found in observational studies. Extreme cases of class imbalance are commonly found in many domains including fraud detection, mammography of cancer and post term births. These rare events are usually the most costly or have the highest level of risk associated with them and are therefore of most interest. To combat class imbalance the machine learning community has relied upon embedded, data preprocessing and ensemble learning approaches. Exploratory research has linked several factors that perpetuate the issue of misclassification in class imbalanced data. However, there remains a lack of understanding between the relationship of the learner and imbalanced data among the competing approaches. The current landscape of data preprocessing approaches have appeal due to the ability to divide the problem space in two which allows for simpler models. However, most of these approaches have little theoretical bases although in some cases there is empirical evidence supporting the improvement. The main goals of this research is to introduce newly proposed a priori based re-sampling methods that improve concept learning within class imbalanced data. The results in this work highlight the robustness of these techniques performance within publicly available data sets from different domains containing various levels of imbalance. In this research the theoretical and empirical reasons are explored and discussed

    Chemometrics and statistical analysis in raman spectroscopy-based biological investigations

    Get PDF
    As mentioned in the chapter 1, chemometrics has become an essential tool in Raman spectroscopy-based biological investigations and significantly enhanced the sensitivity of Raman spectroscopy-based detection. However, there are some open issues on applying chemometrics in Raman spectroscopy-based biological investigations. An automatic proce- dure is needed to optimize the parameters of the mathematical baseline correction. Spectral reconstruction algorithm is required to recover a fluorescence-free Raman spectrum from the two Raman spectra measured with different excitation wavelengths for the shifted-excitation Raman difference spectroscopy (SERDS) technique. Guidelines are necessary for reliable model optimization and rigorous model evaluation to ensure high accuracy and robustness in Raman spectroscopy-based biological detection. Computational methods are required to enable a trained model to successfully predict new data that is significantly different from the training data due to inter-replicate variations. These tasks were tackled in this thesis. The related investigations were related to three main topics: baseline correction, statistical modeling, and model transfer.Wie im Kapitel 1 erwähnt, ist die Chemometrie zu einem essentiellen Werkzeug für biolo- gische Untersuchungen mittels der Raman-Spektroskopie geworden und hat die Sensitivität der Raman-spektroskopischen Detektion erheblich verbessert. Es gibt jedoch einige offene Fragen, welche die Anwendung der Chemometrie in Raman-spektroskopischen Untersuchun- gen biologischer Proben betreffen. Zum Beispiel wird eine automatische Prozedur benötigt, um die Parameter einer mathematischen Basislinienkorrektur zu optimieren. Ein SERDS- Rekonstruktionsalgorithmus ist erforderlich, um ein Fluoreszenz-freies Raman-Spektrum aus den zwei Raman-Spektren zu extrahieren, welche bei der Shifted-excitation-Raman-Differenz- Spektroskopie (SERDS) gemessen werden. Des Weiteren sind Richtlinien erforderlich, welche eine zuverlässige Modelloptimierung und eine rigorose Modellevaluation erlauben. Durch diese Richtlinien wird eine hohe Genauigkeit und Robustheit der Raman-spektroskopischen Detektion biologischer Proben gewährleistet. Computergestützte Methoden sind nötig, um mit einem trainierten Modell erfolgreich neue Daten, die sich aufgrund von Inter-Replikat- Variationen signifikant von den Trainingsdaten unterscheiden, vorherzusagen. Diese vier Probleme sind Beispiele für offene Fragen in der Chemometrie und diese vier Probleme wur- den in dieser Arbeit behandelt. Die damit verbundenen Untersuchungen bezogen sich auf drei Hauptthemen: die Basislinienkorrektur, die statistische Modellierung und der Modell- transfer
    corecore