8 research outputs found

    Covariance and PCA for Categorical Variables

    Full text link
    Covariances from categorical variables are defined using a regular simplex expression for categories. The method follows the variance definition by Gini, and it gives the covariance as a solution of simultaneous equations. The calculated results give reasonable values for test data. A method of principal component analysis (RS-PCA) is also proposed using regular simplex expressions, which allows easy interpretation of the principal components. The proposed methods apply to variable selection problem of categorical data USCensus1990 data. The proposed methods give appropriate criterion for the variable selection problem of categoricalComment: 12 pages, 5 figure

    Potential risk factors associated with human encephalitis: application of canonical correlation analysis

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Infection of the CNS is considered to be the major cause of encephalitis and more than 100 different pathogens have been recognized as causative agents. Despite being identified worldwide as an important public health concern, studies on encephalitis are very few and often focus on particular types (with respect to causative agents) of encephalitis (e.g. West Nile, Japanese, etc.). Moreover, a number of other infectious and non-infectious conditions present with similar symptoms, and distinguishing encephalitis from other disguising conditions continues to a challenging task.</p> <p>Methods</p> <p>We used canonical correlation analysis (CCA) to assess associations between set of exposure variable and set of symptom and diagnostic variables in human encephalitis. Data consists of 208 confirmed cases of encephalitis from a prospective multicenter study conducted in the United Kingdom. We used a covariance matrix based on Gini's measure of similarity and used permutation based approaches to test significance of canonical variates.</p> <p>Results</p> <p>Results show that weak pair-wise correlation exists between the risk factor (exposure and demographic) and symptom/laboratory variables. However, the first canonical variate from CCA revealed strong multivariate correlation (ρ = 0.71, se = 0.03, p = 0.013) between the two sets. We found a moderate correlation (ρ = 0.54, se = 0.02) between the variables in the second canonical variate, however, the value is not statistically significant (p = 0.68). Our results also show that a very small amount of the variation in the symptom sets is explained by the exposure variables. This indicates that host factors, rather than environmental factors might be important towards understanding the etiology of encephalitis and facilitate early diagnosis and treatment of encephalitis patients.</p> <p>Conclusions</p> <p>There is no standard laboratory diagnostic strategy for investigation of encephalitis and even experienced physicians are often uncertain about the cause, appropriate therapy and prognosis of encephalitis. Exploration of human encephalitis data using advanced multivariate statistical modelling approaches that can capture the inherent complexity in the data is, therefore, crucial in understanding the causes of human encephalitis. Moreover, application of multivariate exploratory techniques will generate clinically important hypotheses and offer useful insight into the number and nature of variables worthy of further consideration in a confirmatory statistical analysis.</p

    Machine Learning Methods for Social Signal Processing

    Get PDF

    Partial Least Squares and Principal Component Analysis with Non-metric Variables for Composite Indices

    Get PDF
    Ein zusammengesetzter Index ist eine aggregierte Variable, die aus individuellen Indikatoren und Gewichten besteht, wobei die Gewichte die relative Wichtigkeit jedes Indikators darstellen. Zusammengesetzte Indizes werden oft benutzt um latente PhĂ€nomene zu schreiben oder komplexe Informationen zu einer geringen Anzahl an Variablen zusammenzufassen. Es ist von großer Bedeutung richtige Gewichte fĂŒr die Variablen, die einen zusammengesetzten Index bilden, zu wĂ€hlen. Hauptkomponentenanalyse (PCA) ist ein populĂ€rer Ansatz um Gewichte abzuleiten, aber es ist ungeeignet, wenn informative Variationen nur kleine Varianzen der Variablen in einem zusammengesetzten Index haben. Deshalb schlĂ€gt diese Studie vor, Partial Least Squares (PLS) anzuwenden, welches die Beziehung zwischen Zielvariablen and den Variablen in einem zusammengesetzten Index ausnutzt. Unsere Simulationsstudie zeigt, dass PLS so gut wie PCA funktioniert oder erheblich es ĂŒbertrifft. ZusĂ€tzlich sind in der Praxis die Variablen in einem zusammengesetzten Index hĂ€ufig nicht-metrisch. Solche Variablen benötigen spezielle Verfahren, um PCA oder PLS anzuwenden. Diese Studie untersucht mehrere PCA und PLS Algorithmen fĂŒr nicht-metrische Variablen in der vorliegenden Literatur und vergleicht sie durch umfangreiche Simulationsstudien, um Empfehlungen fĂŒr die Praxis abzugeben. Dummy coding zeigt hĂ€ufig zufriedenstellende Leistung im Vergleich zu komplizierteren Methoden. Als unsere Anwendungen betrachten wir Vermögen, Globalisierung, Geschlechtergleichheit und Korruption, indem PCA- und PLS-basierte zusammengesetzte Indizes angewendet werden. PLS erzeugt fĂŒr die jeweiligen Zielvariablen massgeschnittene zusammengesetzte Indizes, die hĂ€ufig bessere Leistung als PCA zeigten. Ein Vergleich zwischen PCA und PLS Gewichten und Koeffizienten zeigt, welche Variablen fĂŒr die jeweiligen Zielvariablen besonders relevant sind

    On the Viability of Quantitative Assessment Methods in Software Engineering and Software Services

    Get PDF
    IT help desk operations are expensive. Costs associated with IT operations present challenges to profit goals. Help desk managers need a way to plan staffing levels so that labor costs are minimized while problems are resolved efficiently. An incident prediction method is needed for planning staffing levels. The potential value of a solution to this problem is important to an IT service provider since software failures are inevitable and their timing is difficult to predict. In this research, a cost model for help desk operations is developed. The cost model relates predicted incidents to labor costs using real help desk data. Incidents are predicted using software reliability growth models. Cluster analysis is used to group products with similar help desk incident characteristics. Principal Components Analysis is used to determine one product per cluster for the prediction of incidents for all members of the cluster. Incident prediction accuracy is demonstrated using cluster representatives, and is done so successfully for all clusters with accuracy comparable to making predictions for each product in the portfolio. Linear regression is used with cost data for the resolution of incidents to relate incident predictions to help desk labor costs. Following a series of four pilot studies, the cost model is validated by successfully demonstrating cost prediction accuracy for one month prediction intervals over a 22 month period

    The Application of Data Mining Techniques to Learning Analytics and Its Implications for Interventions with Small Class Sizes

    Get PDF
    There has been significant progress in the development of techniques to deliver effective technology enhanced learning systems in education, with substantial progress in the field of learning analytics. These analyses are able to support academics in the identification of students at risk of failure or withdrawal. The early identification of students at risk is critical to giving academic staff and institutions the opportunity to make timely interventions. This thesis considers established machine learning techniques, as well as a novel method, for the prediction of student outcomes and the support of interventions, including the presentation of a variety of predictive analyses and of a live experiment. It reviews the status of technology enhanced learning systems and the associated institutional obstacles to their implementation and deployment. Many courses are comprised of relatively small student cohorts, with institutional privacy protocols limiting the data readily available for analysis. It appears that very little research attention has been devoted to this area of analysis and prediction. I present an experiment conducted on a final year university module, with a student cohort of 23, where the data available for prediction is limited to lecture/tutorial attendance, virtual learning environment accesses and intermediate assessments. I apply and compare a variety of machine learning analyses to assess and predict student performance, applied at appropriate points during module delivery. Despite some mixed results, I found potential for predicting student performance in small student cohorts with very limited student attributes, with accuracies comparing favourably with published results using large cohorts and significantly more attributes. I propose that the analyses will be useful to support module leaders in identifying opportunities to make timely academic interventions. Student data may include a combination of nominal and numeric data. A large variety of techniques are available to analyse numeric data, however there are fewer techniques applicable to nominal data. I summarise the results of what I believe to be a novel technique to analyse nominal data by making a systematic comparison of data pairs. In this thesis I have surveyed existing intelligent learning/training systems and explored the contemporary AI techniques which appear to offer the most promising contributions to the prediction of student attainment. I have researched and catalogued the organisational and non-technological challenges to be addressed for successful system development and implementation and proposed a set of critical success criteria to apply. This dissertation is supported by published work

    RepresentaçÔes euclidianas de dados : uma abordagem para variåveis heterogéneas

    Get PDF
    Tese de doutoramento, Medicina (BiomatemĂĄtica), Universidade de Lisboa, Faculdade de Medicina, 2009DisponĂ­vel no document
    corecore