109 research outputs found

    Dealing with Intransitivity, Non-Convexity, and Algorithmic Bias in Preference Learning

    Full text link
    Rankings are ubiquitous since they are a natural way to present information to people who are making decisions. There are seemingly countless scenarios where rankings arise, such as deciding whom to hire at a company, determining what movies to watch, purchasing products, understanding human perception, judging science fair projects, voting for political candidates, and so on. In many of these scenarios, the number of items in consideration is prohibitively large, such that asking someone to rank all of the choices is essentially impossible. On the other hand, collecting preference data on a small subset of the items is feasible, e.g., collecting answers to ``Do you prefer item A or item B?" or ``Is item A closer to item B or item C?". Therefore, an important machine learning task is to learn a ranking of the items based on this preference data. This thesis theoretically and empirically addresses three key challenges of preference learning: intransitivity in preference data, non-convex optimization, and algorithmic bias. Chapter 2 addresses the challenge of learning a ranking given pairwise comparison data that violates rational choice such as intransitivity. Our key observation is that two items compared in isolation from other items may be compared based on only a salient subset of features. Formalizing this framework, we propose the salient feature preference model and prove a sample complexity result for learning the parameters of our model and the underlying ranking with maximum likelihood estimation. Chapter 3 addresses the non-convexity of an optimization problem inspired by ordinal embedding, which is a preference learning task. We aim to understand the landscape, that is local minimizers and global minimizers, of the non-convex objective, which corresponds to the hinge loss arising from quadratic constraints. Under certain assumptions, we give necessary conditions for non-global, local minimizers of our objective and additionally show that in two dimensions, every local minimizer is a global minimizer. Chapters 4 and 5 address the challenge of algorithmic bias. We consider training machine learning models that are fair in the sense that their performance is invariant under certain sensitive perturbations to the inputs. For example, the performance of a resume screening system should be invariant under changes to the gender and ethnicity of the applicant. We formalize this notion of algorithmic fairness as a variant of individual fairness. In Chapter 4, we consider classification and develop a distributionally robust optimization approach, SenSR, that enforces this notion of individual fairness during training and provably learns individually fair classifiers. Chapter 5 builds upon Chapter 4. We develop a related algorithm, SenSTIR, to train provably individually fair learning-to-rank (LTR) models. The proposed approach ensures items from minority groups appear alongside similar items from majority groups. This notion of fair ranking is based on the individual fairness definition considered in Chapter 4 for the supervised learning context and is more nuanced than prior fair LTR approaches that simply provide underrepresented items with a basic level of exposure. The crux of our method is an optimal transport-based regularizer that enforces individual fairness and an efficient algorithm for optimizing the regularizer.PHDApplied and Interdisciplinary MathematicsUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/166120/1/amandarg_1.pd

    Fuzzy Logic

    Get PDF
    Fuzzy Logic is becoming an essential method of solving problems in all domains. It gives tremendous impact on the design of autonomous intelligent systems. The purpose of this book is to introduce Hybrid Algorithms, Techniques, and Implementations of Fuzzy Logic. The book consists of thirteen chapters highlighting models and principles of fuzzy logic and issues on its techniques and implementations. The intended readers of this book are engineers, researchers, and graduate students interested in fuzzy logic systems

    Reliable statistical modeling of weakly structured information

    Get PDF
    The statistical analysis of "real-world" data is often confronted with the fact that most standard statistical methods were developed under some kind of idealization of the data that is often not adequate in practical situations. This concerns among others i) the potentially deficient quality of the data that can arise for example due to measurement error, non-response in surveys or data processing errors and ii) the scale quality of the data, that is idealized as "the data have some clear scale of measurement that can be uniquely located within the scale hierarchy of Stevens (or that of Narens and Luce or Orth)". Modern statistical methods like, e.g., correction techniques for measurement error or robust methods cope with issue i). In the context of missing or coarsened data, imputation techniques and methods that explicitly model the missing/coarsening process are nowadays wellestablished tools of refined data analysis. Concerning ii) the typical statistical viewpoint is a more pragmatical one, in case of doubt one simply presumes the strongest scale of measurement that is clearly "justified". In more complex situations, like for example in the context of the analysis of ranking data, statisticians often simply do not worry about purely measurement theoretic reservations too much, but instead embed the data structure in an appropriate, easy to handle space, like e.g. a metric space and then use all statistical tools available for this space. Against this background, the present cumulative dissertation tries to contribute from different perspectives to the appropriate handling of data that challenge the above-mentioned idealizations. A focus here is on the one hand on analysis of interval-valued and set-valued data within the methodology of partial identification, and on the other hand on the analysis of data with values in a partially ordered set (poset-valued data). Further tools of statistical modeling treated in the dissertation are necessity measures in the context of possibility theory and concepts of stochastic dominance for poset-valued data. The present dissertation consists of 8 contributions, which will be detailedly discussed in the following sections: Contribution 1 analyzes different identification regions for partially identified linear models under interval-valued responses and develops a further kind of identification region (as well as a corresponding estimator). Estimates for the identifcation regions are compared to each other and also to classical statistical approaches for a data set on wine quality. Contribution 2 deals with logistic regression under coarsened responses, analyzes point-identifying assumptions and develops likelihood-based estimators for the identified set. The methods are illustrated with data of a wave of the panel study "Labor Market and Social Security" (PASS). Contribution 3 analyzes the combinatorial structure of the extreme points and the edges of a polytope (called credal set or core in the literature) that plays a crucial role in imprecise probability theory. Furthermore, an efficient algorithm for enumerating all extreme points is given and compared to existing standard methods. Contribution 4 develops a quantile concept for data or random variables with values in a complete lattice, which is applied in Contribution 5 to the case of ranking data in the context of a data set on the wisdom of the crowd phenomena. In Contribution 6 a framework for evaluating the quality of different aggregation functions of Social Choice Theory is developed, which enables analysis of quality in dependence of group specific homogeneity. In a simulation study, selected aggregation functions, including an aggregation function based on the concepts of Contribution 4 and Contribution 5, are analyzed. Contribution 7 supplies a linear program that allows for detecting stochastic dominance for poset-valued random variables, gives proposals for inference and regularization, and generalizes the approach to the general task of optimizing a linear function on a closure system. The generality of the developed methods is illustrated with data examples in the context of multivariate inequality analysis, item impact and differential item functioning in the context of item response theory, analyzing distributional differences in spatial statistics and guided regularization in the context of cognitive diagnosis models. Contribution 8 uses concepts of stochastic dominance to establish a descriptive approach for a relational analysis of person ability and item difficulty in the context of multidimensional item response theory. All developed methods have been implemented in the language R ([R Development Core Team, 2014]) and are available from the author upon request. The application examples corroborate the usefulness of weak types of statistical modeling examined in this thesis, which, beyond their flexibility to deal with many kinds of data deficiency, can still lead to informative substance matter conclusions that are then more reliable due to the weak modeling.Die statistische Analyse real erhobener Daten sieht sich oft mit der Tatsache konfrontiert, dass übliche statistische Standardmethoden unter einer starken Idealisierung der Datensituation entwickelt wurden, die in der Praxis jedoch oft nicht angemessen ist. Dies betrifft i) die möglicherweise defizitäre Qualität der Daten, die beispielsweise durch Vorhandensein von Messfehlern, durch systematischen Antwortausfall im Kontext sozialwissenschaftlicher Erhebungen oder auch durch Fehler während der Datenverarbeitung bedingt ist und ii) die Skalenqualität der Daten an sich: Viele Datensituationen lassen sich nicht in die einfachen Skalenhierarchien von Stevens (oder die von Narens und Luce oder Orth) einordnen. Modernere statistische Verfahren wie beispielsweise Messfehlerkorrekturverfahren oder robuste Methoden versuchen, der Idealisierung der Datenqualität im Nachhinein Rechnung zu tragen. Im Zusammenhang mit fehlenden bzw. intervallzensierten Daten haben sich Imputationsverfahren zur Vervollständigung fehlender Werte bzw. Verfahren, die den Entstehungprozess der vergröberten Daten explizit modellieren, durchgesetzt. In Bezug auf die Skalenqualität geht die Statistik meist eher pragmatisch vor, im Zweifelsfall wird das niedrigste Skalenniveau gewählt, das klar gerechtfertigt ist. In komplexeren multivariaten Situationen, wie beispielsweise der Analyse von Ranking-Daten, die kaum noch in das Stevensche "Korsett" gezwungen werden können, bedient man sich oft der einfachen Idee der Einbettung der Daten in einen geeigneten metrischen Raum, um dann anschließend alle Werkzeuge metrischer Modellierung nutzen zu können. Vor diesem Hintergrund hat die hier vorgelegte kumulative Dissertation deshalb zum Ziel, aus verschiedenen Blickwinkeln Beiträge zum adäquaten Umgang mit Daten, die jene Idealisierungen herausfordern, zu leisten. Dabei steht hier vor allem die Analyse intervallwertiger bzw. mengenwertiger Daten mittels partieller Identifikation auf der Seite defzitärer Datenqualität im Vordergrund, während bezüglich Skalenqualität der Fall von verbandswertigen Daten behandelt wird. Als weitere Werkzeuge statistischer Modellierung werden hier insbesondere Necessity-Maße im Rahmen der Imprecise Probabilities und Konzepte stochastischer Dominanz für Zufallsvariablen mit Werten in einer partiell geordneten Menge betrachtet. Die vorliegende Dissertation umfasst 8 Beiträge, die in den folgenden Kapiteln näher diskutiert werden: Beitrag 1 analysiert verschiedene Identifikationsregionen für partiell identifizierte lineare Modelle unter intervallwertig beobachteter Responsevariable und schlägt eine neue Identifikationsregion (inklusive Schätzer) vor. Für einen Datensatz, der die Qualität von verschiedenen Rotweinen, gegeben durch ExpertInnenurteile, in Abhängigkeit von verschiedenen physikochemischen Eigenschaften beschreibt, werden Schätzungen für die Identifikationsregionen analysiert. Die Ergebnisse werden ebenfalls mit den Ergebissen klassischer Methoden für Intervalldaten verglichen. Beitrag 2 behandelt logistische Regression unter vergröberter Responsevariable, analysiert punktidentifizierende Annahmen und entwickelt likelihoodbasierte Schätzer für die entsprechenden Identifikationsregionen. Die Methode wird mit Daten einer Welle der Panelstudie "Arbeitsmarkt und Soziale Sicherung" (PASS) illustriert. Beitrag 3 analysiert die kombinatorische Struktur der Extrempunkte und der Kanten eines Polytops (sogenannte Struktur bzw. Kern einer Intervallwahrscheinlichkeit bzw. einer nicht-additiven Mengenfunktion), das von wesentlicher Bedeutung in vielen Gebieten der Imprecise Probability Theory ist. Ein effizienter Algorithmus zur Enumeration aller Extrempunkte wird ebenfalls gegeben und mit existierenden Standardenumerationsmethoden verglichen. In Beitrag 4 wird ein Quantilkonzept für verbandswertige Daten bzw. Zufallsvariablen vorgestellt. Dieses Quantilkonzept wird in Beitrag 5 auf Ranking-Daten im Zusammenhang mit einem Datensatz, der das "Weisheit der Vielen"-Phänomen untersucht, angewendet. Beitrag 6 entwickelt eine Methode zur probabilistischen Analyse der "Qualität" verschiedener Aggregationsfunktionen der Social Choice Theory. Die Analyse wird hier in Abhäangigkeit der Homogenität der betrachteten Gruppen durchgeführt. In einer simulationsbasierten Studie werden exemplarisch verschiedene klassische Aggregationsfunktionen, sowie eine neue Aggregationsfunktion basierend auf den Beiträgen 4 und 5, verglichen. Beitrag 7 stellt einen Ansatz vor, um das Vorliegen stochastischer Dominanz zwischen zwei Zufallsvariablen zu überprüfen. Der Anstaz nutzt Techniken linearer Programmierung. Weiterhin werden Vorschläge für statistische Inferenz und Regularisierung gemacht. Die Methode wird anschließend auch auf den allgemeineren Fall des Optimierens einer linearen Funktion auf einem Hüllensystem ausgeweitet. Die flexible Anwendbarkeit wird durch verschiedene Anwendungsbeispiele illustriert. Beitrag 8 nutzt Ideen stochastischer Dominanz, um Datensätze der multidimensionalen Item Response Theory relational zu analysieren, indem Paare von sich gegenseitig empirisch stützenden Fähigkeitsrelationen der Personen und Schwierigkeitsrelationen der Aufgaben entwickelt werden. Alle entwickelten Methoden wurden in R ([R Development Core Team, 2014]) implementiert. Die Anwendungsbeispiele zeigen die Flexibilität der hier betrachteten Methoden relationaler bzw. "schwacher" Modellierung insbesondere zur Behandlung defizitärer Daten und unterstreichen die Tatsache, dass auch mit Methoden schwacher Modellierung oft immer noch nichttriviale substanzwissenschaftliche Rückschlüsse möglich sind, die aufgrund der inhaltlich vorsichtigeren Modellierung dann auch sehr viel stärker belastbar sind

    A Ubiquitous Framework for Statistical Ranking Systems

    Get PDF
    Ranking systems are everywhere. The thesis will often select sports as its motivating applications, given their accessibility; however, schools and universities, harms of drugs, quality of wines, are all ranked, and all with arguably far greater importance. As such, the methodology is kept necessarily general throughout. In this thesis, a novel conceptual framework for statistical ranking systems is proposed, which separates ranking methodology into two distinct classes: absolute systems, and relative systems. Part I of the thesis deals with absolute systems, with a large portion of the methodology centred on extreme value theory. The methodology is applied to elite swimming, and a statistical ranking system is developed which ranks swimmers, based initially on their personal best times, across different swimming events. A challenge when using extreme value theory in practice is the small number of extreme data, which are by definition rare. By introducing a continuous data-driven covariate, the swim-time can be adjusted for the distance, gender category, or stroke, accordingly, and so allowing all data across all 34 individual events to be pooled into a single model. This results in more efficient inference, and therefore more precise estimates of physical quantities, such as the fastest time possible to swim a particular event. Further increasing inference efficiency, the model is then expanded to include data comprising all the performances of each swimmer, rather than just personal bests. The data therefore have a longitudinal structure, also known as panel data, containing repeated measurements from multiple independent subjects. This work serves as the first attempt at statistical modelling of the extremes of longitudinal data in general and the unique forms of dependence that naturally arise due to the structure of the data. The model can capture a range of extremal dependence structures (asymptotic dependence and asymptotic independence), with this characteristic determined by the data. With this longitudinal model, inference can be made about the careers of individual swimmers - such as the probability an individual will break the world record or swim the fastest time next year. In Part II, the thesis then addresses relative systems. Here, the focus is on incorporating intransitivity into statistical ranking systems. In transitive systems, an object A ranked higher than B implies that A is expected to exhibit preference over B. This is not true in intransitive systems, where pairwise relationships can differ from that which is expected from the underlying rankings alone. In some intransitive systems, a single underlying and unambiguous ranking may not even exist. The seminal Bradley-Terry model is expanded on to allow for intransitivity, and then applied to baseball data as a motivating example. It is found that baseball does indeed contain intransitive elements, and those pairs of teams exhibiting the largest degree of intransitivity are identified. Including intransitivity improves prediction performance for future pairwise comparisons. The thesis ultimately concludes by harmonising the two parts - acknowledging that in reality, there is always some relative element to an absolute system. Forging the armistice between these system types could enflame research into the areas connecting them, which until now remains barren

    Pairwise relations in principle and in practice

    Get PDF
    Pairwise relations are ubiquitous. They occur in any population where individual items are identifiable and interact somehow. Based on these pairwise interactions, it is often of interest to discern a rating — a value reflecting the degree to which an item has some quality, a ranking — an ordering with respect to some quality, or to identify communities of items and the nature of those. In this thesis, I am primarily interested in examples where we might be guided in this practice by something other than purely pragmatic concerns of computational efficiency, predictive ability or familiarity, but rather by more principled motivations. The thesis comprises five chapters, each an independent reflection related to this subject. The first three chapters deal with aspects of those principled motivations. Chapter 1 looks at motivations for a particular well-known statistical rating model, the Bradley-Terry model. Chapter 2 takes a fundamental question in the philosophy of ranking, looking at the nature of the relation that ranking exercises often seek to model by addressing the philosophical controversy over the transitivity of the ‘better than’ relation. Chapter 3 presents an argument, based in sports philosophy, for principles that should guide the selection of ranking models in the context of competitive sport. In Chapter 4, I turn to an example of the practice of ranking based on pairwise comparison, investigating the statistical measures used to assess the reliability of rating exercises in Comparative Judgement, an educational assessment practice. In Chapter 5, I consider an example of community detection, identifying the relative importance of different factors in the propensity for school rugby union fixtures to exist and thus giving clues as to the nature and extent of the ‘old boy’ network

    Satisficing: Integrating two traditions

    Get PDF

    A diffusion model analysis of transitivity and lexicographic semiorder

    Get PDF
    The primary aim of the present dissertation is to examine the underlying cognitive processes of transitivity and lexicographic semiorders. To this end, I apply the diffusion model to preferential choice data, where transitivity or lexicographic semiorders are typically considered to model the choice data. In literature, transitivity is often associated with rationality, whereas lexicographic semiorders are usually considered an alternative way to make decisions specifically when the given task seems daunting. Despite their clearly different decision-making processes, little empirical evidence of such different cognitive processes has been reported, so I decide to run the diffusion model analysis to provide empirical evidence of the underlying cognitive processes behind these two models. To do that, I reparameterize drift rate of the diffusion model in terms of subjective values (or utility) of the alternatives. And I conduct a simulation study to test the new diffusion model's ability to recover the data-generating parameter values. Then, I apply the diffusion model to three sets of real data, one from Cavagnaro and Davis-Stober's (2014) experiment, and two from my own experiment. The results imply that people classified to transitivity tend to integrate more information than those classified to lexicographic semiorders to make a decision. More details about the results and implications are discussed in Chapters 4 and 5.Includes bibliographical references
    • …