In the modern era of big data, academic institutions, business organizations and government agencies have increasingly needed to deal with a substantial amount of heterogeneous data. It becomes a necessity to develop effective methodologies to extract meaningful insights from this type of data. Among the many methods, mixture modeling is one of the most popular tools and has been successfully adapted to many scientific domains in recent decades. One of its appealing features is the ability to perform data clustering in a well-principled manner. The importance of mixture models is evident in the plethora of publication on the application and theory aspects of mixture modeling in the Statistics and general scientific literature. Fields in which mixture models have been applied with success include economy, astronomy, biology, engineering, psychology, ecology, engineering, computer science, neuroscience among many others in the physical, biological and social science.
Our specific contributions to the rich literature of mixture models as follows. The first chapter provides an application of mixture modeling to a complex dataset of solar flares on the surface of Sun. Solar flares are sudden explosions of extremely hot plasma on regions where the Sun's magnetic fields erupt from localized areas known as active regions which are of great interest to physicists. We demonstrate how to explicitly model the heterogeneous patterns of active regions using mixture models. This approach has not yet been pursued in the Space Weather literature at least to our knowledge. Since energetic solar flares are extremely rare events compared to low energy flares which occur orders of magnitude more frequently, statistical inference for this type of data needs to address the data imbalance issue. So another contribution of our work is showing how to deal with the imbalance problem using the Expectation Maximization framework. In the second chapter, we extend an existing identifiability result of well-specified finite mixture models to a setting where the underlying mixture density is of two different kernel families. This setting is motivated by the fact that many datasets in scientific domains typically consist of a signal and a background component. In the latter part of the second chapter, we provide theoretical results of mixture models' behaviors under misspecification. The result begins with the setting of a single Student-t or normal distribution. Then we move to the main result specific to the setting where data population is a mixture of two Student-t distributions but statisticians choose to model as a mixture of two normal distributions. The third chapter utilizes simulation studies to continue the story from the second chapter. Simulation studies are computer experiments that involve creating data by pseudo-random sampling from known probability distributions. A key advantage of simulation studies is that some “truth” (about some parameters of interest) is known from the process of generating the data. It allows us to examine statistical properties such as biases in a relatively straightforward fashion. In this chapter, the bias behaviors of mixture locations and mixing weight are studied for scenarios where biased analytical analysis is difficult to obtain.PhDStatisticsUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/178081/1/vietdo_1.pd