Data mining techniques must be developed and applied to analyse the large
public data bases containing hundreds to thousands of millions entries. The aim
of this study is to develop methods for locating previously unknown stellar
clusters from the UKIDSS Galactic Plane Survey catalogue data. The cluster
candidates are computationally searched from pre-filtered catalogue data using
a method that fits a mixture model of Gaussian densities and background noise
using the Expectation Maximization algorithm. The catalogue data contains a
significant number of false sources clustered around bright stars. A large
fraction of these artefacts were automatically filtered out before or during
the cluster search. The UKIDSS data reduction pipeline tends to classify
marginally resolved stellar pairs and objects seen against variable surface
brightness as extended objects (or "galaxies" in the archive parlance). 10% or
66 x 10^6 of the sources in the UKIDSS GPS catalogue brighter than 17
magnitudes in the K band are classified as "galaxies". Young embedded clusters
create variable NIR surface brightness because the gas/dust clouds in which
they were formed scatters the light from the cluster members. Such clusters
appear therefore as clusters of "galaxies" in the catalogue and can be found
using only a subset of the catalogue data. The detected "galaxy clusters" were
finally screened visually to eliminate the remaining false detections due to
data artefacts. Besides the embedded clusters the search also located locations
of non clustered embedded star formation. The search covered an area of 1302
square degrees and 137 previously unknown cluster candidates and 30 previously
unknown sites of star formation were found