    Predicting Good Configurations for GitHub and Stack Overflow Topic Models

    Software repositories contain large amounts of textual data, ranging from source code comments and issue descriptions to questions, answers, and comments on Stack Overflow. To make sense of this textual data, topic modelling is frequently used as a text-mining tool for the discovery of hidden semantic structures in text bodies. Latent Dirichlet allocation (LDA) is a commonly used topic model that aims to explain the structure of a corpus by grouping texts. LDA requires multiple parameters to work well, and there are only rough and sometimes conflicting guidelines available on how these parameters should be set. In this paper, we contribute (i) a broad study of parameters to arrive at good local optima for GitHub and Stack Overflow text corpora, (ii) an a-posteriori characterisation of text corpora related to eight programming languages, and (iii) an analysis of corpus feature importance via per-corpus LDA configuration. We find that (1) popular rules of thumb for topic modelling parameter configuration are not applicable to the corpora used in our experiments, (2) corpora sampled from GitHub and Stack Overflow have different characteristics and require different configurations to achieve good model fit, and (3) we can predict good configurations for unseen corpora reliably. These findings support researchers and practitioners in efficiently determining suitable configurations for topic modelling when analysing textual data contained in software repositories.Comment: to appear as full paper at MSR 2019, the 16th International Conference on Mining Software Repositorie

    Data mining based cyber-attack detection

    Fracture toughness testing data: A technology survey

    Technical abstracts for about 90 significant documents relating to fracture toughness testing for various structural materials including information on plane strain and the developing areas of mixed mode and plane stress test conditions are presented. An overview of the state-of-the-art represented in the documents that have been abstracted is included. The abstracts in the report are mostly for publications in the period April 1962 through April 1974. The purpose of this report is to provide, in quick reference form, a dependable source for current information in the subject field

    Graph Theory and Networks in Biology

    In this paper, we present a survey of the use of graph theoretical techniques in Biology. In particular, we discuss recent work on identifying and modelling the structure of bio-molecular networks, as well as the application of centrality measures to interaction networks and research on the hierarchical structure of such networks and network motifs. Work on the link between structural network properties and dynamics is also described, with emphasis on synchronization and disease propagation.Comment: 52 pages, 5 figures, Survey Pape

    Soil, grain and water chemistry and human selenium imbalances in Enshi district, Hubei Province, China

    Many elements which are essential to human and other animal health in small doses can be toxic if ingested in excess. Selenium (Se), a naturally occurring metalloid element is found in all natural materials on earth including rocks, soils, waters, air, plant and animal tissues. Since the early 1930’s, it has been recognised that Se toxicity causes hoof disorders and hair loss in livestock. Se was also identified as an essential trace element to humans and other animals in the late 1950’s. It forms a vital constituent of the biologically important enzyme glutathione peroxidase which acts as an anti-oxidant preventing cell degeneration. Se deficiency has been implicated in the aetiology of several diseases including cancer, muscular dystrophy, muscular sclerosis and cystic fibrosis. Se can be assimilated in humans through several pathways including food, drinking water and inhalation of Se-bearing particles from the atmosphere. In the majority of situations, food is the most important source of Se, as levels in water are very low. The narrow range between deficiency levels (<40 pg per day) and toxic levels in susceptible people (> 900 pg per day) makes it necessary to carefully control the amount of Se in the diet. In China, Se deficiency has been linked to an endemic degenerative heart disease known as Keshan Disease (KD) and an endemic osteoarthropathy which causes deformity of affected joints, known as Kaschin-Beck Disease. These diseases occur in a geographic belt stretching from Heilongjiang Province in north-east China to Yunnan Province in the south-west. In the period between 1959 and 1970, peak KD incidence rates exceeded 40 per 100 000 (approximately 8500 cases per annum) with 1400 - 3000 deaths recorded each year. Incidence rates have since fallen to less than 5 per 100 000 with approximately 1000 new cases reported annually (Levander, 1986). Se toxicity (selenosis) resulting in hair and nail loss and disorders of the nervous system in the human population, has also been recorded in Enshi District, Hubei Province and in Ziyang County, Shanxi Province. China possesses one of the best epidemiological databases in the world on Se-related diseases which has been used in conjunction with geochemical data to demonstrate a significant geochemical control on human Se exposure. However, the precise geographical areas at risk and the geochemical controls on selenium availability have yet to be established

    NDE: An effective approach to improved reliability and safety. A technology survey

    Technical abstracts are presented for about 100 significant documents relating to nondestructive testing of aircraft structures or related structural testing and the reliability of the more commonly used evaluation methods. Particular attention is directed toward acoustic emission; liquid penetrant; magnetic particle; ultrasonics; eddy current; and radiography. The introduction of the report includes an overview of the state-of-the-art represented in the documents that have been abstracted

    AKARI Infrared Camera Survey of the Large Magellanic Cloud. I. Point Source Catalog

    We present a near- to mid-infrared point source catalog of 5 photometric bands at 3.2, 7, 11, 15 and 24 um for a 10 deg2 area of the Large Magellanic Cloud (LMC) obtained with the Infrared Camera (IRC) onboard the AKARI satellite. To cover the survey area the observations were carried out at 3 separate seasons from 2006 May to June, 2006 October to December, and 2007 March to July. The 10-sigma limiting magnitudes of the present survey are 17.9, 13.8, 12.4, 9.9, and 8.6 mag at 3.2, 7, 11, 15 and 24 um, respectively. The photometric accuracy is estimated to be about 0.1 mag at 3.2 um and 0.06--0.07 mag in the other bands. The position accuracy is 0.3" at 3.2, 7 and 11um and 1.0" at 15 and 24 um. The sensitivities at 3.2, 7, and 24 um are roughly comparable to those of the Spitzer SAGE LMC point source catalog, while the AKARI catalog provides the data at 11 and 15 um, covering the mid-infrared spectral range contiguously. Two types of catalog are provided: a Catalog and an Archive. The Archive contains all the detected sources, while the Catalog only includes the sources that have a counterpart in the Spitzer SAGE point source catalog. The Archive contains about 650,000, 140,000, 97,000, 43,000, and 52,000 sources at 3.2, 7, 11, 15, and 24 um, respectively. Based on the catalog, we discuss the luminosity functions at each band, the color-color diagram, and the color-magnitude diagram using the 3.2, 7, and 11 um band data. Stars without circumstellar envelopes, dusty C-rich and O-rich stars, young stellar objects, and background galaxies are located at distinct regions in the diagrams, suggesting that the present catalog is useful for the classification of objects towards the LMC.Comment: 59 pages, 12 figures, accepted for the Astronomical Journa

    Galaxy And Mass Assembly (GAMA): end of survey report and data release 2

    The Galaxy And Mass Assembly (GAMA) survey is one of the largest contemporary spectroscopic surveys of low redshift galaxies. Covering an area of ˜286 deg2 (split among five survey regions) down to a limiting magnitude of r < 19.8 mag, we have collected spectra and reliable redshifts for 238 000 objects using the AAOmega spectrograph on the Anglo-Australian Telescope. In addition, we have assembled imaging data from a number of independent surveys in order to generate photometry spanning the wavelength range 1 nm-1 m. Here, we report on the recently completed spectroscopic survey and present a series of diagnostics to assess its final state and the quality of the redshift data. We also describe a number of survey aspects and procedures, or updates thereof, including changes to the input catalogue, redshifting and re-redshifting, and the derivation of ultraviolet, optical and near-infrared photometry. Finally, we present the second public release of GAMA data. In this release, we provide input catalogue and targeting information, spectra, redshifts, ultraviolet, optical and near-infrared photometry, single-component Sérsic fits, stellar masses, Hα-derived star formation rates, environment information, and group properties for all galaxies with r < 19.0 mag in two of our survey regions, and for all galaxies with r < 19.4 mag in a third region (72 225 objects in total). The data base serving these data is available at http://www.gama-survey.org/
