2,062 research outputs found

    "Medium" LMs of Code in the Era of LLMs: Lessons From StackOverflow

    Full text link
    Large pre-trained neural language models have brought immense progress to both NLP and software engineering. Models in OpenAI's GPT series now dwarf Google's BERT and Meta's RoBERTa, which previously set new benchmarks on a wide range of NLP applications. These models are trained on massive corpora of heterogeneous data from web crawls, which enables them to learn general language patterns and semantic relationships. However, the largest models are both expensive to train and deploy and are often closed-source, so we lack access to their data and design decisions. We argue that this trend towards large, general-purpose models should be complemented with single-purpose, more modestly sized pre-trained models. In this work, we take StackOverflow (SO) as a domain example in which large volumes of rich aligned code and text data is available. We adopt standard practices for pre-training large language models, including using a very large context size (2,048 tokens), batch size (0.5M tokens) and training set (27B tokens), coupled with a powerful toolkit (Megatron-LM), to train two models: SOBertBase, with 109M parameters, and SOBertLarge with 762M parameters, at a budget of just $187\$187 and $800\$800 each. We compare the performance of our models with both the previous SOTA model trained on SO data exclusively as well general-purpose BERT models and OpenAI's ChatGPT on four SO-specific downstream tasks - question quality prediction, closed question prediction, named entity recognition and obsoletion prediction (a new task we introduce). Not only do our models consistently outperform all baselines, the smaller model is often sufficient for strong results. Both models are released to the public. These results demonstrate that pre-training both extensively and properly on in-domain data can yield a powerful and affordable alternative to leveraging closed-source general-purpose models

    A new mixture model approach to analyzing allelic-loss data using Bayes factors

    Get PDF
    BACKGROUND: Allelic-loss studies record data on the loss of genetic material in tumor tissue relative to normal tissue at various loci along the genome. As the deletion of a tumor suppressor gene can lead to tumor development, one objective of these studies is to determine which, if any, chromosome arms harbor tumor suppressor genes. RESULTS: We propose a large class of mixture models for describing the data, and we suggest using Bayes factors to select a reasonable model from the class in order to classify the chromosome arms. Bayes factors are especially useful in the case of testing that the number of components in a mixture model is n(0 )versus n(1). In these cases, frequentist test statistics based on the likelihood ratio statistic have unknown distributions and are therefore not applicable. Our simulation study shows that Bayes factors favor the right model most of the time when tumor suppressor genes are present. When no tumor suppressor genes are present and background allelic-loss varies, the Bayes factors are often inconclusive, although this results in a markedly reduced false-positive rate compared to that of standard frequentist approaches. Application of our methods to three data sets of esophageal adenocarcinomas yields interesting differences from those results previously published. CONCLUSIONS: Our results indicate that Bayes factors are useful for analyzing allelic-loss data

    A Swarm-based Approach To Medical Image Analysis

    Get PDF
    Image segmentation is an indispensable part of the visualization of human tissues, particularly during analysis of Magnetic Resonance (MR) images. Unfortunately images always contain a significant amount of noise due to operator performance, equipment, and the environment can lead to serious inaccuracies with segmentation. A segmentation technique based on an extension to the traditional C-means (FCM) clustering algorithm is proposed in this paper. A neighborhood attraction, which is dependent on the relative location and features of neighboring pixels considered.. The degree of attraction is optimized by a Particle Swarm Optimization model. Paper demonstrates the superiority of the proposed technique to FCM-based method. This segmentation method is component of an MR image-based classification system for tumors, currently being developed

    Odontogenic tumors : a study of 120 cases in an indian teaching hospital

    Get PDF
    Objective: Studies on odontogenic tumors published from many parts of the world show a distinct geographic variation; however, there is little information available in the English-language literature on the relative frequency of odontogenic tumors in India. This retrospective study was designed to determine the relative frequency of odontogenic tumors in an Indian population and compare them with various reports from other parts of the world. Study design: The histopathology records of the Department of Oral Pathology and Microbiology of Government Dental College and Hospital, Mumbai were retrieved retrospectively within the period of January 2001 to July 2010. A total of 120 lesions classified as odontogenic tumors were reviewed. These were analyzed for age, gender, site of tumor and histopathologic typing. Criteria used were World Health Organization (WHO) classification 2005. The mandible and maxilla were divided into 4 anatomic regions, and the distribution of each odontogenic tumor among these regions was recorded and analyzed. Results: A total of 120 cases of odontogenic tumors were reported in this period. Odontogenic tumors in the present study constituted 5.78% of all the 2075 registered biopsies. The most frequent histological type was ameloblastoma (40.83%), followed by Keratocystic odontogenic tumor (37.5%), odontome (11.66%) and adenomatoid odontogenic tumor (5.8%). In general, the odontogenic tumors showed a predilection for the mandible and the posterior regions of the jaws. Ameloblastomas occurred with a marked predilection for the mandible, while adenomatoid odontogenic tumor showed predilection for the maxilla, anterior regions of the jaws, and young females. Conclusion: A frequency of 5.78% of odontogenic tumors was observed in this study. Ameloblastoma comprised the single most common tumor of all odontogenic tumors. This study observed geographic variations in the frequency and distribution of odontogenic tumors. © Medicina Oral S. L

    Bulk Current Injection Testing of Cable Noise Reduction Techniques, 50 kHz to 400 MHz

    Get PDF
    This paper presents empirical results of cable noise reduction techniques as demonstrated using bulk current injection (BCI) techniques with radiated fields from 50 kHz - 400 MHz. It is a follow up to the two-part paper series presented at the Asia Pacific EMC Conference that focused on TEM cell signal injection. This paper discusses the effects of cable types, shield connections, and chassis connections on cable noise. For each topic, well established theories are compared with data from a real-world physical system

    An Examplar Based Video Inpainting using Dictionary Based Method

    Get PDF
    Inpainting is a skill of rebuilding lost or selected part from the image based on relatedor available information. Reconstruction of missing parts in videos is used extensively nowadays. A method for video inpainting usingexamplar-based inpainting is introduced in the system. The examplar based inpainting samples and copies best matching texture patches using texture synthesis. Matching patches are extracted from the known part of the frames from the video. Input frames are extracted and inpainted using examplar based method. For that dictionary is maintained which consists of legal patches. The input picture isinpainted several times with different parameters. Then it is combined and details are recovered to get the final inpainted video

    Formononetin Treatment in Type 2 Diabetic Rats Reduces Insulin Resistance and Hyperglycemia

    Get PDF
    Type 2 diabetic mellitus is a multifactorial metabolic disorder affecting huge population around the world. This indicates that there is an urgent unmet need of cost effective, new treatment strategies for type 2 diabetes mellitus with no or less side effects. Phenolic compounds including isoflavones are known for their beneficial effect in metabolic disorders. The present work was intended to find out efficacy of formononetin, an isoflavone treatment in experimental model of type 2 diabetes. Type 2 diabetes mellitus was induced by feeding high fat diet for 2 weeks prior to streptozotocin administration in Sprague Dawley rats. Diabetic animals were treated with formononetin for 28 days at three dose level, i.e., 10, 20, and 40 mg/kg body weight orally. The effect of formononetin treatment on various parameters such as plasma glucose, glucose tolerance, insulin, HOMA-IR, lipid profile, hepatic glycogen content, glycohaemoglobin and SIRT1 expression in pancreatic tissue was measured. Histopathological changes in pancreatic tissue were also studied. Results of the study demonstrate that formononetin treatment reduces blood glucose level significantly (p < 0.001) at all the three dose level. It also improved glucose tolerance, insulin sensitivity and lipid profile along with reduction in glycohaemoglobin content in blood. Formononetin treatment also improved hepatic glycogen level profoundly in diabetic rats. Determination of SIRT1 expression in pancreatic tissue by immunohistochemical analysis showed that formononetin treatment increases the expression of SIRT1 in pancreatic tissue. Histopathological study showed that treatment with formononetin protects pancreatic beta cells from necro-degeneration and atrophic effect. It can be concluded that formononetin treatment reduces insulin resistance and attenuate hyperglycemia in type 2 diabetes which may be due to increasing expression of SIRT1 in pancreatic tissues
    corecore