Search CORE

7 research outputs found

Challenges and Barriers of Using Low Code Software for Machine Learning

Author: Alamin Md Abdullah Al
Uddin Gias
Publication venue
Publication date: 08/11/2022
Field of study

As big data grows ubiquitous across many domains, more and more stakeholders seek to develop Machine Learning (ML) applications on their data. The success of an ML application usually depends on the close collaboration of ML experts and domain experts. However, the shortage of ML engineers remains a fundamental problem. Low-code Machine learning tools/platforms (aka, AutoML) aim to democratize ML development to domain experts by automating many repetitive tasks in the ML pipeline. This research presents an empirical study of around 14k posts (questions + accepted answers) from Stack Overflow (SO) that contained AutoML-related discussions. We examine how these topics are spread across the various Machine Learning Life Cycle (MLLC) phases and their popularity and difficulty. This study offers several interesting findings. First, we find 13 AutoML topics that we group into four categories. The MLOps topic category (43% questions) is the largest, followed by Model (28% questions), Data (27% questions), Documentation (2% questions). Second, Most questions are asked during Model training (29%) (i.e., implementation phase) and Data preparation (25%) MLLC phase. Third, AutoML practitioners find the MLOps topic category most challenging, especially topics related to model deployment & monitoring and Automated ML pipeline. These findings have implications for all three AutoML stakeholders: AutoML researchers, AutoML service vendors, and AutoML developers. Academia and Industry collaboration can improve different aspects of AutoML, such as better DevOps/deployment support and tutorial-based documentation

arXiv.org e-Print Archive

Innovative Heuristics to Improve the Latent Dirichlet Allocation Methodology for Textual Analysis and a New Modernized Topic Modeling Approach

Author: Zimmerman Jamie T.
Publication venue: AFIT Scholar
Publication date: 01/06/2022
Field of study

Natural Language Processing is a complex method of data mining the vast trove of documents created and made available every day. Topic modeling seeks to identify the topics within textual corpora with limited human input into the process to speed analysis. Current topic modeling techniques used in Natural Language Processing have limitations in the pre-processing steps. This dissertation studies topic modeling techniques, those limitations in the pre-processing, and introduces new algorithms to gain improvements from existing topic modeling techniques while being competitive with computational complexity. This research introduces four contributions to the field of Natural Language Processing and topic modeling. First, this research identifies a requirement for a more robust “stopwords” list and proposes a heuristic for creating a more robust list. Second, a new dimensionality-reduction technique is introduced that exploits the number of words within a document to infer importance to word choice. Third, an algorithm is developed to determine the number of topics within a corpus and demonstrated using a standard topic modeling data set. These techniques produce a higher quality result from the Latent Dirichlet Allocation topic modeling technique. Fourth, a novel heuristic utilizing Principal Component Analysis is introduced that is capable of determining the number of topics within a corpus that produces stable sets of topic words

AFTI Scholar (Air Force Institute of Technology)

Using heuristics to estimate an appropriate number of latent topics in source code analysis

Author: Antoniol
Asuncion
Bartholomew
Blei
Bollen
Bradford
Comon
Cordy
David B. Skillicorn
Grant
Griffiths
Heinrich
Hindle
James R. Cordy
Kuhn
Linstead
Linstead
Lukins
Maletic
Maletic
Marcus
Maskeri
Oliveto
Roy
Roy
Schölkopf
Scott Grant
Steinwart
Thomas
Thomas
Wallach
Publication venue: 'Elsevier BV'
Publication date
Field of study

Crossref

Improving software modularization via automated analysis of latent topics and dependencies

Author: Andrea de Lucia
Anquetil N.
Antoniol G.
Chen T.
Denys Poshyvanyk
Fanta R.
Fowler M.
Gabriele Bavota
Harman M.
Hindle A.
Lanza M.
Lee Y.
Li W.
Malcom Gethers
Maletic J. I.
Mancoridis S.
Panichella A.
Pashov I.
Pressman R.
Rocco Oliveto
Shtern M.
Yourdon E.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date
Field of study

Crossref

A Data Mining Toolbox for Collaborative Writing Processes

Author: Southavilay Vilaythong
Publication venue: Faculty of Engineering and Information Technologies, School of Information Technologies
Publication date: 01/01/2013
Field of study

Collaborative writing (CW) is an essential skill in academia and industry. Providing support during the process of CW can be useful not only for achieving better quality documents, but also for improving the CW skills of the writers. In order to properly support collaborative writing, it is essential to understand how ideas and concepts are developed during the writing process, which consists of a series of steps of writing activities. These steps can be considered as sequence patterns comprising both time events and the semantics of the changes made during those steps. Two techniques can be combined to examine those patterns: process mining, which focuses on extracting process-related knowledge from event logs recorded by an information system; and semantic analysis, which focuses on extracting knowledge about what the student wrote or edited. This thesis contributes (i) techniques to automatically extract process models of collaborative writing processes and (ii) visualisations to describe aspects of collaborative writing. These two techniques form a data mining toolbox for collaborative writing by using process mining, probabilistic graphical models, and text mining. First, I created a framework, WriteProc, for investigating collaborative writing processes, integrated with the existing cloud computing writing tools in Google Docs. Secondly, I created new heuristic to extract the semantic nature of text edits that occur in the document revisions and automatically identify the corresponding writing activities. Thirdly, based on sequences of writing activities, I propose methods to discover the writing process models and transitional state diagrams using a process mining algorithm, Heuristics Miner, and Hidden Markov Models, respectively. Finally, I designed three types of visualisations and made contributions to their underlying techniques for analysing writing processes. All components of the toolbox are validated against annotated writing activities of real documents and a synthetic dataset. I also illustrate how the automatically discovered process models and visualisations are used in the process analysis with real documents written by groups of graduate students. I discuss how the analyses can be used to gain further insight into how students work and create their collaborative documents

Sydney eScholarship