150 research outputs found

    On cross-domain social semantic learning

    Get PDF
    Approximately 2.4 billion people are now connected to the Internet, generating massive amounts of data through laptops, mobile phones, sensors and other electronic devices or gadgets. Not surprisingly then, ninety percent of the world's digital data was created in the last two years. This massive explosion of data provides tremendous opportunity to study, model and improve conceptual and physical systems from which the data is produced. It also permits scientists to test pre-existing hypotheses in various fields with large scale experimental evidence. Thus, developing computational algorithms that automatically explores this data is the holy grail of the current generation of computer scientists. Making sense of this data algorithmically can be a complex process, specifically due to two reasons. Firstly, the data is generated by different devices, capturing different aspects of information and resides in different web resources/ platforms on the Internet. Therefore, even if two pieces of data bear singular conceptual similarity, their generation, format and domain of existence on the web can make them seem considerably dissimilar. Secondly, since humans are social creatures, the data often possesses inherent but murky correlations, primarily caused by the causal nature of direct or indirect social interactions. This drastically alters what algorithms must now achieve, necessitating intelligent comprehension of the underlying social nature and semantic contexts within the disparate domain data and a quantifiable way of transferring knowledge gained from one domain to another. Finally, the data is often encountered as a stream and not as static pages on the Internet. Therefore, we must learn, and re-learn as the stream propagates. The main objective of this dissertation is to develop learning algorithms that can identify specific patterns in one domain of data which can consequently augment predictive performance in another domain. The research explores existence of specific data domains which can function in synergy with another and more importantly, proposes models to quantify the synergetic information transfer among such domains. We include large-scale data from various domains in our study: social media data from Twitter, multimedia video data from YouTube, video search query data from Bing Videos, Natural Language search queries from the web, Internet resources in form of web logs (blogs) and spatio-temporal social trends from Twitter. Our work presents a series of solutions to address the key challenges in cross-domain learning, particularly in the field of social and semantic data. We propose the concept of bridging media from disparate sources by building a common latent topic space, which represents one of the first attempts toward answering sociological problems using cross-domain (social) media. This allows information transfer between social and non-social domains, fostering real-time socially relevant applications. We also engineer a concept network from the semantic web, called semNet, that can assist in identifying concept relations and modeling information granularity for robust natural language search. Further, by studying spatio-temporal patterns in this data, we can discover categorical concepts that stimulate collective attention within user groups.Includes bibliographical references (pages 210-214)

    A Systematic Review of Automated Query Reformulations in Source Code Search

    Full text link
    Fixing software bugs and adding new features are two of the major maintenance tasks. Software bugs and features are reported as change requests. Developers consult these requests and often choose a few keywords from them as an ad hoc query. Then they execute the query with a search engine to find the exact locations within software code that need to be changed. Unfortunately, even experienced developers often fail to choose appropriate queries, which leads to costly trials and errors during a code search. Over the years, many studies attempt to reformulate the ad hoc queries from developers to support them. In this systematic literature review, we carefully select 70 primary studies on query reformulations from 2,970 candidate studies, perform an in-depth qualitative analysis (e.g., Grounded Theory), and then answer seven research questions with major findings. First, to date, eight major methodologies (e.g., term weighting, term co-occurrence analysis, thesaurus lookup) have been adopted to reformulate queries. Second, the existing studies suffer from several major limitations (e.g., lack of generalizability, vocabulary mismatch problem, subjective bias) that might prevent their wide adoption. Finally, we discuss the best practices and future opportunities to advance the state of research in search query reformulations.Comment: 81 pages, accepted at TOSE

    Discovering the value of unstructured data in business settings

    Get PDF
    With the increasing amount of unstructured data in business settings, the analysis of unstructured data is reshaping business practices in many industries. The implementation of unstructured data analysis will eventually have dominant presence in all department of organisations thus contributing to the organisations. This dissertation focuses the most widely utilised unstructured data-textual data within the organisation. A variety of techniques has been applied in three studies to discover the information within the unstructured textual data. Study I proposed a dynamic model that incorporates values from topic membership, an outcome variable from Latent Dirichlet Allocation (a probabilistic topic model), with sentiment analysis for rating prediction. A variety of machine learning algorithms are employed to validate the model. Study II focused on the exploration of online reviews from customers in the OFD domain. In addition, this study examines the outcomes of franchising in the service sector from the customer’s perspective. This study identifies key issues during the processes of producing and delivering product/services from service providers to customers in service industries using a large-scale dataset. Study III extends the data scope to the firm-level data. Latent signals are discovered from companies’ self-descriptions. In addition, the association between the signals and the organisation context of the entrepreneurship is also examined, which could display the heterogeneity of various signals across different organisation context

    Software Engineering in the Age of App Stores: Feature-Based Analyses to Guide Mobile Software Engineers

    Get PDF
    Mobile app stores are becoming the dominating distribution platform of mobile applications. Due to their rapid growth, their impact on software engineering practices is not yet well understood. There has been no comprehensive study that explores the mobile app store ecosystem's effect on software engineering practices. Therefore, this thesis, as its first contribution, empirically studies the app store as a phenomenon from the developers' perspective to investigate the extent to which app stores affect software engineering tasks. The study highlights the importance of a mobile application's features as a deliverable unit from developers to users. The study uncovers the involvement of app stores in eliciting requirements, perfective maintenance and domain analysis in the form of discoverable features written in text form in descriptions and user reviews. Developers discover possible features to include by searching the app store. Developers, through interviews, revealed the cost of such tasks given a highly prolific user base, which major app stores exhibit. Therefore, the thesis, in its second contribution, uses techniques to extract features from unstructured natural language artefacts. This is motivated by the indication that developers monitor similar applications, in terms of provided features, to understand user expectations in a certain application domain. This thesis then devises a semantic-aware technique of mobile application representation using textual functionality descriptions. This representation is then shown to successfully cluster mobile applications to uncover a finer-grained and functionality-based grouping of mobile apps. The thesis, furthermore, provides a comparison of baseline techniques of feature extraction from textual artefacts based on three main criteria: silhouette width measure, human judgement and execution time. Finally, this thesis, in its final contribution shows that features do indeed migrate in the app store beyond category boundaries and discovers a set of migratory characteristics and their relationship to price, rating and popularity in the app stores studied

    Developing a Framework to Identify Professional Skills Required for Banking Sector Employee in UK using Natural Language Processing (NLP) Techniques

    Get PDF
    The banking sector is changing dramatically, and new studies reveal that many financial institutions are having challenges keeping up with technology advancements and an acute shortage of skilled workers. The banking industry is changing into a dynamic field where success requires a wide range of talents. For the industry to properly analyses, match, and develop personnel, a strong skill identification process is needed. The objective of this research is to establish a framework for determining the competencies needed by banking industry experts through data extraction from job postings on UK websites.Data is extracted from job vacancy websites leveraging web-based annotation tools and Natural Language Processing (NLP) techniques. This study starts by conducting a thorough examination of the literature to investigate the theoretical underpinnings of NLP techniques, its applications in talent management and human resources within the banking industry, and its potential for skill identification. Next, textual data from job ads is processed using NLP techniques to extract and categorize talents unique to these categories. Advanced algorithms and approaches are used in the NLP-based development process to automatically extract skills from unstructured textual material, guaranteeing that the skills gathered are accurate and most relevant to the needs of the banking industry. To make sure the NLP techniques-driven skill identification is accurate and up to date, the extracted skills are verified by expert feedback. In the final phase, machine learning models are employed to predict the skills required for banking sector employees. This study delves into various machine learning techniques, which are implemented within the framework. By preprocessing and training on skills extracted from job advertisements, these models undergo evaluation to assess their effectiveness in skill prediction. The results offer a detailed analysis of each model's performance, with metrics such as recall, precision, and F1-score being used for assessment. This comprehensive examination underscores the potential of machine learning in skill identification and highlights its relevance in the banking sector.Key Words: Machine Learning, Banking Sector, Employability, Data Mining, NLP, Semantic analysis, Skill assessment, Skill Recognition, Talent managemen

    E-Quality: An Analysis of Digital Equity Discourse and Co-Production in the Era of COVID-19

    Get PDF
    The digital divide refers to the social stratification due to an unequal ability to access, adapt, and create knowledge via information and communication technologies (Andreasson, 2015). Digitally disadvantaged individuals have inadequate access to services and resources, exacerbating existing vulnerabilities. The COVID-19 pandemic instigated a new model of digital equity policymaking that leverages co-production between numerous actors. As citizens faced new financial and community constraints and governments reached administrative capacities, both the digital divide and the policymaking process evolved. This inductive study explores how digital equity policymaking shifted to a co-production model (Ostrom, 1996) amid the pandemic. Using a sequential mixed-methods approach, this research considers the interconnections of digital equity, co-production, and crisis policymaking. Digital divide discourse was first examined through a large-scale text analysis of verified tweets. Methods of investigation include natural language processing techniques, regression modeling, and unsupervised machine learning topic modeling. Descriptive and inferential analyses demonstrate a statistically significant increase in policy discourse as well as a diversification of topics, though suggest a disconnect between outputs and on-the-ground needs. Next, semi-structured interviews were conducted with City of Boston policymakers, and the resulting data was open-coded and axially coded to reveal insights into the design and implementation of co-productive solutions. Additionally, interviews detail what conditions contribute to successful outcomes while working with limited time, knowledge, and resources. Analyses reveal that co-productive behavior is critical to coping with the effects of the pandemic and highlight the influential role of community-based organizations. Furthermore, the study provides contextual information on co-production prerequisites that were previously understood, and sheds light on interpersonal conditions that Ostrom does not address. This dissertation contributes to the developing body of scholarly literature on the digital divide in the era of COVID-19. This case study also advances theoretical knowledge, offers methodological innovations, and provides concrete policy recommendations to promote more egalitarian digital use

    Supporting Source Code Search with Context-Aware and Semantics-Driven Query Reformulation

    Get PDF
    Software bugs and failures cost trillions of dollars every year, and could even lead to deadly accidents (e.g., Therac-25 accident). During maintenance, software developers fix numerous bugs and implement hundreds of new features by making necessary changes to the existing software code. Once an issue report (e.g., bug report, change request) is assigned to a developer, she chooses a few important keywords from the report as a search query, and then attempts to find out the exact locations in the software code that need to be either repaired or enhanced. As a part of this maintenance, developers also often select ad hoc queries on the fly, and attempt to locate the reusable code from the Internet that could assist them either in bug fixing or in feature implementation. Unfortunately, even the experienced developers often fail to construct the right search queries. Even if the developers come up with a few ad hoc queries, most of them require frequent modifications which cost significant development time and efforts. Thus, construction of an appropriate query for localizing the software bugs, programming concepts or even the reusable code is a major challenge. In this thesis, we overcome this query construction challenge with six studies, and develop a novel, effective code search solution (BugDoctor) that assists the developers in localizing the software code of interest (e.g., bugs, concepts and reusable code) during software maintenance. In particular, we reformulate a given search query (1) by designing novel keyword selection algorithms (e.g., CodeRank) that outperform the traditional alternatives (e.g., TF-IDF), (2) by leveraging the bug report quality paradigm and source document structures which were previously overlooked and (3) by exploiting the crowd knowledge and word semantics derived from Stack Overflow Q&A site, which were previously untapped. Our experiment using 5000+ search queries (bug reports, change requests, and ad hoc queries) suggests that our proposed approach can improve the given queries significantly through automated query reformulations. Comparison with 10+ existing studies on bug localization, concept location and Internet-scale code search suggests that our approach can outperform the state-of-the-art approaches with a significant margin

    Mapping the evolving landscape of child-computer interaction research: structures and processes of knowledge (re)production

    Get PDF
    Implementing an iterative sequential mixed methods design (Quantitative → Qualitative → Quantitative) framed within a sociology of knowledge approach to discourse, this study offers an account of the structure of the field of Child-Computer Interaction (CCI), its development over time, and the practices through which researchers have (re)structured knowledge comprising the field. Thematic structure of knowledge within the field, and its evolution over time, is quantified through implementation of a Correlated Topic Model (CTM), an automated inductive content analysis method, in analysing 4,771 CCI research papers published between 2003 and 2021. Detailed understanding of practices through which researchers (re)structure knowledge within the field, including factors influencing these practices, is obtained through thematic analysis of online workshops involving prominent contributors to the field (n=7). Strategic practices utilised by researchers in negotiating tensions impeding integration of novel concepts in the field are investigated through analysis of semantic features of retrieved papers using linear and negative binomial regression models. Contributing an extensive mapping, results portray the field of CCI as a varied research landscape, comprising 48 major themes of study, which has evolved dynamically over time. Research priorities throughout the field have been subject to influence from a range of endogenous and exogenous factors which researchers actively negotiate through research and publication practices. Tacitly structuring research practices, these factors have broadly sustained a technology-driven, novelty-dominated paradigm throughout the field which has failed to substantively progress cumulative knowledge. Through strategic negotiation of persistent tensions arising as consequence of these factors, researchers have nonetheless affected structural change within the field, contributing to a shift towards a user needs-driven agenda and progression of knowledge therein. Findings demonstrate that the field of CCI is proceeding through an intermediary phase in maturation, forming an increasingly distinct disciplinary shape and identity through the cumulative structuring effect of community members’ continued negotiation of tensions

    An ebd-enabled design knowledge acquisition framework

    Get PDF
    Having enough knowledge and keeping it up to date enables designers to execute the design assignment effectively and gives them a competitive advantage in the design profession. Knowledge elicitation or acquisition is a crucial component of system design, particularly for tasks requiring transdisciplinary or multidisciplinary cooperation. In system design, extracting domain-specific information is exceedingly tricky for designers. This thesis presents three works that attempt to bridge the gap between designers and domain expertise. First, a systematic literature review on data-driven demand elicitation is given using the Environment-based Design (EBD) approach. This review address two research objectives: (i) to investigate the present state of computer-aided requirement knowledge elicitation in the domains of engineering; (ii) to integrate EBD methodology into the conventional literature review framework by providing a well-structured research question generation methodology. The second study describes a data-driven interview transcript analysis strategy that employs EBD environment analysis, unsupervised machine learning, and a range of natural language processing (NLP) approaches to assist designers and qualitative researchers in extracting needs when domain expertise is lacking. The second research proposes a transfer-learning method-based qualitative text analysis framework that aids researchers in extracting valuable knowledge from interview data for healthcare promotion decision-making. The third work is an EBD-enabled design lexical knowledge acquisition framework that automatically constructs a semantic network -- RomNet from an extensive collection of abstracts from engineering publications. Applying RomNet can improve the design information retrieval quality and communication between each party involved in a design project. To conclude, this thesis integrates artificial intelligence techniques, such as Natural Language Processing (NLP) methods, Machine Learning techniques, and rule-based systems to build a knowledge acquisition framework that supports manual, semi-automatic, and automatic extraction of design knowledge from different types of the textual data source

    Trustworthiness in Social Big Data Incorporating Semantic Analysis, Machine Learning and Distributed Data Processing

    Get PDF
    This thesis presents several state-of-the-art approaches constructed for the purpose of (i) studying the trustworthiness of users in Online Social Network platforms, (ii) deriving concealed knowledge from their textual content, and (iii) classifying and predicting the domain knowledge of users and their content. The developed approaches are refined through proof-of-concept experiments, several benchmark comparisons, and appropriate and rigorous evaluation metrics to verify and validate their effectiveness and efficiency, and hence, those of the applied frameworks
    • …
    corecore