2,653 research outputs found

    Analytical Challenges in Modern Tax Administration: A Brief History of Analytics at the IRS

    Get PDF

    Analysing Longitudinal Social Science Questionnaires: Topic modelling with BERT-based Embeddings

    Get PDF
    Unsupervised topic modelling is a useful unbiased mechanism for topic labelling of complex longitudinal questionnaires covering multiple domains such as social science and medicine. Manual tagging of such complex datasets increases the propensity of incorrect or inconsistent labels and is a barrier to scaling the processing of longitudinal questionnaires for provision of question banks for data collection agencies. Towards this effort, we propose a tailored BERTopic framework that takes advantage of its novel sentence embedding for creating interpretable topics, and extend it with an enhanced visualisation for comparing the topic model labels with the tags manually assigned to the question literals. The resulting topic clusters uncover instances of mislabelled question tags, while also enabling showcasing the semantic shifts and evolution of the topics across the time span of the longitudinal questionnaires. The tailored BERTopic framework outperforms existing topic modelling baselines for the quantitative evaluation metrics of topic coherence and diversity, while also being 18 times faster than the next best-performing baseline

    Aplicação de técnicas de Clustering ao contexto da Tomada de Decisão em Grupo

    Get PDF
    Nowadays, decisions made by executives and managers are primarily made in a group. Therefore, group decision-making is a process where a group of people called participants work together to analyze a set of variables, considering and evaluating a set of alternatives to select one or more solutions. There are many problems associated with group decision-making, namely when the participants cannot meet for any reason, ranging from schedule incompatibility to being in different countries with different time zones. To support this process, Group Decision Support Systems (GDSS) evolved to what today we call web-based GDSS. In GDSS, argumentation is ideal since it makes it easier to use justifications and explanations in interactions between decision-makers so they can sustain their opinions. Aspect Based Sentiment Analysis (ABSA) is a subfield of Argument Mining closely related to Natural Language Processing. It intends to classify opinions at the aspect level and identify the elements of an opinion. Applying ABSA techniques to Group Decision Making Context results in the automatic identification of alternatives and criteria, for example. This automatic identification is essential to reduce the time decision-makers take to step themselves up on Group Decision Support Systems and offer them various insights and knowledge on the discussion they are participants. One of these insights can be arguments getting used by the decision-makers about an alternative. Therefore, this dissertation proposes a methodology that uses an unsupervised technique, Clustering, and aims to segment the participants of a discussion based on arguments used so it can produce knowledge from the current information in the GDSS. This methodology can be hosted in a web service that follows a micro-service architecture and utilizes Data Preprocessing and Intra-sentence Segmentation in addition to Clustering to achieve the objectives of the dissertation. Word Embedding is needed when we apply clustering techniques to natural language text to transform the natural language text into vectors usable by the clustering techniques. In addition to Word Embedding, Dimensionality Reduction techniques were tested to improve the results. Maintaining the same Preprocessing steps and varying the chosen Clustering techniques, Word Embedders, and Dimensionality Reduction techniques came up with the best approach. This approach consisted of the KMeans++ clustering technique, using SBERT as the word embedder with UMAP dimensionality reduction, reducing the number of dimensions to 2. This experiment achieved a Silhouette Score of 0.63 with 8 clusters on the baseball dataset, which wielded good cluster results based on their manual review and Wordclouds. The same approach obtained a Silhouette Score of 0.59 with 16 clusters on the car brand dataset, which we used as an approach validation dataset.Atualmente, as decisões tomadas por gestores e executivos são maioritariamente realizadas em grupo. Sendo assim, a tomada de decisão em grupo é um processo no qual um grupo de pessoas denominadas de participantes, atuam em conjunto, analisando um conjunto de variáveis, considerando e avaliando um conjunto de alternativas com o objetivo de selecionar uma ou mais soluções. Existem muitos problemas associados ao processo de tomada de decisão, principalmente quando os participantes não têm possibilidades de se reunirem (Exs.: Os participantes encontramse em diferentes locais, os países onde estão têm fusos horários diferentes, incompatibilidades de agenda, etc.). Para suportar este processo de tomada de decisão, os Sistemas de Apoio à Tomada de Decisão em Grupo (SADG) evoluíram para o que hoje se chamam de Sistemas de Apoio à Tomada de Decisão em Grupo baseados na Web. Num SADG, argumentação é ideal pois facilita a utilização de justificações e explicações nas interações entre decisores para que possam suster as suas opiniões. Aspect Based Sentiment Analysis (ABSA) é uma área de Argument Mining correlacionada com o Processamento de Linguagem Natural. Esta área pretende classificar opiniões ao nível do aspeto da frase e identificar os elementos de uma opinião. Aplicando técnicas de ABSA à Tomada de Decisão em Grupo resulta na identificação automática de alternativas e critérios por exemplo. Esta identificação automática é essencial para reduzir o tempo que os decisores gastam a customizarem-se no SADG e oferece aos mesmos conhecimento e entendimentos sobre a discussão ao qual participam. Um destes entendimentos pode ser os argumentos a serem usados pelos decisores sobre uma alternativa. Assim, esta dissertação propõe uma metodologia que utiliza uma técnica não-supervisionada, Clustering, com o objetivo de segmentar os participantes de uma discussão com base nos argumentos usados pelos mesmos de modo a produzir conhecimento com a informação atual no SADG. Esta metodologia pode ser colocada num serviço web que segue a arquitetura micro serviços e utiliza Preprocessamento de Dados e Segmentação Intra Frase em conjunto com o Clustering para atingir os objetivos desta dissertação. Word Embedding também é necessário para aplicar técnicas de Clustering a texto em linguagem natural para transformar o texto em vetores que possam ser usados pelas técnicas de Clustering. Também Técnicas de Redução de Dimensionalidade também foram testadas de modo a melhorar os resultados. Mantendo os passos de Preprocessamento e variando as técnicas de Clustering, Word Embedder e as técnicas de Redução de Dimensionalidade de modo a encontrar a melhor abordagem. Essa abordagem consiste na utilização da técnica de Clustering KMeans++ com o SBERT como Word Embedder e UMAP como a técnica de redução de dimensionalidade, reduzindo as dimensões iniciais para duas. Esta experiência obteve um Silhouette Score de 0.63 com 8 clusters no dataset de baseball, que resultou em bons resultados de cluster com base na sua revisão manual e visualização dos WordClouds. A mesma abordagem obteve um Silhouette Score de 0.59 com 16 clusters no dataset das marcas de carros, ao qual usamos esse dataset com validação de abordagem

    Wasco Environmental Chamber System Manifold

    Get PDF
    This report details the process of designing and producing a manifold that connects ultra-high purity (UHP) pressure and vacuum switches to an air supply so that the parts may be tested in an environmental chamber. Using the DMAIC (define, measure, analyze, improve, and control) methodology of problem solving, progress on the project includes thorough research about the problem, preliminary design solutions, iterative prototyping, and testing specific functionalities. Manufacturing engineering topics of tooling, fixturing, metrology, quality, and machining with manual and computer numerical control (CNC) varieties of both mills and lathes are applied during the course of the project. The report concludes with a final manifold design and comparison against the old manifold

    Exploring quantitative modelling of semantic factors for content marketing

    Get PDF
    Developments in business analytics as well as an increased availability of data has allowed digital marketers to better understand and capitalize on consumer behavior to maximize the engagement with marketing materials. However, because most previous studies in this field have focused on consumer behavior theory, they have been largely limited in scope due to small datasets and reliance on human-labeled data. This study aims to explore the potential of using a machine-learning language model to generate vector embeddings, representing the semantics in text, to model engagement in a quantitative way. By clustering the semantic vector embeddings, the study was able to generate datasets on different topics, on which regression models were estimated to gauge the impact of the represented variables. Many of the parameters in the models were shown to be significant, implying both explanatory potential in text semantics, as well as the presented methods’ ability to model these. This expands on theories in the literature regarding how semantic factors affect consumer perception, as well as highlighting that text semantics contains information that can help inform marketing decision-making. The paper contributes a methodology that can allow academics and marketers alike to model these semantics and thus gain insights relating to how topics and language affect consumer engagement. Further investigation into similar methods might allow digital marketers to improve their understanding how different consumers perceive and engage with their marketing content

    Scalable manifold learning by uniform landmark sampling and constrained locally linear embedding

    Full text link
    As a pivotal approach in machine learning and data science, manifold learning aims to uncover the intrinsic low-dimensional structure within complex nonlinear manifolds in high-dimensional space. By exploiting the manifold hypothesis, various techniques for nonlinear dimension reduction have been developed to facilitate visualization, classification, clustering, and gaining key insights. Although existing manifold learning methods have achieved remarkable successes, they still suffer from extensive distortions incurred in the global structure, which hinders the understanding of underlying patterns. Scalability issues also limit their applicability for handling large-scale data. Here, we propose a scalable manifold learning (scML) method that can manipulate large-scale and high-dimensional data in an efficient manner. It starts by seeking a set of landmarks to construct the low-dimensional skeleton of the entire data, and then incorporates the non-landmarks into the learned space based on the constrained locally linear embedding (CLLE). We empirically validated the effectiveness of scML on synthetic datasets and real-world benchmarks of different types, and applied it to analyze the single-cell transcriptomics and detect anomalies in electrocardiogram (ECG) signals. scML scales well with increasing data sizes and embedding dimensions, and exhibits promising performance in preserving the global structure. The experiments demonstrate notable robustness in embedding quality as the sample rate decreases.Comment: 33 pages, 10 figure

    LEVERAGING SOCIAL NETWORK DATA FOR ANALYTICAL CRM STRATEGIES - THE INTRODUCTION OF SOCIAL BI

    Get PDF
    The skyrocketing trend for social media on the Internet greatly alters analytical Customer Relationship Management (CRM). Against this backdrop, the purpose of this paper is to advance the conceptual design of Business Intelligence (BI) systems with data identified from social networks. We develop an integrated social network data model, based on an in-depth analysis of Facebook. The data model can inform the design of data warehouses in order to offer new opportunities for CRM analyses, leading to a more consistent and richer picture of customers? characteristics, needs, wants, and demands. Four major contributions are offered. First, Social CRM and Social BI are introduced as emerging fields of research. Second, we develop a conceptual data model to identify and systematize the data available on online social networks. Third, based on the identified data, we design a multidimensional data model as an early contribution to the conceptual design of Social BI systems and demonstrate its application by developing management reports in a retail scenario. Fourth, intellectual challenges for advancing Social CRM and Social BI are discussed

    Web Data Extraction, Applications and Techniques: A Survey

    Full text link
    Web Data Extraction is an important problem that has been studied by means of different scientific tools and in a broad range of applications. Many approaches to extracting data from the Web have been designed to solve specific problems and operate in ad-hoc domains. Other approaches, instead, heavily reuse techniques and algorithms developed in the field of Information Extraction. This survey aims at providing a structured and comprehensive overview of the literature in the field of Web Data Extraction. We provided a simple classification framework in which existing Web Data Extraction applications are grouped into two main classes, namely applications at the Enterprise level and at the Social Web level. At the Enterprise level, Web Data Extraction techniques emerge as a key tool to perform data analysis in Business and Competitive Intelligence systems as well as for business process re-engineering. At the Social Web level, Web Data Extraction techniques allow to gather a large amount of structured data continuously generated and disseminated by Web 2.0, Social Media and Online Social Network users and this offers unprecedented opportunities to analyze human behavior at a very large scale. We discuss also the potential of cross-fertilization, i.e., on the possibility of re-using Web Data Extraction techniques originally designed to work in a given domain, in other domains.Comment: Knowledge-based System

    WLAN-paikannuksen elinkaaren tukeminen

    Get PDF
    The advent of GPS positioning at the turn of the millennium provided consumers with worldwide access to outdoor location information. For the purposes of indoor positioning, however, the GPS signal rarely penetrates buildings well enough to maintain the same level of positioning granularity as outdoors. Arriving around the same time, wireless local area networks (WLAN) have gained widespread support both in terms of infrastructure deployments and client proliferation. A promising approach to bridge the location context then has been positioning based on WLAN signals. In addition to being readily available in most environments needing support for location information, the adoption of a WLAN positioning system is financially low-cost compared to dedicated infrastructure approaches, partly due to operating on an unlicensed frequency band. Furthermore, the accuracy provided by this approach is enough for a wide range of location-based services, such as navigation and location-aware advertisements. In spite of this attractive proposition and extensive research in both academia and industry, WLAN positioning has yet to become the de facto choice for indoor positioning. This is despite over 20 000 publications and the foundation of several companies. The main reasons for this include: (i) the cost of deployment, and re-deployment, which is often significant, if not prohibitive, in terms of work hours; (ii) the complex propagation of the wireless signal, which -- through interaction with the environment -- renders it inherently stochastic; (iii) the use of an unlicensed frequency band, which means the wireless medium faces fierce competition by other technologies, and even unintentional radiators, that can impair traffic in unforeseen ways and impact positioning accuracy. This thesis addresses these issues by developing novel solutions for reducing the effort of deployment, including optimizing the indoor location topology for the use of WLAN positioning, as well as automatically detecting sources of cross-technology interference. These contributions pave the way for WLAN positioning to become as ubiquitous as the underlying technology.GPS-paikannus avattiin julkiseen käyttöön vuosituhannen vaihteessa, jonka jälkeen sitä on voinut käyttää sijainnin paikantamiseen ulkotiloissa kaikkialla maailmassa. Sisätiloissa GPS-signaali kuitenkin harvoin läpäisee rakennuksia kyllin hyvin voidakseen tarjota vastaavaa paikannustarkkuutta. Langattomat lähiverkot (WLAN), mukaan lukien tukiasemat ja käyttölaitteet, yleistyivät nopeasti samoihin aikoihin. Näiden verkkojen signaalien käyttö on siksi alusta asti tarjonnut lupaavia mahdollisuuksia sisätilapaikannukseen. Useimmissa ympäristöissä on jo valmiit WLAN-verkot, joten paikannuksen käyttöönotto on edullista verrattuna järjestelmiin, jotka vaativat erillisen laitteiston. Tämä johtuu osittain lisenssivapaasta taajuusalueesta, joka mahdollistaa kohtuuhintaiset päätelaitteet. WLAN-paikannuksen tarjoama tarkkuus on lisäksi riittävä monille sijaintipohjaisille palveluille, kuten suunnistamiselle ja paikkatietoisille mainoksille. Näistä lupaavista alkuasetelmista ja laajasta tutkimuksesta huolimatta WLAN-paikannus ei ole kuitenkaan pystynyt lunastamaan paikkaansa pääasiallisena sisätilapaikannusmenetelmänä. Vaivannäöstä ei ole puutetta; vuosien saatossa on julkaistu yli 20 000 tieteellistä artikkelia sekä perustettu useita yrityksiä. Syitä tähän kehitykseen on useita. Ensinnäkin, paikannuksen pystyttäminen ja ylläpito vaativat aikaa ja vaivaa. Toiseksi, langattoman signaalin eteneminen ja vuorovaikutus ympäristön kanssa on hyvin monimutkaista, mikä tekee mallintamisesta vaikeaa. Kolmanneksi, eri teknologiat ja laitteet kilpailevat lisenssivapaan taajuusalueen käytöstä, mikä johtaa satunnaisiin paikannustarkkuuteen vaikuttaviin tietoliikennehäiriöihin. Väitöskirja esittelee uusia menetelmiä joilla voidaan merkittävästi pienentää paikannusjärjestelmän asennuskustannuksia, jakaa ympäristö automaattisesti osiin WLAN-paikannusta varten, sekä tunnistaa mahdolliset langattomat häiriölähteet. Nämä kehitysaskeleet edesauttavat WLAN-paikannuksen yleistymistä jokapäiväiseen käyttöön
    corecore