Search CORE

3 research outputs found

Text mining aplicado à gestão de fundos públicos

Author: Chinita Luís Henrique Broncas
Publication venue
Publication date: 28/12/2022
Field of study

Este trabalho tem como objetivo analisar documentos textuais submetidos por empresas portuguesas no momento de candidatura a programas de incentivos empresariais públicos. Com esta análise pretende-se extrair e selecionar variáveis relevantes, presentes nos textos, que possuam poder preditivo em relação a futuras ações das empresas candidatas aceites, no decorrer dos projetos. O objetivo concreto é a predição da anulação de projetos com fundos atribuídos, durante a sua duração prevista. Para realizar esta análise foi necessário criar uma cadeia de classificação de texto na qual são aplicadas variadas técnicas de processamento da língua natural, extração e seleção de variáveis, seleção e utilização de classificadores, e métricas de avaliação dos resultados. Foram utilizadas técnicas de referência de extração de variáveis como a extração de valores TF e TF-IDF e foram igualmente levadas a cabo experiências de extração de variáveis baseadas em geração de tópicos, análise de similaridade textual, análise de diversidade lexical, exploração de vocabulário específico, entre outros tipos de análise do conteúdo textual. A exploração de variáveis criadas a partir destas experiências mostra-nos características escondidas nos dados, como por exemplo, o facto de se verificar uma maior incidência de projetos com elevados níveis de similaridade em certos distritos do país. O principal objetivo foi alcançar o melhor desempenho possível nas métricas obtidas através da matriz de confusão (taxa de acerto; precisão; cobertura; F1-Score) na predição da anulação de projetos. Os melhores resultados da predição de anulação foram obtidos por um conjunto de variáveis provenientes de diversos métodos de extração e utilizando o algoritmo Classificador Naïve Bayes: 79% de taxa de acerto; 77% de precisão; 71% de cobertura; 74% de F1-Score. Neste trabalho é assim demonstrado o proveito da mistura de variáveis provenientes de diferentes métodos de extração de variáveis.This work aims to analyze the textual documents presented by Portuguese companies when applying for business incentive programs. This work intends to extract and select relevant features, present in the texts, which have predictive power in relation to future actions of the companies whose projects were accepted, during the projects. The concrete goal is the prediction of the cancellation of the projects with allocated funds, during their expected duration. It was necessary to create a text classification pipeline which applies natural language processing, various features extraction and selection techniques, classification algorithms and evaluation metrics. Many feature extraction techniques were used, such as classical techniques as TF and TF-IDF values generation, as also other experiments as topic generation, similarity analysis, lexical analysis, identification of specific vocabulary used, among other analysis of textual content that were also carried out. The feature analysis can show us hidden characteristics in the data, such as the fact that there is a preponderance of projects with high levels of similarity in certain districts of the country. The main objective, regarding the perdition of cancellation of the projects, was achieving the best possible performance, for that there were used the confusion matrix metrics (accuracy; precision; revocation; F1-Score). The best prediction results were obtained by a set of features from different extraction methods together with the use of the Naïve Bayes Classifier algorithm: 79% accuracy; 77% precision; 71% recall; 74% F1-Score. Therefore, it is shown the advantages of mixing features from different extraction methods on this text classification application

Repositório Institucional do ISCTE-IUL

Evaluation des systèmes d'intelligence épidémiologique appliqués à la détection précoce des maladies infectieuses au niveau mondial.

Author: Barboza Philippe
Publication venue: HAL CCSD
Publication date: 16/12/2014
Field of study

Our work demonstrated the performance of the epidemic intelligence systems used for the early detection of infectious diseases in the world, the specific added value of each system, the greater intrinsic sensitivity of moderated systems and the variability of the type information source’s used. The creation of a combined virtual system incorporating the best result of the seven systems showed gains in terms of sensitivity and timeliness that would result from the integration of these individual systems into a supra-system. They have shown the limits of these tools and in particular: the low positive predictive value of the raw signals detected, the variability of the detection capacities for the same disease, but also the significant influence played by the type of pathology, the language and the region of occurrence on the detection of infectious events. They established the wide variety of epidemic intelligence strategies used by public health institutions to meet their specific needs and the impact of these strategies on the nature, the geographic origin and the number of events reported. As well, they illustrated that under conditions close to the routine, epidemic intelligence permitted the detection of infectious events on average one to two weeks before their official notification, hence allowing to alert health authorities and therefore the anticipating the implementation of eventual control measures. Our work opens new fields of investigation which applications could be important for both users systems.Nos travaux ont démontré les performances des systèmes d’intelligence épidémiologique en matière de détection précoce des évènements infectieux au niveau mondial, la valeur ajoutée spécifique de chaque système, la plus grande sensibilité intrinsèque des systèmes modérés et la variabilité du type de source d’information utilisé. La création d’un système virtuel combiné intégrant le meilleur résultat des sept systèmes a démontré les gains en termes de sensibilité et de réactivité, qui résulterait de l’intégration de ces systèmes individuels dans un supra-système. Ils ont illustrés les limites de ces outils et en particulier la faible valeur prédictive positive des signaux bruts détectés, la variabilité les capacités de détection pour une même pathologie, mais également l’influence significative jouée par le type de pathologie, la langue et la région de survenue sur les capacités de détection des évènements infectieux. Ils ont établis la grande diversité des stratégies d’intelligence épidémiologique mises en œuvre par les institutions de santé publique pour répondre à leurs besoins spécifiques et l’impact de ces stratégies sur la nature, l’origine géographique et le nombre des évènements rapportés. Ils ont également montré que dans des conditions proches de la routine, l’intelligence épidémiologique permettait la détection d’évènements infectieux en moyenne une à deux semaines avant leur notification officielle, permettant ainsi d’alerter les autorités sanitaires et d’anticiper la mise en œuvre d’éventuelles mesures de contrôle. Nos travaux ouvrent de nouveaux champs d’investigations dont les applications pourraient être importantes pour les utilisateurs comme pour les systèmes

Thèses en Ligne