3 research outputs found
Text mining aplicado à gestão de fundos públicos
Este trabalho tem como objetivo analisar documentos textuais submetidos por empresas
portuguesas no momento de candidatura a programas de incentivos empresariais públicos. Com
esta análise pretende-se extrair e selecionar variáveis relevantes, presentes nos textos, que
possuam poder preditivo em relação a futuras ações das empresas candidatas aceites, no
decorrer dos projetos. O objetivo concreto é a predição da anulação de projetos com fundos
atribuÃdos, durante a sua duração prevista. Para realizar esta análise foi necessário criar uma
cadeia de classificação de texto na qual são aplicadas variadas técnicas de processamento da
lÃngua natural, extração e seleção de variáveis, seleção e utilização de classificadores, e métricas
de avaliação dos resultados. Foram utilizadas técnicas de referência de extração de variáveis
como a extração de valores TF e TF-IDF e foram igualmente levadas a cabo experiências de
extração de variáveis baseadas em geração de tópicos, análise de similaridade textual, análise
de diversidade lexical, exploração de vocabulário especÃfico, entre outros tipos de análise do
conteúdo textual. A exploração de variáveis criadas a partir destas experiências mostra-nos
caracterÃsticas escondidas nos dados, como por exemplo, o facto de se verificar uma maior
incidência de projetos com elevados nÃveis de similaridade em certos distritos do paÃs. O
principal objetivo foi alcançar o melhor desempenho possÃvel nas métricas obtidas através da
matriz de confusão (taxa de acerto; precisão; cobertura; F1-Score) na predição da anulação de
projetos. Os melhores resultados da predição de anulação foram obtidos por um conjunto de
variáveis provenientes de diversos métodos de extração e utilizando o algoritmo Classificador
Naïve Bayes: 79% de taxa de acerto; 77% de precisão; 71% de cobertura; 74% de F1-Score.
Neste trabalho é assim demonstrado o proveito da mistura de variáveis provenientes de
diferentes métodos de extração de variáveis.This work aims to analyze the textual documents presented by Portuguese companies when
applying for business incentive programs. This work intends to extract and select relevant
features, present in the texts, which have predictive power in relation to future actions of the
companies whose projects were accepted, during the projects. The concrete goal is the
prediction of the cancellation of the projects with allocated funds, during their expected
duration. It was necessary to create a text classification pipeline which applies natural language
processing, various features extraction and selection techniques, classification algorithms and
evaluation metrics. Many feature extraction techniques were used, such as classical techniques
as TF and TF-IDF values generation, as also other experiments as topic generation, similarity
analysis, lexical analysis, identification of specific vocabulary used, among other analysis of
textual content that were also carried out. The feature analysis can show us hidden
characteristics in the data, such as the fact that there is a preponderance of projects with high
levels of similarity in certain districts of the country. The main objective, regarding the perdition
of cancellation of the projects, was achieving the best possible performance, for that there were
used the confusion matrix metrics (accuracy; precision; revocation; F1-Score). The best
prediction results were obtained by a set of features from different extraction methods together
with the use of the Naïve Bayes Classifier algorithm: 79% accuracy; 77% precision; 71% recall;
74% F1-Score. Therefore, it is shown the advantages of mixing features from different
extraction methods on this text classification application
Evaluation des systèmes d'intelligence épidémiologique appliqués à la détection précoce des maladies infectieuses au niveau mondial.
Our work demonstrated the performance of the epidemic intelligence systems used for the early detection of infectious diseases in the world, the specific added value of each system, the greater intrinsic sensitivity of moderated systems and the variability of the type information source’s used. The creation of a combined virtual system incorporating the best result of the seven systems showed gains in terms of sensitivity and timeliness that would result from the integration of these individual systems into a supra-system. They have shown the limits of these tools and in particular: the low positive predictive value of the raw signals detected, the variability of the detection capacities for the same disease, but also the significant influence played by the type of pathology, the language and the region of occurrence on the detection of infectious events. They established the wide variety of epidemic intelligence strategies used by public health institutions to meet their specific needs and the impact of these strategies on the nature, the geographic origin and the number of events reported. As well, they illustrated that under conditions close to the routine, epidemic intelligence permitted the detection of infectious events on average one to two weeks before their official notification, hence allowing to alert health authorities and therefore the anticipating the implementation of eventual control measures. Our work opens new fields of investigation which applications could be important for both users systems.Nos travaux ont démontré les performances des systèmes d’intelligence épidémiologique en matière de détection précoce des évènements infectieux au niveau mondial, la valeur ajoutée spécifique de chaque système, la plus grande sensibilité intrinsèque des systèmes modérés et la variabilité du type de source d’information utilisé. La création d’un système virtuel combiné intégrant le meilleur résultat des sept systèmes a démontré les gains en termes de sensibilité et de réactivité, qui résulterait de l’intégration de ces systèmes individuels dans un supra-système. Ils ont illustrés les limites de ces outils et en particulier la faible valeur prédictive positive des signaux bruts détectés, la variabilité les capacités de détection pour une même pathologie, mais également l’influence significative jouée par le type de pathologie, la langue et la région de survenue sur les capacités de détection des évènements infectieux. Ils ont établis la grande diversité des stratégies d’intelligence épidémiologique mises en œuvre par les institutions de santé publique pour répondre à leurs besoins spécifiques et l’impact de ces stratégies sur la nature, l’origine géographique et le nombre des évènements rapportés. Ils ont également montré que dans des conditions proches de la routine, l’intelligence épidémiologique permettait la détection d’évènements infectieux en moyenne une à deux semaines avant leur notification officielle, permettant ainsi d’alerter les autorités sanitaires et d’anticiper la mise en œuvre d’éventuelles mesures de contrôle. Nos travaux ouvrent de nouveaux champs d’investigations dont les applications pourraient être importantes pour les utilisateurs comme pour les systèmes