11 research outputs found
Flecha : um sistema de recomendação de questĂ”es de concurso pĂșblico
Trabalho de conclusĂŁo de curso (graduação)âUniversidade de BrasĂlia, Instituto de CiĂȘncias Exatas, Departamento de CiĂȘncia da Computação, 2017.Esse estudo apresenta um sistema de recomendação de questĂ”es sobre NoçÔes de InformĂĄtica e Direito Administrativo que foram aplicadas em concursos pĂșblicos, com base na maior incidĂȘncia dos temas nas provas. A ferramenta pode ser utilizada por candidatos para otimizar o tempo de estudo, permitindo escolher para resolver os exercĂcios cujos temas foram mais abordados em concursos. O sistema criado Ă© capaz de extrair questĂ”es oriundas de meios nĂŁo estruturados como PDFs, classificar automaticamente as questĂ”es e fazer recomendaçÔes de tal forma que a quantidade recomendada Ă© proporcional ao que foi cobrado anteriormente. Para fazer a classificação foram induzidos classificadores com o SVM nas implementaçÔes SVC e LinearSVC e realizados experimentos com diferentes parĂąmetros. TambĂ©m foram testados diferentes tipos de prĂ©-processamento. JĂĄ na recomendação foi proposto um sistema que clusteriza as questĂ”es em grupos que serĂŁo recomendados proporcionalmente ao tamanho de cada cluster. Foram realizados experimentos com diferentes nĂșmeros de clusters, a eficĂĄcia de uma função de decaimento e o comportamento da qualidade da recomendação quando se aumenta o nĂșmero de questĂ”es recomendadas. Na classificação, os melhores resultados obtidos foram com o LinearSVC. A recomendação obteve os melhores resultados sem a utilização do decaimento e com um nĂșmero pequeno de clusters.This research presents a recommendation system for Brazilian civil service exams on Information Technology and Administrative Law. Based on the higher incidence of subjects in the tests, the tool can be used by candidates to optimize the study time, allowing to choose exercises whose subjects were most approached in the last exams. The system was created to extract questions from unstructured media such as PDFs, automatically classify questions and make recommendations such that the recommended amount is proportional to what was previously seen on past exams. In order to do the classification, it was induced SVM classifiers with the SVC and LinearSVC implementations and experiments were performed with different parameters. Different types of preprocessing have also been tested. In the recommendation, the exam question were clusterized into similar groups and recommended in proportion to the size of each cluster. Experiments were performed with different numbers of clusters, the effectiveness of the decay function and the behavior of the recommendation when increasing the number of recommended questions. In the classification, the best results were obtained with LinearSVC. The recommendation obtained the best results without the use of decay function and with a small number of clusters
Recommended from our members
FP-tree Based Spatial Co-location Pattern Mining
A co-location pattern is a set of spatial features frequently located together in space. A frequent pattern is a set of items that frequently appears in a transaction database. Since its introduction, the paradigm of frequent pattern mining has undergone a shift from candidate generation-and-test based approaches to projection based approaches. Co-location patterns resemble frequent patterns in many aspects. However, the lack of transaction concept, which is crucial in frequent pattern mining, makes the similar shift of paradigm in co-location pattern mining very difficult. This thesis investigates a projection based co-location pattern mining paradigm. In particular, a FP-tree based co-location mining framework and an algorithm called FP-CM, for FP-tree based co-location miner, are proposed. It is proved that FP-CM is complete, correct, and only requires a small constant number of database scans. The experimental results show that FP-CM outperforms candidate generation-and-test based co-location miner by an order of magnitude
An Investigation in Efficient Spatial Patterns Mining
The technical progress in computerized spatial data acquisition and storage results
in the growth of vast spatial databases. Faced with large amounts of increasing spatial
data, a terminal user has more difficulty in understanding them without the helpful
knowledge from spatial databases. Thus, spatial data mining has been brought under
the umbrella of data mining and is attracting more attention.
Spatial data mining presents challenges. Differing from usual data, spatial data includes
not only positional data and attribute data, but also spatial relationships among
spatial events. Further, the instances of spatial events are embedded in a continuous
space and share a variety of spatial relationships, so the mining of spatial patterns demands
new techniques.
In this thesis, several contributions were made. Some new techniques were proposed,
i.e., fuzzy co-location mining, CPI-tree (Co-location Pattern Instance Tree),
maximal co-location patterns mining, AOI-ags (Attribute-Oriented Induction based on Attributesâ
Generalization Sequences), and fuzzy association prediction. Three algorithms
were put forward on co-location patterns mining: the fuzzy co-location mining algorithm,
the CPI-tree based co-location mining algorithm (CPI-tree algorithm) and the orderclique-
based maximal prevalence co-location mining algorithm (order-clique-based algorithm).
An attribute-oriented induction algorithm based on attributesâ generalization sequences
(AOI-ags algorithm) is further given, which unified the attribute thresholds and
the tuple thresholds. On the two real-world databases with time-series data, a fuzzy association
prediction algorithm is designed. Also a cell-based spatial object fusion algorithm
is proposed. Two fuzzy clustering methods using domain knowledge were proposed:
Natural Method and Graph-Based Method, both of which were controlled by a
threshold. The threshold was confirmed by polynomial regression. Finally, a prototype
system on spatial co-location patternsâ mining was developed, and shows the relative
efficiencies of the co-location techniques proposed
The techniques presented in the thesis focus on improving the feasibility, usefulness,
effectiveness, and scalability of related algorithm. In the design of fuzzy co-location
Abstract
mining algorithm, a new data structure, the binary partition tree, used to improve the
process of fuzzy equivalence partitioning, was proposed. A prefix-based approach to
partition the prevalent event set search space into subsets, where each sub-problem can
be solved in main-memory, was also presented. The scalability of CPI-tree algorithm is
guaranteed since it does not require expensive spatial joins or instance joins for identifying
co-location table instances. In the order-clique-based algorithm, the co-location table
instances do not need be stored after computing the Pi value of corresponding colocation,
which dramatically reduces the executive time and space of mining maximal colocations.
Some technologies, for example, partitions, equivalence partition trees, prune
optimization strategies and interestingness, were used to improve the efficiency of the
AOI-ags algorithm. To implement the fuzzy association prediction algorithm, the âgrowing
windowâ and the proximity computation pruning were introduced to reduce both I/O and
CPU costs in computing the fuzzy semantic proximity between time-series.
For new techniques and algorithms, theoretical analysis and experimental results
on synthetic data sets and real-world datasets were presented and discussed in the thesis
Design and analysis of clustering algorithms for numerical, categorical and mixed data
In recent times, several machine learning techniques have been applied successfully to discover useful knowledge from data. Cluster analysis that aims at finding similar subgroups from a large heterogeneous collection of records, is one o f the most useful and popular of the available techniques o f data mining. The purpose of this research is to design and analyse clustering algorithms for numerical, categorical and mixed data sets. Most clustering algorithms are limited to either numerical or categorical attributes. Datasets with mixed types o f attributes are common in real life and so to design and analyse clustering algorithms for mixed data sets is quite timely. Determining the optimal solution to the clustering problem is NP-hard. Therefore, it is necessary to find solutions that are regarded as âgood enoughâ quickly. Similarity is a fundamental concept for the definition of a cluster. It is very common to calculate the similarity or dissimilarity between two features using a distance measure. Attributes with large ranges will implicitly assign larger contributions to the metrics than the application to attributes with small ranges. There are only a few papers especially devoted to normalisation methods. Usually data is scaled to unit range. This does not secure equal average contributions of all features to the similarity measure. For that reason, a main part o f this thesis is devoted to normalisation.EThOS - Electronic Theses Online ServiceGBUnited Kingdo
Design and analysis of clustering algorithms for numerical, categorical and mixed data
In recent times, several machine learning techniques have been applied successfully to
discover useful knowledge from data. Cluster analysis that aims at finding similar
subgroups from a large heterogeneous collection of records, is one o f the most useful
and popular of the available techniques o f data mining.
The purpose of this research is to design and analyse clustering algorithms for numerical,
categorical and mixed data sets. Most clustering algorithms are limited to either
numerical or categorical attributes. Datasets with mixed types o f attributes are common
in real life and so to design and analyse clustering algorithms for mixed data sets is quite
timely. Determining the optimal solution to the clustering problem is NP-hard. Therefore,
it is necessary to find solutions that are regarded as âgood enoughâ quickly.
Similarity is a fundamental concept for the definition of a cluster. It is very common to
calculate the similarity or dissimilarity between two features using a distance measure.
Attributes with large ranges will implicitly assign larger contributions to the metrics than
the application to attributes with small ranges. There are only a few papers especially
devoted to normalisation methods. Usually data is scaled to unit range. This does not
secure equal average contributions of all features to the similarity measure. For that
reason, a main part o f this thesis is devoted to normalisation
An investigation in efficient spatial patterns mining
The technical progress in computerized spatial data acquisition and storage results in the growth of vast spatial databases. Faced with large amounts of increasing spatial data, a terminal user has more difficulty in understanding them without the helpful knowledge from spatial databases. Thus, spatial data mining has been brought under the umbrella of data mining and is attracting more attention. Spatial data mining presents challenges. Differing from usual data, spatial data includes not only positional data and attribute data, but also spatial relationships among spatial events. Further, the instances of spatial events are embedded in a continuous space and share a variety of spatial relationships, so the mining of spatial patterns demands new techniques. In this thesis, several contributions were made. Some new techniques were proposed, i.e., fuzzy co-location mining, CPI-tree (Co-location Pattern Instance Tree), maximal co-location patterns mining, AOI-ags (Attribute-Oriented Induction based on Attributesâ Generalization Sequences), and fuzzy association prediction. Three algorithms were put forward on co-location patterns mining: the fuzzy co-location mining algorithm, the CPI-tree based co-location mining algorithm (CPI-tree algorithm) and the orderclique- based maximal prevalence co-location mining algorithm (order-clique-based algorithm). An attribute-oriented induction algorithm based on attributesâ generalization sequences (AOI-ags algorithm) is further given, which unified the attribute thresholds and the tuple thresholds. On the two real-world databases with time-series data, a fuzzy association prediction algorithm is designed. Also a cell-based spatial object fusion algorithm is proposed. Two fuzzy clustering methods using domain knowledge were proposed: Natural Method and Graph-Based Method, both of which were controlled by a threshold. The threshold was confirmed by polynomial regression. Finally, a prototype system on spatial co-location patternsâ mining was developed, and shows the relative efficiencies of the co-location techniques proposed The techniques presented in the thesis focus on improving the feasibility, usefulness, effectiveness, and scalability of related algorithm. In the design of fuzzy co-location Abstract mining algorithm, a new data structure, the binary partition tree, used to improve the process of fuzzy equivalence partitioning, was proposed. A prefix-based approach to partition the prevalent event set search space into subsets, where each sub-problem can be solved in main-memory, was also presented. The scalability of CPI-tree algorithm is guaranteed since it does not require expensive spatial joins or instance joins for identifying co-location table instances. In the order-clique-based algorithm, the co-location table instances do not need be stored after computing the Pi value of corresponding colocation, which dramatically reduces the executive time and space of mining maximal colocations. Some technologies, for example, partitions, equivalence partition trees, prune optimization strategies and interestingness, were used to improve the efficiency of the AOI-ags algorithm. To implement the fuzzy association prediction algorithm, the âgrowing windowâ and the proximity computation pruning were introduced to reduce both I/O and CPU costs in computing the fuzzy semantic proximity between time-series. For new techniques and algorithms, theoretical analysis and experimental results on synthetic data sets and real-world datasets were presented and discussed in the thesis.EThOS - Electronic Theses Online ServiceGBUnited Kingdo