11 research outputs found
Метод визначення семантичної зв’язності
Роботу присвячено вивченню проблеми визначення семантичної зв’язності понять англійської мови на базі текстових корпусів. На початку роботи ми наводимо короткий огляд існуючих підходів до вирішення проблеми, розглядаємо основні еталонні корпуси, що розмічено експертами. Далі переходимо до опису власного методу та основних класів гіпотез, на яких він базується. В роботі запропоновано і описано більше 70 гіпотез, що можуть бути використаними при обчисленні семантичної зв’язності, а також нову, високоефективну модель вимірювання зв’язності на базі машинного навчання і запропонованих гіпотез. Модель дозволяє гнучко обирати серед гіпотез підмножини і показує високу ефективність на різних наборах еталонних тестів.The work is dedicated to the problem of semantic relatedness calculation based on text corpora. At the beginning of the work, we present a brief overview of existing approaches to solve the problem and consider the basic benchmark corpora. Then we describe our own method and main hypotheses on which it is based. The paper presents more than 70 hypotheses that can be used in the calculation of semantic relatedness and a new, high-performance relatedness measure model based on machine learning. The model can flexibly switch between subsets of hypotheses and demonstrate high efficiency on different benchmarks sets
Word Embeddings: A Survey
This work lists and describes the main recent strategies for building
fixed-length, dense and distributed representations for words, based on the
distributional hypothesis. These representations are now commonly called word
embeddings and, in addition to encoding surprisingly good syntactic and
semantic information, have been proven useful as extra features in many
downstream NLP tasks.Comment: 10 pages, 2 tables, 1 imag
Low-rank Tensor Assisted K-space Generative Model for Parallel Imaging Reconstruction
Although recent deep learning methods, especially generative models, have
shown good performance in fast magnetic resonance imaging, there is still much
room for improvement in high-dimensional generation. Considering that internal
dimensions in score-based generative models have a critical impact on
estimating the gradient of the data distribution, we present a new idea,
low-rank tensor assisted k-space generative model (LR-KGM), for parallel
imaging reconstruction. This means that we transform original prior information
into high-dimensional prior information for learning. More specifically, the
multi-channel data is constructed into a large Hankel matrix and the matrix is
subsequently folded into tensor for prior learning. In the testing phase, the
low-rank rotation strategy is utilized to impose low-rank constraints on tensor
output of the generative network. Furthermore, we alternately use traditional
generative iterations and low-rank high-dimensional tensor iterations for
reconstruction. Experimental comparisons with the state-of-the-arts
demonstrated that the proposed LR-KGM method achieved better performance
Feature Selection by Singular Value Decomposition for Reinforcement Learning
Solving reinforcement learning problems using value function approximation requires having good state features, but constructing them manually is often difficult or impossible. We propose Fast Feature Selection (FFS), a new method for automatically constructing good features in problems with high-dimensional state spaces but low-rank dynamics. Such problems are common when, for example, controlling simple dynamic systems using direct visual observations with states represented by raw images. FFS relies on domain samples and singular value decomposition to construct features that can be used to approximate the optimal value function well. Compared with earlier methods, such as LFD, FFS is simpler and enjoys better theoretical performance guarantees. Our experimental results show that our approach is also more stable, computes better solutions, and can be faster when compared with prior work
Métodos de Deteção Automática de Plágio Extrínseco em Textos de Grande Dimensão
A prática de plágio em documentos, livros e na arte de forma geral, tem consequência gravas na
sociedade. A existência de pessoas sem honestidade, na academia, na indústria, na imprensa
que se apropriam da propriedade intelectual de outrem, levou algumas organizações a produzirem
normas de combate ao plágio e adotarem meios tecnológicas para enfrentar e evitar a
propagação deste mal.
Os sistemas de Deteção Automática de Plágio (DAP) são, sem dúvida, os principais meios utilizadas
para identificação de situações que envolvem a prática de plágio em documentos de texto
disponíveis na Web.
Para tentar ofuscar a atitude fraudulenta (omitir o plágio) em um documento de texto de grande
dimensão, os praticantes de plágio, algumas vezes extraem curtas frases, sendo consequentemente
manipuladas e transformadas de voz ativa para passiva e vice-versa, bem como os léxicos
transformados em sinónimos e antónimos [ASA12, AIAA15, ASI+17]. Por outra, com pares de
texto1 de maior tamanho, o processo de alinhamento textual é fastidioso, que o torna menos
eficiente e até menos eficaz, sobretudo, se existir tentativa de ofuscação.
Este trabalho tinha como objetivo propor métodos de DAP menos complexos que tornam o processo
da Análise Detalhada mais eficiente e com melhor eficácia. Para tal, desenvolvemos
dois métodos de DAP primeiramente, um método de deteção plágio que utiliza uma abordagem
de segmentação recursiva do documento fonte em três blocos, afim de identificar pequenos e
grandes segmentos plagiados com paráfrases com eficácia e alto nível de eficiência temporal.
O segundo método proposto é o de Pesquisa de Plágio por Scanning Vetorial. Este método utiliza
word embeeding (word2vec) sem recurso aos cálculos matriciais, e é capaz de detetar quer
pequenos segmentos plagiados, quer segmentos grandes, mesmo com alto nível de ofuscação
de forma eficiente e com alto nível de eficácia.
Os resultados que apresentados no Capítulo 4 demonstram a eficácia e a eficiência dos métodos
propostos nesta dissertação.The existence of people without honesty, in the academy, in the industry, in the press that
appropriates the intellectual property of others, led some organizations to produce norms to
combat plagiarism and to adopt technological means to confront and to prevent the propagation
of this evil. Plagiarism Automatic Detectiors (PAD) systems are undoubtedly the main means used
to identify situations involving the practice of plagiarism in text documents available in Web.
To attempt to obfuscate the fraudulent attitude (omitting plagiarism) in a large text document,
plagiarists sometimes extract short phrases and are consequently manipulated and transformed
from active to passive and vice versa, as well as lexicons transformed into synonyms and antonyms
[ASA12, AIAA15, ASI+17]. On the other, with pairs of text 2 Of larger size, the process
of text alignment is tedious, which makes it less efficient and even less effective, especially if
there is an attempt to obfuscate.
This work aimed to propose less complex PAD methods that make the Detailed Analysis process
more efficient and with better efficiency. For this, we developed two methods of PAD first, a
plagiarism detection method that uses a recursive segmentation approach of the source document
in three blocks, in order to identify small and large segments plagiarized with efficacious
paraphrases and high level of temporal efficiency. The second proposed method is the Plagiarism
Research by Vector Scanning). This method uses word embeedings (word2vec) without
recourse to matrix calculations, and is capable of detecting either small plagiarized segments or
large segments, even with high level of obfuscation efficiently and with high level of efficiency.
The results presented in Chapter 4 demonstrate the efficacy and efficiency of the methods
proposed in this dissertation