Time series analysis is an important research topic of great interest in many fields. Recently, the Matrix Profile method, and particularly one of its implementationsthe SCRIMP algorithm-has become a state-of-the-art approach in this field. This is a technique that brings the possibility of obtaining exact motifs from a time series, which can be used to infer events, predict outcomes, detect anomalies and more. However, the memory-bound nature of the SCRIMP algorithm limits the execution performance in some processor architectures. In this paper, we analyze the SCRIMP algorithm from the performance viewpoint in the context of the Intel Xeon Phi Knights Landing architecture (KNL), which integrates high-bandwidth memory (HBM) modules, and we combine several techniques aimed at exploiting the potential of this architecture. On the one hand, we exploit the multi-threading and vector capabilities of the architecture. On the other hand, we explore how to allocate data in order to take advantage of the available hybrid memory architecture that conjugates both the high-bandwidth 3D-stacked HBM and the DDR4 memory modules. The experimental evaluation shows a performance improvement up to 190 × with respect to the sequential execution and that the use of the HBM memory improves performance in a factor up to 5 × with respect to the DDR4 memory.
Introduction and related work
Time series analysis is one of the most important techniques to extract information in domains such as genetic analysis [1] , seismology [2] , traffic [3] and energy [4] . Recently, Matrix Profile [5] was introduced as a powerful tool for time series analysis. Among other features, this tool is able to detect motifs and anomalies in a time series.
According to the literature, there are two major approaches for time series motif discovery: approximate algorithms [6] and anytime (exact) algorithms [5] (such as Matrix Profile). Approximate algorithms usually take less execution time than anytime ones. However, for large time series, an approximate algorithm can provide inaccurate results as they are based on probabilistic assumptions. As a result, the user has to set several (even not intuitive) parameters, trying to get results accordingly to the expected ones, a fact which is not always possible.
Multiple techniques for time series data mining, and more specifically, for anomaly detection, can be found in the literature. Chandola et al. wrote a survey on different techniques for anomaly detection on time series [7] up to 2009. The most recent techniques are based on deep learning [8] , machine learning [9] , neural networks [10] and motif discovery [5] . In addition to the Matrix Profile, we can find other approaches based on time series motif discovery, such as a probabilistic solution [11] , spatiotemporal models [12] , an indexing solution [13] and a symbolic representation [14] . Not many works can be found accelerating time series processing algorithms on the Intel Xeon Phi architecture. For instance, in [15] the Xeon Phi KNC [16] (Knights Corner-a previous version of the Xeon Phi architecture) is used to speed up the dynamic time warping (DTW) technique to measure the similarity between time series. However, their approach is limited by a master-worker scheme, and this target architecture, being older, does not include ultra-high-bandwidth 3D-stacked memory modules.
We focus our attention on the Matrix Profile technique and, in particular, the SCRIMP algorithm [17] . At this time, there are three major different implementations of the matrix Profile: STAMP [5] , STOMP [18] and SCRIMP [17] . The SCRIMP algorithm combines the best properties of STAMP, which computes the diagonals of the Matrix in random order, with STOMP, which is faster than STAMP but computes the diagonals in sequential order. By the time of writing this paper, SCRIMP is the algorithm that offers better results.
The authors also presented PreSCRIMP [17] , a FFT-based implementation of the Matrix Profile (as STAMP and STOMP), which exploits the Consecutive Neighborhood Preserving (CNP) and allows obtaining very approximate Matrix Profiles in time series that consists in a pattern that is repeated over the time, at a fraction of the execution time of SCRIMP. They named SCRIMP++ the combination of the PreSCRIMP phase with SCRIMP. We focus our interest on the SCRIMP algorithm because our main aim is anomaly detection on non-predictable time series.
With the purpose of speeding up the Matrix Profile algorithms, the authors have developed a GPU-based implementation of STOMP [18] and PreSCRIMP [17] . By the time of writing this paper, there is no parallel version of Matrix Profile available using 3D-stacked memory architectures. SCRIMP can compute diagonals in order, but it also can compute them in random order. If having a partial solution of Matrix Profile is sufficient for the target application, using a random order for the diagonals allows stopping the execution of the algorithm before it finishes. In contrast, if the application needs the oracle (exact) Matrix Profile, we can use the in-order implementation which leads to a lower execution time thanks to data locality. Figure 1 shows a time series and its Matrix Profile. We can observe that the series presents a periodic pattern, but with an anomaly between the values 100 and 120, approximately. The Matrix Profile of this time series returns low values for the periodic part of the series and higher values where the anomaly appears. Figure 2 , suited to the Intel Xeon Phi Knights Landing many-core processor (KNL) [19] , shows the observed arithmetic intensity of SCRIMP (that is, arithmetic operations per byte accessed in memory) and measured performance in GFLOP/s (details about this evaluation are described in Sect. 3). The KNL processor integrates ultra-high-bandwidth memory modules on chip (HBM type) and has access to external DDR4 memory modules. The point 1 in the graph represents the performance of SCRIMP. We observe how the low arithmetic intensity prevents the application from using the full computing potential of the many-core processor. However, despite the arithmetic intensity being low, the SCRIMP bandwidth usage is close to the theoretical limit of the platform (see Sect. 4.3 for more details). This indicates that the memory bandwidth is the main bottleneck limiting the performance of the application, and shows that there is ample room for improvement. Two different options for performance improvement could be considered. A first approach The anomaly is marked with a rectangle in the time series, where there is a lower value that does not fit with the regular pattern of the whole series. In the Matrix Profile, this appears as a high value of distance, which means low similarity with respect to the other subsequences of the series could be modifying the algorithm suitably in order to increase its arithmetic intensity (i.e., incrementing the computation performed for each memory access). This could move the performance to the point 2 in Fig. 2 , but still limited by memory bandwidth. If the arithmetic intensity is improved enough, the application can even achieve the peak performance offered by the platform. However, modifying the algorithm requires a huge programming effort and it may not be even possible. Another approach consists in exploiting the HBM memory available in the processor. HBM is a high-performance RAM interface for a 3D-stacked DRAM [20] . As depicted in Fig. 3 , this memory organization consists of several layers of active silicon bonded with vertical interconnections, known as TSVs (Through-Silicon Vias) and microbumps. In this way and for our target platform, a peak memory bandwidth of about 450 GB/s can be achieved, in contrast to the 80 GB/s offered by the DDR4 memory. By using HBM memory either combined with or instead of DDR4, and assuming that the same arithmetic intensity is maintained, the performance can increase to When using DDR4, performance is limited by the memory bandwidth 1 . Redesigning the algorithm to increase the arithmetic intensity can lead to a performance improvement ( 2 ) . Using the integrated HBM memory, performance improves without redesigning the algorithm 3 . Finally, the use of the HBM memory and redesigning the algorithm could lead to the ideal configuration (maximum fused multiplyadd (FMA) peak available in the platform), marked as 4 This paper is focused on how much parallel architectures with HBM memory can accelerate bandwidth-bound algorithms, such as SCRIMP, by bringing into play all computing resources of the target platform and selecting suitable mappings for data allocation in memory.
The paper is organized as follows. In Sect. 2, we present some background related to the problem. Section 3 explains how the algorithm has been implemented to extract maximum performance from the platform. In Sect. 4, we present the experimental evaluation and discuss the benchmark results. Finally, Sect. 5 draws some conclusions and directions of future work.
Background

Matrix Profile and SCRIMP
In this subsection, we provide a summary of the Matrix Profile and the SCRIMP algorithm. More details about the Matrix Profile can be found in [5] , while the complete description of SCRIMP is available in [17] . A formal definition of the distance d i,j will be provided later, but for the moment we can consider it as a measurement of similarity. Figure 5 shows the calculation of the Matrix Profile for the sequence in Fig. 4 . Given a length m, a Matrix Profile P of a time series T is a vector containing the distance between every subsequence of T and its nearest subsequence (i.e., P i = d i,j if T j,m is the closest subsequence to T i,m ). The Matrix Profile index I is a vector of indexes where I i = j if P i = d i,j . This way, P contains the minimum distances between subsequences of T while I acts as a vector of "pointers" to the location The size of such matrix is huge for large time series, and thus, it is not convenient to store it in memory. In addition, the cache hierarchy could not benefit from having the whole matrix in memory because the relative size of the window with respect to the cache block when performing the calculation between the subsequences could lead to consecutive misses. Furthermore, storing such matrix in memory results in contention for the cache resources between the distance matrix, the Matrix Profile P, the Matrix Profile index I and the time series T. For these reasons, Matrix Profile algorithms are designed to store only the Matrix Profile P and the Matrix Profile index I, and they calculate the distances d i,j as required.
Note that the neighboring subsequences of T i,m are similar to such subsequence (i.e., d i,i+1 ≈ 0 ). As the purpose of the matrix profile is to find similar subsequences after a given distance, neighboring subsequences are excluded from the computation of P. This is done by defining an exclusion zone for each subsequence. In general, the exclusion zone for a subsequence T i,m of length m is m / 2 after the ending of such subsequence. Figure 4 shows two subsequences for a given time series T and the exclusion zone for one of them.
In Sect. 1 of this paper, we discuss the 3 most relevant implementations of the Matrix Profile: STAMP [5] , STOMP [18] and SCRIMP [17] . From the complexity point of view (i.e., the amount of data to compute), the STAMP algorithm allows for the calculus of Matrix Profile with an O(n 2 log(n)) computational complexity (n is the time series length). Lately, the STOMP algorithm was proposed to calculate the Matrix Profile with an O(n 2 ) complexity. While being slower, the main advantage of STAMP over STOMP is that it is a fast anytime converging algorithm which is able to compute accurate partial solutions in much less time. This way, for many SCRIMP is an algorithm developed by Zhu et al. [17] to compute the Matrix Profile and Matrix Profile index, provided a time series and a subsequence length (or window length) m. To compute the distance d i,j of two subsequences, SCRIMP draws on the z-normalized Euclidean distance calculated as follows:
In this formula, Q i,j is the dot product of subsequences i and j and can be calculated directly or, if a previous dot product is known, using the expression [18] for more details). Thus, calculating Q i,j creates a dependency between elements in the same diagonal in Fig. 5 . This way, the dot product only has to be calculated for the first row (or column) of the table, and the remaining values can be calculated using the previous ones. Symbols k and k represent the mean and standard deviation of the subsequence T k,m , respectively, provided a window size m. These values can be calculated beforehand following the technique introduced in [21] . Figure 6 shows a pseudo-code of the SCRIMP algorithm. Initially, the mean and standard deviation are precalculated (line 2), and the Matrix Profile and Matrix Profile index are initialized (line 3). Then, the diagonals (see Fig. 5 ) are calculated (lines 4-9). The variable Diagonals is the set of all diagonals needed to calculate P and I. These diagonals can be grabbed randomly, allowing SCRIMP providing partial results in case the algorithm is not computed until completion, or in order, which makes it lack this property but allowing further optimizations (see details in [17] ). Both options are explored in this work. When processing each element of the diagonal, we need either to compute the dot product if it is the first element (line 7), or use previous results (line 9). Once the distance is calculated (line 11), it replaces the previous distance for this diagonal provided that the new one is smaller (lines [12] [13] [14] . As the Matrix Profile stores the distance d i,j between two subsequences T i,m and T j,m , a second condition checks whether the distance d j,i needs to be updated (lines [15] [16] [17] . This checking phase is necessary because there is no guarantee of reciprocity, in terms of best similarity, between two subsequences.
Intel Xeon Phi KNL
A SuperMicro Superserver 5038K-i [22] was used as the target platform, which includes the Intel Xeon Phi 7210 processor [23] . This architecture features a combination of both HBM and DDR4 memory technologies, allowing programmers to allocate data in any of them as required [24] . The processor, based on the Knights Landing architecture (KNL), includes 64 Airmont cores (Atom) and up to 256 hardware threads (4 threads per core). It features 16 GB HBM integrated in the CPU chip, providing a much higher bandwidth than the external DDR4 modules. The HBM is based on Multi-Channel DRAM (MCDRAM) and consists of four memory banks with an aggregated peak bandwidth of more than 450 GB/s. The DDR4 memory, whose capacity is 192 GB, provides 6 channels with a peak raw bandwidth of 80 GB/s. The MCDRAM can be configured in three ways: Flat, Cache or Hybrid modes. The mode has to be selected at boot time. If the Flat mode is enabled, the MCDRAM acts as a separate addressable space. In the Cache mode, the MCDRAM is a transparent last-level cache. By setting the Hybrid mode, we can take advantage of both modes. Hybrid mode allows us to use part of the MCDRAM as addressable memory and the rest as cache. In this work, the Flat mode has been selected, as our goal is to find a suitable data mapping using both the MCDRAM and the DDR4 memories, and consequently aggregating bandwidth and storage of both memory spaces.
The cores are bonded in pairs conforming 32 tiles. Both cores in each tile shares an L2-cache of 1 MB. Tiles can be clustered in several NUMA configurations. In this work, the quadrant cluster has been used, in which the array of tiles is organized in four quadrants, each one connected to a MCDRAM controller. This configuration reduces the L2-cache miss latency, and the four blocks of MCDRAM appear as contiguous memory spaces.
Two basic levels of parallelism can be exploited: task/thread parallelism between the cores, and data parallelism using the AVX-512 SIMD extensions. These extensions allow computing 16 single-or 8 double-precision floating-point operations at the same time by means of two out-of-order vector processing units (VPUs) that are available per tile. The theoretical peak performance of the KNL is 6 TFLOP/s in single precision and 3 TFLOP/s in double precision.
3
Accelerating time series motif discovery in the Intel Xeon… 3 Optimizing SCRIMP on KNL As described in Sect. 2.1, SCRIMP calculates distances between subsequences following diagonals, as shown in Fig. 5 . This has two main advantages: (1) Elements in a diagonal can be calculated from previous elements of the same diagonal, and (2) by calculating diagonals in random order, we can obtain a good sampled partial result of the Matrix Profile. SCRIMP is highly parallelizable because the computation of the different diagonals can be carried out independently. The simplest way for speeding up the algorithm in a many-core system is distributing the diagonals among the executing threads. However, as diagonals have different length such assignment may lead to workload imbalance. We deal with this situation creating a pool of diagonals to be processed, which consists in a list of indexes and the size depends on the time series length. From the thread scheduling point of view, it consumes 1% of the execution time. Every time a computing thread is idle, it retrieves a diagonal from this pool (task queue) and processes it. This way, threads are busy most of the time and the imbalance is reduced to those diagonals at the end of the computation when few of them remains to be processed.
When processing a diagonal, a thread calculates distances between subsequences and updates the Matrix Profile P and the Matrix Profile index I. Note that both P and I are to be calculated per column. This way, it could be possible that two threads are computing concurrently two different distances in two different diagonals, but with such distances belonging to the same column (i.e., the same element in the profile and profile index). Thus, both threads may try to update the same element in P and I simultaneously which may lead in inconsistencies if not managed properly (writeafter-read data dependency). To deal with these potential data conflicts, two solutions were considered: lock-based mutual exclusion and data privatization.
An important drawback to be addressed is the low arithmetic intensity exhibited by the SCRIMP algorithm. We tackle this by: (1) incrementing the arithmetic intensity of the algorithm via vectorization, and (2) selecting the most appropriate memory space (i.e., HBM or DDR4) for each variable in order to benefit from the aggregated bandwidth offered by the KNL architecture. In this sense, the work of Khaldi et al. [25] explores the use of HBM and DDR4 memories from an automatic perspective depending on the application. However, our paper focuses on the use of these memory spaces to solve memory-bounded problems, such as the Matrix Profile, profiling which data structures and variables fits better in each memory space. The different design approaches we followed are discussed in the next subsections.
Updating P and I
Potential concurrent updates to the P and I data structures can be solved either by mutual exclusion or by privatizing the critical data. The performance of the first approach may be impacted strongly by the synchronization pressure. On the other hand, although the synchronization effort is negligible for privatization, it requires an extra amount of memory. Both possibilities are analyzed next. Fig. 7 , if a thread needs to update the Matrix Profile P and the Matrix Profile index I (lines 12-15) with a new distance for a pair of subsequences, it has to acquire the two locks (lines 4-11), one for the first position and the other for the second position, releasing it after the update occurs (lines [16] [17] . In case both locks cannot be acquired, they are released (line 9) to prevent deadlocks. Note that this solution may lead to livelocks, but we rely on the fairness of the OS scheduler to prevent this situation. Profile and index privatization A second implementation is based on privatizing the accesses to the shared data structures. Our goal is to avoid the use of mutual exclusion methods, looking for performance boosting. In this case, the use of synchronization between threads is minimal. This is achieved by expanding [26] the Matrix Profile, creating a replica (row) per thread as depicted in Fig. 8 . This is computationally safe as long as the performed operation per iteration is commutative and associative, i.e., it is a reduction loop [27] . Although shared, each replica can be seen as a private storage during the calculation of the Matrix Profile, in such a way that each thread computes its private Matrix Profile, storing the computed partial results in it. The only imperative synchronization is a barrier to wait for all threads to finish its private computation. Once all threads have calculated their private Matrix Profiles, a final reduction operation (per column) is carried out to compute the final results. No synchronization is necessary in this stage because each thread is in charge of a different column, so it can be done fully in parallel. Note that an analogous approach is applicable to the Fig. 8 . The expansion involves to declare two matrices that can be accessed by threads: private profile and profile index, as the code in Fig. 9 shows. Each thread updates the new distances in the profile corresponding to a row based on their thread identifier (lines [20] [21] [22] [23] [24] [25] [26] [27] [28] , and the final reduction stage is performed in parallel with no synchronization (lines 33-43).
Atomics
Increasing arithmetic intensity
Additionally to the thread parallelization of the algorithm, it is possible to reduce the execution time even more taking advantage of the SIMD support available in the KNL platform. For this purpose, several distance calculations and updates have to be packed into groups instead of calculating and updating them individually. In particular, we optimized the dot product calculation Q i,j of Eq. (1). It must be taken into account that there are data dependencies between the calculation of consecutive elements of a single diagonal. Consequently, this part of the computation will be carried out in a non-vectorized unrolled loop, and later used to calculate the non-dependent part of the computation that can be vectorized inside another loop. Figure 10 shows how the computation is performed. First we precalculate the component of the value with no data dependencies (lines 3-6) in a vectorized loop. After that, we can update the values, taking account of the dependencies and using an unrolled loop (lines 8-11). Finally, distance values are obtained using the values previously precalculated, again in a vectorized loop (lines [13] [14] [15] [16] [17] . Note that the constant ARIT_FACT is architecture dependent and it guides the compiler when generating the machine code. In our case, this constant is set to eight, in order to take advantage of the 512-bit SIMD instructions (eight 64-bit double-precision floating-point numbers stored in a single vector register). 
3
Accelerating time series motif discovery in the Intel Xeon…
Memory allocation policy and scalability
In order to gain advantage of the maximum memory bandwidth available (HBM plus DDR4), our approach allocates the most frequently accessed variables in the HBM space, where the highest bandwidth is available. Specifically, HBM will store the computed average and standard deviation for each subsequence (parameters and in Eq. (1)), as well as the privatized profile and profile indexes structures. Note that the code changes required to allocate HBM memory are minimal: Intel's KNL libraries provided the hbw_malloc functions to allocate HBM, while a regular malloc allocates DDR4 memory. However, the original time series and the final profile and profile indexes are allocated in the DDR4 memory.
Using this approach, the DDR4 memory serves the time series to the threads while the calculation of the diagonals is performed (only reads), and after that, the DDR4 is used again to keep the final results obtained in the final reduction stage. In this way, both the HBM and DDR4 bandwidths are aggregated, leading to an efficient memory exploitation, and saving space in the HBM. This approach becomes more relevant in very large time series, because for shorter ones there is no significant improvement from the execution time point of view, and there is no need for saving space in the HBM space, being enough for those time series. Moreover, when processing very large time series, where the privatized structure of the whole series does not fit in the HBM space, we need to process the time series in several chunks of diagonals. In this sense, we take advantage of the DDR bandwidth and space for reading the original time series and storing the final result that is going to be updated after every chunk is processed.
The analysis of scalability using several Intel Xeon Phi KNL processors in the same machine or distributed computing techniques is out of the scope of this manuscript, because we focus on how a memory-bounded algorithm benefits from the use of 3D stacked memories. Despite of that, the nature of our implementation makes both approaches very straightforward to follow, where the diagonals are distributed over the available computational resources and only a final reduction (which can also be done in parallel) is needed.
Experimental evaluation
Experimental setup
Experiments were conducted in the Intel Xeon Phi 7210 as described in Sect. 2.2. Codes were compiled with the Intel C++ compiler version 18.0.2, enabling the flag -O3 for the highest optimization and the flag -xHost to generate code with the widest available instruction set (in this case, the AVX-512, when possible). The results below correspond to the average of ten executions.
Speedup results
In this subsection, we present the experimental evaluation of our proposals. First, we carried out experiments related to speedup and memory bandwidth usage with fixed time series length and fixed window size, using the different parallel implementations described in Sect. 3. After that, we measured the memory bandwidth usage varying the number of threads in the implementation that brought the best results from the performance point of view. Finally, we executed our private-structurebased implementation varying time series length, window size and memory allocation policies allowing a comparison with previous works.
With respect to the speedup, we initially tested our SCRIMP implementation based on software atomics. In this case, our goal is to study the scalability with the number of threads. Speedup is calculated with respect to the original sequential version. Taking account of the Matrix Profile calculations, the nature of the data does not affect the execution times, so there is no need to particularize for a specific time series. In this sense, we used a random time series of 2 17 (131,072) elements, and a fixed window size of 1024. Regarding the allocation of the data, we used only the DDR4 space, because using the HBM memory in this approach does not improve the performance of the execution. The explanation to this is that the memory bandwidth usage of this implementation is very low in comparison with the available bandwidth, as we will show later.
Results of this experiment correspond with the lowest plot in Fig. 11 . As expected, this implementation is far away from the ideal speedup. The maximum speedup with this approach is only about 16 times faster than the original and sequential version using 128 threads. Also, for obtaining similar execution times than the sequential version, the atomic-based implementation needs 4 threads. The explanation to this behavior is that the atomic accesses to the common profile structure slow down the thread because such atomics are implemented in software. Also, this approach leads to a huge amount of cache traffic and waiting time for the threads, which also contribute to slow down the execution.
The second experiment tests our implementation based on privatizing the calculation of the Matrix Profile. As in the previous experiment, we used a random time series of 2 17 elements and a fixed window size of 1024. In this case, four configurations were evaluated. On the one hand, we tested either the combination of the HBM plus the DDR4 memory, or using the DDR4 memory only. On the other hand, when computing the diagonals, we considered a random order or the sequential one.
The observed speedup of these four configurations is shown in Fig. 11 . Setting the random order gives the advantage of the anytime property, which allows the user to stop the computation in any part of it, having a partial (not exact) result. The drawback of this approach is that it is expected a lower reuse of data in caches and that will explain why computing diagonals in the sequential order results in a better performance.
Furthermore, we can notice that both of the DDR4 cases (random and sequential order) obtain similar performance, growing significantly up to 16 threads and even starting to decrease from 64 threads. This is a scenario of over-threading, where increasing the non-payload part of the computation (more threads asking for memory requests) brings lower performance. Thus, the DDR4 bandwidth is not enough to serve the threads the data they need. In contrast, the HBM cases keep improving with the number of threads, and from 16 threads, they start to be far from the DDR4 ones. Also, we have to take into account that the number of physical cores in the target platform is 64, and an amount of threads beyond this number imply the use of hyperthreading. This fact results in a lower performance increment if compared with the case of a thread per actual core. Finally, note that our single-thread execution is faster than the sequential implementation from the authors of Matrix Profile (concretely, 2.6 × faster), due to the increase in the arithmetic intensity through taking advantage of vectorization (the use of AVX-512 instructions) in most of the computations, as explained in Sect. 3.2.
Finally, we compare our implementation with the GPU-STOMP implementation from authors of Matrix Profile [18] . The results are shown in Table 1 , where we define a fixed window size of 256 elements and the same data type than the GPU implementation (single precision). The diagonals are computed in order by 256 threads and the implementation uses the HBM space. Note that the speedup of SCRIMP in the Xeon Phi tends to be lower with respect to GPU-STOMP in larger series than in shorter ones.
Memory bandwidth results
By using the Intel VTune profiling tool [28] , the memory bandwidth usage of our implementation has been measured. Figure 12 shows the normalized execution time for every bandwidth usage given a random time series of 2 18 elements, which allows more accurate profiler results. We used a window size of 1024, 256 hardware threads and the same five configurations than in the previous subsection. In the atomic-based implementation, the memory usage is very low, since most of the time the threads are competing for the locks. The two cases corresponding to the DDR4-only implementations present a high memory bandwidth usage taking account of the maximum of this type of memory. This maximum is not enough for taking advantage of the sequential order of the diagonals. With respect to the HBM plus DDR4 based implementations, we can notice how whereas a random order of the diagonals uses a high memory bandwidth, the sequential order nearly achieves the maximum for the HBM memory most of the time.
In another set of experiments, we measured the bandwidth utilization of our privatization-based implementations varying the number of threads, using the same time series and parameters than in the previous experiment. With these experiments, we wanted to test how is the memory bandwidth usage for different number of threads.
The first experiment is shown in Fig. 13 , where we used only the DDR4 memory and a sequential order for computing the diagonals. We can see that the maximum bandwidth is obtained using 64 threads, which agrees with the point where the speedup graph becomes plain, as shown in Fig. 11 .
The second experiment is shown in Fig. 14 , where we used the combination of the HBM plus the DDR4 bandwidth and a sequential order for the diagonals. In this case, the bandwidth starts to grow from the beginning until the maximum number of threads, where it achieves the theoretical maximum bandwidth for non-sequential accesses.
Sensibility to time series length and window size
Tables from 2, 3, 4 and 5 show the sensibility of the implementations when varying the time series length and the window size. All of these tests use the number of threads that achieves the best performance. To conduct these experiments, we defined random time series with representative sizes. In particular, we used time [17] , as well as the most common window sizes (m). Table 2 shows the results of executing SCRIMP with the DDR4 memory and setting a random order for the diagonals. As the results suggest, there is not a clear correlation between the window size and the execution time, but depends on the number of elements that fits in caches.
In the case we use a sequential order for computing the diagonals, we obtain similar execution times than in the previous case, as shown in Table 3 . Taking account that we use software prefetching for random and sequential order for the diagonals, there are no significant performance benefits for short series, but for bigger ones, we can achieve up to 25% of better execution times.
We have also evaluated the combination of the HBM plus the DDR4 memories. If we use the HBM memory for allocating the most used variables and a random order for the diagonals, we obtained from 2.8 × to 4.6 × of speedup with respect to only using the DDR4, depending on the window size and the time series length. Larger time series obtain more benefit from the use of HBM than the smaller ones, as shown in Table 4 . The experiments using the sequential order for the diagonals and the combination of the HBM plus the DDR4 memory are shown in Table 5 . In this case, we obtain up to 58% of better execution times than the same memory configuration with random order for the diagonals. Table 6 summarizes the observed floating-point operations per second that our implementation can achieve. Such results has been obtained with the Intel Advisor tool [29] using the same 2
Floating-point performance
18
-element random time series used previously with a window size of 1024 and 256 threads. We observe that there is no significant benefit in terms of GFLOP/s between computing in random or sequential order using only the DDR4 memory, because the bandwidth is already saturated even with the sequential order and data locality is not exploited. In contrast, if combined both the DDR4 and the HBM memory, there is a significant difference (around 20% of more throughput from the FLOP/s viewpoint) between using or not random order for the diagonals, as explained before. Nevertheless, the results remain far away from the processor peak performance, which is a common issue in memory-bounded problems.
Real-world applications
Our implementations have been tested taking as reference some of the datasets used in [5, 18] and [17] , which allows us to compare and validate the results. These datasets come from real-world applications. We are interested in the exact solution of the Matrix Profile, so our executions finish when SCRIMP converges (no partial solutions). Exact solutions are specially relevant in problems such as anomaly detection, which in some cases are detected at the final stage of the execution.
The first case study we evaluated is seismology. In Fig. 15 , we present the dataset (upper graph) which consists of about 40,000 elements. Our multi-threaded and vectorized implementation of SCRIMP returns the Matrix Profile shown in the lower graph, in which there are represented several peaks corresponding with the most significant discords, where the distance of the corresponding subsequence is higher (the similarity with respect to the whole time series is lower).
Using a window size of 100 elements, our implementation of HBM plus DDR4 memory allocation policy and sequential order for the diagonals took only 0.25 s in obtaining the Matrix Profile, instead of 22.5 s of the original sequential implementation. Another application used for testing our implementation is penguin data, which consists of a dataset of 110,000 elements approximately, obtained from an accelerometer. In Fig. 16 , we show the corresponding Matrix Profile of the dataset, from which a biologist could infer when the penguin is diving or walking, for example.
In this case, the execution took 0.93 s using our implementation of HBM plus DDR4 memory allocation policy and sequential order for the diagonals, instead of 39.53 s of the original sequential version, using a window size of 800 elements.
Finally, we tested a neuroscience dataset, which contains 1,030,000 elements approximately, represented in the upper part of Fig. 17 . Note that there are several discontinuities (missing values) in the dataset, which are represented as a value of 0 in the graph for readability. The Matrix Profile algorithm can give us a coherent results even with this outliers, which is a proof of the robustness of the algorithm. Fig. 15 Seismology data. This dataset, which consists of approximately 40,000 elements, presents several peaks corresponding to earthquakes. The Matrix Profile is able to accurately identify them, which are represented as higher values of distance for the corresponding subsequences, as they are discords with respect to the whole time series Fig. 16 Penguin data. This dataset, which consists of approximately 110,000 elements, presents several motifs that can help a biologist to identify when the penguin is diving, for example Using our implementation of HBM plus DDR4 memory allocation policy and sequential order for the diagonals, the Matrix Profile took 81 s to be calculated, whereas the original sequential implementation took 15,108.12 s, using a window size of 5400.
Conclusions and future work
In this work, we have introduced a novel implementation of the SCRIMP Matrix Profile algorithm tuned for an Intel Xeon Phi KNL architecture, provided with 3D-stacked high-bandwidth memory. In this implementation, we exploit multithreading, vectorization and the use of aggregated HBM plus DDR4 memory bandwidth. Two approaches have been implemented and tested. First, we propose the parallelization of the processing of diagonals in SCRIMP, dynamically distributing their computation across the cores in a many-core machine. Secondly, we use privatization and reduction techniques to avoid unnecessary thread synchronization and improve the scalability. Additionally, we increase the arithmetic intensity of our implementations using vector operations. Finally, we propose the distribution of data across the DDR4 and HBM memory spaces, allocating private and more frequently accessed data in HBM and shared read-only data in DDR4.
Experiments show that our approach can improve performance in up to 190 × with respect to sequential execution (which does not take advantage of vectorization nor HBM space) using a 64-core Xeon Phi 7210 (KNL). Furthermore, our techniques present better scalability than traditional lock-based synchronization mechanisms. Lastly, the implementation using both HBM and DDR4 memories is able to outperform by 5 × the DDR4-only solution, proving the benefits of HBM memory for bandwidth-bounded problems.
Future research directions involve the study of specific hardware to compute memory-bounded problems as SCRIMP. One promising solution is processing in Fig. 17 Neuroscience data. This dataset, which consists of approximately 1,030,000 elements, presents several discontinuities which are represented in the graph as values of 0 for readability. The Matix Profile is able to obtain coherent results even with those discontinuities, proof of robustness of the algorithm memory (PIM) [30] . With PIM, a logic layer can be stored within (or very close) to memory to perform basic operations. The goal is to assess the validity of the PIM approach to compute memory-bounded algorithms such as SCRIMP.
