Abstract-In blind detection, a set of candidates has to be decoded within a strict time constraint, to identify which transmissions are directed at the user equipment. Blind detection is required by the 3GPP LTE/LTE-Advanced standard, and it will be required in the 5 th generation wireless communication standard (5G) as well. Polar codes have been selected for use in 5G: thus, the issue of blind detection of polar codes must be addressed. We propose a polar code blind detection scheme where the user ID is transmitted instead of some of the frozen bits. A first, coarse decoding phase helps selecting a subset of candidates that is decoded by a more powerful algorithm: an early stopping criterion is also introduced for the second decoding phase. Simulations results show good missed detection and false alarm rates, along with substantial latency gains thanks to early stopping. We then propose an architecture to implement the devised blind detection scheme, based on a tunable decoder that can be used for both phases. The architecture is synthesized and implementation results are reported for various system parameters. The reported area occupation and latency, obtained in 65 nm CMOS technology, are able to meet 5G requirements, and are guaranteed to meet them with even less resource usage in the latest technology nodes.
I. INTRODUCTION
Blind decoding, also known as blind detection, requires the receiver of a set of bits to identify if said bits compose a codeword of a particular channel code. In 3GPP LTE/LTEAdvanced standards blind detection is used by the user equipment (UE) to receive control information related to the downlink shared channel. The UE attempts the decoding of a set of candidates, to identify if one of the candidates holds its control information. Blind detection will be required in the 5 th generation wireless communication standard (5G) as well: ongoing discussions are considering a substantial reduction of the time frame allocated to blind detection, from 16µs to 4µs. Blind detection must be performed very frequently, and given the high number of decoding attempts that must be performed in a limited time [1] , it can lead to large implementation costs and high energy consumption. Blind detection solutions for codes adopted in previous generation standards can be found in [2] - [4] .
Polar codes are a class of capacity-achieving error correcting codes, introduced by Arıkan in [5] . They are characterized by simple encoding and decoding algorithms, and have been selected for use in 5G [6] . In [5] , the successive-cancellation C. Condo (SC) decoding algorithm has been proposed as well. It is optimal for infinite code lengths, but its error-correction performance degrades quickly at moderate and short code lengths. In its original formulation, it also suffers from long decoding latency. SC list (SCL) decoding has been proposed in [7] to improve the error-correction performance of SC, at the cost of increased decoding latency. In [8] - [11] , a series of techniques has been proposed, aimed at improving the decoding speed of both SC and SCL without sacrificing errorcorrection performance.
Blind detection of polar codes has been recently addressed in [12] , where a blind detection scheme fitting within 3GPP LTE-A and future 5G requirements has been proposed. It is based on a two-step scheme: a first SC decoding phase helps selecting a set of candidates, subsequently decoded with SCL. An early stopping criterion for SCL is also proposed to reduce average latency. Another recent work on polar code blind detection [13] detaches itself from 4G-5G standard requirements, and proposes a metric on which the outcome of the blind detection can be based.
In this work, we extend the blind detection scheme presented in [12] and its early stopping criterion by considering SCL also in the first decoding phase, and provide improved detection accuracy results. We then propose an architecture to implement the blind detection scheme: it relies on an SCL decoder with tunable list size, that can be used for both the first and second decoding stages. The architecture is synthesized and implementation results are reported for various system parameters.
The rest of the paper is organized as follows. Section II introduces background information on polar codes and blind detection. Section III details the proposed blind detection scheme, and provides simulation results to evaluate its performance. The architecture of the blind detection system is detailed in Section IV, and implementation results are given in Section V. Finally, Section VI draws the conclusion.
II. PRELIMINARIES

A. Polar Codes
A polar code P(N, K) is a linear block code of length N = 2 n and rate K/N , and it can be expressed as the concatenation of two polar codes of length N/2. This is due to the fact that the encoding process is represented by a modulo-2 matrix multiplication as
Fig. 1: Binary tree example for P (16, 8) . White circles at s = 0 are frozen bits, black circles at s = 0 are information bits.
where u = {u 0 , u 1 , . . . , u N −1 } is the input vector, x = {x 0 , x 1 , . . . , x N −1 } is the codeword, and the generator matrix G ⊗n is the n-th Kronecker product of the polarizing matrix G = 1 0 1 1 . The polarization effect brought by polar codes allows to divide the N -bit input vector u between reliable and unreliable bit-channels. The K information bits are assigned to the most reliable bit-channels of u, while the remaining N −K, called frozen bits, are set to a predefined value, usually 0. Codeword x is transmitted through the channel, and the decoder receives the logarithmic likelihood ratio (LLR) vector y = {y 0 , y 1 , . . . , y N −1 }.
In the seminal work on polar codes [5] , the SC decoder is proposed. The SC-based decoding process can be represented as a binary tree search, in which the tree is explored depth first, with priority given to the left branches. Fig. 1 shows an example of SC decoding tree for P (16, 8) , where nodes at stage s contain 2 s bits. White leaf nodes are frozen bits, while black leaf nodes are information bits. Fig. 2 portrays the message passing among SC tree nodes. Parents pass LLR values α to children, that send in return the hard bit estimates β. The left and right branch messages α l and α r , in the hardware-friendly version of [14] , are computed as
while β is computed as
where ⊕ denotes the bitwise XOR. The SC operations are scheduled according to the following order: each node receives α first, then sends α l , receives β l , sends α r , receives β r , and finally sends β. When a leaf node is reached, β i is set as the estimated bitû i :
where F is the set of frozen bits. The SC decoding process requires full tree exploration: however, in [8] , [15] it has been shown that it is possible to prune the tree by identifying patterns in the sequence of frozen and information bits, achieving substantial speed increments. This improved SC decoding is called fast simplified SC (Fast-SSC).
SC decoding suffers from modest error-correction performance with moderate and short code lengths. To improve it, the SCL algorithm was proposed in [7] . It is based on the same process as SC, but each time that a bit is estimated at a leaf node, both its possible values 0 and 1 are considered. A set of L codeword candidates is stored, so that a bit estimation results in 2L new candidates, half of which must be discarded. To this purpose, a path metric (PM) is associated to each candidate and updated at every new estimate: the L paths with the lowest PM survive. In the LLR-based SCL proposed in [16] , the hardware-friendly formulation of the PM is
where l is the path index andû i l is the estimate of bit i at path l. As with SC decoding, SCL tree pruning techniques relying on the identification of frozen-information bit patterns have been proposed in [9] , [11] , called simplified SCL (SSCL) and Fast-SSCL.
B. Blind Detection
The physical downlink control channel (PDCCH) is used in 3GPP LTE/LTE-Advanced to transmit the downlink control information (DCI) related to the downlink shared channel. The DCI carries information regarding the channel resource allocation, transport format and hybrid automatic repeat request, and allows the UE to receive, demodulate and decode.
A cyclic redundancy check (CRC) is attached to the DCI payload before transmission. The CRC is masked according to an ID, like the radio network temporary identifier (RNTI), of the UE to which the transmission is directed, or according to one of the system-wide IDs. Finally, the DCI is encoded with a convolutional code. The UE is not aware of the format with which the DCI has been transmitted: it thus has to explore a combination of PDCCH locations, PDCCH formats, and DCI formats in the common search space (CSS) and UE-specific search space (UESSS) and attempt decoding to identify useful DCIs. This process is called blind decoding, or blind detection. For each PDCCH candidate in the search space, the UE performs channel decoding, and demasks the CRC with its ID. If no error is found in the CRC, the DCI is considered as carrying the UE control information.
Based on LTE standard R8 [1], the performance specifications for the blind detection process are the following:
• The DCI of PDCCH is from 8 to 57 bits plus 16-bit CRC, masked by 16-bit ID.
• In UESSS, a maximum of 2 DCI formats can be sent per transmission time interval (TTI) for 2 potential frame lengths. Therefore, 16 candidate locations in UESSS → 32 candidates.
• In CSS, a maximum of 2 DCI formats can be sent per TTI for 2 potential frame lengths. Therefore, 6 candidate locations in CSS → 12 candidates.
• Code length could be between 72 and 576 bits.
• Information length (including 16-bit CRC) could be between 24 and 73 bits.
• Target signal-to-noise ratio (SNR) is dependent on the targeted block error rate (BLER): 10 −2 .
• There are two types of false-alarm scenarios: Type-1, when the UE ID is not transmitted but detected, and Type-2, when the UE ID is transmitted but another one is detected. The target false-alarm rate (FAR) is below 1.52 × 10 −5 .
• Missed detection occurs when UE ID is transmitted but not detected. The missed detection rate (MDR) is close to BLER curve.
• The available time frame for blind detection is 16µs.
III. BLIND DETECTION SCHEME
In [12] , polar codes have been considered within a blind detection framework, and a blind detection scheme has been proposed. Frozen bit positions are selected to instead transmit the RNTI. Fig. 3 shows the block diagram of the devised blind detection scheme. C 1 candidates are received at the same time: in this case, C 1 = 44. The C 1 candidates are decoded with the simple SC algorithm, and a PM is obtained for each candidate, equivalent to the LLR of the last decoded bit: thanks to the serial nature of SC decoding, the LLR of the last bit can be interpreted as a reliability measure on the decoding process. The PMs are then sorted, to help the selection of the best candidates to forward to the following decoding phase. C 2 candidates are in fact selected to be decoded with the more powerful SCL decoding algorithm, that guarantees a better error-correction performance, at a higher implementation complexity. The C 2 candidates are chosen as: 1) All candidates whose ID, after the first phase, matches the one assigned to the UE. If more than C 2 are present, the ones with the highest PMs are selected. 2) If free slots among the C 2 remain, the candidates with the smallest PMs are selected. The candidates with large PMs have higher probability to be correctly decoded: if their ID does not match the one assigned to the UE, it is probably a different one. On the other hand, candidates with small PMs have a higher chance of being incorrectly decoded, and a transmission to the UE might be hiding among them. After the SCL decoding phase, if one of the C 2 candidates matches the UE ID, it is selected, otherwise no selection is attempted.
In [12] , an early stopping criterion has been proposed as well, to reduce the latency and energy expenditure of the second phase of the blind detection scheme, The first phase requires the full decoding of each candidate, to identify the C 2 codewords that will be sent to the second phase. In the second phase, however, all codewords whose ID does not match the UE ID will be discarded. Thus, as soon as the ID is shown to be different, the decoding can be interrupted. Since SC-based decoding algorithms estimate codeword bits sequentially, the ID evaluation can be performed every time an ID bit is estimated. In case the estimated bit is different from the UE ID bit, the decoding is stopped.
Three methods of ID bits have been described in [12] to choose the bits assigned to the ID:
• ID mode 1: the ID bits are the 16 most reliable bits after the K information bits.
• ID mode 2: the ID bits are the 16 most reliable bits, while the K information bits are the most reliable bits after the 16 ID bits.
• ID mode 3: considering the order with which bits are decoded in SC-based algorithms, the ID bits are the first 16 to be decoded among the K + 16 most reliable bits.
The three techniques yield negligible differences in terms of error-correction performance, while ID mode 3 yields considerable advantages over mode 1 and mode 2 when early stopping is applied. In fact, since the ID bits are decoded earlier, the average percentage of estimated bits decreases, and the reduction in average latency is more substantial. In this work, we generalize the blind detection scheme proposed in [12] , by considering SCL also for the first decoding phase. In particular, we consider a list sizes L 1 ≥ 1 for the first decoding phase, and a list size L max > L 1 for the second decoding phase. It should be noted that when L 1 = 1, the blind detection scheme reverts to that of [12] .
A. Simulation Results
To evaluate the effectiveness of the proposed blind detection scheme, simulations were performed. The BLER, MDR, and FAR have been measured on the additive white Gaussian noise (AWGN) channel, with binary phase-shift keying (BPSK) modulation, at the variation of different code parameters. We focused on polar codes with block lengths N = {256, 512}, since in [12] it has been shown that they constitute the most critical cases in terms of speed. Four information lengths K = {8, 16, 32, 57} have been considered, while the number of ID bits has been set to 16. The 3GPP standardization committee has decided that information bits in polar codes must be assigned to the K most reliable bit-channels [17] : thus, the ID bits have been assigned according to ID mode 1. The ID values assigned to the C 1 candidates are randomly selected over 16 bits. While different numbers of candidates passed to the second phase have been considered in [12] , we have focused here on C 2 = 5, for which a good tradeoff between accuracy and latency is found. At the same time, we set L max = 8 and L 1 = 2: it is a representative case for which L max guarantees good error-correction performance, and at which SCL decoders can be implemented with reasonable complexity. Fig. 4 plots the BLER curves for all the considered code lengths and rates. As expected, their error-correction performance improves as the code length increases and the code rate decreases. In Fig. 5 , the first of the metrics specific to the blind detection problem, the MDR, is depicted. The MDR can be defined as the number of missed detections divided by the number of transmissions in which the UE ID was sent. The curves in Fig. 5 have been obtained considering C 1 /2 candidates of length N 1 = 256, and C 1 /2 candidates of length N 2 = 512 in each transmission, with K 1 = K 2 information bits. Together with the MDR, in Fig. 5 the BLER curves relative to the aggregate transmissions are portrayed. It can be seen that the MDR curve is always lower than the relative BLER curve.
The FAR curves for the considered case study are portrayed in Fig. 6 . The system target FAR is equivalent to the FAR obtained with a 16-bit CRC: in 5G, a CRC of at least 16-bits long is foreseen. Here, we evaluate the additional contribution that the proposed blind detection scheme can bring in lowering the FAR on top of the CRC. It can be seen that the FAR is kept below the 10 −4 threshold at SNR values for which the BLER is still very high, and decreases as the channel conditions improve. In the blind detection method presented in [13] , the FAR increases as the MDR decreases. On the other hand, the proposed scheme allows to decrease both at the same time, thus avoiding performance limitations that could make it unappealing for 5G standard applications.
The impact of the devised early stopping criterion on the average number of estimated bits is shown in Fig. 7 , for K = 32 and K = 57. These results consider each of the C 2 candidates separately, since the number of candidates of length N 1 and N 2 in the second phase depends on the PMs received from the first phase, and thus on channel SNR. The solid curves have been obtained in cases the UE ID was sent through the considered code, while the dashed curves in cases it was not sent through the code.
• For N = 256 (curves with a circle marker), it is possible to observe the same behavior noted in [12] for N = 128 as well. In case the UE ID was sent, as the channel conditions improve, the number of estimated bits increases until stabilizing at a maximum average value. This phenomenon can be explained by the fact that when the SNR is low, it is more likely that the codeword carrying the UE ID is not selected to be among the C 2 candidates. Thus the decoders in the second phase easily encounter ID bits different from the UE ID early in the decoding process. As the channel conditions improve, the codeword with the UE ID falls among the C 2 candidates with rising probability. Consequently, the decoder tasked with its decoding does not interrupt the process, reaching 100% estimated bits, while the remaining C 2 −1 decoders stop the decoding early, thus averaging the estimated bit percentage at a stable value (67% for K = 32 and 61% for K = 57). The dashed curves show instead a stable value regardless of channel conditions: since among the C 2 candidates there is never one carrying the UE ID, all second phase decoders tend to stop the decoding early, at a percentage independent of the SNR, and mostly influenced by the position of bits assigned to the ID. • For N = 512 (curves with a cross marker) a similar behavior to the N = 256 case can be observed when the UE ID is not sent, with the average number of estimated bits stable at all the considered SNR values. On the other hand, when the UE ID is sent, the trend is different: at low SNR values, the percentage of estimated bits is very close to 100%. As the SNR value increases, the average starts to decrease, until it settles on a stable value. This behavior is due to the fact that at low SNR, it is very unlikely that a codeword with N = 512 is among the C 2 second phase candidates if the UE ID is not matching: the longer code length and lower rate contribute to a higher decoding reliability during the first phase, that allows to screen out unlikely candidates better than the N = 256 case.
IV. HARDWARE ARCHITECTURE
To evaluate the implementation cost of the devised blind detection scheme, we designed a decoder architecture that supports it, portrayed in Fig. 8 . An array of flexible list size SCL decoders handles both the first and second decoding phase. A dedicated module selects the C 2 candidates for the second phase according to the criteria described in Section III. 
A. Flexible list size SCL decoder
We based our SCL decoder architecture on that of [11] , [18] : the decoding process follows the one described in Section II-A for a list size L max . Most of the datapath and memories are instantiated L max times: multiple candidates are stored at the same time, with the best candidate being selected at the end of the decoding. While in [11] , [18] the final candidate is selected according to a CRC check, in the proposed architecture no CRC is considered, and the validity of the final candidate is based on the matching ID and PM value.
The SC decoding tree is descended by computing (2) and (3) at each stage s, with priority being given to left branches. These calculations are performed by L max parallel sets of P processing elements (PEs), with P being a power of 2. In the stages for which 2 s > 2P , the operations in (2) and (3) are performed over 2 s /(2P ) steps, while a single step is needed otherwise. Internal memories store the updated LLR values between stages.
PEs get two LLR values as input, and concurrently compute both α l and α r according to (2) and (3), respectively. The correct output is selected depending on the index of the leaf node to be estimated. When a leaf node is reached, the decoder controller module identifies the leaf node as either an information bit or a frozen bit. If a frozen bit is found, the paths are not split, and the bit is estimated only as 0, and the L memories are updated with the same bit or LLR values. Instead, in case of an information bit, both 0 and 1 are considered, so that paths are split, and the PMs updated for the 2L candidates according to (6) . Afterwards, the PMs are sorted, identifying the L surviving paths.
All memories in the decoder are registers, enabling the internal LLR and β values to be read, updated by the PEs, and written back in a single clock cycle. At the same time, the paths are either updated or split and updated, and the new PMs computed. In the following clock cycle, in case the paths were split, the PMs are sorted and the surviving paths selected.
Codes with different code lengths can be decoded by storing the appropriate memory offsets for every considered code in a dedicated memory.
This baseline decoder has been modified to better fit the needs of the proposed blind detection scheme. In order to maximize resource sharing, the SCL decoder has been sized for L max > L 1 , and the effective list size can be selected through a dedicated input. The L max − L 1 paths that are not used in the first decoding phase are used to decode up to (L max − L 1 )/L 1 additional candidates at the same time. In order to exploit the unused paths, additional functional modules are necessary.
• The baseline decoder uses a single memory to store the channel LLR values, sharing it among the different paths. If different codewords have to be decoded at the same time, the channel memory needs to be instantiated not once, but L max /L 1 times.
• The decoder relies on sorting and selection logic that identifies the surviving L max ones after paths are split. To support the parallel decoding of L max /L 1 candidates, as many sorting and selection modules targeting the selection of L 1 paths out of 2L 1 are instantiated. If L 1 = 1 is selected, the path splitting and PM sorting steps are bypassed, reverting decoders to the standard SC case. Since a single set of SCL decoders can handle both decoding phases, the total number of decoders is N SCLmax (see Fig. 8 ). However, the effective number of decoders for the first decoding phase is
The early stopping technique described in Section III has been also implemented. The decoder receives as input the position of the ID bits and the value of the UE ID: every time a bit in an ID position is estimated, the bit value is compared to the expected UE ID bit. All paths whose estimated bit does not match the UE ID bit are deactivated. This operation is performed after the L surviving paths have been selected, in order not to force the survival of unlikely paths and increase the FAR. In case all paths have been deactivated, the decoding is stopped. The early stopping logic can be activated and deactivated by means of a dedicated control signal. Since the same hardware is used for both decoding phases, early stopping is enabled only during the second one.
B. PM sorting and candidate selection Fig. 9 depicts the architecture of the PM sorting and candidate selection block. It processes the output of the first decoding phase to select the C 2 candidates for the second phase, and selects the overall system output based on the results from the second phase. For each of the N SCL1 first phase decoders, a PM and a flag signalling a UE ID match are received. They are stored every time the respective Valid signal is risen by the decoder. The Valid signal is also used as an enable for the PM and UE ID match register address counter, and for the counter keeping track of how many codewords had a matching UE ID after the first phase. When all the C 1 candidates have gone through the first decoding phase, a Valid signal is issued to the sorter module, that receives as input all the stored PMs. The sorter module returns the C 2 minimum PMs in as many clock cycles: each PM is compared to all the others, and a single clock cycle is necessary to identify the minimum one, that is excluded from the subsequent comparison. When the C 2 minima have been found, the selector module considers how many candidates had a matching UE ID after the first phase, and selects the C 2 candidates for the second phase among them and those with the minimum PM values. The C 2 candidates are sent to the N SCLmax decoders by means of a dedicated counter. Returning PMs and UE ID match flags are received and compared by another selector: when all C 2 candidates have been decoded, the selected codeword, if any, is output.
V. IMPLEMENTATION RESULTS
The architecture proposed in Section IV has been described in VHDL and synthesized in TSMC 65 nm CMOS technology. Table I reports the synthesis results for the architecture sized for a maximum code length N max = 512, a maximum list size L max = 8, C 2 = 5, and a target frequency f = 1 GHz. Various N SCLmax values have been considered, leading to different latencies and area occupations. Since during the first decoding phase L 1 = 2, the effective number of decoders N SCL1 is equal to 4N SCLmax , even if only N SCLmax are physically instantiated. Regarding the area, the N SCLmax SCL decoders contribute to the majority of the complexity, ranging from 97.8% when N SCLmax = 1 to 99.7% when N SCLmax = 5. The logic complexity of the PM sorting and candidate selection module remains almost unchanged at the variation of N SCLmax , being mainly affected by C 1 and C 2 . Memories have been synthesized with registers only, without the use of RAM, and account for 36% of the total area occupation.
The worst case latency of the proposed blind detection system can be found as
where T 1 SCL and T 2 SCL are the SCL decoding latencies for codes of length N 1 and N 2 , respectively, while T sort is the number of time steps required to sort the PM of the first decoding phase and obtain the C 2 candidates out of the C 1 candidate locations. Also, it is worth remembering that for the proposed architecture, N SCL1 = L max /L 1 × N SCLmax . The SCL decoding latency can be found as [16] 
for x ∈ {1, 2}. From the results presented in Table I , it is possible to see that even when considering the relatively old 65 nm technology node, the 16µs worst case latency target can be reached with a single SCL decoder running at a frequency of 1 GHz, while N SCLmax = 5 guarantees a worst case latency of 3.6µs, meeting the 4µs target as well.
However, considering only the worst case latency is indeed an unrealistic scenario. To begin with, while there is no guarantee on how the C 2 candidates are distributed among N 1 and N 2 , simulation results have shown that we can expect the C 2 candidates either to favor the shorter code length, or to be equally divided between N 1 and N 2 candidates. Thus, the factor Fig. 9 : PM sorting and candidate selection architecture.
in (7), that represents the contribution of the second decoding phase, could be better expressed as:
Note that this is still a conservative assumption, since it entails the C 2 candidates equally divided among the two code lengths. We can refine this assumption by taking in account the effect of early stopping. We can approximate the latency reduction with a multiplicative factor E x associated to T (9) Considering the number of UEs connected to the shared channel, blind detection is dominated by instances in which a particular UE ID is not sent. Thus, we can set E x as the fraction of bits expressed by the dashed curves in Fig. 7 . The average latency results in Table I show substantial reduction with respect to the worst case latency case, within a more realistic framework. Even within the 65 nm technology node, with N SCLmax ≥ 4, the average latency is below 4µs. With the latest technology nodes, a substantially higher frequency will be easy to achieve, along with proportionally smaller area occupation. It is consequently safe to assume that the 4µs worst case latency target can be easily met for N SCLmax ≥ 3, and the average latency with N SCLmax ≥ 2.
VI. CONCLUSION
In this work, we propose a polar codes blind detection scheme. The candidates go through a first, coarser decoding phase, that helps to select a few of them for a second, finer decoding phase. An early stopping criterion is proposed for the second phase, to reduce average latency. We evaluate the effectiveness of the blind detection scheme, and propose an architecture to implement it. It is based on an SCL decoder with tunable list size, that can be used for both decoding stages. The architecture is synthesized and implementation results are reported for various system parameters. The reported area occupation and latency, obtained in 65 nm CMOS technology, are able to meet 5G requirements, and are guaranteed to meet them with even less resource usage in the latest technology nodes.
