Massive Machine Type Communication Pilot-Hopping Sequence Detection
  Architectures Based on Non-Negative Least Squares for Grant-Free Random
  Access by Sarband, Narges Mohammadi et al.
SUBMITTED TO IEEE OPEN JOURNAL OF CIRCUITS AND SYSTEMS 1
Massive Machine Type Communication Pilot-Hopping Sequence
Detection Architectures Based on Non-Negative Least Squares for
Grant-Free Random Access
Narges Mohammadi Sarband, Student Member, IEEE, Ema Becirovic, Student Member, IEEE, Mattias Krysander,
Erik G. Larsson, Fellow, IEEE, and Oscar Gustafsson, Senior Member, IEEE
User activity detection in grant-free random access massive machine type communication (mMTC) using pilot-hopping sequences
can be formulated as solving a non-negative least squares (NNLS) problem. In this work, two architectures using different algorithms
to solve the NNLS problem is proposed. The algorithms are implemented using a fully parallel approach and fixed-point arithmetic,
leading to high detection rates and low power consumption. The first algorithm, fast projected gradients, converges faster to the
optimal value. The second algorithm, multiplicative updates, is partially implemented in the logarithmic domain, and provides a
smaller chip area and lower power consumption. For a detection rate of about one million detections per second, the chip area for
the fast algorithm is about 0.7 mm2 compared to about 0.5 mm2 for the multiplicative algorithm when implemented in a 28 nm
FD-SOI standard cell process at 1 V power supply voltage. The energy consumption is about 300 nJ/detection for the fast projected
gradient algorithm using 256 iterations, leading to a convergence close to the theoretical. With 128 iterations, about 250 nJ/detection
is required, with a detection performance on par with 192 iterations of the multiplicative algorithm for which about 100 nJ/detection
is required.
Index Terms—5G mobile communication, Base stations, Internet of Things, Machine-to-machine communications, MIMO.
I. INTRODUCTION
MASSIVE machine type communication (mMTC) is acore use case in 5G and later generations wireless
communication systems, where a large number of users are
expected to be served [1]–[5]. These users are expected to
intermittently send small amounts of data, primarily in the
uplink, i.e., from the users to the base station. As the mMTC
users send a limited amount data, the impact of overhead sig-
naling such as random access or scheduling requests is large.
Therefore, a grant-free random access scheme is beneficial for
these types of users.
Grant-free random access with massive MIMO, i.e., base
stations with a large number of antennas, has been widely stud-
ied in many papers [6]–[15]. If the number of users exceeds
the dimension (in samples) of the channel coherence interval,
the users cannot be allocated mutually orthogonal pilots, which
leads to pilot collisions. There are two fundamentally different
approaches for solving this problem. In the first, pilots are
drawn at random and allowed to be non-orthogonal [7], [8],
[10], [11], [13]–[15]. In the other, the pilots are orthogonal, but
the data transmission is spread over many coherence intervals
[6], [12], i.e., pilot-hopping sequences are detected instead of
pilots.
The scenario for detecting pilot-hopping sequences for
grant-free random access was introduced in [6]. However,
in [6] perfect detection of the pilot-hopping sequences was
assumed. In [12], it was suggested that the non-negative least-
squares (NNLS) is an appropriate problem formulation for the
pilot-hopping sequence detection problem.
Manuscript received September 7, 2020. This work was supported in part
by the ELLIIT strategic research environment.
The authors are with the Department of Electrical and Engineer-
ing, Linko¨ping University, SE–581 83 Linko¨ping, Sweden, e-mails:
{narges.mohammadi.sarband, ema.becirovic, mattias.krysander, erik.g.larsson,
oscar.gustafsson}@liu.se.
User 3
User 2
User 1
φ1 D3(1) φ2 D3(2) φ1 D3(3) φ1 D3(4)
φ1 D2(1) φ1 D2(2) φ2 D2(3) φ2 D2(4)
φ2 D1(1) φ1 D1(2) φ1 D1(3) φ2 D1(4)
Fig. 1. Three users transmitting pilot-hopping sequences of length four using
two pilots, φ1 and φ2. Additionally, in coherence interval t the user k also
transmits uplink data Dk(t).
The active users transmit one pilot per coherence interval
(the time-frequency interval in which the channel can be as-
sumed to be constant), according to a predetermined sequence.
The sequences are unique for every user but the pilots might
collide in the coherence interval. The pilots in each coherence
interval are accompanied by an uplink data part. An example
of this scheme can be found in Fig. 1, with three users, two
pilots and a pilot-hopping sequence of length four.
In this paper, two architectures for solving this problem
using the approach in [12] is proposed. To the best of the
authors’ knowledge, the only other implementation of a similar
detector is [16]. However, this is based on the other principle:
having non-orthogonal pilot sequences drawn from random
sources, using the algorithm in [10].
A preliminary version of this work was presented in [17].
In addition to a more detailed presentation and more detailed
results, an alternative NNLS algorithm is also considered
here. The new algorithm is shown to require smaller chip
area and less energy, although it has slightly worse detection
performance for hard channel conditions.
In the next section, the detection algorithm is reviewed,
algorithms for solving NNLS problems discussed, and the
general proposed architectures introduced. Then, in Section III,
implementation results and decisions are presented for a con-
sidered scenario. Finally, some concluding remarks are given
in Section IV.
ar
X
iv
:2
00
9.
02
08
9v
1 
 [e
es
s.S
P]
  4
 Se
p 2
02
0
SUBMITTED TO IEEE OPEN JOURNAL OF CIRCUITS AND SYSTEMS 2
user k
K users
active
inactive
M antennas
gk
Fig. 2. System model.
II. METHODS AND PROCEDURES
A. Algorithm to Detect Pilot-Hoppping Sequence
We consider a single-cell massive MIMO system where
the base station is equipped with M antennas and serves
K mMTC users, each with a single antenna, see Fig. 2.
We assume that only a subset of these users is active at
once. We consider a block fading channel model where the
channel of user k in coherence interval t is denoted by
gtk ∈ CM . The users transmit T coherence interval long pilot-
hopping sequences. There are τp pilots which are of length τp,
mutually orthogonal, and have unit norm. At the pilot phase
of coherence interval t, the base station receives
Yt =
K∑
k=1
τp∑
j=1
αkS
t
j,k
√
τppkg
t
kφ
H
j +N
t, (1)
where
αk =
{
1 if user k is active,
0 otherwise,
(2)
Stj,k =
{
1, if user k sends pilot j at pilot phase t,
0, otherwise,
(3)
pk is the transmit power of user k, φj ∈ Cτp is the j:th pilot,
and Nt ∈ CM×τp is noise with i.i.d. CN (0, σ2) elements.
The model uses the assumption that the users start their pilot-
hopping sequences at the same coherence interval.
In each coherence interval the base station computes an
estimate of the received signal energy over each pilot as
Ei,t =
(Ytφi)
H(Ytφi)
M
− σ2 = ‖Y
tφi‖2
M
− σ2. (4)
With block Rayleigh fading, gk ∼ CN (0, βkIM ), we have the
channel hardening, ‖g‖
2
M → βk, as M → ∞, and favorable
propagation, g
H
k gk′
M → 0, as M →∞, k 6= k′, properties [18]
of massive MIMO. Using these properties, we can state the
“asymptotic energies” as
Ei,t =
‖Ytφi‖2
M
−σ2 →
K∑
k=1
αkS
t
i,kτppkβk as M →∞. (5)
From this limit, we can see that the asymptotic energies are
linear in the user activity parameters. We assume that the users
are using statistical channel inversion power control [19],
pk = p
mink′ βk′
βk
= p
βmin
βk
, (6)
where βmin = mink′ βk′ is the large-scale fading to the
weakest user and p is a system wide power parameter. With
statistical channel information the received signal power at the
base station from each user is the same.
We use the asymptotic energies and the fact that the users
perform statistical channel inversion to form the vector
b =
1
τppβmin

E1,1
...
Eτp,1
...
Ei,t
...
E1,T
...
Eτp,T

(7)
and the matrix
A =

S11,1 . . . S
1
1,K
...
. . .
...
S1τp,1 . . . S
1
τp,K
...
. . .
...
ST1,1 . . . S
T
1,K
...
. . .
...
STτp,1 . . . S
T
τp,K

. (8)
With this vector and matrix, we can state the asymptotic energy
equation as
Ax→ b as M →∞, (9)
where
x =
(
α1 . . . αk . . . αK
)T
.
We use the fact that not all the users are active at the
same time, i.e., that x is sparse, to solve the pilot-hopping
sequence detection problem. In [12], it is showed that NNLS
algorithms produce sparse solutions that solve this problem.
In the following, these algorithms are described in details.
B. Algorithms for Solving NNLS
The non-negative least squares (NNLS) problem can be
stated as
minimize ‖Ax− b‖2
subject to x ≥ 0, (10)
where the design matrix, A and an observation vector, b, are
given. The solution vector, x, minimizes the error of Ax−b
in the least squares sense subject to all elements in x being
non-negative. Over the years, many different algorithms have
been proposed to solve NNLS problems [20].
The probably most well-known approach is the active set
method by Lawson and Hanson [21]. However, an active
set based algorithm relies heavily on branching making it
unsuitable for a parallel high-speed implementation since parts
of the execution will be sequential in nature. Besides, the
execution is data dependent making it non-deterministic.
Instead, we here use two algorithms with a much higher
degree of computational parallelism. As they are iterative,
there is still limitations, but as later seen, a very high degree
of parallelism is still obtained. The execution in each iteration
is deterministic simplifying the design of any control.
SUBMITTED TO IEEE OPEN JOURNAL OF CIRCUITS AND SYSTEMS 3
1) Fast Projected Gradient for Solving NNLS – Fast
The fast projected gradient algorithm [22], is described as
x(0) = 0, p(0) = 0 (11)
x(k+1) = max
{
0, p(k) − 1
L
AT
(
Ap(k) − b
)}
(12)
p(k+1) = x(k+1) + s(k)
(
x(k+1) − x(k)
)
, (13)
where
c(k) =
1 +
√
1 + 4
(
c(k−1)
)2
2
, s(k) =
c(k−1) − 1
c(k)
, (14)
with c(−1) = 0 and 0 denotes a vector with all zeros. L is
the Lipschitz constant which controls the step size in each
iteration. It should be larger or equal to the spectral radius
of ATA, i.e., the largest absolute eigenvalue. For a given A,
this can be computed off-line. Replacing p(k) with x(k) in
(12) and not using (13) and (14) gives the standard projected
gradient algorithm for NNLS. The fast part comes from (13),
where the previous iteration value is used to accelerate the
convergence. s(k) is the step length of the fast part and p(k)
is now denoted the predictor [22].
Note here that the core of the algorithm, the matrix-vector
multiplications, will be simple additions since the elements of
A are 0 or 1 as seen from (3) and (8).
2) Multiplicative Updates for Solving NNLS – Mult
In the multiplicative updates algorithm, the updates in
each iteration are instead performed using multiplications and
divisions and is described as [23]–[25]:
x(0) = 1 (15)
d = ATb (16)
e(k+1) = ATAx(k) (17)
x(k+1) = d x(k)  e(k+1), (18)
where 1 denotes a vector will all ones and  and  rep-
resent component-wise (Hadamard) vector multiplication and
division, respectively.
Similar to the Fast algorithm, the matrix-vector multiplica-
tions consist of additions.
C. Proposed Architectures
The proposed architectures are iso-morphic mappings of
a single iteration of the respective algorithms. The primary
reason to do this is to obtain a high degree of parallelism
and to easily benefit from the fact that A only consists of 0
and 1 entries, leading to additions only. Since the matrices A
and AT are constant in all iterations, therefore, these constant
matrix-vector multiplications can be converted to shift-and-add
structures using sub-expression sharing, that reduces the area
and power consumption of the implementation.
1) Fast
The resulting architecture is shown in Fig. 3. It should be
noted that most signals are vectors. There are registers to store
b, which is given at the start of a detection process, and p(k)
and x(k), which are initialized to zero at the start as seen in
(11).
Fig. 3. Proposed architecture of Fast algorithm with number of parallel signals
annotated.
To implement multiplication by s(k), the value of it should
be calculated in each iteration and then multiplied. As visible
from (14), the calculation of s(k) is costly in hardware, but it
is possible to compute the s(k) values off-line and save them
in the memory. furthermore, s(k) converges after some time,
so saving all s(k) values in the memory are not required.
2) Mult
To avoid the relatively costly multiplications and divisions
of the Mult algorithm, a logarithmic number system (LNS)
[26] is used for those parts. This reduces the multiplications
and divisions to additions and subtractions, respectively, at
the expense of converting the numbers between the linear and
logarithmic domains.
The multiplicative updates algorithm, partly in the logarith-
mic domain, is described as:
xˆ(0) = 0 (19)
dˆ = log
(
ATb
)
(20)
eˆ(k+1) = log
(
ATAx(k)
)
(21)
xˆ(k+1) = dˆ+ xˆ(k) − eˆ(k+1) (22)
x(k+1) = exp
(
xˆ(k+1)
)
. (23)
The resulting architecture is shown in Fig. 4. Note that dˆ is
computed in the first clock cycle and that the first iteration is
performed in the second clock cycle. Hence, one more clock
cycle is required here compared to the Fast algorithm for the
same number of iterations.
To compute the logarithm and exponentiation, either a
table or using Mitchell’s logarithm approximation [27] is
considered. There are many methods suggested to either per-
form approximations of the logarithm and exponent functions
using smaller tables or to improve the accuracy of Mitchell’s
algorithm. However, as seen in the results section, the accuracy
of the Mitchell approximation is enough.
SUBMITTED TO IEEE OPEN JOURNAL OF CIRCUITS AND SYSTEMS 4
Fig. 4. Proposed architecture of Mult algorithm with number of parallel
signals annotated.
The idea of Mitchell’s logarithm approximation is to utilize
the approximation
log(1 + x) ≈ x. (24)
To benefit from this, the position of the leading one1 is
determined and the input is shifted such that the leading one
is in a known position. The position of the first one also
determines the integer part of the logarithm. Then, the leading
one is removed and the fractional part of the logarithm is
obtained as per (24).
For computing the exponent, the opposite procedure is
performed. A one is put in front of the fractional part and the
final result is obtained by shifting the fractional result with the
leading one according to the number of integer bits.
3) A Note on Pipelining
It should be noted here that pipelining will not provide any
advantage as we need the result of one iteration before being
able to start the next one. If required, one may interleave
several detections by introducing pipelining, although that is
not considered here since it will increase the detection latency
and storage requirements. Unfolding may provide a small
increase of detection rate as unfolding N times will most likely
not decrease the clock frequency N times since the synthesis
tool probably can optimize some parts better. Naturally, fully
unfolding the architecture the required number of iterations
will enable pipelining and streaming processing. However, the
obtained detection rate is most likely more than enough for
most practical current and envisioned scenarios.
III. RESULTS
For the implementation example, we consider a system with
parameters as shown in Table I. Synthesis results are for
1Here, the input is always non-negative, so the negative case is not
considered.
TABLE I
PARAMETERS FOR THE IMPLEMENTED EXAMPLE.
Parameter Value
Number of users, K 1024
Number of antennas, M 96
Number of orthogonal pilots, τp 16
Number of transmitted pilots, T 8
User activity 2−6
Fig. 5. Convergence of xˆ using Fast (dashed) and Mult (solid) algorithms
decoding an activity detection estimation successfully.
28 nm FD-SOI standard cells with a power supply voltage
of 1.0 V, using Design Compiler. Power estimates for the
final designs are obtained by simulating the synthesized netlist
using SDF data to produce switching activities that are back
annotated into the synthesis tool. Compared to the preliminary
version [17], here the low leakage library with a slow-slow
characterization at 125◦C is used.
A. Number of Iterations
Determining the number of iterations required for robust
activity detection is important since this relates the clock
frequency of the implementation with the number of detections
per second. Here, we will consider floating-point implemen-
tation of the algorithms, while fixed-point implementation is
further discussed in Section III-B.
First, the behavior of the algorithms is illustrated. When
iterating the algorithms, the values in xˆ corresponding to active
users will be close to one, while the values corresponding to
inactive users will be zero. A typical sequence of values for a
successful decoding is shown in Fig. 5, where the values of xˆ
are shown after each iteration. Here, one can see that the Mult
algorithm initially has a faster convergence for active users, but
that around iteration 50, the Fast algorithm has caught up and
reaches the final values slightly faster. One can see that with
about 100 iterations, it is possible to clearly separate the active
and inactive users.
Typically, one does not use a fixed-threshold value, but
studies the effect of the number of active users that are not
detected (missed detections) and number of inactive users that
are detected as active (false alarms) with a varying threshold.
The probability of missed detections, pm, and false alarm, pfa,
are calculated as
pm =
#undetected active users
#active users
(25)
pfa =
#detected inactive users
#inactive users
(26)
SUBMITTED TO IEEE OPEN JOURNAL OF CIRCUITS AND SYSTEMS 5
Fig. 6. Missed detection, pm, versus false alarm rates, pfa, for 4000 iterations
results of floating-point realizations of Fast and Mult algorithms.
(a) (b)
Fig. 7. Convergence of xˆ using (a) Fast and (b) Mult algorithms where the
active and non-active users are not clearly separated.
and plotted against each other in a receiver operating charac-
teristic (ROC) curve. The ROC curves for the two considered
algorithms are shown in Fig. 6 when using 4000 iterations,
clearly more than in Fig. 5, and a large number of instances.
Here, it can be seen that the Fast algorithm has a slightly
better behavior compared to the Mult algorithm. Each point
on the curve corresponds to a threshold and the position, the
number of missed detections and false alarms if that threshold
was used. Hence, the values in the top left corner corresponds
to large thresholds and the values in the bottom right corner
corresponds to small thresholds. For example, if a threshold
of, say, 0.9 is used, there will almost never be any false alarms,
but it is quite common that there are missed detections.
Hence, one can realize that the case in Fig. 5 is not so
relevant to study further, as when the detection converges that
well, one of the missed detection and false alarm values will
always be zero, independent of the threshold.
Instead, consider a case with bad convergence, as shown in
Fig. 7. Here, the channel conditions and the selection of active
users interacts in such a way that it is very hard to distinguish
the active and inactive users. In these cases, one can expect
that the curves of some active users are actually below the
curves of some inactive users and that there are no distinct
value that will separate the two groups.
To study this further, the same case as in Fig. 7, was solved
using 10000 iterations, with the results shown in Fig. 8. Here,
it can be seen that the Fast algorithm in Fig. 8(a) reaches a
more or less stable value in about 1000 iterations, while for the
Mult algorithm in Fig. 8(b) some variables have not reached
a stable value even after 10000 iterations.
Therefore, it can be argued that although Mult gives worse
results in Fig. 6, the degradation comes primarily from these
cases where the results will contain detection errors anyway.
Using even more iterations will make the curves in Fig. 6
(a) (b)
Fig. 8. Convergence of xˆ using (a) Fast and (b) Mult algorithms where the
active and non-active users are not clearly separated. Extended version of
Fig. 7 covering 10000 iterations.
(a) (b)
Fig. 9. Missed detection, pm, versus false alarm rates, pfa, for floating-point
implementations, 128 iterations, 192 iterations and 256 iterations for (a) Fast
algorithm, and (b) Mult algorithm.
identical. Hence, we will use the Fast algorithm curve in Fig. 6
as a reference curve for a converged solution.
In Fig. 9, the missed detection, pm, and false alarm, pfa, rates
are shown for the floating-point implementations with 128,
192, and 256 iterations, for the Fast and Mult algorithms. As
seen, the results are better for the Fast algorithm, but we know
from previous discussions that the Mult algorithm converges
faster in the good cases, so this will also be further considered.
B. Word Length Optimization
A careful word length optimization has been performed to
determine the minimum resolution of the signals while still
being able to operate correctly. In addition, the maximum
values have been determined by a combination of interval
arithmetic and simulations. One example of the latter is the
output of the A block that theoretically may require seven
additional MSBs compared to the input as up to 87 p(k)-values
are summed for the selected A since the maximum number
of ones in a row of the selected A is 87. However, simulation
show that at most five additional MSBs are required in practice
as not all the p(k)-values will be close to one. Similarly, for
the Fast algorithm, the subtraction of x(k) from x(k+1) results
in a shorter output word length than the input word length. It
can be seen from Figs. 5 and 7 that the values are correlated
and do not change much between two iterations.
In the following, S and U denote signed and unsigned
representation, respectively. The numbers following S and U
are the number of integer and fractional bits, respectively. A
negative number of integer bits denotes that the corresponding
number of fractional bits are not required. As examples,
SUBMITTED TO IEEE OPEN JOURNAL OF CIRCUITS AND SYSTEMS 6
(a)
(b) (c)
Fig. 10. Missed detection, pm, versus false alarm pfa, rates for converged
floating-point and fixed-point with 128, 192, and 256 iterations, for (a)
Fast algorithm, (b) Mult algorithm using lookup tables for logarithm and
exponent computations, and (c) Mult algorithm using Mitchell’s approximated
logarithm.
S(2, 2) denotes a four-bit signed number with values be-
tween − 24−122 = −2 and 2
4−1−1
22 = 1.75 and U(−2, 4)
denotes a two-bit unsigned number with values between 0
and 2
2−1
24 = 0.1875. In general, a S(I, F ) number goes from
− 2I+F−1
2F
= −2I−1 to 2I+F−1−1
2F
= 2I−1−2−F and a U(I, F )
number goes from 0 to 2
I+F−1
2F
= 2I − 2−F .
For the nodes where quantization is performed, i.e., the
number of fractional bits are reduced, rounding is used.
The resulting ROC curves for the resulting word lengths,
further discussed below, are shown in Fig. 10. Here, one can
see that 256 and 192 iterations are required for the Fast and
Mult algorithms to reach close to the floating-point converged
performance, respectively. On the other hand, 128 iterations
are enough for the Fast algorithm to reach a performance
similar to the Mult algorithm.
1) Fast
The value of L for the considered A is 520.74. Strictly,
this should be rounded up to guarantee convergence under all
possible inputs, but simulations show that selecting L = 512
works well, resulting in a shift instead of a multiplication. The
values of s(k) are pre-computed and stored in a memory (all
values for k ≥ 58 are 1 − 2−5). The values for pk closely
follow the values of xk. As seen from Fig. 7(a), sometimes
the values for xk, and therefore pk, can take on values above
one. Hence, we saturate the values to at most one, using a
single guard-bit. This operation is combined with the max
operation in (12) to the min(max(.))-operation in Fig. 11. It
can also be noted that (13) on rare occasions result in a small
negative value, which is confirmed by simulations. Hence,
min(max(.))-operations are performed before storing pk as
well to keep the word length limited and enabling storage of
unsigned values only.
The resulting architecture with annotated word lengths is
Fig. 11. Modified proposed architecture for the Fast algorithm with annotated
word lengths for the implemented example.
shown in Fig. 11. Rounding is here also performed for
the input to the matrix-vector multiplication by A, which
significantly reduces the complexity of the computation by
halving the input word length.
The implemented Fast algorithm can therefore be described
as
x(0) = 0, p(0) = 0
x(k+1) = min
{
m,max
{
0, p(k) − 1
L
AT
(
Ap(k) − b
)}}
p(k+1) = min {m,
max
{
0, x(k+1) + s(k)
(
x(k+1) − x(k)
)}}
where m is a vector of the largest positive value representable,
1− 2−12.
2) Mult
As discussed earlier, the Mult architecture is based one
partially logarithmic computation. The results in Figs. 10(b)
and 10(c) show that the Mitchell logarithm approximation
gives the same activity detection performance as using a
lookup table.
As part of the word length optimization, the initial value
of x(0) was reduced from 1 to 2−7. This reduces the values
of e(1), allowing a reduced word length, without reducing the
detection performance. In addition, saturation was introduced
in order to handle potential overflows of xˆ(k+1).
This leads to that the implemented Mult algorithm, partially
SUBMITTED TO IEEE OPEN JOURNAL OF CIRCUITS AND SYSTEMS 7
Fig. 12. Modified proposed architecture for the Mult algorithm with annotated
word lengths for the implemented example.
in the logarithmic domain, is described as
xˆ(0) = s (27)
dˆ = log2
(
ATb
)
(28)
eˆ(k+1) = log2
(
ATAx(k)
)
(29)
xˆ(k+1) = sat
(
dˆ+ xˆ
(k)
i − eˆ(k+1)
)
(30)
x(k+1) = 2xˆ
(k+1)
(31)
where s is a vector with the initial value, log2
(
2−7
)
= −7,
and sat corresponds to saturating the result in case of overflow.
The resulting architecture with annotated word lengths is
shown in Fig. 12.
The structure of the used logarithm approximation with base
2 is shown in Fig. 13. It consists of a leading one detector
(LOD) that outputs the position of the first one in the input
word, with the MSB corresponding to 0. As there in general
are I integer bits, the value must be offset by I−1 to determine
the correct integer part of the logarithmic value. The LOD
value is also used to left-shift the input such that the leading
one comes in the MSB position. This leading one is removed
and the remaining part is used as the fractional part of the
logarithm and combined with the integer part. Finally, as a
zero input value will output the most negative output value, a
simple saturation multiplexer is added.
To determine the best way to realize the configurable shift,
three different approaches was used. First, it can be realized
using a single 11-to-1-multiplexer as shown in Fig.14(a).
This approach is denoted 1-MUX. If realized using 2-to-
1-multiplexers requires 10 2-to-1-multiplexers. The second
approach, 2-MUX, is to use a 4-to-1-multiplexer and a 3-to-1-
Fig. 13. Realization of the Mitchell logarithm approximation with a U(I, F )
input format and zero input saturating to the most negative output value.
23567910 8 4 1
01234567810 9
x
x n
n
(a)
n1n0 023 1
x
23 1
n3n2 02 1
48
x n
(b)
01n0
x
01n1
01n2
01n3
4
8
x n
1
2
(c)
Fig. 14. Different implementations of configurable left-shift in Fig. 13:
(a) using one 11-to-1-multiplexer, 1-MUX, (b) using cascaded 4-to-1-
multiplexer and 3-to-1-multiplexer, 2-MUX, and (c) using four cascaded 2-
to-1-multiplexers, 4-MUX.
multiplexer as shown in Fig. 14(b). This requires five 2-to-1-
multiplexers. Finally, the third approach, 4-MUX, illustrated
in Fig. 14(c), only require four 2-to-1-multiplexer. For com-
parison, a plain lookup table of the logarithm function was
also implemented, as well as a Mitchell approximation where
the shift and integer part generation is directly expressed in the
LOD, instead of the LOD returning the position explicitly. This
is denoted Combined and is included to give more freedom to
the synthesis tool.
The synthesis results for the different logarithm conver-
sion blocks are shown in Fig. 15, where it is clear that
the lookup table approach is significantly larger and more
power consuming than the Mitchell logarithm approach. In
addition, the maximum clock frequency is significantly higher
for the Mitchell-based approaches. It can also be seen that the
combined approach did not provide any advantages.
To distinguish between the different shift approaches, the
detailed results are shown in Fig. 16. Here, it is clear that the
approaches using a single 11-to-1-multiplexer, 1-MUX, or four
SUBMITTED TO IEEE OPEN JOURNAL OF CIRCUITS AND SYSTEMS 8
(a) (b)
Fig. 15. (a) Area and (b) power results at different clock frequencies for
Mitchell’s logarithm approximation using different realizations of the left-shift
as well as a plain lookup table realization and a combined LOD, integer part
generation and shift realization.
(a) (b)
Fig. 16. (a) Area and (b) power results at different clock frequencies for
Mitchell’s logarithm approximation using different realizations of the left-
shift.
2-to-1-multiplexers, 4-MUX, are advantageous to the approach
with two 4-to-1-multiplexers and with the combined approach.
This is despite the 11-to-1-multiplexer corresponding to ten 2-
to-1-multiplexers and illustrates the importance of synthesizing
the different parts rather than relying on high-level measures.
Finally, the 1-MUX and 4-MUX programmable left-shifters
was incorporated into the whole design. The results are shown
in Fig. 17 and it is clear that the 1-MUX approach has a
slightly lower total area and most of the time a slightly lower
power consumption. The power consumption behavior of 1-
MUX in Fig. 17(b) is hard to explain, but the lower area and
the small difference justifies the selection. 4-MUX enables
a slightly higher clock frequency, but the detection rate is
already 1 million detections per second when operating at 192
MHz. Hence, the 1-MUX approach is used as in the following.
(a) (b)
Fig. 17. a) Area and (b) power consumption at different clock frequencies
for the implementation of Mult algorithm, using Mitchell’s logarithm approx-
imation with 1-MUX and 4-MUX realizations of the left-shift.
TABLE II
NUMBER OF ADDERS FOR THE MATRIX-VECTOR MULTIPLICATIONS
WITHOUT AND WITH SUB-EXPRESSION SHARING IN THE IMPLEMENTED
EXAMPLE.
Adders
Matrix Without With Reduction
A 8064 5699 29.3%
AT 7168 4709 34.3%
Total 15232 10408 31.7%
(a) (b)
Fig. 18. (a) Area and (b) power results at different clock frequencies for the
matrix-vector multiplication with A with and without using sub-expression
sharing.
C. Sub-Expression Sharing
As mentioned earlier, the number of adders for perform-
ing the matrix-vector multiplications by A and AT can be
reduced. Here, we use an iterative two-term sub-expression
sharing approach with the constraint that each addition should
be performed at the minimum depth to keep the power
consumption low. The results of this are shown in Table II. As
seen, a significant reduction is obtained compared to imple-
menting the additions straightforwardly. For comparison, the
other computations for the Fast approach are 3K+τpT = 3200
additions/subtractions and 2K multiplications, of which K
are reduced to a simple shift, so K = 1024 multiplications.
For the Mult approach, 2K = 2048 additions/subtractions,
K = 1024 logarithm approximations, and K = 1024 exponent
approximations are used.
To further clarify the consequences of sub-expression shar-
ing, the A and AT matrix-vector multiplications have been
synthesized separately for using the word lengths of the Fast
implementation, i.e., the word lengths in Fig. 11. The area
and power results are shown in Figs. 18 and 19 for A and
AT , respectively. It can be seen that although the reduction in
area is not the ≈ 30% expected, more for A and less for AT ,
sub-expression sharing provides a significant area and power
reduction.
1) Selecting A
The number of ones in A is always KT leading to that
the number of adders without sharing is the same indepen-
dent of the pilot hopping sequences, KT − τpT for A and
KT−K for AT . However, by selecting different pilot hopping
sequences, different sub-expression sharing results can be
obtained. Therefore, one can try to optimize A to give fewer
adders, while still having good detection performance. One
may note that qualitatively, detection performance increases
the more different the pilot hopping sequences are, while the
sharing increases with similarities in pilot hopping sequences.
SUBMITTED TO IEEE OPEN JOURNAL OF CIRCUITS AND SYSTEMS 9
(a) (b)
Fig. 19. (a) Area and (b) power consumption at different clock frequencies
for the matrix-vector multiplication with AT with and without using sub-
expression sharing.
Fig. 20. Missed detection, pm, versus false alarm rates, pfa, of the three
considered implementation examples and the converged floating-point results.
Similarly, one can see that decreasing the number of pilot
hopping sequences T will lead to fewer adders, assuming that
the number of users is constant. In [12], it was shown detection
performance scales with τpT . Hence, to decrease the imple-
mentation complexity with similar detection performance, one
may decrease T and increase τp. However, this will lead to
that less time is available to transmit data.
D. Implementation Results
Based on the previous analyses, three different design cases
are considered. A design using the Fast algorithm and 256
iterations provides detection results close to the converged
floating-point results as shown in Fig. 10(a). For the Mult
algorithm, 192 iterations are used since increasing the number
of iterations beyond that only gives a limited detection gain.
As discussed earlier, the Mult algorithm converges faster for
the good cases, but slower for the bad cases, so depending
on channel conditions etc, the actual detection performance
may differ, and the different approaches may be beneficial
in different scenarios. Finally, the Fast algorithm with 128
iterations gives about the same detection performance as the
Mult algorithm with 192 iterations and is also included. The
design is identical, and it is only the number of iterations in the
power simulations that differ. The corresponding ROC curves
are shown in Fig. 20.
The designs were synthesized for increasing clock frequency
and power simulated. The results in terms of area and power
are shown in Fig. 21. It can be seen that the area and power
is significantly smaller for the Mult algorithm, although the
maximum clock frequency is slightly higher for the Fast
(a) (b)
Fig. 21. a) Area and (b) power consumption at different clock frequencies
for the implementations of Fast and Mult algorithms.
(a) (b)
Fig. 22. (a) Average power and (b) accumulated energy consumption over
blocks of 16 iterations for Fast and Mult approaches.
algorithm, 320 MHz, as compared to 300 MHz for the Mult
algorithm2.
It may at first seem a bit unintuitive that the power con-
sumption for the Fast approach using 128 iterations is higher
than when using 256 iterations. However, it should be noted
that these results are against clock frequency, so the approach
with 128 iterations can perform twice as many detections.
Also, as illustrated in the following, the switching activity
and therefore power consumption is reduced when the results
converge. Similarly, the Mult approach using 192 iterations
can perform more detections compared to the Fast approach
with 256 iterations, but fewer than the Fast approach with 128
iterations.
In the following, the 250 MHz clock frequency designs are
used. Considering the area curves in Fig. 21, this should be
close to the minimum area design, as synthesizing for a lower
clock rate will not decrease the area significantly.
To provide further insights here, the average power con-
sumption was analyzed in blocks of 16 iterations. The results
are shown in Fig. 22(a). As expected, the power consumption
is high in the first iterations and decreases as the results
converge.
Similarly, one can compute the energy required for one
detection instance. This is shown in Fig. 22(b), where the 16
iteration block averages are used as underlying data. Here,
it is clear that the total energy required for one detection is
significantly lower for the Mult algorithm compared to the
Fast algorithm. Even if 128 iterations are used for the Fast
algorithm, about 250 nJ is required, compared to about 100 nJ
for the Mult algorithm. Still, the Fast algorithm has better
detection performance when more iterations are performed.
2Can be increased to 320 MHz by using MUX-4 with a small area increase,
as shown in Fig. 17.
SUBMITTED TO IEEE OPEN JOURNAL OF CIRCUITS AND SYSTEMS 10
Finally, it should be pointed out that the current approaches
can perform about one million detections per second. This is
most likely too much in the current settings. However, it is
possible that the future brings scenarios with higher detection
rate requirements. Hence, in the meanwhile one can reduce the
energy consumption further by power supply voltage scaling,
leading to even lower power and energy consumption.
E. Comparison with Previous Results
As discussed earlier, to the best of the authors’ knowledge,
the only other existing previous implementation of an mMTC
activity detection algorithm is [16]. However, this is for the
other type of scenario, non-orthogonal sequences, instead of
the pilot-hopping scheme considered here. Also, it considers
the case where each user can send different sequences and
therefore transmit a few bits of information (four bits in [16]).
Hence, the comparison is not fully relevant as such. The
reported detection performance is better for [16]. However,
this will apart from the algorithm also depend on, e.g., SNR,
sequence length, number of users, user activity etc.
The implementation in [16] can perform about 10000 de-
tections per second, so significantly less. It occupies about
0.9 mm2 chip area in the same standard cell process, memories
not included, compared to about 0.7 mm2 and 0.4 mm2 for
the Fast and Mult algorithms, respectively. Hence, the Mult
algorithm occupies less than half the chip area.
The energy consumption for one detection is about 150 µJ
for [16]. The Fast algorithm presented here requires about
250 nJ and 300 nJ for 128 and 256 iterations, respectively. The
Mult algorithm requires about 100 nJ. Hence, the proposed
approaches are significantly more energy efficient.
Another advantage of the approach in [16] is that the pilot
sequences are easy to change. In the approaches proposed
in the current work, they are hard coded into the structure
of the matrix-vector multiplications with A and AT . While
this provides a very efficient implementation, it may have
other drawbacks. However, one may imagine that a system
comes with prepared user modules. This will for example
guarantee that all users have unique pilot-hopping sequences.
One can also consider to synthesize the proposed architectures
to FPGAs, in which case the sequences are readily config-
urable through a re-synthesis of the A and AT matrix-vector
multiplications3.
IV. CONCLUSION
Two different architectures for activity detection in grant-
free random access massive machine type communication
based on pilot-hopping sequences have been proposed. They
both solve a non-negative least squares problem, but with dif-
ferent algorithms. Both algorithms are iterative and with deter-
ministic computations, enabling a high-speed implementation.
It is shown that the fixed-point implementation of the fast
projected gradient algorithm provides detection results close
to the converged floating-point realization. The multiplicative
3The spectral radius is about the same for all matrices with the same
parameters that have been tested.
updates algorithm is shown to converge faster for good cases,
but slower for bad cases. It is partially implemented in the
logarithmic domain to simplify multiplications and divisions.
Here, 192 iterations are selected, which gives a detection
performance similar to 128 iterations of the Fast algorithm.
However, the area and power/energy consumptions of the
multiplicative updates algorithm is significantly lower. The
energy consumed for one detection is about 100, 250, and
300 nJ for multiplicative with 192 iterations, fast with 128
iterations, and fast with 256 iterations, respectively, in a
scenario with 1024 users and eight pilots, each with a length
of 16.
The detection rate is more than one million detections per
second, which is most likely too much for contemporary
scenarios. However, future scenarios will most likely require
an increased detection speed. It is also possible to reduce the
clock frequency and power supply voltage to benefit from a
lower energy consumption. Currently, the pilot hopping matri-
ces are hard coded in the structure, requiring that the terminals
must send the correct sequences. When implementing the same
architecture on an FPGA, it is possible to change the pilot
hopping matrices by reprogramming the FPGA. This is readily
obtained as the architectures have code generators developed.
REFERENCES
[1] IMT Vision – Framework and overall objectives of the future develop-
ment of IMT for 2020 and beyond, ITU-R Std. M.2083-0, 2015.
[2] C. Bockelmann, N. K. Pratas, G. Wunder, S. Saur, M. Navarro, D. Gre-
goratti, G. Vivier, E. De Carvalho, Y. Ji, C. Stefanovic, P. Popovski,
Q. Wang, M. Schellmann, E. Kosmatos, P. Demestichas, M. Raceala-
Motoc, P. Jung, S. Stanczak, and A. Dekorsy, “Towards massive con-
nectivity support for scalable mMTC communications in 5G networks,”
IEEE Access, vol. 6, pp. 28 969–28 992, 2018.
[3] K. Mikhaylov, V. Petrov, R. Gupta, M. A. Lema, O. Galinina, S. An-
dreev, Y. Koucheryavy, M. Valkama, A. Pouttu, and M. Dohler, “Energy
efficiency of multi-radio massive machine-type communication (MR-
MMTC): Applications, challenges, and solutions,” IEEE Commun. Mag.,
vol. 57, no. 6, pp. 100–106, 2019.
[4] S. Dang, O. Amin, B. Shihada, and M.-S. Alouini, “What should 6G
be?” Nature Electron., vol. 3, no. 1, pp. 20–29, 2020.
[5] W. Saad, M. Bennis, and M. Chen, “A vision of 6G wireless systems:
Applications, trends, technologies, and open research problems,” IEEE
Network, vol. 34, no. 3, pp. 134–142, 2020.
[6] E. de Carvalho, E. Bjo¨rnson, J. H. Sørensen, E. G. Larsson, and
P. Popovski, “Random pilot and data access in massive MIMO for
machine-type communications,” IEEE Trans. Wireless Commun., vol. 16,
no. 12, pp. 7703–7717, Dec. 2017.
[7] Z. Chen, F. Sohrabi, and W. Yu, “Sparse activity detection for massive
connectivity,” IEEE Trans. Signal Process., vol. 66, no. 7, pp. 1890–
1904, Apr. 2018.
[8] L. Liu and W. Yu, “Massive connectivity with massive MIMO–part I:
Device activity detection and channel estimation,” IEEE Trans. Signal
Process., vol. 66, no. 11, pp. 2933–2946, Jun. 2018.
[9] L. Liu, E. G. Larsson, W. Yu, P. Popovski, C. Stefanovic, and E. de Car-
valho, “Sparse signal processing for grant-free massive connectivity: A
future paradigm for random access protocols in the internet of things,”
IEEE Signal Process. Mag., vol. 35, no. 5, pp. 88–99, Sep. 2018.
[10] K. S¸enel and E. G. Larsson, “Grant-free massive MTC-enabled massive
MIMO: A compressive sensing approach,” IEEE Trans. Commun.,
vol. 66, no. 12, pp. 6164–6175, Dec. 2018.
[11] S. Haghighatshoar, P. Jung, and G. Caire, “Improved scaling law
for activity detection in massive MIMO systems,” CoRR, vol.
abs/1803.02288, 2018. [Online]. Available: http://arxiv.org/abs/1803.
02288
[12] E. Becirovic, E. Bjo¨rnson, and E. G. Larsson, “Detection of pilot-
hopping sequences for grant-free random access in massive MIMO sys-
tems,” in Proc. IEEE Int. Conf. Acoustics, Speech and Signal Process.,
May 2019, pp. 8380–8384.
SUBMITTED TO IEEE OPEN JOURNAL OF CIRCUITS AND SYSTEMS 11
[13] Z. Chen, F. Sohrabi, Y.-F. Liu, and W. Yu, “Covariance based joint
activity and data detection for massive random access with massive
MIMO,” in IEEE Int. Conf. Commun. IEEE, 2019, pp. 1–6.
[14] X. Shao, X. Chen, and R. Jia, “Low-complexity design of massive device
detection via Riemannian pursuit,” in 2019 IEEE Global Commun. Conf.,
2019, pp. 1–6.
[15] J. Ahn, B. Shim, and K. B. Lee, “EP-based joint active user detection and
channel estimation for massive machine-type communications,” IEEE
Trans. Commun., vol. 67, no. 7, pp. 5178–5189, Jul. 2019.
[16] M. Tran, O. Gustafsson, P. Ka¨llstro¨m, K. S¸enel, and E. G. Larsson, “An
architecture for grant-free massive MIMO MTC based on compressive
sensing,” in Proc. Asilomar Conf. Signals, Systems, Computers, Nov.
2019.
[17] N. M. Sarband, E. Becirovic, M. Krysander, E. G. Larsson, and
O. Gustafsson, “Pilot-hopping sequence detection architecture for grant-
free random access using massive MIMO,” in Proc. IEEE Int. Symp.
Circuits Systems, 2020.
[18] T. L. Marzetta, E. G. Larsson, H. Yang, and H. Q. Ngo, Fundamentals
of Massive MIMO. Cambridge University Press, 2016.
[19] E. Bjo¨rnson, J. Hoydis, and L. Sanguinetti, “Massive MIMO networks:
Spectral, energy, and hardware efficiency,” Foundations and Trends R©
in Signal Processing, vol. 11, no. 3–4, pp. 154–655, 2017.
[20] M. Slawski, “Problem-specific analysis of non-negative least squares
solvers with a focus on instances with sparse solutions (working
paper),” Mar. 2013. [Online]. Available: https://sites.google.com/site/
slawskimartin/nnlsalgorithms.pdf
[21] C. L. Lawson and R. J. Hanson, Solving Least Squares Problems.
SIAM, 1987.
[22] R. A. Polyak, “Projected gradient method for non-negative least square,”
Contemp. Math., vol. 636, pp. 167–179, 2015.
[23] M. E. Daube-Witherspoon and G. Muehllehner, “An iterative image
space reconstruction algorthm suitable for volume ECT,” IEEE Trans.
Med. Imag., vol. 5, no. 2, pp. 61–66, 1986.
[24] D. D. Lee and H. S. Seung, “Algorithms for non-negative matrix
factorization,” in Advances Neural Inf. Process. Syst., 2001, pp. 556–
562.
[25] F. Sha, Y. Lin, L. K. Saul, and D. D. Lee, “Multiplicative updates
for nonnegative quadratic programming,” Neural Computation, vol. 19,
no. 8, pp. 2004–2031, 2007.
[26] E. E. Swartzlander and A. G. Alexopoulos, “The sign/logarithm number
system,” IEEE Trans. Comput., vol. C-24, no. 12, pp. 1238–1242, 1975.
[27] J. N. Mitchell, “Computer multiplication and division using binary
logarithms,” IRE Trans. Electronic Comput., vol. EC-11, no. 4, pp. 512–
517, 1962.
