ABSTRACT Memory-augmented recurrent neural networks (M-RNNs) have demonstrated empirically that they are very attractive for many applications, but a good theoretical understanding of their behaviors is unclear yet. In this paper, three analytical indicators named duration, addressability, and capacity of general forms of the additional memory in M-RNNs are formalized. The analysis results of the interactions among these indicators reveal that it is hard for an M-RNN to simultaneously provide good performance on more than two out of three of indicators. Meanwhile, the duration, addressability, and capacity are applied to analyze and compare two M-RNNs: long short term memory and neural turing machine for different cases. The comparison results show that none of the models has better performance on one indicator than the other model all the time. Moreover, it is found that separating memory system into sub-memories can bring the increasing duration and addressability and the decreasing capacity for the whole memory system. INDEX TERMS Memory, recurrent neural networks, duration, addressability, capacity.
I. INTRODUCTION
Recurrent neural networks (RNNs) [1] - [3] are a rich family of neural networks with recurrent connections, as shown in the left part of figure 1. These recurrent connections make RNNs pose internal state that can memorize context information, enabling it to capture temporal correlations in sequential data. However, the memory which is implicitly encoded by hidden states and recurrent connections of RNNs, is limited. This limitation can be described from two perspectives. One is that the duration of its memory is often short because of the problems of vanishing and exploding gradients [4] , [5] . The other is the limited capacity and the high complexity of memory accessing for the reason that the number of trainable parameters grows quadratically with the size of its memory [6] .
Recent years, many research works focus on extending the memory of RNNs by introducing an additional memory which can be read and written. In this study, we refer to this class of models as memory-augmented recurrent neural networks (M-RNNs). As shown in figure 1, on the left is a RNN, and on the right is a M-RNN. Analogous to the recurrent connections in the hidden layer of RNN, the additional memory block of M-RNN is design to store temporal information, but with longer duration and larger capacity. Prime examples of M-RNN include long short term memory (LSTM) [7] , RNNsearch [8] , end-to-end memory networks (MemN2Ns) [9] , neural Turing machines (NTMs) [10] , dynamic memory networks (DMN) [11] , fast associative memory [12] and neural semantic encoders (NSEs) [13] . These M-RNNs achieve success in many machine learning tasks, such as machine translation [8] , [14] , speech recognition [15] , [16] , image captioning [17] , [18] , question answering [11] , [19] , visual question answering [20] , dialogue response generation [21] , end-to-end dialogue [22] and action recognition [23] .
In contrast to the empirical success of M-RNNs, the theoretical analysis and research works regarding M-RNNs are relatively lacking. To our knowledge, there are two properties which have been analyzed by the previous study [6] , [24] , [25] . On one hand, the size of the additional memory is often independent to the trainable parameters of RNNs [6] . This is equivalent to increasing the capacity of the memory in RNNs. On the other hand, once a piece of information is stored in the memory, it will be copied from time step to time step and thus can stay there for a very long time [24] . This is an equivalent way of creating shortcut connections as pointed out by [25] . As a result, it can reduce the effect of vanishing and exploding gradients.
These various extensions and qualitative studies imply the following three indicators which can be used for understanding and measuring M-RNNs.
• Duration. How long can the content of memory persist?
• Addressability. How to measure the complexity of memory accessing?
• Capacity. How much (many) information in the memory can be used over a period of time? Motivated by a desire to the question that ''can we design a M-RNN, which simultaneously provides good performance on all of the above three indicators for its memory?'', this paper makes an attempt to do quantitative analysis of the three indicators of M-RNNs, and to discuss the interactions among them.
To address this, we firstly derive a general definition of M-RNNs based on LSTM and NTM. Then, we give the definitions of the three indicators: duration, addressability and capacity (DAC) of M-RNNs. With careful derivation, we find that the three indicators can be characterized as functions of the size and updating frequency of the additional memory. This characteristic offers the bridge for analyzing the interactions among these indicators. Finally, by checking these interactions, a principle (called DAC principle) which can answer the above question, is discovered. This principle reveals that it is hard for a M-RNN to simultaneously provide good performance on more than two out of three of indicators.
In applications of DAC, we firstly derive minimum memory requirements for M-RNN when learning long term dependencies (LTDs) in sequential data; then we compare LSTM with NTM on the three indicators. The comparison results show that NTM does not have better performance on duration (or capacity, or addressability) than LSTM in all the cases, and vice versa. Additionally, we investigate the impacts when separating memory system into sub-memories.
The main contributions of this work are summarized as follows.
• We propose a general definition of M-RNNs which covers internal M-RNNs (e.g. LSTM) and external M-RNNs (e.g. NTM) as special cases.
• We propose three indicators (called duration, capacity and addressability), and use them to help understand and measure M-RNNs.
• We find out a principle (called DAC principle) which reveals that it is hard for a M-RNN to simultaneously provide good performance on more than two out of three of indicators.
• Further more, we prove that separating memory system into sub-memories can bring the increasing duration and addressability, and the decreasing capacity for the whole memory system. The rest of this paper is organized as follows. In section II, we introduce the vanilla RNN and its two variants: LSTM and NTM. We then derive a generalized definition of M-RNN in section III which covers LSTM and NTM as special cases. In section IV, we propose definitions of the three indicators: duration, addressability, and capacity, and get the DAC principle by investigating and deriving the interactions among these indicators. We analyze the nature of LTDs and derive the minimum memory requirements for learning LTDs with M-RNN in section V. Additionally, we calculate the values of DAC for LSTM and NTM, and compare them on DAC under different settings. Moreover, we investigate how the values of DAC change when memory system is separated into a group of sub-memories in section VI. Finally, the last section gives conclusion and future work.
II. RECURRENT NEURAL NETWORKS
In this section we describe the vanilla RNN and its two variants, i.e. LSTM and NTM.
A vanilla RNN is a function R : X × H → Y × H , where X and Y are respectively input and output space; H is the hidden state space. On a variable-length input sequence (x (1) , . . . , x (T ) ) ∈ X T and with an initial hidden state h (0) ∈ H , the RNN transitions internally into states (h (1) , . . . , h (T ) ) and outputs a sequence (y (1) , . . . , y (T ) ) according to
where θ denotes the set of trainable parameters (i.e. weight matrices and biases). Long short term memory (LSTM) [7] is a particular variant of RNN's hidden unit. It has two types of hidden variables c (t) and h (t) , where c (t) is the ''memory'' cell which is designed to be maintained for a long time when necessary. The function of LSTM [26] (
is defined as follows:
where is element product; the output function is omitted; i, f , and o are the input gates, forget gates and output gates, VOLUME 6, 2018 respectively. These gates are defined by
where v ∈ {i, f , o}; σ is the logistic function. All the gates are the same size as the hidden vector h, and memory cell c. The weight matrices W cv are diagonal. Neural Turing machine (NTM) [10] , or its latest version Differentiable Neural Computer (DNC) [27] extends RNNs by introducing an external memory M . Like a digital computer, it has read / write heads for explicitly reading content from or writing content into M dynamically. The function of NTM
is defined by
where the update equation of h is omitted; E is an all-ones matrix; e and a are the parameters for writing M ; r is the read vector; w is the address for reading from (or writing into) M . w is calculated by an addressing function,
where k is the query key.
III. MEMORY AUGMENTED RECURRENT NEURAL NETWORKS
Vanilla RNNs governed by equation (1) and its two extensions LSTM in equation (2) and NTM in equation (7) contain two types of memory: 1) internal memory i.e. the hidden state vector h (t−1) or c (t−1) , where the memory size depends on the size of trainable parameters of RNN; 2) external memory M (t) , where the memory size is independent to the size of trainable parameters of RNN. This section tries to derive a generalized definition for these extensions. For convenience, some notations are given in table 1.
Definition 1 (Memory-Augmented Recurrent Neural Networks (M-RNNs)): Let M = {m i ∈ E|i = 1, 2, . . . , N } be a set of memory elements, where E can be any space of vector or matrix and N is the size of memory. If there are a read function γ :
, and an addressing function α : E N × E → R N , such that given a context (or query) key k ∈ E that is produced by a RNN R, and a memory set M , we can get the address vector w ∈ R N by w ← α(M , k) and read information r ∈ E by r ← γ (M , w), and the memory is updated by M ← ω(M , w, a), where a ∈ E C is the content vector that is produced by R, then the quadruple M , α, γ , ω is called an augmented memory M of R, and R is called a memory-augmented RNN, which is re-denoted as M R . firstly, M R receives input x (t) , and combines it with h (t−1) to produce an updated hidden state h (t) and parameters (k (t) , a (t) ) which are used to access memory M ; then the address w (t) is calculated by the addressing function α and memory M (t−1) is updated by the write function ω; finally, M R combines h (t) with an additional read vector r (t) that is produced by the read function γ to produce the output y (t) .
If the w satisfies the condition of softmax distribution:
then α is called a soft addressing. If w satisfies the condition:
then α is called a hard addressing. The soft addressing offers O(N ) time complexity of memory accessing and the hard addressing offers O(1) time complexity of memory accessing [6] .
IV. DURATION, ADDRESSABILITY, CAPACITY, AND DAC PRINCIPLE
In this section, we propose the definitions of the three indicators: duration, addressability, capacity for M-RNNs. With detail derivation, we prove that the three indicators can be characterized as functions of the size and updating frequency of the memory. Based on this characteristic, the DAC principle is discovered at the end of this section.
A. DURATION

Definition 2 (Duration):
Given a M R , the duration D of M is defined as the average time intervals on each element m i between two successive write operations ω t,i and ω t ,i , i.e.
where N δ (τ ) is the number of total time intervals; t = min t<t≤τt ; δ is defined as 
and f D holds following properties:
where 0 < p(ω) < 1. 
and the sum of time intervals on each element m i between two successive write operations over a period of time τ can be written as,
Substitute equation (15) and equation (16) into equation (13), we have,
Thus f D holds equation (14) . Two properties can be deduced directly from the equation:
In practical applications, N often takes fixed value. In order to calculate the bounds of duration, let's consider three extreme cases of p(ω):
Case 0: p(ω) = 0: this case does not make any sense because the memory can not be written anymore in this case.
Case 1: p(ω) ≈ 0: for all the memory element m i , i = 1, . . . N over all the time steps t = 1, . . . , τ , each memory element m i just be written only once.
Case 2: p(ω) = 1: for every time step t, there is a write operation ω t,i to update the value of memory element m i .
For any τ , we can find a multiple of N which satisfies nN = τ . Then the D(M) can be written as blow,
With respect to Case 1, D(M) can be rewritten as
Then, we can get the upper bound of
With respect to Case 2, D(M) can be rewritten as
Thus, the lower bound of f D is,
= N .
B. ADDRESSABILITY
Definition 3 (Addressability): Given a M R , the addressability A of M is defined as a monotonically decreasing function of the sum of the time complexity O α , O γ and O ω of the algorithm of addressing function α, read function γ and write function ω on the memory set, i.e.
where f can be any monotonically decreasing function, such as f (x) = −x c , where c (c ≥ 1) is a constant. The Definition 3 reflects the average time complexity of memory accessing for the whole system.
Lemma 2: For any M R , suppose the addressability of M satisfies Definition 3, then A(M) is a monotonically decreasing function f A of memory size N .
Proof: From previous study [6] , we have O α = N and
The g(N ) is another monotonically increasing function of N , such as log(N ) in [28] . A can be written as
Because f is a monotonically decreasing function, f A is a monotonically decreasing function of memory size N as well.
C. CAPACITY
Definition 4 (Capacity): Given a M R , the capacity C of M is defined as the total bits of available information from M over a period of time τ , i.e. [τ, τ + τ ]. Let each element m in memory set M contain k (k ≥ 1) bits of information, i.e. I (m) = k, then the capacity C is defined as below,
Lemma 3: For any M R , suppose the capacity C of M satisfies Definition 4, then C(M) is a function f C of memory size N and write frequency p(ω). More precisely, f C can be written as,
This means that at time step τ the total available information is kN . In the time steps from τ to τ + τ , the amount of new available information depends on the number of write operations, and the number of write operations depends on the write frequency p(ω), we have
Thus f C holds equation (18) .
In a standard information storage system, e.g., the disk of a digital computer, the amount of information storage is an explicit quantity. However, Lemma 3 reveals that the capacity which is defined by the Definition 4 reflects a key insight: the updating frequency of memory can increase the capacity for the whole system. 
Theorem 1 implies that the three indicators can be characterized as functions of the size N and updating (write) frequency p(ω) of the additional memory. This offers two insights for helping design and train a M-RNN:
• Tuning the hyper-parameter (memory size N ) can change the values of DAC.
• Different training algorithm can force M-RNNs to learn different write frequency p(ω). This leads different values of DAC.
V. APPLICATIONS OF DAC
This section uses the DAC to help design M-RNNs for learning long term dependencies in sequential data, and to compare two M-RNNs: LSTM and NTM.
A. DAC FOR HELPING DESIGN M-RNNs FOR LEARNING LONG TERM DEPENDENCIES
Long term dependency (LTD) [4] , also called long memory process [29] , [30] or long range dependency (persistence) [31] , [32] , is a phenomenon that may arise in the analysis of spatial or time series data. It relates to the dependency of two points with long time interval or spatial distance between the points.
Definition 5 (Long Term Dependencies (LTDs) [4]): If the prediction of the desired output d
(t i ) of a sequence-tosequence task T at time step t i depends on the inputs
} presented at earlier time steps T (t i ) = {t j 1 , t j 2 , . . . , t j κ(t i ) }, and
where κ(t i ) is the size of dependency time steps at time step t i and min T (t i ) is the time step of the input which is the longest dependency of the output d (t i ) , then the task T displays long term dependencies.
If a sequencial processing task T displays LTDs, learning T using vanilla RNNs with stochastic gradient descent is a challenge due to the vanish gradient problem [4] , [5] . M-RNNs can avoid this problem by writing the relevant inputs into and reading them from the additional memory when necessary. Here, we aim to use the DAC to analyze the nature of LTDs, as well as the requirements when using MRNNs to learn LTDs in sequencial data.
Definition 6 (Dependency Span):
If an output d (t) of a sequence-to-sequence task T at time step t depends on the inputs presented at earlier time steps T (t) = {t j 1 , t j 2 , . . . , t j κ(t) }, then the dependency span of the output d (t) is defined as where min T (t) is the time step of the input which is the longest dependency of the current output d (t) .
Definition 7 (Dependency Matrix):
A dependency matrix is used to denote the dependencies over a period of time steps {1, 2, . . . , τ } of two sequences d [1:τ ] 
Each element of the matrix is defined as flow,
0, otherwise. Here, we propose the definition of dependency matrix just for analyzing the nature of LTDs and for helping design a M R to capture the LTDs. In practical applications, e.g. machine translation, the exact values of a particular dependency matrix are actually unknown, but the expectation value of the dependency matrix can be estimated from data.
A dependency matrix is usually a upper triangular matrix because the current output d (t i ) always depends on the inputs x (t j ) at previous time steps (t j ≤ t i ). Figure 4 visualizes the 12 × 12 dependency matrix D of a sequential pair (x [1:12] , d [1:12] ). Consider a sequence-to-sequence task T which displays LTDs. The input sequence is x [1:τ ] and the corresponding desired output sequence is d [1:τ ] . Let D be their dependency matrix. For any time step t, 1 ≤ t ≤ τ , the κ(t) can be calculated by the following equation,
and the dependency time steps set T (t) can be calculated as,
For example, consider a sequence pair (x [1:12] , d [1:12] ) which is governed by the dependency matrix as shown in figure 4,   FIGURE 5 . Curves of memory demand when learning sequential data. κ(t ) is the size of dependent inputs of current output d (t ) , which equals to the minimum memory demand for making the prediction at time step t ; N * (t ) is the total memory demand at time step t . We can also calculate the duration D x demand of each input
where D + x (t) is defined as,
In figure 4 , we can get D x (1) = 12 − 1 = 11, and D x (12) = 12−12 = 0. With the duration demand of each input, the total memory demand over a period of time can be calculated as,
where I is the indicator function. In figure 4 , N * (1) = 1 and
With the calculations of N * (t) and κ(t), we have N * (t) ≥ κ(t), for t = 1, 2, . . . , τ. Figure 5 shows the values of N * (t) and κ(t) for the sequence pair (x [1:12] , d [1:12] ) which is governed by the dependency matrix as shown in figure 4 . N * (t) and κ(t) offer two conditions for designing a M R .
Condition 1 (Minimum Memory Requirement for Whole Sequence Prediction):
Suppose a M R is used to capture the LTDs in a sequence pair (x [1:τ ] , d [1:τ ] ) which is govern by a dependency matrix D, then the memory size N of M R must hold that,
Condition 2 (Minimum Memory Requirement for Single Prediction):
Suppose a M R is used to make prediction at time step t, then the memory size N of M R must hold that,
In practical applications, the value of N * (t) is unknown, because the exactly durations D x (t) of each input x (t) are unknown. The value of κ(t) is also unknown because the dependency time step set T (t) of each output d (t) is unknown.
B. DAC FOR COMPARING LSTM AND NTM
We need to calculate the DAC values for LSTM and NTM firstly, before using DAC to compare them.
1) CALCULATING THE DAC VALUES FOR LSTM
In order to calculate the DAC values for LSTM, let us analyze equivalences between LSTM and M-RNN.
Theorem 2 (A LSTM as a M-RNN):
There exists equivalences in a LSTM, which correspond to the four tuples M , α, γ , ω in a M R .
Proof: In order to prove the equivalence, we need to prove that there exists a ''memory'' M for storing information, an ''addressing'' function to obtain address, a ''write'' function to update the memory, and a ''read'' function to get information in a LSTM.
(1) Firstly, let memory cells c in LSTM be the M in M R , i.e. c = M .
(2) Secondly, let the gated functions of LSTM in equation (6) be a kind of addressing function α in M R ,
where k
(t) be the content a (t) for updating memory, i.e. a (t) = i (t) c (t) , then, the equation (4) can be regarded as the memory M write function ω,
where w (t) = f (t) is a write address. (4) Finally, let h (t) be the read content r (t) , then the equation (5) can be considered as a read function γ ,
where w (t) = o (t) is a read address. Let N l be the size of memory of a LSTM M lstm , i.e. the hidden layer size is N l and the write frequency is p(ω l ). We have following properties of M lstm .
Property 1: Property 3: Comparing LSTM with NTM on duration D, addressability A and capacity C under the different settings of memory size N and write frequency p(ω). 'l ' represents LSTM; 'n' represents NTM; '=' means equals to; '>' means greater than; '<' means less than; '<>' means uncertain.
2) CALCULATING THE DAC VALUES FOR NTM
In order to calculate the DAC values for NTM, we need to analyze the equivalences between NTM and M-RNN.
Theorem 3 (A NTM as a M-RNN):
There exists equivalences in a NTM, which correspond to the four tuples M , α, γ , ω in a M R .
Proof: Let M in equation (7) be the memory set in Definition 1. Then, the equation (10), equation (9) and equation (8) can be considered as the addressing function α, the read function γ , and the write function ω respectively.
Let N n be the size of memory of a NTM M ntm and p(ω n ) be the write frequency. We have following properties of M ntm .
Property 4:
We have O α = N n , O γ = N n and O ω = N n because the addressing function α of M ntm is soft addressing mechanism. Property 5 is obtained, if we substitute these values into equation (17) and let f (x) = −x c .
Property 6: C(M ntm ) = kN n + kp(ω n ) τ .
3) COMPARISON BETWEEN LSTM AND NTM USING DAC
In order to compare LSTM with NTM using DAC, let each memory element in NTM contain one bit information, i.e. k = 1 and time intervals τ for capacity of two models are equal. Table 3 shows the comparison results of DAC values of the two models under the different settings of memory size N and write frequency p(ω). For example, in case 1 in the table, if memory sizes are equal, i.e., N l = N n and write frequencies are equal, i.e., p(ω l ) = p(ω n ), then the durations are equal, i.e., D(M lstm ) = D(M ntm ); the addressability of LSTM is less than the addressability of NTM, i.e., A(M lstm ) < A(M ntm ); and the capacities are equal, i.e., C(M lstm ) = C(M ntm ).
Two observations can be deduced directly from the nine cases in table 3:
(1) M lstm does not have better performance in D (or A, or C) than M ntm in all the cases, and vice versa;
(2) A(M ntm ) is better than A(M lstm ) when N l ≥ N n . 
VI. DISCUSSION
We here try to discuss the problem: ''how the values of DAC change when we separate the memory of a M into a group of sub-memories
Suppose M is a separation of M, and it has two submemories M (1) and M (2) . Let N 1 and N 2 be their memory sizes, and p(ω 1 ) and p(ω 2 ) be their frequencies of write operation. For comparing M with M, we have following constraints:
M (1) and M (2) have following properties by Lemmas 1 -3.
The properties of M can be derived based on the properties of it sub-memories M (1) and M (2) .
Theorem 4: Let M be a separation of M, suppose its sub-memories satisfy the constraints (19) - (23) , then this separation
• brings the increase of duration and addressability, and the decrease of capacity;
• and makes a trade-off among DAC when the submemory size N 1 (or N 2 ) is changed.
Proof:
The DAC values of M are the weighted sums of the DAC values of its two sub-memories M (1) and M (2) .
Substitute Properties 7 -12 and equality constraints (21) and (22) into the above equations, then,
The DAC values of M are the functions of memory size N 1 and writing frequency p(ω 1 ) of M (1) , i.e., f v (N 1 , p(ω 1 )), v ∈ {D, A, C}. Figure 6 visualizes these functions. For all the sub-figures, the N in equality constraints (21) is 100 and the p(ω) in equality constraints (22) is 0.2. Consider the region which is constituted by inequality constraints (19) and (20), by checking the gradients of f v (N 1 , p(ω 1 )) with respect to N 1 and p(ω 1 ), we come to the conclusions:
(1) Comparing with M,
(2) Changing the value of N 1 (or N 2 ) can make a trade-off among the DAC values of M .
VII. CONCLUSION AND FUTURE WORK
In this study, we introduced a general framework for memoryaugmented recurrent neural networks (M-RNNs) and proposed three indicators: duration, addressability, and capacity for measuring and understanding the ''memory'' of M-RNN. To our knowledge this is the first work to understand M-RNNs from the perspective of these indicators. The three indicators are used to measure three aspects of M-RNNs: the duration of memory content, the complexity of memory accessing and the available information of memory over a period of time. Based on these indicators, the DAC principle is discovered, which reveals that it is hard for a M-RNN to simultaneously provide good performance on more than two out of three of indicators. The DAC is applied to help design M-RNNs for learning long term dependencies in sequential data, and to analyze and compare LSTM with NTM in different cases. The comparison results show that NTM does not have better performance on duration (or addressability, or capacity) than LSTM in all the cases, and vice versa. Moreover, we showed that separating a memory system into sub-memories can bring the increasing duration and addressability, and the decreasing capacity for the whole memory system.
As future work, the definition and calculation of DAC need more detail consideration; then, we should do quantitative and qualitative analysis for different read and write functions, and addressing functions of the memory; thirdly, study the impacts of different memory structure; finally, set different models with the guideline of DAC and test these models in a wide range of machine learning tasks. We also plan to verify the sub-memory property in a series of natural language processing tasks, such as relation extraction, semantic role labeling, language modeling, text comprehension, and question answering. He is currently pursuing the Ph.D. degree with Southeast University, Nanjing. His current research interests include machine learning and natural language processing. His current research interests include material science and manufacturing, numerical simulation of heat transfer and fluid flow, CAD, CG, visualization, VR, game theory, agent oriented programming, Semantic Web, knowledge capture, web mining, military simulation, philosophy, traveling, basketball, and Taijiquan. He was concentrated in human-like and conversational agent, multi-agent system, agent oriented programming, and machine learning. A project was launched in 2003 about interactive multi-agent system used for shooter training simulation, in which agents are hosted by augmented reality. He has to construct 3-D buildings and terrains, design and debug scenarios, and recognize behavior of shooters in the real world. Although this project is more complicated than designing ordinary CG games, it is much more interesting and challenging. He is also involved in 973 Project for semantic grid. Recently, He is responsible for two National Science Foundation of China projects, which are related to ontology learning and inductive logic programming.
Speech recognition, text understanding and image recognition are faced with the same puzzles, and natural language processing is one of the best test beds for various innovative AI ideas. Machine Learning may play a key role in these fields in the future. In addition, semantic annotation of 3-D environments for multi-agents to plan and act "rationally" is an important branch of Semantic Web. He firmly believes that a new era for search engine based on deep NLP, machine learning, and large scale reasoning is coming.
WEILI ZENG received the B.S. degree in applied mathematics from Jishou University, Jishou, China, in 2007, and the M.S. degree in applied mathematics and the Ph. D degree in intelligent transportation systems from Southeast University, Nanjing, China, in 2009 and 2013, respectively.
He is currently an Associate Professor and a Master's Supervisor with the College of Civil Aviation, Nanjing University of Aeronautics and Astronautics, Nanjing. His current research interests include intelligent transportation systems, computation vision, and machine learning.
XUELIAN LI received the B.S. degree in information and computing science and the M.S. degree in applied linguistics from Nanjing Tech University, Nanjing, China, in 2010 and 2014, respectively.
She is currently pursuing the Ph.D. degree in computer soft science and technology with Southeast University, Nanjing. Her research interests include information extraction, natural language processing, and artificial intelligence. Since 2016, she has been with the School of Computer Science, Nanjing University of Posts and Telecommunications, China. Her research interests include information extraction, medicare fraud detection, statistical machine learning, and Semantic Web.
