With the advent of artificial intelligence and machine learning, multiple industries have gone through different kinds of revolution. For example, convolutional neural networks has drastically changed the conventional ways for computer to capture features of image and video also known as computer vision. In the audio domain, artificial intelligence has been widely used in areas such as sound classification, speech to text conversion etc. In this work, I will mainly focus on the use of artificial intelligence in urban sound analysis and processing which was shown to have much better performance than conventional methods. Unlike images or videos, analog sound has to be sampled and quantized in order to be stored in digital format. In this work, only digital sound is concerned since neural networks can only pick up digital values. Digital sound also has its unique sets of features such as sampling frequency, bit depth. Various research work has also utilized sound features in the frequency domain such as bandwidth. One important feature of digital sound, sampling frequency, is normally beyond 8kHz. This would bring up some issues in audio processing since one second of audio would contain at least thousands of discrete digital values. In order to process large amounts of sound samples in a sequential manner, the focus of this work will be on recurrent neural networks, a type of network structure with its own memory mechanism that can deal with long-term dependency. In this work I will focus on two topics: audio captioning and audio synthesis. Firstly, captioning using AI has been widely used in the field of computer vision. Meanwhile, audio captioning would be useful for those people who may have hearing issues to perceive sound information. Secondly, audio data collection could be time-consuming and costly. However by learning audio patterns and inter-dependencies, sound synthesis would generate sound more efficiently.Bachelor of Engineering (Electrical and Electronic Engineering