1,789 research outputs found

    Exploring the Symbolic/Subsymbolic Continuum: A case study of RAAM

    Get PDF
    Exoplanets orbiting pre-main-sequence stars are laboratories for studying planet evolution processes, including atmospheric loss, orbital migration, and radiative cooling. V1298 Tau, a young solar analog with an age of 23 ยฑ 4 Myr, is one such laboratory. The star is already known to host a Jupiter-sized planet on a 24 day orbit. Here, we report the discovery of three additional planetsโ€”all between the sizes of Neptune and Saturnโ€”based on our analysis of K2 Campaign 4 photometry. Planets c and d have sizes of 5.6 and 6.4 RโŠ•, respectively, and with orbital periods of 8.25 and 12.40 days reside 0.25% outside of the nominal 3:2 mean-motion resonance. Planet e is 8.7 RโŠ• in size but only transited once in the K2 time series and thus has a period longer than 36 days, but likely shorter than 223 days. The V1298 Tau system may be a precursor to the compact multiplanet systems found to be common by the Kepler mission. However, the large planet sizes stand in sharp contrast to the vast majority of Kepler multiplanet systems, which have planets smaller than 3 RโŠ•. Simple dynamical arguments suggest total masses of <28 MโŠ• and <120 MโŠ• for the cโ€“d and dโ€“b planet pairs, respectively. The implied low masses suggest that the planets may still be radiatively cooling and contracting, and perhaps losing atmosphere. The V1298 Tau system offers rich prospects for further follow-up including atmospheric characterization by transmission or eclipse spectroscopy, dynamical characterization through transit-timing variations, and measurements of planet masses and obliquities by radial velocities

    Deep Learning Techniques for Music Generation -- A Survey

    Full text link
    This paper is a survey and an analysis of different ways of using deep learning (deep artificial neural networks) to generate musical content. We propose a methodology based on five dimensions for our analysis: Objective - What musical content is to be generated? Examples are: melody, polyphony, accompaniment or counterpoint. - For what destination and for what use? To be performed by a human(s) (in the case of a musical score), or by a machine (in the case of an audio file). Representation - What are the concepts to be manipulated? Examples are: waveform, spectrogram, note, chord, meter and beat. - What format is to be used? Examples are: MIDI, piano roll or text. - How will the representation be encoded? Examples are: scalar, one-hot or many-hot. Architecture - What type(s) of deep neural network is (are) to be used? Examples are: feedforward network, recurrent network, autoencoder or generative adversarial networks. Challenge - What are the limitations and open challenges? Examples are: variability, interactivity and creativity. Strategy - How do we model and control the process of generation? Examples are: single-step feedforward, iterative feedforward, sampling or input manipulation. For each dimension, we conduct a comparative analysis of various models and techniques and we propose some tentative multidimensional typology. This typology is bottom-up, based on the analysis of many existing deep-learning based systems for music generation selected from the relevant literature. These systems are described and are used to exemplify the various choices of objective, representation, architecture, challenge and strategy. The last section includes some discussion and some prospects.Comment: 209 pages. This paper is a simplified version of the book: J.-P. Briot, G. Hadjeres and F.-D. Pachet, Deep Learning Techniques for Music Generation, Computational Synthesis and Creative Systems, Springer, 201

    ์ž ์žฌ ์ž„๋ฒ ๋”ฉ์„ ํ†ตํ•œ ์‹œ๊ฐ์  ์Šคํ† ๋ฆฌ๋กœ๋ถ€ํ„ฐ์˜ ์„œ์‚ฌ ํ…์ŠคํŠธ ์ƒ์„ฑ๊ธฐ ํ•™์Šต

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ (๋ฐ•์‚ฌ)-- ์„œ์šธ๋Œ€ํ•™๊ต ๋Œ€ํ•™์› : ๊ณต๊ณผ๋Œ€ํ•™ ์ „๊ธฐยท์ปดํ“จํ„ฐ๊ณตํ•™๋ถ€, 2019. 2. ์žฅ๋ณ‘ํƒ.The ability to understand the story is essential to make humans unique from other primates as well as animals. The capability of story understanding is crucial for AI agents to live with people in everyday life and understand their context. However, most research on story AI focuses on automated story generation based on closed worlds designed manually, which are widely used for computation authoring. Machine learning techniques on story corpora face similar problems of natural language processing such as omitting details and commonsense knowledge. Since the remarkable success of deep learning on computer vision field, increasing our interest in research on bridging between vision and language, vision-grounded story data will potentially improve the performance of story understanding and narrative text generation. Let us assume that AI agents lie in the environment in which the sensing information is input by the camera. Those agents observe the surroundings, translate them into the story in natural language, and predict the following event or multiple ones sequentially. This dissertation study on the related problems: learning stories or generating the narrative text from image streams or videos. The first problem is to generate a narrative text from a sequence of ordered images. As a solution, we introduce a GLAC Net (Global-local Attention Cascading Network). It translates from image sequences to narrative paragraphs in text as a encoder-decoder framework with sequence-to-sequence setting. It has convolutional neural networks for extracting information from images, and recurrent neural networks for text generation. We introduce visual cue encoders with stacked bidirectional LSTMs, and all of the outputs of each layer are aggregated as contextualized image vectors to extract visual clues. The coherency of the generated text is further improved by conveying (cascading) the information of the previous sentence to the next sentence serially in the decoders. We evaluate the performance of it on the Visual storytelling (VIST) dataset. It outperforms other state-of-the-art results and shows the best scores in total score and all of 6 aspects in the visual storytelling challenge with evaluation of human judges. The second is to predict the following events or narrative texts with the former parts of stories. It should be possible to predict at any step with an arbitrary length. We propose recurrent event retrieval models as a solution. They train a context accumulation function and two embedding functions, where make close the distance between the cumulative context at current time and the next probable events on a latent space. They update the cumulative context with a new event as a input using bilinear operations, and we can find the next event candidates with the updated cumulative context. We evaluate them for Story Cloze Test, they show competitive performance and the best in open-ended generation setting. Also, it demonstrates the working examples in an interactive setting. The third deals with the study on composite representation learning for semantics and order for video stories. We embed each episode as a trajectory-like sequence of events on the latent space, and propose a ViStoryNet to regenerate video stories with them (tasks of story completion). We convert event sentences to thought vectors, and train functions to make successive event embed close each other to form episodes as trajectories. Bi-directional LSTMs are trained as sequence models, and decoders to generate event sentences with GRUs. We test them experimentally with PororoQA dataset, and observe that most of episodes show the form of trajectories. We use them to complete the blocked part of stories, and they show not perfect but overall similar result. Those results above can be applied to AI agents in the living area sensing with their cameras, explain the situation as stories, infer some unobserved parts, and predict the future story.์Šคํ† ๋ฆฌ๋ฅผ ์ดํ•ดํ•˜๋Š” ๋Šฅ๋ ฅ์€ ๋™๋ฌผ๋“ค ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ ๋‹ค๋ฅธ ์œ ์ธ์›๊ณผ ์ธ๋ฅ˜๋ฅผ ๊ตฌ๋ณ„์ง“๋Š” ์ค‘์š”ํ•œ ๋Šฅ๋ ฅ์ด๋‹ค. ์ธ๊ณต์ง€๋Šฅ์ด ์ผ์ƒ์ƒํ™œ ์†์—์„œ ์‚ฌ๋žŒ๋“ค๊ณผ ํ•จ๊ป˜ ์ง€๋‚ด๋ฉด์„œ ๊ทธ๋“ค์˜ ์ƒํ™œ ์† ๋งฅ๋ฝ์„ ์ดํ•ดํ•˜๊ธฐ ์œ„ํ•ด์„œ๋Š” ์Šคํ† ๋ฆฌ๋ฅผ ์ดํ•ดํ•˜๋Š” ๋Šฅ๋ ฅ์ด ๋งค์šฐ ์ค‘์š”ํ•˜๋‹ค. ํ•˜์ง€๋งŒ, ๊ธฐ์กด์˜ ์Šคํ† ๋ฆฌ์— ๊ด€ํ•œ ์—ฐ๊ตฌ๋Š” ์–ธ์–ด์ฒ˜๋ฆฌ์˜ ์–ด๋ ค์›€์œผ๋กœ ์ธํ•ด ์‚ฌ์ „์— ์ •์˜๋œ ์„ธ๊ณ„ ๋ชจ๋ธ ํ•˜์—์„œ ์ข‹์€ ํ’ˆ์งˆ์˜ ์ €์ž‘๋ฌผ์„ ์ƒ์„ฑํ•˜๋ ค๋Š” ๊ธฐ์ˆ ์ด ์ฃผ๋กœ ์—ฐ๊ตฌ๋˜์–ด ์™”๋‹ค. ๊ธฐ๊ณ„ํ•™์Šต ๊ธฐ๋ฒ•์„ ํ†ตํ•ด ์Šคํ† ๋ฆฌ๋ฅผ ๋‹ค๋ฃจ๋ ค๋Š” ์‹œ๋„๋“ค์€ ๋Œ€์ฒด๋กœ ์ž์—ฐ์–ด๋กœ ํ‘œํ˜„๋œ ๋ฐ์ดํ„ฐ์— ๊ธฐ๋ฐ˜ํ•  ์ˆ˜ ๋ฐ–์— ์—†์–ด ์ž์—ฐ์–ด ์ฒ˜๋ฆฌ์—์„œ ๊ฒช๋Š” ๋ฌธ์ œ๋“ค์„ ๋™์ผํ•˜๊ฒŒ ๊ฒช๋Š”๋‹ค. ์ด๋ฅผ ๊ทน๋ณตํ•˜๊ธฐ ์œ„ํ•ด์„œ๋Š” ์‹œ๊ฐ์  ์ •๋ณด๊ฐ€ ํ•จ๊ป˜ ์—ฐ๋™๋œ ๋ฐ์ดํ„ฐ๊ฐ€ ๋„์›€์ด ๋  ์ˆ˜ ์žˆ๋‹ค. ์ตœ๊ทผ ๋”ฅ๋Ÿฌ๋‹์˜ ๋ˆˆ๋ถ€์‹  ๋ฐœ์ „์— ํž˜์ž…์–ด ์‹œ๊ฐ๊ณผ ์–ธ์–ด ์‚ฌ์ด์˜ ๊ด€๊ณ„๋ฅผ ๋‹ค๋ฃจ๋Š” ์—ฐ๊ตฌ๋“ค์ด ๋Š˜์–ด๋‚˜๊ณ  ์žˆ๋‹ค. ์—ฐ๊ตฌ์˜ ๋น„์ „์œผ๋กœ์„œ, ์ธ๊ณต์ง€๋Šฅ ์—์ด์ „ํŠธ๊ฐ€ ์ฃผ๋ณ€ ์ •๋ณด๋ฅผ ์นด๋ฉ”๋ผ๋กœ ์ž…๋ ฅ๋ฐ›๋Š” ํ™˜๊ฒฝ ์†์— ๋†“์—ฌ์žˆ๋Š” ์ƒํ™ฉ์„ ์ƒ๊ฐํ•ด ๋ณผ ์ˆ˜ ์žˆ๋‹ค. ์ด ์•ˆ์—์„œ ์ธ๊ณต์ง€๋Šฅ ์—์ด์ „ํŠธ๋Š” ์ฃผ๋ณ€์„ ๊ด€์ฐฐํ•˜๋ฉด์„œ ๊ทธ์— ๋Œ€ํ•œ ์Šคํ† ๋ฆฌ๋ฅผ ์ž์—ฐ์–ด ํ˜•ํƒœ๋กœ ์ƒ์„ฑํ•˜๊ณ , ์ƒ์„ฑ๋œ ์Šคํ† ๋ฆฌ๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ ๋‹ค์Œ์— ์ผ์–ด๋‚  ์Šคํ† ๋ฆฌ๋ฅผ ํ•œ ๋‹จ๊ณ„์—์„œ ์—ฌ๋Ÿฌ ๋‹จ๊ณ„๊นŒ์ง€ ์˜ˆ์ธกํ•  ์ˆ˜ ์žˆ๋‹ค. ๋ณธ ํ•™์œ„ ๋…ผ๋ฌธ์—์„œ๋Š” ์‚ฌ์ง„ ๋ฐ ๋น„๋””์˜ค ์†์— ๋‚˜ํƒ€๋‚˜๋Š” ์Šคํ† ๋ฆฌ(visual story)๋ฅผ ํ•™์Šตํ•˜๋Š” ๋ฐฉ๋ฒ•, ๋‚ด๋Ÿฌํ‹ฐ๋ธŒ ํ…์ŠคํŠธ๋กœ์˜ ๋ณ€ํ™˜, ๊ฐ€๋ ค์ง„ ์‚ฌ๊ฑด ๋ฐ ๋‹ค์Œ ์‚ฌ๊ฑด์„ ์ถ”๋ก ํ•˜๋Š” ์—ฐ๊ตฌ๋“ค์„ ๋‹ค๋ฃฌ๋‹ค. ์ฒซ ๋ฒˆ์งธ๋กœ, ์—ฌ๋Ÿฌ ์žฅ์˜ ์‚ฌ์ง„์ด ์ฃผ์–ด์กŒ์„ ๋•Œ ์ด๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ ์Šคํ† ๋ฆฌ ํ…์ŠคํŠธ๋ฅผ ์ƒ์„ฑํ•˜๋Š” ๋ฌธ์ œ(๋น„์ฃผ์–ผ ์Šคํ† ๋ฆฌํ…”๋ง)๋ฅผ ๋‹ค๋ฃฌ๋‹ค. ์ด ๋ฌธ์ œ ํ•ด๊ฒฐ์„ ์œ„ํ•ด ๊ธ€๋ž™๋„ท(GLAC Net)์„ ์ œ์•ˆํ•˜์˜€๋‹ค. ๋จผ์ €, ์‚ฌ์ง„๋“ค๋กœ๋ถ€ํ„ฐ ์ •๋ณด๋ฅผ ์ถ”์ถœํ•˜๊ธฐ ์œ„ํ•œ ์ปจ๋ณผ๋ฃจ์…˜ ์‹ ๊ฒฝ๋ง, ๋ฌธ์žฅ์„ ์ƒ์„ฑํ•˜๊ธฐ ์œ„ํ•ด ์ˆœํ™˜์‹ ๊ฒฝ๋ง์„ ์ด์šฉํ•œ๋‹ค. ์‹œํ€€์Šค-์‹œํ€€์Šค ๊ตฌ์กฐ์˜ ์ธ์ฝ”๋”๋กœ์„œ, ์ „์ฒด์ ์ธ ์ด์•ผ๊ธฐ ๊ตฌ์กฐ์˜ ํ‘œํ˜„์„ ์œ„ํ•ด ๋‹ค๊ณ„์ธต ์–‘๋ฐฉํ–ฅ ์ˆœํ™˜์‹ ๊ฒฝ๋ง์„ ๋ฐฐ์น˜ํ•˜๋˜ ๊ฐ ์‚ฌ์ง„ ๋ณ„ ์ •๋ณด๋ฅผ ํ•จ๊ป˜ ์ด์šฉํ•˜๊ธฐ ์œ„ํ•ด ์ „์—ญ์ -๊ตญ๋ถ€์  ์ฃผ์˜์ง‘์ค‘ ๋ชจ๋ธ์„ ์ œ์•ˆํ•˜์˜€๋‹ค. ๋˜ํ•œ, ์—ฌ๋Ÿฌ ๋ฌธ์žฅ์„ ์ƒ์„ฑํ•˜๋Š” ๋™์•ˆ ๋งฅ๋ฝ์ •๋ณด์™€ ๊ตญ๋ถ€์ •๋ณด๋ฅผ ์žƒ์ง€ ์•Š๊ฒŒ ํ•˜๊ธฐ ์œ„ํ•ด ์•ž์„  ๋ฌธ์žฅ ์ •๋ณด๋ฅผ ์ „๋‹ฌํ•˜๋Š” ๋ฉ”์ปค๋‹ˆ์ฆ˜์„ ์ œ์•ˆํ•˜์˜€๋‹ค. ์œ„ ์ œ์•ˆ ๋ฐฉ๋ฒ•์œผ๋กœ ๋น„์ŠคํŠธ(VIST) ๋ฐ์ดํ„ฐ ์ง‘ํ•ฉ์„ ํ•™์Šตํ•˜์˜€๊ณ , ์ œ 1 ํšŒ ์‹œ๊ฐ์  ์Šคํ† ๋ฆฌํ…”๋ง ๋Œ€ํšŒ(visual storytelling challenge)์—์„œ ์‚ฌ๋žŒ ํ‰๊ฐ€๋ฅผ ๊ธฐ์ค€์œผ๋กœ ์ „์ฒด ์ ์ˆ˜ ๋ฐ 6 ํ•ญ๋ชฉ ๋ณ„๋กœ ๋ชจ๋‘ ์ตœ๊ณ ์ ์„ ๋ฐ›์•˜๋‹ค. ๋‘ ๋ฒˆ์งธ๋กœ, ์Šคํ† ๋ฆฌ์˜ ์ผ๋ถ€๊ฐ€ ๋ฌธ์žฅ๋“ค๋กœ ์ฃผ์–ด์กŒ์„ ๋•Œ ์ด๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ ๋‹ค์Œ ๋ฌธ์žฅ์„ ์˜ˆ์ธกํ•˜๋Š” ๋ฌธ์ œ๋ฅผ ๋‹ค๋ฃฌ๋‹ค. ์ž„์˜์˜ ๊ธธ์ด์˜ ์Šคํ† ๋ฆฌ์— ๋Œ€ํ•ด ์ž„์˜์˜ ์œ„์น˜์—์„œ ์˜ˆ์ธก์ด ๊ฐ€๋Šฅํ•ด์•ผ ํ•˜๊ณ , ์˜ˆ์ธกํ•˜๋ ค๋Š” ๋‹จ๊ณ„ ์ˆ˜์— ๋ฌด๊ด€ํ•˜๊ฒŒ ์ž‘๋™ํ•ด์•ผ ํ•œ๋‹ค. ์ด๋ฅผ ์œ„ํ•œ ๋ฐฉ๋ฒ•์œผ๋กœ ์ˆœํ™˜ ์‚ฌ๊ฑด ์ธ์ถœ ๋ชจ๋ธ(Recurrent Event Retrieval Models)์„ ์ œ์•ˆํ•˜์˜€๋‹ค. ์ด ๋ฐฉ๋ฒ•์€ ์€๋‹‰ ๊ณต๊ฐ„ ์ƒ์—์„œ ํ˜„์žฌ๊นŒ์ง€ ๋ˆ„์ ๋œ ๋งฅ๋ฝ๊ณผ ๋‹ค์Œ์— ๋ฐœ์ƒํ•  ์œ ๋ ฅ ์‚ฌ๊ฑด ์‚ฌ์ด์˜ ๊ฑฐ๋ฆฌ๋ฅผ ๊ฐ€๊น๊ฒŒ ํ•˜๋„๋ก ๋งฅ๋ฝ๋ˆ„์ ํ•จ์ˆ˜์™€ ๋‘ ๊ฐœ์˜ ์ž„๋ฒ ๋”ฉ ํ•จ์ˆ˜๋ฅผ ํ•™์Šตํ•œ๋‹ค. ์ด๋ฅผ ํ†ตํ•ด ์ด๋ฏธ ์ž…๋ ฅ๋˜์–ด ์žˆ๋˜ ์Šคํ† ๋ฆฌ์— ์ƒˆ๋กœ์šด ์‚ฌ๊ฑด์ด ์ž…๋ ฅ๋˜๋ฉด ์Œ์„ ํ˜•์  ์—ฐ์‚ฐ์„ ํ†ตํ•ด ๊ธฐ์กด์˜ ๋งฅ๋ฝ์„ ๊ฐœ์„ ํ•˜์—ฌ ๋‹ค์Œ์— ๋ฐœ์ƒํ•  ์œ ๋ ฅํ•œ ์‚ฌ๊ฑด๋“ค์„ ์ฐพ๋Š”๋‹ค. ์ด ๋ฐฉ๋ฒ•์œผ๋กœ ๋ฝ์Šคํ† ๋ฆฌ(ROCStories) ๋ฐ์ดํ„ฐ์ง‘ํ•ฉ์„ ํ•™์Šตํ•˜์˜€๊ณ , ์Šคํ† ๋ฆฌ ํด๋กœ์ฆˆ ํ…Œ์ŠคํŠธ(Story Cloze Test)๋ฅผ ํ†ตํ•ด ํ‰๊ฐ€ํ•œ ๊ฒฐ๊ณผ ๊ฒฝ์Ÿ๋ ฅ ์žˆ๋Š” ์„ฑ๋Šฅ์„ ๋ณด์˜€์œผ๋ฉฐ, ํŠนํžˆ ์ž„์˜์˜ ๊ธธ์ด๋กœ ์ถ”๋ก ํ•  ์ˆ˜ ์žˆ๋Š” ๊ธฐ๋ฒ• ์ค‘์— ์ตœ๊ณ ์„ฑ๋Šฅ์„ ๋ณด์˜€๋‹ค. ์„ธ ๋ฒˆ์งธ๋กœ, ๋น„๋””์˜ค ์Šคํ† ๋ฆฌ์—์„œ ์‚ฌ๊ฑด ์‹œํ€€์Šค ์ค‘ ์ผ๋ถ€๊ฐ€ ๊ฐ€๋ ค์กŒ์„ ๋•Œ ์ด๋ฅผ ๋ณต๊ตฌํ•˜๋Š” ๋ฌธ์ œ๋ฅผ ๋‹ค๋ฃฌ๋‹ค. ํŠนํžˆ, ๊ฐ ์‚ฌ๊ฑด์˜ ์˜๋ฏธ ์ •๋ณด์™€ ์ˆœ์„œ๋ฅผ ๋ชจ๋ธ์˜ ํ‘œํ˜„ ํ•™์Šต์— ๋ฐ˜์˜ํ•˜๊ณ ์ž ํ•˜์˜€๋‹ค. ์ด๋ฅผ ์œ„ํ•ด ์€๋‹‰ ๊ณต๊ฐ„ ์ƒ์— ๊ฐ ์—ํ”ผ์†Œ๋“œ๋“ค์„ ๊ถค์  ํ˜•ํƒœ๋กœ ์ž„๋ฒ ๋”ฉํ•˜๊ณ , ์ด๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ ์Šคํ† ๋ฆฌ๋ฅผ ์žฌ์ƒ์„ฑ์„ ํ•˜์—ฌ ์Šคํ† ๋ฆฌ ์™„์„ฑ์„ ํ•  ์ˆ˜ ์žˆ๋Š” ๋ชจ๋ธ์ธ ๋น„์Šคํ† ๋ฆฌ๋„ท(ViStoryNet)์„ ์ œ์•ˆํ•˜์˜€๋‹ค. ๊ฐ ์—ํ”ผ์†Œ๋“œ๋ฅผ ๊ถค์  ํ˜•ํƒœ๋ฅผ ๊ฐ€์ง€๊ฒŒ ํ•˜๊ธฐ ์œ„ํ•ด ์‚ฌ๊ฑด ๋ฌธ์žฅ์„ ์‚ฌ๊ณ ๋ฒกํ„ฐ(thought vector)๋กœ ๋ณ€ํ™˜ํ•˜๊ณ , ์—ฐ์† ์ด๋ฒคํŠธ ์ˆœ์„œ ์ž„๋ฒ ๋”ฉ์„ ํ†ตํ•ด ์ „ํ›„ ์‚ฌ๊ฑด๋“ค์ด ์„œ๋กœ ๊ฐ€๊น๊ฒŒ ์ž„๋ฒ ๋”ฉ๋˜๋„๋ก ํ•˜์—ฌ ํ•˜๋‚˜์˜ ์—ํ”ผ์†Œ๋“œ๊ฐ€ ๊ถค์ ์˜ ๋ชจ์–‘์„ ๊ฐ€์ง€๋„๋ก ํ•™์Šตํ•˜์˜€๋‹ค. ๋ฝ€๋กœ๋กœQA ๋ฐ์ดํ„ฐ์ง‘ํ•ฉ์„ ํ†ตํ•ด ์‹คํ—˜์ ์œผ๋กœ ๊ฒฐ๊ณผ๋ฅผ ํ™•์ธํ•˜์˜€๋‹ค. ์ž„๋ฒ ๋”ฉ ๋œ ์—ํ”ผ์†Œ๋“œ๋“ค์€ ๊ถค์  ํ˜•ํƒœ๋กœ ์ž˜ ๋‚˜ํƒ€๋‚ฌ์œผ๋ฉฐ, ์—ํ”ผ์†Œ๋“œ๋“ค์„ ์žฌ์ƒ์„ฑ ํ•ด๋ณธ ๊ฒฐ๊ณผ ์ „์ฒด์ ์ธ ์ธก๋ฉด์—์„œ ์œ ์‚ฌํ•œ ๊ฒฐ๊ณผ๋ฅผ ๋ณด์˜€๋‹ค. ์œ„ ๊ฒฐ๊ณผ๋ฌผ๋“ค์€ ์นด๋ฉ”๋ผ๋กœ ์ž…๋ ฅ๋˜๋Š” ์ฃผ๋ณ€ ์ •๋ณด๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ ์Šคํ† ๋ฆฌ๋ฅผ ์ดํ•ดํ•˜๊ณ  ์ผ๋ถ€ ๊ด€์ธก๋˜์ง€ ์•Š์€ ๋ถ€๋ถ„์„ ์ถ”๋ก ํ•˜๋ฉฐ, ํ–ฅํ›„ ์Šคํ† ๋ฆฌ๋ฅผ ์˜ˆ์ธกํ•˜๋Š” ๋ฐฉ๋ฒ•๋“ค์— ๋Œ€์‘๋œ๋‹ค.Abstract i Chapter 1 Introduction 1 1.1 Story of Everyday lives in Videos and Story Understanding . . . 1 1.2 Problems to be addressed . . . . . . . . . . . . . . . . . . . . . . 3 1.3 Approach and Contribution . . . . . . . . . . . . . . . . . . . . . 6 1.4 Organization of Dissertation . . . . . . . . . . . . . . . . . . . . . 9 Chapter 2 Background and Related Work 10 2.1 Why We Study Stories . . . . . . . . . . . . . . . . . . . . . . . . 10 2.2 Latent Embedding . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.3 Order Embedding and Ordinal Embedding . . . . . . . . . . . . 14 2.4 Comparison to Story Understanding . . . . . . . . . . . . . . . . 15 2.5 Story Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.5.1 Abstract Event Representations . . . . . . . . . . . . . . . 17 2.5.2 Seq-to-seq Attentional Models . . . . . . . . . . . . . . . . 18 2.5.3 Story Generation from Images . . . . . . . . . . . . . . . 19 Chapter 3 Visual Storytelling via Global-local Attention Cascading Networks 21 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.2 Evaluation for Visual Storytelling . . . . . . . . . . . . . . . . . . 26 3.3 Global-local Attention Cascading Networks (GLAC Net) . . . . . 27 3.3.1 Encoder: Contextualized Image Vector Extractor . . . . . 28 3.3.2 Decoder: Story Generator with Attention and Cascading Mechanism . . . . . . . . . . . . . . . . . . . . . . . . . . 30 3.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.4.1 VIST Dataset . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.4.2 Experiment Settings . . . . . . . . . . . . . . . . . . . . . 33 3.4.3 Network Training Details . . . . . . . . . . . . . . . . . . 36 3.4.4 Qualitative Analysis . . . . . . . . . . . . . . . . . . . . . 38 3.4.5 Quantitative Analysis . . . . . . . . . . . . . . . . . . . . 38 3.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 Chapter 4 Common Space Learning on Cumulative Contexts and the Next Events: Recurrent Event Retrieval Models 44 4.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 4.2 Problems of Context Accumulation . . . . . . . . . . . . . . . . . 45 4.3 Recurrent Event Retrieval Models for Next Event Prediction . . 46 4.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . 49 4.4.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . 51 4.4.2 Story Cloze Test . . . . . . . . . . . . . . . . . . . . . . . 52 4.4.3 Open-ended Story Generation . . . . . . . . . . . . . . . . 53 4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 Chapter 5 ViStoryNet: Order Embedding of Successive Events and the Networks for Story Regeneration 58 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 5.2 Order Embedding with Triple Learning . . . . . . . . . . . . . . 60 5.2.1 Embedding Ordered Objects in Sequences . . . . . . . . . 62 5.3 Problems and Contextual Events . . . . . . . . . . . . . . . . . . 62 5.3.1 Problem Definition . . . . . . . . . . . . . . . . . . . . . . 62 5.3.2 Contextual Event Vectors from Kids Videos . . . . . . . . 64 5.4 Architectures for the Story Regeneration Task . . . . . . . . . . . 67 5.4.1 Two Sentence Generators as Decoders . . . . . . . . . . . 68 5.4.2 Successive Event Order Embedding (SEOE) . . . . . . . . 68 5.4.3 Sequence Models of the Event Space . . . . . . . . . . . . 72 5.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . 73 5.5.1 Experimental setup . . . . . . . . . . . . . . . . . . . . . . 73 5.5.2 Quantitative Analysis . . . . . . . . . . . . . . . . . . . . 73 5.5.3 Qualitative Analysis . . . . . . . . . . . . . . . . . . . . . 74 5.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 Chapter 6 Concluding Remarks 80 6.1 Summary of Methods and Contributions . . . . . . . . . . . . . . 80 6.2 Limitation and Outlook . . . . . . . . . . . . . . . . . . . . . . . 81 6.3 Suggestions for Future Research . . . . . . . . . . . . . . . . . . . 81 ์ดˆ๋ก 101Docto

    Survey of the State of the Art in Natural Language Generation: Core tasks, applications and evaluation

    Get PDF
    This paper surveys the current state of the art in Natural Language Generation (NLG), defined as the task of generating text or speech from non-linguistic input. A survey of NLG is timely in view of the changes that the field has undergone over the past decade or so, especially in relation to new (usually data-driven) methods, as well as new applications of NLG technology. This survey therefore aims to (a) give an up-to-date synthesis of research on the core tasks in NLG and the architectures adopted in which such tasks are organised; (b) highlight a number of relatively recent research topics that have arisen partly as a result of growing synergies between NLG and other areas of artificial intelligence; (c) draw attention to the challenges in NLG evaluation, relating them to similar challenges faced in other areas of Natural Language Processing, with an emphasis on different evaluation methods and the relationships between them.Comment: Published in Journal of AI Research (JAIR), volume 61, pp 75-170. 118 pages, 8 figures, 1 tabl

    An original framework for understanding human actions and body language by using deep neural networks

    Get PDF
    The evolution of both fields of Computer Vision (CV) and Artificial Neural Networks (ANNs) has allowed the development of efficient automatic systems for the analysis of people's behaviour. By studying hand movements it is possible to recognize gestures, often used by people to communicate information in a non-verbal way. These gestures can also be used to control or interact with devices without physically touching them. In particular, sign language and semaphoric hand gestures are the two foremost areas of interest due to their importance in Human-Human Communication (HHC) and Human-Computer Interaction (HCI), respectively. While the processing of body movements play a key role in the action recognition and affective computing fields. The former is essential to understand how people act in an environment, while the latter tries to interpret people's emotions based on their poses and movements; both are essential tasks in many computer vision applications, including event recognition, and video surveillance. In this Ph.D. thesis, an original framework for understanding Actions and body language is presented. The framework is composed of three main modules: in the first one, a Long Short Term Memory Recurrent Neural Networks (LSTM-RNNs) based method for the Recognition of Sign Language and Semaphoric Hand Gestures is proposed; the second module presents a solution based on 2D skeleton and two-branch stacked LSTM-RNNs for action recognition in video sequences; finally, in the last module, a solution for basic non-acted emotion recognition by using 3D skeleton and Deep Neural Networks (DNNs) is provided. The performances of RNN-LSTMs are explored in depth, due to their ability to model the long term contextual information of temporal sequences, making them suitable for analysing body movements. All the modules were tested by using challenging datasets, well known in the state of the art, showing remarkable results compared to the current literature methods

    Practical deep learning

    Get PDF
    Deep learning is experiencing a revolution with tremendous progress because of the availability of large datasets and computing resources. The development of deeper and larger neural network models has made significant progress recently in boosting the accuracy of many applications, such as image classification, image captioning, object detection, and language translation. However, despite the opportunities they offer, existing deep learning approaches are impractical for many applications due to the following challenges. Many applications exist with only limited amounts of annotated training data, or the collected labelled training data is too expensive. Such scenarios impose significant drawbacks for deep learning methods, which are not designed for limited data and suffer from performance decay. Especially for generative tasks, because the data for many generative tasks is difficult to obtain from the real world and the results they generate are difficult to control. As deep learning algorithms become more complicated increasing the workload for researchers to train neural network models and manage the life-cycle deep learning workflows, including the model, dataset, and training pipeline, the demand for efficient deep learning development is rising. Practical deep learning should achieve adequate performance from the limited training data as well as be based on efficient deep learning development processes. In this thesis, we propose several novel methods to improve the practicability of deep generative models and development processes, leading to four contributions. First, we improve the visual quality of synthesising images conditioned on text descriptions without requiring more manual labelled data, which provides controllable generated results using object attribute information from text descriptions. Second, we achieve unsupervised image-to-image translation that synthesises images conditioned on input images without requiring paired images to supervise the training, which provides controllable generated results using semantic visual information from input images. Third, we deliver semantic image synthesis that synthesises images conditioned on both image and text descriptions without requiring ground truth images to supervise the training, which provides controllable generated results using both semantic visual and object attribute information. Fourth, we develop a research-oriented deep learning library called TensorLayer to reduce the workload of researchers for defining models, implementing new layers, and managing the deep learning workflow comprised of the dataset, model, and training pipeline. In 2017, this library has won the best open source software award issued by ACM Multimedia (MM).Open Acces
    • โ€ฆ
    corecore