# 480DeepMind.ipynb

COSC 480 - Deep Learning

Fall 2018

Very minor adaptation from: https://github.com/keras-rl/keras-rl/examples/dqn_atari.py

Based off of: Minh et. al, "Human-level control through deep reinforcement learning" (2015)

Notebook for DeepMind/Game Playing lecture.

Installation notes:
You will need the gym package from OpenAI as well as a handful of others that you may not have installed yet. For Linux and OSX this should be as simple as running:

pip3 install gym<br>
pip3 install gym[atari]<br>
pip3 install h5py<br>
pip3 install Pillow<br>
pip3 install keras-rl

For Windows? ¯\\_(ツ)_/¯

In [None]:
#imports needed for processing, OpenAI gym, DQN
from __future__ import division

from PIL import Image
import numpy as np
import gym

from keras.models import Sequential
from keras.layers import Dense, Activation, Flatten, Convolution2D, Permute
from keras.optimizers import Adam
import keras.backend as K

from rl.agents.dqn import DQNAgent
from rl.policy import LinearAnnealedPolicy, BoltzmannQPolicy, EpsGreedyQPolicy
from rl.memory import SequentialMemory
from rl.core import Processor
from rl.callbacks import FileLogger, ModelIntervalCheckpoint

# arguments for our simulation
mode = 'train'
envname = 'BreakoutDeterministic-v4'
weights = None #if you have a previous weight file, put it here and switch mode to test
INPUT_SHAPE = (84, 84)
WINDOW_LENGTH = 4

In [None]:
#processing the input and observations from the atari gym instance - note it's an image
class AtariProcessor(Processor):
    def process_observation(self, observation):
        assert observation.ndim == 3  # (height, width, channel)
        img = Image.fromarray(observation)
        img = img.resize(INPUT_SHAPE).convert('L')  # resize and convert to grayscale
        processed_observation = np.array(img)
        assert processed_observation.shape == INPUT_SHAPE
        return processed_observation.astype('uint8')  # saves storage in experience memory

    def process_state_batch(self, batch):
        # We could perform this processing step in `process_observation`. In this case, however,
        # we would need to store a `float32` array instead, which is 4x more memory intensive than
        # an `uint8` array. This matters if we store 1M observations.
        processed_batch = batch.astype('float32') / 255.
        return processed_batch

    def process_reward(self, reward):
        return np.clip(reward, -1., 1.)


In [5]:
# Get the environment and extract the number of actions.
env = gym.make(envname)
np.random.seed(123)
env.seed(123)
nb_actions = env.action_space.n

# Next, we build our model. We use the same model that was described by Mnih et al. (2015).
input_shape = (WINDOW_LENGTH,) + INPUT_SHAPE
model = Sequential()
if K.image_dim_ordering() == 'tf':
    # (width, height, channels)
    model.add(Permute((2, 3, 1), input_shape=input_shape))
elif K.image_dim_ordering() == 'th':
    # (channels, width, height)
    model.add(Permute((1, 2, 3), input_shape=input_shape))
else:
    raise RuntimeError('Unknown image_dim_ordering.')
model.add(Convolution2D(32, (8, 8), strides=(4, 4)))
model.add(Activation('relu'))
model.add(Convolution2D(64, (4, 4), strides=(2, 2)))
model.add(Activation('relu'))
model.add(Convolution2D(64, (3, 3), strides=(1, 1)))
model.add(Activation('relu'))
model.add(Flatten())
model.add(Dense(512))
model.add(Activation('relu'))
model.add(Dense(nb_actions))
model.add(Activation('linear'))

# Finally, we configure and compile our agent. You can use every built-in Keras optimizer and
# even the metrics!
memory = SequentialMemory(limit=1000000, window_length=WINDOW_LENGTH)
processor = AtariProcessor()

# Select a policy. We use eps-greedy action selection, which means that a random action is selected
# with probability eps. We anneal eps from 1.0 to 0.1 over the course of 1M steps. This is done so that
# the agent initially explores the environment (high eps) and then gradually sticks to what it knows
# (low eps). We also set a dedicated eps value that is used during testing. Note that we set it to 0.05
# so that the agent still performs some random actions. This ensures that the agent cannot get stuck.
policy = LinearAnnealedPolicy(EpsGreedyQPolicy(), attr='eps', value_max=1., value_min=.1, value_test=.05,
                              nb_steps=1000000)

# The trade-off between exploration and exploitation is difficult and an on-going research topic.
# If you want, you can experiment with the parameters or use a different policy. Another popular one
# is Boltzmann-style exploration:
# policy = BoltzmannQPolicy(tau=1.)
# Feel free to give it a try!

dqn = DQNAgent(model=model, nb_actions=nb_actions, policy=policy, memory=memory,
               processor=processor, nb_steps_warmup=50000, gamma=.99, target_model_update=10000,
               train_interval=4, delta_clip=1.)
dqn.compile(Adam(lr=.00025), metrics=['mae'])

if mode == 'train':
    # Okay, now it's time to learn something! We capture the interrupt exception so that training
    # can be prematurely aborted. Notice that you can the built-in Keras callbacks!
    #adjust nb_steps to lower the amount of time that this executes
    weights_filename = 'dqn_{}_weights.h5f'.format(envname)
    checkpoint_weights_filename = 'dqn_' + envname + '_weights_{step}.h5f'
    log_filename = 'dqn_{}_log.json'.format(envname)
    callbacks = [ModelIntervalCheckpoint(checkpoint_weights_filename, interval=250000)]
    callbacks += [FileLogger(log_filename, interval=100)]
    dqn.fit(env, callbacks=callbacks, nb_steps=1750000, log_interval=10000, visualize=True)

    # After training is done, we save the final weights one more time.
    dqn.save_weights(weights_filename, overwrite=True)

    # Finally, evaluate our algorithm for 10 episodes.
    dqn.test(env, nb_episodes=10, visualize=False)
elif mode == 'test':
    weights_filename = 'dqn_{}_weights.h5f'.format(envname)
    if weights:
        weights_filename = weights
    dqn.load_weights(weights_filename)
    dqn.test(env, nb_episodes=10, visualize=True)


_________________________________________________________________
Layer (type)                 Output Shape              Param #   
permute_2 (Permute)          (None, 84, 84, 4)         0         
_________________________________________________________________
conv2d_4 (Conv2D)            (None, 20, 20, 32)        8224      
_________________________________________________________________
activation_6 (Activation)    (None, 20, 20, 32)        0         
_________________________________________________________________
conv2d_5 (Conv2D)            (None, 9, 9, 64)          32832     
_________________________________________________________________
activation_7 (Activation)    (None, 9, 9, 64)          0         
_________________________________________________________________
conv2d_6 (Conv2D)            (None, 7, 7, 64)          36928     
_________________________________________________________________
activation_8 (Activation)    (None, 7, 7, 64)          0         
__________

48 episodes - episode_reward: 1.896 [0.000, 7.000] - loss: 0.001 - mean_absolute_error: 0.116 - mean_q: 0.131 - mean_eps: 0.771 - ale.lives: 2.930

Interval 27 (260000 steps performed)
49 episodes - episode_reward: 1.755 [0.000, 5.000] - loss: 0.001 - mean_absolute_error: 0.129 - mean_q: 0.155 - mean_eps: 0.762 - ale.lives: 2.948

Interval 28 (270000 steps performed)
46 episodes - episode_reward: 2.196 [0.000, 5.000] - loss: 0.001 - mean_absolute_error: 0.139 - mean_q: 0.177 - mean_eps: 0.753 - ale.lives: 2.829

Interval 29 (280000 steps performed)
47 episodes - episode_reward: 1.936 [0.000, 6.000] - loss: 0.001 - mean_absolute_error: 0.158 - mean_q: 0.202 - mean_eps: 0.744 - ale.lives: 2.961

Interval 30 (290000 steps performed)
44 episodes - episode_reward: 2.432 [0.000, 6.000] - loss: 0.001 - mean_absolute_error: 0.176 - mean_q: 0.222 - mean_eps: 0.735 - ale.lives: 3.080

Interval 31 (300000 steps performed)
43 episodes - episode_reward: 2.605 [0.000, 8.000] - loss: 0.001 - mean_abs

32 episodes - episode_reward: 4.812 [1.000, 11.000] - loss: 0.001 - mean_absolute_error: 0.692 - mean_q: 0.933 - mean_eps: 0.492 - ale.lives: 2.965

Interval 58 (570000 steps performed)
Testing for 10 episodes ...
Episode 1: reward: 17.000, steps: 751
Episode 2: reward: 17.000, steps: 748
Episode 3: reward: 17.000, steps: 748
Episode 4: reward: 17.000, steps: 748


KeyboardInterrupt: 