core.common package

Submodules

core.common.agent module

class core.common.agent.Agent(processor=None)

Bases: object

Abstract base class for all implemented agents.

Each agent interacts with the environment (as defined by the Env class) by first observing the state of the environment. Based on this observation the agent changes the environment by performing an action.

Do not use this abstract base class directly but instead use one of the concrete agents implemented. Each agent realizes a reinforcement learning algorithm. Since all agents conform to the same interface, you can use them interchangeably.

To implement your own agent, you have to implement the following methods:

  • forward
  • backward
  • compile
  • load_weights
  • save_weights
# Arguments
processor (Processor instance): See [Processor](#processor) for details.
append_replay_memory(reward, terminal)

save sates to replay buffer after having executed the action returned by forward.

# Argument
reward (float): The observed reward after executing the action returned by forward. terminal (boolean): True if the new state of the environment is terminal.
# Returns
List of metrics values
backward(reward, terminal)

Updates the agent after having executed the action returned by forward. If the policy is implemented by a neural network, this corresponds to a weight update using back-prop.

# Argument
reward (float): The observed reward after executing the action returned by forward. terminal (boolean): True if the new state of the environment is terminal.
# Returns
List of metrics values
compile(optimizer, metrics=[])

Compiles an agent and the underlaying models to be used for training and testing.

# Arguments
optimizer (keras.optimizers.Optimizer instance): The optimizer to be used during training. metrics (list of functions lambda y_true, y_pred: metric): The metrics to run during training.
forward(observation)

Takes the an observation from the environment and returns the action to be taken next. If the policy is implemented by a neural network, this corresponds to a forward (inference) pass.

# Argument
observation (object): The current observation from the environment.
# Returns
The next action to be executed in the environment.
layers

Returns all layers of the underlying model(s).

If the concrete implementation uses multiple internal models, this method returns them in a concatenated list.

# Returns
A list of the model’s layers
load_weights(filepath, filename)

Loads the weights of an agent from an HDF5 file.

# Arguments
filepath (str or list): The path to the HDF5 files. In case of algorithms using multiple models, it could be list of path of models. filename (str or list): The name to the HDF5 files. In case of algorithms using multiple models, it could be list of name of models.
reset_states()

Resets all internally kept states after an episode is completed.

run(env, nb_steps, shared_cb_params={}, train_mode=True, action_repetition=1, callbacks=None, verbose=1, visualize=False, nb_max_start_steps=0, random_policy=None, log_interval=10000, nb_max_episode_steps=None, nb_episodes=0)

Trains the agent on the given environment.

# Arguments

env: Environment instance that the agent interacts with. nb_steps (integer): Number of training steps to be performed. shared_cb_params (map) : Shared (key, value) parameters where it can be used in callbacks action_repetition (integer): Number of times the agent repeats the same action without observing the environment again. callbacks (list of keras.callbacks.Callback or rl.callbacks.Callback instances):

List of callbacks to apply during training. See [callbacks](/callbacks) for details.

verbose (integer): 0 for no logging, 1 for interval logging (compare log_interval), 2 for episode logging visualize (boolean): If True, the environment is visualized during training. However,

this is likely going to slow down training significantly and is thus intended to be a debugging instrument.
nb_max_start_steps (integer): Number of maximum steps that the agent performs at the beginning
of each episode using start_step_policy. Notice that this is an upper limit since the exact number of steps to be performed is sampled uniformly from [0, max_start_steps] at the beginning of each episode.
start_step_policy (lambda observation: action): The policy
to follow if nb_max_start_steps > 0. If set to None, a random action is performed.

log_interval (integer): If verbose = 1, the number of steps that are considered to be an interval. nb_max_episode_steps (integer): Number of steps per episode that the agent performs before

automatically resetting the environment. Set to None if each episode should run (potentially indefinitely) until the environment signals a terminal state.
# Returns
A keras.callbacks.History instance that recorded the entire training process.
save_weights(filepath, filename=None, overwrite=False)

Saves the weights of an agent as an HDF5 file.

# Arguments
filepath (str): The path to where the weights should be saved. overwrite (boolean): If False and filepath already exists, raises an error.

core.common.callback module

class core.common.callback.Callback(agent=None, *args, **kwargs)

Bases: keras.callbacks.Callback

on_action_begin(action, logs={})

Called at beginning of each action

on_action_end(action, logs={})

Called at end of each action

on_episode_begin(episode, logs={})

Called at beginning of each episode

on_episode_end(episode, logs={})

Called at end of each episode

on_step_begin(step, logs={})

Called at beginning of each step

on_step_end(step, logs={})

Called at end of each step

class core.common.callback.CallbackList(callbacks=None, queue_length=10)

Bases: keras.callbacks.CallbackList

on_action_begin(action, logs={})

Called at beginning of each action for each callback in callbackList

on_action_end(action, logs={})

Called at end of each action for each callback in callbackList

on_episode_begin(episode, logs={})

Called at beginning of each episode for each callback in callbackList

on_episode_end(episode, logs={})

Called at end of each episode for each callback in callbackList

on_step_begin(step, logs={})

Called at beginning of each step for each callback in callbackList

on_step_end(step, logs={})

Called at end of each step for each callback in callbackList

core.common.cartpole_dqn_PER module

class core.common.cartpole_dqn_PER.DQNAgent(state_size, action_size)

Bases: object

append_sample(state, action, reward, next_state, done)
build_model()
get_action(state)
optimizer()
train_model(beta)
update_target_model()
class core.common.cartpole_dqn_PER.MinSegmentTree(capacity)

Bases: core.common.cartpole_dqn_PER.SegmentTree

min(start=0, end=None)

Returns min(arr[start], …, arr[end])

class core.common.cartpole_dqn_PER.PrioritizedReplayBuffer(size, alpha)

Bases: core.common.cartpole_dqn_PER.ReplayBuffer

add(*args, **kwargs)

See ReplayBuffer.store_effect

sample(batch_size, beta)

Sample a batch of experiences. compared to ReplayBuffer.sample it also returns importance weights and idxes of sampled experiences. Parameters ———- batch_size: int

How many transitions to sample.
beta: float
To what degree to use importance weights (0 - no corrections, 1 - full correction)
obs_batch: np.array
batch of observations
act_batch: np.array
batch of actions executed given obs_batch
rew_batch: np.array
rewards received as results of executing act_batch
next_obs_batch: np.array
next set of observations seen after executing act_batch
done_mask: np.array
done_mask[i] = 1 if executing act_batch[i] resulted in the end of an episode and 0 otherwise.
weights: np.array
Array of shape (batch_size,) and dtype np.float32 denoting importance weight of each sampled transition
idxes: np.array
Array of shape (batch_size,) and dtype np.int32 idexes in buffer of sampled experiences
update_priorities(idxes, priorities)

Update priorities of sampled transitions. sets priority of transition at index idxes[i] in buffer to priorities[i]. Parameters ———- idxes: [int]

List of idxes of sampled transitions
priorities: [float]
List of updated priorities corresponding to transitions at the sampled idxes denoted by variable idxes.
class core.common.cartpole_dqn_PER.ReplayBuffer(size)

Bases: object

add(obs_t, action, reward, obs_tp1, done)
sample(batch_size)

Sample a batch of experiences. Parameters ———- batch_size: int

How many transitions to sample.
obs_batch: np.array
batch of observations
act_batch: np.array
batch of actions executed given obs_batch
rew_batch: np.array
rewards received as results of executing act_batch
next_obs_batch: np.array
next set of observations seen after executing act_batch
done_mask: np.array
done_mask[i] = 1 if executing act_batch[i] resulted in the end of an episode and 0 otherwise.
class core.common.cartpole_dqn_PER.SegmentTree(capacity, operation, neutral_element)

Bases: object

reduce(start=0, end=None)

Returns result of applying self.operation to a contiguous subsequence of the array.

self.operation(arr[start], operation(arr[start+1], operation(… arr[end])))
start: int
beginning of the subsequence
end: int
end of the subsequences
reduced: obj
result of reducing self.operation over the specified range of array elements.
class core.common.cartpole_dqn_PER.SumSegmentTree(capacity)

Bases: core.common.cartpole_dqn_PER.SegmentTree

find_prefixsum_idx(prefixsum)
Find the highest index i in the array such that
sum(arr[0] + arr[1] + … + arr[i - i]) <= prefixsum

if array values are probabilities, this function allows to sample indexes according to the discrete probability efficiently. Parameters ———- perfixsum: float

upperbound on the sum of array prefix
idx: int
highest index satisfying the prefixsum constraint
sum(start=0, end=None)

Returns arr[start] + … + arr[end]

core.common.cartpole_dqn_PER.load_sample(memory, file_path)
core.common.cartpole_dqn_PER.save_sample(memory, file_path)

core.common.memory module

class core.common.memory.Experience(state0, action, reward, state1, terminal1)

Bases: tuple

action

Alias for field number 1

reward

Alias for field number 2

state0

Alias for field number 0

state1

Alias for field number 3

terminal1

Alias for field number 4

class core.common.memory.Memory(window_length, ignore_episode_boundaries=False)

Bases: object

append(observation, action, reward, terminal, training=True)
get_config()

Return configuration (window_length, ignore_episode_boundaries) for Memory

# Return
A dict with keys window_length and ignore_episode_boundaries
get_recent_state(current_observation)

Return list of last observations

# Argument
current_observation (object): Last observation
# Returns
A list of the last observations
sample(batch_size, batch_idxs=None)
class core.common.memory.RingBuffer(maxlen)

Bases: object

append(v)

Append an element to the buffer

# Argument
v (object): Element to append
length()

Return the length of Deque

# Argument
None
# Returns
The lenght of deque element
core.common.memory.sample_batch_indexes(low, high, size)

Return a sample of (size) unique elements between low and high

# Argument
low (int): The minimum value for our samples high (int): The maximum value for our samples size (int): The number of samples to pick
# Returns
A list of samples of length size, with values between low and high
core.common.memory.zeroed_observation(observation)

Return an array of zeros with same shape as given observation

# Argument
observation (list): List of observation
# Return
A np.ndarray of zeros with observation.shape

core.common.policy module

class core.common.policy.Policy

Bases: object

Abstract base class for all implemented policies.

Each policy helps with selection of action to take on an environment.

Do not use this abstract base class directly but instead use one of the concrete policies implemented. To implement your own policy, you have to implement the following methods:

  • select_action
# Arguments
agent (rl.core.Agent): Agent used
get_config()

Return configuration of the policy

# Returns
Configuration as dict
metrics
metrics_names
on_episode_end(episode, logs={})
reset_states()
select_action(**kwargs)

core.common.processor module

class core.common.processor.Processor

Bases: object

Abstract base class for implementing processors.

A processor acts as a coupling mechanism between an Agent and its Env. This can be necessary if your agent has different requirements with respect to the form of the observations, actions, and rewards of the environment. By implementing a custom processor, you can effectively translate between the two without having to change the underlaying implementation of the agent or environment.

Do not use this abstract base class directly but instead use one of the concrete implementations or write your own.

metrics

The metrics of the processor, which will be reported during training.

# Returns
List of lambda y_true, y_pred: metric functions.
metrics_names

The human-readable names of the agent’s metrics. Must return as many names as there are metrics (see also compile).

process_action(action)

Processes an action predicted by an agent but before execution in an environment.

# Arguments
action (int): Action given to the environment
# Returns
Processed action given to the environment
process_info(info)

Processes the info as obtained from the environment for use in an agent and returns it.

# Arguments
info (dict): An info as obtained by the environment
# Returns
Info obtained by the environment processed
process_observation(observation, state_size=None)

Processes the observation as obtained from the environment for use in an agent and returns it.

# Arguments
observation (object): An observation as obtained by the environment
# Returns
Observation obtained by the environment processed
process_reward(reward)

Processes the reward as obtained from the environment for use in an agent and returns it.

# Arguments
reward (float): A reward as obtained by the environment
# Returns
Reward obtained by the environment processed
process_state_batch(batch)

Processes an entire batch of states and returns it.

# Arguments
batch (list): List of states
# Returns
Processed list of states
process_step(observation, reward, done, info)

Processes an entire step by applying the processor to the observation, reward, and info arguments.

# Arguments
observation (object): An observation as obtained by the environment. reward (float): A reward as obtained by the environment. done (boolean): True if the environment is in a terminal state, False otherwise. info (dict): The debug info dictionary as obtained by the environment.
# Returns
The tupel (observation, reward, done, reward) with with all elements after being processed.

core.common.random module

class core.common.random.AdaptiveParamNoiseSpec(initial_stddev=0.1, desired_action_stddev=0.2, adaptation_coefficient=1.01)

Bases: object

adapt(distance)
get_stats()
class core.common.random.AnnealedGaussianProcess(mu, sigma, sigma_min, n_steps_annealing)

Bases: core.common.random.RandomProcess

current_sigma
class core.common.random.GaussianWhiteNoiseProcess(mu=0.0, sigma=1.0, sigma_min=None, n_steps_annealing=1000, size=1)

Bases: core.common.random.AnnealedGaussianProcess

sample()
class core.common.random.OrnsteinUhlenbeckProcess(theta, mu=0.0, sigma=1.0, dt=0.01, size=1, sigma_min=None, n_steps_annealing=1000)

Bases: core.common.random.AnnealedGaussianProcess

reset_states()
sample()
class core.common.random.RandomProcess

Bases: object

reset_states()
class core.common.random.SimpleOUNoise(size=1, mu=0, theta=0.05, sigma=0.25)

Bases: object

Ornstein-Uhlenbeck process.

reset_states()

Reset the internal state (= noise) to mean (mu).

sample()

Update internal state and return it as a noise sample.

core.common.random.ddpg_distance_metric(actions1, actions2)

Compute “distance” between actions taken by two policies at the same states Expects numpy arrays

core.common.util module

class core.common.util.AdditionalUpdatesOptimizer(optimizer, additional_updates)

Bases: keras.optimizers.Optimizer

get_config()
get_updates(params, loss)
class core.common.util.OPS

Bases: enum.Enum

An enumeration.

ACTION_REPETITION = '-act-rept'
BATCH_SIZE = '-batch-size'
BUFFER_SIZE = '-buf_size'
DISCOUNT_FACTOR = '-df'
DOUBLE = '-double'
DUELING = '-dueling'
ENTROPY_LOSS = '-ent_loss'
EPOCHS = '-epo'
FRAMES_PER_STEP = '-frms-per-step'
GAMMA = '-gamma'
LEARNING_ACTOR_RATE = '-actor-lr'
LEARNING_CRITIC_RATE = '-critic-lr'
LEARNING_RATE = '-lr'
MARGINAL_SPACE = '-m-s'
MOVE_ANG = '-move_a'
MOVE_DIST = '-move_d'
NO_GUI = '-no-gui'
N_STEPS = '-nsteps'
OU_SIGMA = '-ou-sm'
OU_THETA = '-ou-tt'
PER = '-PER'
POLICY = '-p'
PURE_ACTION_RATIO = '-ratio-pure-action'
REPLAY_MEMORY_SIZE = '-repm-size'
REWARD_HEIGHT_RANK_WEIGHT = '-r-w'
REWARD_SCALE = '-rc'
REWARD_VERSION = '-rv'
TARGET_NETWORK_UPDATE_INTERVAL = '-tn-u-invl'
TIME_PENALTY_WEIGHT = '-p-t'
TIME_WINDOW = '-t-w'
USE_PARAMETERIZED_NOISE = '-p-noise'
WINDOW_LENGTH = '-window-length'
class core.common.util.PopArtLayer(beta=0.0001, epsilon=0.0001, stable_rate=0.1, min_steps=1000, **kwargs)

Bases: keras.engine.base_layer.Layer

Automatic network output scale adjuster, which is supposed to keep the output of the network up to date as we keep updating moving average and variance of discounted returns. Part of the PopArt algorithm described in DeepMind’s paper “Learning values across many orders of magnitude” (https://arxiv.org/abs/1602.07714)

build(input_shape)

Creates the layer weights.

Must be implemented on all layers that have weights.

# Arguments
input_shape: Keras tensor (future input to layer)
or list/tuple of Keras tensors to reference for weight shape computations.
call(inputs, **kwargs)

This is where the layer’s logic lives.

# Arguments
inputs: Input tensor, or list/tuple of input tensors. **kwargs: Additional keyword arguments.
# Returns
A tensor or list/tuple of tensors.
compute_output_shape(input_shape)

Computes the output shape of the layer.

Assumes that the layer will be built to match that input shape provided.

# Arguments
input_shape: Shape tuple (tuple of integers)
or list of shape tuples (one per output tensor of the layer). Shape tuples can include None for free dimensions, instead of an integer.
# Returns
An input shape tuple.
de_normalize(x: numpy.ndarray) → numpy.ndarray

Converts previously normalized data into original values.

pop_art_update(x: numpy.ndarray) → Tuple[float, float]

Performs ART (Adaptively Rescaling Targets) update, adjusting normalization parameters with respect to new targets x. Updates running mean, mean of squares and returns new mean and standard deviation for later use.

update_and_normalize(x: numpy.ndarray) → Tuple[numpy.ndarray, float, float]

Normalizes given tensor x and updates parameters associated with PopArt: running means (art) and network’s output scaling (pop).

class core.common.util.RunningMeanStd(my, epsilon=0.01, shape=())

Bases: object

update(x)
core.common.util.auto_executor(params, filename)

사용법 1. 사용하고 싶은 옵션을 OPS enum 클래스에 추가한다. 2. params에 enum 의 값을 배열로 입력한다. ex : param[OPS.추가한옵션.value] = [실행할 변수 list] 3. 수행할 filename 을 입력한다.

  1. 호출되는 소스에서 arg 처리를 해준다.

import argparse from core.common.util import OPS

parser = argparse.ArgumentParser(description=’DQN Configuration including setting dqn / double dqn / double dueling dqn’)

parser.add_argument(OPS.NO_GUI.value, help=’gui’, type=bool, default=False) parser.add_argument(OPS.DOUBLE.value, help=’double dqn’, default=False, action=’store_true’) parser.add_argument(OPS.DUELING.value, help=’dueling dqn’, default=False, action=’store_true’) parser.add_argument(OPS.DRQN.value, help=”drqn”, default=False, action=’store_true’) parser.add_argument(OPS.BATCH_SIZE.value, type=int, default=128, help=”batch size”) parser.add_argument(OPS.REPLAY_MEMORY_SIZE.value, type=int, default=8000, help=”replay memory size”) parser.add_argument(OPS.LEARNING_RATE.value, type=float, default=0.001, help=”learning rate”) parser.add_argument(OPS.TARGET_NETWORK_UPDATE_INTERVAL.value, type=int, default=60, help=”target_network_update_interval”) . . . parser.add_argument(“추가한 옵션”, type=타입, default=기본값, help=”help 에 표시될 도움말” …)

args = parser.parse_args()

dict_args = vars(args) post_fix = ‘’ for k in dict_args.keys():

if k == ‘no_gui’:
continue

post_fix += ‘_’ + k + ‘_’ + str(dict_args[k])

  1. args 를 적절히 소스에 사용해준다.
  2. output file naming 에 post_fix 를 추가해주면 좋음.
class core.common.util.cLogger

Bases: object

static getLogger(loggerName='not_init', loggerFile=None)
core.common.util.clipped_masked_error(y_true, y_pred, mask, delta_clip)
core.common.util.clone_model(model, custom_objects={})

model_copy = keras.models.clone_model(model) model_copy.set_weights(model.get_weights())

core.common.util.clone_optimizer(optimizer)
core.common.util.denormalize(x, stats)
core.common.util.display_param_dic(params={})
core.common.util.display_param_list(params=[])
core.common.util.gen_agent_params(params={}, filepath=None)
core.common.util.get_kv_from_agent(agent)
core.common.util.get_logger(logger_name, logger_level, use_stream_handler=True, use_file_handler=False)
core.common.util.get_soft_target_model_updates(target, source, tau)
core.common.util.gradients(loss, variables, grad_ys)

Returns the gradients of loss w.r.t. variables. # Arguments

loss: Scalar tensor to minimize. variables: List of variables.
# Returns
A gradients tensor.
core.common.util.gumbel_softmax(logits, temperature=1, hard=False)

referenced by # https://github.com/ericjang/gumbel-softmax

Sample from the Gumbel-Softmax distribution and optionally discretize.

Args:
logits: [batch_size, n_class] unnormalized log-probs temperature: non-negative scalar hard: if True, take argmax, but differentiate w.r.t. soft sample y
Returns:
[batch_size, n_class] sample from the Gumbel-Softmax distribution. If hard=True, then the returned sample will be one-hot, otherwise it will be a probabilitiy distribution that sums to 1 across classes
core.common.util.gumbel_softmax_sample(logits, temperature)

Draw a sample from the Gumbel-Softmax distribution

core.common.util.huber_loss(y_true, y_pred, clip_value)
core.common.util.mmdd24hhmmss()
core.common.util.normalize(x, stats)
core.common.util.pickle_to_plot(pickle_name, png_filename, title='', time_window=100, overwrite=False)
core.common.util.reward_moving_avg_plot(y_data_list=[], title='', label='reward', window=100, filepath='plot.png', shadow_color_index=5)

save reward plot :param rewards: reward list :param window: moving average time window :param filepath: plot file to be saved :param shadow_color_index: color palette index number :return: None

core.common.util.reward_quantile_plot(y_data_list=[], title=None, label='reward', window=100, filepath='plot.png', shadow_color_index=5)
Parameters:
  • y_data_list – reward list
  • title – optional, title
  • label – y axis label
  • window – moving average time window
  • filepath – plot file to be saved
  • shadow_color_index
    param shadow_color_index:
     
Returns:

core.common.util.sample_gumbel(shape, eps=1e-20)

Sample from Gumbel(0, 1)

core.common.util.save_ci_graph(y_data_list=[], title='Some Graph', xlabel='episode', ylabel='reward', window=100, filepath='plot.png', y_data=[], y_index=[], figsize=(12, 8), title_font_size=20)

save reward plot :param window: moving average time window :param filepath: plot file to be saved :return: None

core.common.util.save_ci_graph_from_tuple(y_data_list=[], graph_title=[], window=1000, filepath='2plot.png', y_data_legend=[], y_index=[], figsize=(12, 8), title_font_size=20)
core.common.util.save_plot(FILE_NAME)
core.common.util.smoothL1(y_true, y_pred)

https://stackoverflow.com/questions/44130871/keras-smooth-l1-loss

core.common.util.yyyymmdd24hhmmss()

Module contents