core.common package¶

Submodules¶

core.common.agent module¶

class core.common.agent.Agent(processor=None)¶

Bases: object

Abstract base class for all implemented agents.

Each agent interacts with the environment (as defined by the Env class) by first observing the state of the environment. Based on this observation the agent changes the environment by performing an action.

Do not use this abstract base class directly but instead use one of the concrete agents implemented. Each agent realizes a reinforcement learning algorithm. Since all agents conform to the same interface, you can use them interchangeably.

To implement your own agent, you have to implement the following methods:

forward
backward
compile
load_weights
save_weights

# Arguments: processor (Processor instance): See [Processor](#processor) for details.

append_replay_memory(reward, terminal)¶

save sates to replay buffer after having executed the action returned by forward.

# Argument: reward (float): The observed reward after executing the action returned by forward. terminal (boolean): True if the new state of the environment is terminal.
# Returns: List of metrics values

backward(reward, terminal)¶

Updates the agent after having executed the action returned by forward. If the policy is implemented by a neural network, this corresponds to a weight update using back-prop.

# Argument: reward (float): The observed reward after executing the action returned by forward. terminal (boolean): True if the new state of the environment is terminal.
# Returns: List of metrics values

compile(optimizer, metrics=[])¶

Compiles an agent and the underlaying models to be used for training and testing.

# Arguments: optimizer (keras.optimizers.Optimizer instance): The optimizer to be used during training. metrics (list of functions lambda y_true, y_pred: metric): The metrics to run during training.

forward(observation)¶

Takes the an observation from the environment and returns the action to be taken next. If the policy is implemented by a neural network, this corresponds to a forward (inference) pass.

# Argument: observation (object): The current observation from the environment.
# Returns: The next action to be executed in the environment.

layers¶

Returns all layers of the underlying model(s).

If the concrete implementation uses multiple internal models, this method returns them in a concatenated list.

# Returns: A list of the model’s layers

load_weights(filepath, filename)¶

Loads the weights of an agent from an HDF5 file.

# Arguments: filepath (str or list): The path to the HDF5 files. In case of algorithms using multiple models, it could be list of path of models. filename (str or list): The name to the HDF5 files. In case of algorithms using multiple models, it could be list of name of models.

reset_states()¶: Resets all internally kept states after an episode is completed.

run(env, nb_steps, shared_cb_params={}, train_mode=True, action_repetition=1, callbacks=None, verbose=1, visualize=False, nb_max_start_steps=0, random_policy=None, log_interval=10000, nb_max_episode_steps=None, nb_episodes=0)¶

Trains the agent on the given environment.

# Arguments

env: Environment instance that the agent interacts with. nb_steps (integer): Number of training steps to be performed. shared_cb_params (map) : Shared (key, value) parameters where it can be used in callbacks action_repetition (integer): Number of times the agent repeats the same action without observing the environment again. callbacks (list of keras.callbacks.Callback or rl.callbacks.Callback instances):

List of callbacks to apply during training. See [callbacks](/callbacks) for details.

verbose (integer): 0 for no logging, 1 for interval logging (compare log_interval), 2 for episode logging visualize (boolean): If True, the environment is visualized during training. However,

this is likely going to slow down training significantly and is thus intended to be a debugging instrument.

nb_max_start_steps (integer): Number of maximum steps that the agent performs at the beginning: of each episode using start_step_policy. Notice that this is an upper limit since the exact number of steps to be performed is sampled uniformly from [0, max_start_steps] at the beginning of each episode.
start_step_policy (lambda observation: action): The policy: to follow if nb_max_start_steps > 0. If set to None, a random action is performed.

log_interval (integer): If verbose = 1, the number of steps that are considered to be an interval. nb_max_episode_steps (integer): Number of steps per episode that the agent performs before

automatically resetting the environment. Set to None if each episode should run (potentially indefinitely) until the environment signals a terminal state.

# Returns

A keras.callbacks.History instance that recorded the entire training process.

save_weights(filepath, filename=None, overwrite=False)¶

Saves the weights of an agent as an HDF5 file.

# Arguments: filepath (str): The path to where the weights should be saved. overwrite (boolean): If False and filepath already exists, raises an error.

core.common.callback module¶

class core.common.callback.Callback(agent=None, *args, **kwargs)¶

Bases: keras.callbacks.Callback

on_action_begin(action, logs={})¶: Called at beginning of each action

on_action_end(action, logs={})¶: Called at end of each action

on_episode_begin(episode, logs={})¶: Called at beginning of each episode

on_episode_end(episode, logs={})¶: Called at end of each episode

on_step_begin(step, logs={})¶: Called at beginning of each step

on_step_end(step, logs={})¶: Called at end of each step

class core.common.callback.CallbackList(callbacks=None, queue_length=10)¶

Bases: keras.callbacks.CallbackList

on_action_begin(action, logs={})¶: Called at beginning of each action for each callback in callbackList

on_action_end(action, logs={})¶: Called at end of each action for each callback in callbackList

on_episode_begin(episode, logs={})¶: Called at beginning of each episode for each callback in callbackList

on_episode_end(episode, logs={})¶: Called at end of each episode for each callback in callbackList

on_step_begin(step, logs={})¶: Called at beginning of each step for each callback in callbackList

on_step_end(step, logs={})¶: Called at end of each step for each callback in callbackList

core.common.cartpole_dqn_PER module¶

class core.common.cartpole_dqn_PER.DQNAgent(state_size, action_size)¶

Bases: object

append_sample(state, action, reward, next_state, done)¶

build_model()¶

get_action(state)¶

optimizer()¶

train_model(beta)¶

update_target_model()¶

class core.common.cartpole_dqn_PER.MinSegmentTree(capacity)¶

Bases: core.common.cartpole_dqn_PER.SegmentTree

min(start=0, end=None)¶: Returns min(arr[start], …, arr[end])

class core.common.cartpole_dqn_PER.PrioritizedReplayBuffer(size, alpha)¶

Bases: core.common.cartpole_dqn_PER.ReplayBuffer

add(*args, **kwargs)¶: See ReplayBuffer.store_effect

sample(batch_size, beta)¶

Sample a batch of experiences. compared to ReplayBuffer.sample it also returns importance weights and idxes of sampled experiences. Parameters ———- batch_size: int

How many transitions to sample.

beta: float: To what degree to use importance weights (0 - no corrections, 1 - full correction)

obs_batch: np.array: batch of observations
act_batch: np.array: batch of actions executed given obs_batch
rew_batch: np.array: rewards received as results of executing act_batch
next_obs_batch: np.array: next set of observations seen after executing act_batch
done_mask: np.array: done_mask[i] = 1 if executing act_batch[i] resulted in the end of an episode and 0 otherwise.
weights: np.array: Array of shape (batch_size,) and dtype np.float32 denoting importance weight of each sampled transition
idxes: np.array: Array of shape (batch_size,) and dtype np.int32 idexes in buffer of sampled experiences

update_priorities(idxes, priorities)¶

Update priorities of sampled transitions. sets priority of transition at index idxes[i] in buffer to priorities[i]. Parameters ———- idxes: [int]

List of idxes of sampled transitions

priorities: [float]: List of updated priorities corresponding to transitions at the sampled idxes denoted by variable idxes.

class core.common.cartpole_dqn_PER.ReplayBuffer(size)¶

Bases: object

add(obs_t, action, reward, obs_tp1, done)¶

sample(batch_size)¶

Sample a batch of experiences. Parameters ———- batch_size: int

How many transitions to sample.

obs_batch: np.array: batch of observations
act_batch: np.array: batch of actions executed given obs_batch
rew_batch: np.array: rewards received as results of executing act_batch
next_obs_batch: np.array: next set of observations seen after executing act_batch
done_mask: np.array: done_mask[i] = 1 if executing act_batch[i] resulted in the end of an episode and 0 otherwise.

class core.common.cartpole_dqn_PER.SegmentTree(capacity, operation, neutral_element)¶

Bases: object

reduce(start=0, end=None)¶

Returns result of applying self.operation to a contiguous subsequence of the array.

self.operation(arr[start], operation(arr[start+1], operation(… arr[end])))

start: int: beginning of the subsequence
end: int: end of the subsequences

reduced: obj: result of reducing self.operation over the specified range of array elements.

class core.common.cartpole_dqn_PER.SumSegmentTree(capacity)¶

Bases: core.common.cartpole_dqn_PER.SegmentTree

find_prefixsum_idx(prefixsum)¶

Find the highest index i in the array such that: sum(arr[0] + arr[1] + … + arr[i - i]) <= prefixsum

if array values are probabilities, this function allows to sample indexes according to the discrete probability efficiently. Parameters ———- perfixsum: float

upperbound on the sum of array prefix

idx: int: highest index satisfying the prefixsum constraint

sum(start=0, end=None)¶: Returns arr[start] + … + arr[end]

core.common.cartpole_dqn_PER.load_sample(memory, file_path)¶

core.common.cartpole_dqn_PER.save_sample(memory, file_path)¶

core.common.memory module¶

class core.common.memory.Experience(state0, action, reward, state1, terminal1)¶

Bases: tuple

action¶: Alias for field number 1

reward¶: Alias for field number 2

state0¶: Alias for field number 0

state1¶: Alias for field number 3

terminal1¶: Alias for field number 4

class core.common.memory.Memory(window_length, ignore_episode_boundaries=False)¶

Bases: object

append(observation, action, reward, terminal, training=True)¶

get_config()¶

Return configuration (window_length, ignore_episode_boundaries) for Memory

# Return: A dict with keys window_length and ignore_episode_boundaries

get_recent_state(current_observation)¶

Return list of last observations

# Argument: current_observation (object): Last observation
# Returns: A list of the last observations

sample(batch_size, batch_idxs=None)¶

class core.common.memory.RingBuffer(maxlen)¶

Bases: object

append(v)¶

Append an element to the buffer

# Argument: v (object): Element to append

length()¶

Return the length of Deque

# Argument: None
# Returns: The lenght of deque element

core.common.memory.sample_batch_indexes(low, high, size)¶

Return a sample of (size) unique elements between low and high

# Argument: low (int): The minimum value for our samples high (int): The maximum value for our samples size (int): The number of samples to pick
# Returns: A list of samples of length size, with values between low and high

core.common.memory.zeroed_observation(observation)¶

Return an array of zeros with same shape as given observation

# Argument: observation (list): List of observation
# Return: A np.ndarray of zeros with observation.shape

core.common.policy module¶

class core.common.policy.Policy¶

Bases: object

Abstract base class for all implemented policies.

Each policy helps with selection of action to take on an environment.

Do not use this abstract base class directly but instead use one of the concrete policies implemented. To implement your own policy, you have to implement the following methods:

select_action

# Arguments: agent (rl.core.Agent): Agent used

get_config()¶

Return configuration of the policy

# Returns: Configuration as dict

metrics¶

metrics_names¶

on_episode_end(episode, logs={})¶

reset_states()¶

select_action(**kwargs)¶

core.common.processor module¶

class core.common.processor.Processor¶

Bases: object

Abstract base class for implementing processors.

A processor acts as a coupling mechanism between an Agent and its Env. This can be necessary if your agent has different requirements with respect to the form of the observations, actions, and rewards of the environment. By implementing a custom processor, you can effectively translate between the two without having to change the underlaying implementation of the agent or environment.

Do not use this abstract base class directly but instead use one of the concrete implementations or write your own.

metrics¶

The metrics of the processor, which will be reported during training.

# Returns: List of lambda y_true, y_pred: metric functions.

metrics_names¶: The human-readable names of the agent’s metrics. Must return as many names as there are metrics (see also compile).

process_action(action)¶

Processes an action predicted by an agent but before execution in an environment.

# Arguments: action (int): Action given to the environment
# Returns: Processed action given to the environment

process_info(info)¶

Processes the info as obtained from the environment for use in an agent and returns it.

# Arguments: info (dict): An info as obtained by the environment
# Returns: Info obtained by the environment processed

process_observation(observation, state_size=None)¶

Processes the observation as obtained from the environment for use in an agent and returns it.

# Arguments: observation (object): An observation as obtained by the environment
# Returns: Observation obtained by the environment processed

process_reward(reward)¶

Processes the reward as obtained from the environment for use in an agent and returns it.

# Arguments: reward (float): A reward as obtained by the environment
# Returns: Reward obtained by the environment processed

process_state_batch(batch)¶

Processes an entire batch of states and returns it.

# Arguments: batch (list): List of states
# Returns: Processed list of states

process_step(observation, reward, done, info)¶

Processes an entire step by applying the processor to the observation, reward, and info arguments.

# Arguments: observation (object): An observation as obtained by the environment. reward (float): A reward as obtained by the environment. done (boolean): True if the environment is in a terminal state, False otherwise. info (dict): The debug info dictionary as obtained by the environment.
# Returns: The tupel (observation, reward, done, reward) with with all elements after being processed.

core.common.random module¶

class core.common.random.AdaptiveParamNoiseSpec(initial_stddev=0.1, desired_action_stddev=0.2, adaptation_coefficient=1.01)¶

Bases: object

adapt(distance)¶

get_stats()¶

class core.common.random.AnnealedGaussianProcess(mu, sigma, sigma_min, n_steps_annealing)¶

Bases: core.common.random.RandomProcess

current_sigma¶

class core.common.random.GaussianWhiteNoiseProcess(mu=0.0, sigma=1.0, sigma_min=None, n_steps_annealing=1000, size=1)¶

Bases: core.common.random.AnnealedGaussianProcess

sample()¶

class core.common.random.OrnsteinUhlenbeckProcess(theta, mu=0.0, sigma=1.0, dt=0.01, size=1, sigma_min=None, n_steps_annealing=1000)¶

Bases: core.common.random.AnnealedGaussianProcess

reset_states()¶

sample()¶

class core.common.random.RandomProcess¶

Bases: object

reset_states()¶

class core.common.random.SimpleOUNoise(size=1, mu=0, theta=0.05, sigma=0.25)¶

Bases: object

Ornstein-Uhlenbeck process.

reset_states()¶: Reset the internal state (= noise) to mean (mu).

sample()¶: Update internal state and return it as a noise sample.

core.common.random.ddpg_distance_metric(actions1, actions2)¶: Compute “distance” between actions taken by two policies at the same states Expects numpy arrays

core.common.util module¶

class core.common.util.AdditionalUpdatesOptimizer(optimizer, additional_updates)¶

Bases: keras.optimizers.Optimizer

get_config()¶

get_updates(params, loss)¶

class core.common.util.OPS¶

Bases: enum.Enum

An enumeration.

ACTION_REPETITION = '-act-rept'¶

BATCH_SIZE = '-batch-size'¶

BUFFER_SIZE = '-buf_size'¶

DISCOUNT_FACTOR = '-df'¶

DOUBLE = '-double'¶

DUELING = '-dueling'¶

ENTROPY_LOSS = '-ent_loss'¶

EPOCHS = '-epo'¶

FRAMES_PER_STEP = '-frms-per-step'¶

GAMMA = '-gamma'¶

LEARNING_ACTOR_RATE = '-actor-lr'¶

LEARNING_CRITIC_RATE = '-critic-lr'¶

LEARNING_RATE = '-lr'¶

MARGINAL_SPACE = '-m-s'¶

MOVE_ANG = '-move_a'¶

MOVE_DIST = '-move_d'¶

NO_GUI = '-no-gui'¶

N_STEPS = '-nsteps'¶

OU_SIGMA = '-ou-sm'¶

OU_THETA = '-ou-tt'¶

PER = '-PER'¶

POLICY = '-p'¶

PURE_ACTION_RATIO = '-ratio-pure-action'¶

REPLAY_MEMORY_SIZE = '-repm-size'¶

REWARD_HEIGHT_RANK_WEIGHT = '-r-w'¶

REWARD_SCALE = '-rc'¶

REWARD_VERSION = '-rv'¶

TARGET_NETWORK_UPDATE_INTERVAL = '-tn-u-invl'¶

TIME_PENALTY_WEIGHT = '-p-t'¶

TIME_WINDOW = '-t-w'¶

USE_PARAMETERIZED_NOISE = '-p-noise'¶

WINDOW_LENGTH = '-window-length'¶

class core.common.util.PopArtLayer(beta=0.0001, epsilon=0.0001, stable_rate=0.1, min_steps=1000, **kwargs)¶

Bases: keras.engine.base_layer.Layer

Automatic network output scale adjuster, which is supposed to keep the output of the network up to date as we keep updating moving average and variance of discounted returns. Part of the PopArt algorithm described in DeepMind’s paper “Learning values across many orders of magnitude” (https://arxiv.org/abs/1602.07714)

build(input_shape)¶

Creates the layer weights.

Must be implemented on all layers that have weights.

# Arguments

input_shape: Keras tensor (future input to layer): or list/tuple of Keras tensors to reference for weight shape computations.

call(inputs, **kwargs)¶

This is where the layer’s logic lives.

# Arguments: inputs: Input tensor, or list/tuple of input tensors. **kwargs: Additional keyword arguments.
# Returns: A tensor or list/tuple of tensors.

compute_output_shape(input_shape)¶

Computes the output shape of the layer.

Assumes that the layer will be built to match that input shape provided.

# Arguments

input_shape: Shape tuple (tuple of integers): or list of shape tuples (one per output tensor of the layer). Shape tuples can include None for free dimensions, instead of an integer.

# Returns

An input shape tuple.

de_normalize(x: numpy.ndarray) → numpy.ndarray¶: Converts previously normalized data into original values.

pop_art_update(x: numpy.ndarray) → Tuple[float, float]¶: Performs ART (Adaptively Rescaling Targets) update, adjusting normalization parameters with respect to new targets x. Updates running mean, mean of squares and returns new mean and standard deviation for later use.

update_and_normalize(x: numpy.ndarray) → Tuple[numpy.ndarray, float, float]¶: Normalizes given tensor x and updates parameters associated with PopArt: running means (art) and network’s output scaling (pop).

class core.common.util.RunningMeanStd(my, epsilon=0.01, shape=())¶

Bases: object

update(x)¶

core.common.util.auto_executor(params, filename)¶

사용법 1. 사용하고 싶은 옵션을 OPS enum 클래스에 추가한다. 2. params에 enum 의 값을 배열로 입력한다. ex : param[OPS.추가한옵션.value] = [실행할 변수 list] 3. 수행할 filename 을 입력한다.

호출되는 소스에서 arg 처리를 해준다.

import argparse from core.common.util import OPS

parser = argparse.ArgumentParser(description=’DQN Configuration including setting dqn / double dqn / double dueling dqn’)

parser.add_argument(OPS.NO_GUI.value, help=’gui’, type=bool, default=False) parser.add_argument(OPS.DOUBLE.value, help=’double dqn’, default=False, action=’store_true’) parser.add_argument(OPS.DUELING.value, help=’dueling dqn’, default=False, action=’store_true’) parser.add_argument(OPS.DRQN.value, help=”drqn”, default=False, action=’store_true’) parser.add_argument(OPS.BATCH_SIZE.value, type=int, default=128, help=”batch size”) parser.add_argument(OPS.REPLAY_MEMORY_SIZE.value, type=int, default=8000, help=”replay memory size”) parser.add_argument(OPS.LEARNING_RATE.value, type=float, default=0.001, help=”learning rate”) parser.add_argument(OPS.TARGET_NETWORK_UPDATE_INTERVAL.value, type=int, default=60, help=”target_network_update_interval”) . . . parser.add_argument(“추가한 옵션”, type=타입, default=기본값, help=”help 에 표시될 도움말” …)

args = parser.parse_args()

dict_args = vars(args) post_fix = ‘’ for k in dict_args.keys():

if k == ‘no_gui’:

continue

post_fix += ‘_’ + k + ‘_’ + str(dict_args[k])

args 를 적절히 소스에 사용해준다.
output file naming 에 post_fix 를 추가해주면 좋음.

class core.common.util.cLogger¶

Bases: object

static getLogger(loggerName='not_init', loggerFile=None)¶

core.common.util.clipped_masked_error(y_true, y_pred, mask, delta_clip)¶

core.common.util.clone_model(model, custom_objects={})¶: model_copy = keras.models.clone_model(model) model_copy.set_weights(model.get_weights())

core.common.util.clone_optimizer(optimizer)¶

core.common.util.denormalize(x, stats)¶

core.common.util.display_param_dic(params={})¶

core.common.util.display_param_list(params=[])¶

core.common.util.gen_agent_params(params={}, filepath=None)¶

core.common.util.get_kv_from_agent(agent)¶

core.common.util.get_logger(logger_name, logger_level, use_stream_handler=True, use_file_handler=False)¶

core.common.util.get_soft_target_model_updates(target, source, tau)¶

core.common.util.gradients(loss, variables, grad_ys)¶

Returns the gradients of loss w.r.t. variables. # Arguments

loss: Scalar tensor to minimize. variables: List of variables.

# Returns: A gradients tensor.

core.common.util.gumbel_softmax(logits, temperature=1, hard=False)¶

referenced by # https://github.com/ericjang/gumbel-softmax

Sample from the Gumbel-Softmax distribution and optionally discretize.

Args:: logits: [batch_size, n_class] unnormalized log-probs temperature: non-negative scalar hard: if True, take argmax, but differentiate w.r.t. soft sample y
Returns:: [batch_size, n_class] sample from the Gumbel-Softmax distribution. If hard=True, then the returned sample will be one-hot, otherwise it will be a probabilitiy distribution that sums to 1 across classes

core.common.util.gumbel_softmax_sample(logits, temperature)¶: Draw a sample from the Gumbel-Softmax distribution

core.common.util.huber_loss(y_true, y_pred, clip_value)¶

core.common.util.mmdd24hhmmss()¶

core.common.util.normalize(x, stats)¶

core.common.util.pickle_to_plot(pickle_name, png_filename, title='', time_window=100, overwrite=False)¶

core.common.util.reward_moving_avg_plot(y_data_list=[], title='', label='reward', window=100, filepath='plot.png', shadow_color_index=5)¶: save reward plot :param rewards: reward list :param window: moving average time window :param filepath: plot file to be saved :param shadow_color_index: color palette index number :return: None

core.common.util.reward_quantile_plot(y_data_list=[], title=None, label='reward', window=100, filepath='plot.png', shadow_color_index=5)¶

Parameters:

y_data_list – reward list
title – optional, title
label – y axis label
window – moving average time window
filepath – plot file to be saved
shadow_color_index –

param shadow_color_index:

Returns:

core.common.util.sample_gumbel(shape, eps=1e-20)¶: Sample from Gumbel(0, 1)

core.common.util.save_ci_graph(y_data_list=[], title='Some Graph', xlabel='episode', ylabel='reward', window=100, filepath='plot.png', y_data=[], y_index=[], figsize=(12, 8), title_font_size=20)¶: save reward plot :param window: moving average time window :param filepath: plot file to be saved :return: None

core.common.util.save_ci_graph_from_tuple(y_data_list=[], graph_title=[], window=1000, filepath='2plot.png', y_data_legend=[], y_index=[], figsize=(12, 8), title_font_size=20)¶

core.common.util.save_plot(FILE_NAME)¶

core.common.util.smoothL1(y_true, y_pred)¶: https://stackoverflow.com/questions/44130871/keras-smooth-l1-loss

core.common.util.yyyymmdd24hhmmss()¶

core.common package¶

Submodules¶

core.common.agent module¶

core.common.callback module¶

core.common.cartpole_dqn_PER module¶

core.common.memory module¶

core.common.policy module¶

core.common.processor module¶

core.common.random module¶

core.common.util module¶

Module contents¶