core.common package¶
Submodules¶
core.common.agent module¶
-
class
core.common.agent.
Agent
(processor=None)¶ Bases:
object
Abstract base class for all implemented agents.
Each agent interacts with the environment (as defined by the Env class) by first observing the state of the environment. Based on this observation the agent changes the environment by performing an action.
Do not use this abstract base class directly but instead use one of the concrete agents implemented. Each agent realizes a reinforcement learning algorithm. Since all agents conform to the same interface, you can use them interchangeably.
To implement your own agent, you have to implement the following methods:
- forward
- backward
- compile
- load_weights
- save_weights
- # Arguments
- processor (Processor instance): See [Processor](#processor) for details.
-
append_replay_memory
(reward, terminal)¶ save sates to replay buffer after having executed the action returned by forward.
- # Argument
- reward (float): The observed reward after executing the action returned by forward. terminal (boolean): True if the new state of the environment is terminal.
- # Returns
- List of metrics values
-
backward
(reward, terminal)¶ Updates the agent after having executed the action returned by forward. If the policy is implemented by a neural network, this corresponds to a weight update using back-prop.
- # Argument
- reward (float): The observed reward after executing the action returned by forward. terminal (boolean): True if the new state of the environment is terminal.
- # Returns
- List of metrics values
-
compile
(optimizer, metrics=[])¶ Compiles an agent and the underlaying models to be used for training and testing.
- # Arguments
- optimizer (keras.optimizers.Optimizer instance): The optimizer to be used during training. metrics (list of functions lambda y_true, y_pred: metric): The metrics to run during training.
-
forward
(observation)¶ Takes the an observation from the environment and returns the action to be taken next. If the policy is implemented by a neural network, this corresponds to a forward (inference) pass.
- # Argument
- observation (object): The current observation from the environment.
- # Returns
- The next action to be executed in the environment.
-
layers
¶ Returns all layers of the underlying model(s).
If the concrete implementation uses multiple internal models, this method returns them in a concatenated list.
- # Returns
- A list of the model’s layers
-
load_weights
(filepath, filename)¶ Loads the weights of an agent from an HDF5 file.
- # Arguments
- filepath (str or list): The path to the HDF5 files. In case of algorithms using multiple models, it could be list of path of models. filename (str or list): The name to the HDF5 files. In case of algorithms using multiple models, it could be list of name of models.
-
reset_states
()¶ Resets all internally kept states after an episode is completed.
-
run
(env, nb_steps, shared_cb_params={}, train_mode=True, action_repetition=1, callbacks=None, verbose=1, visualize=False, nb_max_start_steps=0, random_policy=None, log_interval=10000, nb_max_episode_steps=None, nb_episodes=0)¶ Trains the agent on the given environment.
- # Arguments
env: Environment instance that the agent interacts with. nb_steps (integer): Number of training steps to be performed. shared_cb_params (map) : Shared (key, value) parameters where it can be used in callbacks action_repetition (integer): Number of times the agent repeats the same action without observing the environment again. callbacks (list of keras.callbacks.Callback or rl.callbacks.Callback instances):
List of callbacks to apply during training. See [callbacks](/callbacks) for details.verbose (integer): 0 for no logging, 1 for interval logging (compare log_interval), 2 for episode logging visualize (boolean): If True, the environment is visualized during training. However,
this is likely going to slow down training significantly and is thus intended to be a debugging instrument.- nb_max_start_steps (integer): Number of maximum steps that the agent performs at the beginning
- of each episode using start_step_policy. Notice that this is an upper limit since the exact number of steps to be performed is sampled uniformly from [0, max_start_steps] at the beginning of each episode.
- start_step_policy (lambda observation: action): The policy
- to follow if nb_max_start_steps > 0. If set to None, a random action is performed.
log_interval (integer): If verbose = 1, the number of steps that are considered to be an interval. nb_max_episode_steps (integer): Number of steps per episode that the agent performs before
automatically resetting the environment. Set to None if each episode should run (potentially indefinitely) until the environment signals a terminal state.- # Returns
- A keras.callbacks.History instance that recorded the entire training process.
-
save_weights
(filepath, filename=None, overwrite=False)¶ Saves the weights of an agent as an HDF5 file.
- # Arguments
- filepath (str): The path to where the weights should be saved. overwrite (boolean): If False and filepath already exists, raises an error.
core.common.callback module¶
-
class
core.common.callback.
Callback
(agent=None, *args, **kwargs)¶ Bases:
keras.callbacks.Callback
-
on_action_begin
(action, logs={})¶ Called at beginning of each action
-
on_action_end
(action, logs={})¶ Called at end of each action
-
on_episode_begin
(episode, logs={})¶ Called at beginning of each episode
-
on_episode_end
(episode, logs={})¶ Called at end of each episode
-
on_step_begin
(step, logs={})¶ Called at beginning of each step
-
on_step_end
(step, logs={})¶ Called at end of each step
-
-
class
core.common.callback.
CallbackList
(callbacks=None, queue_length=10)¶ Bases:
keras.callbacks.CallbackList
-
on_action_begin
(action, logs={})¶ Called at beginning of each action for each callback in callbackList
-
on_action_end
(action, logs={})¶ Called at end of each action for each callback in callbackList
-
on_episode_begin
(episode, logs={})¶ Called at beginning of each episode for each callback in callbackList
-
on_episode_end
(episode, logs={})¶ Called at end of each episode for each callback in callbackList
-
on_step_begin
(step, logs={})¶ Called at beginning of each step for each callback in callbackList
-
on_step_end
(step, logs={})¶ Called at end of each step for each callback in callbackList
-
core.common.cartpole_dqn_PER module¶
-
class
core.common.cartpole_dqn_PER.
DQNAgent
(state_size, action_size)¶ Bases:
object
-
append_sample
(state, action, reward, next_state, done)¶
-
build_model
()¶
-
get_action
(state)¶
-
optimizer
()¶
-
train_model
(beta)¶
-
update_target_model
()¶
-
-
class
core.common.cartpole_dqn_PER.
MinSegmentTree
(capacity)¶ Bases:
core.common.cartpole_dqn_PER.SegmentTree
-
min
(start=0, end=None)¶ Returns min(arr[start], …, arr[end])
-
-
class
core.common.cartpole_dqn_PER.
PrioritizedReplayBuffer
(size, alpha)¶ Bases:
core.common.cartpole_dqn_PER.ReplayBuffer
-
add
(*args, **kwargs)¶ See ReplayBuffer.store_effect
-
sample
(batch_size, beta)¶ Sample a batch of experiences. compared to ReplayBuffer.sample it also returns importance weights and idxes of sampled experiences. Parameters ———- batch_size: int
How many transitions to sample.- beta: float
- To what degree to use importance weights (0 - no corrections, 1 - full correction)
- obs_batch: np.array
- batch of observations
- act_batch: np.array
- batch of actions executed given obs_batch
- rew_batch: np.array
- rewards received as results of executing act_batch
- next_obs_batch: np.array
- next set of observations seen after executing act_batch
- done_mask: np.array
- done_mask[i] = 1 if executing act_batch[i] resulted in the end of an episode and 0 otherwise.
- weights: np.array
- Array of shape (batch_size,) and dtype np.float32 denoting importance weight of each sampled transition
- idxes: np.array
- Array of shape (batch_size,) and dtype np.int32 idexes in buffer of sampled experiences
-
update_priorities
(idxes, priorities)¶ Update priorities of sampled transitions. sets priority of transition at index idxes[i] in buffer to priorities[i]. Parameters ———- idxes: [int]
List of idxes of sampled transitions- priorities: [float]
- List of updated priorities corresponding to transitions at the sampled idxes denoted by variable idxes.
-
-
class
core.common.cartpole_dqn_PER.
ReplayBuffer
(size)¶ Bases:
object
-
add
(obs_t, action, reward, obs_tp1, done)¶
-
sample
(batch_size)¶ Sample a batch of experiences. Parameters ———- batch_size: int
How many transitions to sample.- obs_batch: np.array
- batch of observations
- act_batch: np.array
- batch of actions executed given obs_batch
- rew_batch: np.array
- rewards received as results of executing act_batch
- next_obs_batch: np.array
- next set of observations seen after executing act_batch
- done_mask: np.array
- done_mask[i] = 1 if executing act_batch[i] resulted in the end of an episode and 0 otherwise.
-
-
class
core.common.cartpole_dqn_PER.
SegmentTree
(capacity, operation, neutral_element)¶ Bases:
object
-
reduce
(start=0, end=None)¶ Returns result of applying self.operation to a contiguous subsequence of the array.
self.operation(arr[start], operation(arr[start+1], operation(… arr[end])))- start: int
- beginning of the subsequence
- end: int
- end of the subsequences
- reduced: obj
- result of reducing self.operation over the specified range of array elements.
-
-
class
core.common.cartpole_dqn_PER.
SumSegmentTree
(capacity)¶ Bases:
core.common.cartpole_dqn_PER.SegmentTree
-
find_prefixsum_idx
(prefixsum)¶ - Find the highest index i in the array such that
- sum(arr[0] + arr[1] + … + arr[i - i]) <= prefixsum
if array values are probabilities, this function allows to sample indexes according to the discrete probability efficiently. Parameters ———- perfixsum: float
upperbound on the sum of array prefix- idx: int
- highest index satisfying the prefixsum constraint
-
sum
(start=0, end=None)¶ Returns arr[start] + … + arr[end]
-
-
core.common.cartpole_dqn_PER.
load_sample
(memory, file_path)¶
-
core.common.cartpole_dqn_PER.
save_sample
(memory, file_path)¶
core.common.memory module¶
-
class
core.common.memory.
Experience
(state0, action, reward, state1, terminal1)¶ Bases:
tuple
-
action
¶ Alias for field number 1
-
reward
¶ Alias for field number 2
-
state0
¶ Alias for field number 0
-
state1
¶ Alias for field number 3
-
terminal1
¶ Alias for field number 4
-
-
class
core.common.memory.
Memory
(window_length, ignore_episode_boundaries=False)¶ Bases:
object
-
append
(observation, action, reward, terminal, training=True)¶
-
get_config
()¶ Return configuration (window_length, ignore_episode_boundaries) for Memory
- # Return
- A dict with keys window_length and ignore_episode_boundaries
-
get_recent_state
(current_observation)¶ Return list of last observations
- # Argument
- current_observation (object): Last observation
- # Returns
- A list of the last observations
-
sample
(batch_size, batch_idxs=None)¶
-
-
class
core.common.memory.
RingBuffer
(maxlen)¶ Bases:
object
-
append
(v)¶ Append an element to the buffer
- # Argument
- v (object): Element to append
-
length
()¶ Return the length of Deque
- # Argument
- None
- # Returns
- The lenght of deque element
-
-
core.common.memory.
sample_batch_indexes
(low, high, size)¶ Return a sample of (size) unique elements between low and high
- # Argument
- low (int): The minimum value for our samples high (int): The maximum value for our samples size (int): The number of samples to pick
- # Returns
- A list of samples of length size, with values between low and high
-
core.common.memory.
zeroed_observation
(observation)¶ Return an array of zeros with same shape as given observation
- # Argument
- observation (list): List of observation
- # Return
- A np.ndarray of zeros with observation.shape
core.common.policy module¶
-
class
core.common.policy.
Policy
¶ Bases:
object
Abstract base class for all implemented policies.
Each policy helps with selection of action to take on an environment.
Do not use this abstract base class directly but instead use one of the concrete policies implemented. To implement your own policy, you have to implement the following methods:
- select_action
- # Arguments
- agent (rl.core.Agent): Agent used
-
get_config
()¶ Return configuration of the policy
- # Returns
- Configuration as dict
-
metrics
¶
-
metrics_names
¶
-
on_episode_end
(episode, logs={})¶
-
reset_states
()¶
-
select_action
(**kwargs)¶
core.common.processor module¶
-
class
core.common.processor.
Processor
¶ Bases:
object
Abstract base class for implementing processors.
A processor acts as a coupling mechanism between an Agent and its Env. This can be necessary if your agent has different requirements with respect to the form of the observations, actions, and rewards of the environment. By implementing a custom processor, you can effectively translate between the two without having to change the underlaying implementation of the agent or environment.
Do not use this abstract base class directly but instead use one of the concrete implementations or write your own.
-
metrics
¶ The metrics of the processor, which will be reported during training.
- # Returns
- List of lambda y_true, y_pred: metric functions.
-
metrics_names
¶ The human-readable names of the agent’s metrics. Must return as many names as there are metrics (see also compile).
-
process_action
(action)¶ Processes an action predicted by an agent but before execution in an environment.
- # Arguments
- action (int): Action given to the environment
- # Returns
- Processed action given to the environment
-
process_info
(info)¶ Processes the info as obtained from the environment for use in an agent and returns it.
- # Arguments
- info (dict): An info as obtained by the environment
- # Returns
- Info obtained by the environment processed
-
process_observation
(observation, state_size=None)¶ Processes the observation as obtained from the environment for use in an agent and returns it.
- # Arguments
- observation (object): An observation as obtained by the environment
- # Returns
- Observation obtained by the environment processed
-
process_reward
(reward)¶ Processes the reward as obtained from the environment for use in an agent and returns it.
- # Arguments
- reward (float): A reward as obtained by the environment
- # Returns
- Reward obtained by the environment processed
-
process_state_batch
(batch)¶ Processes an entire batch of states and returns it.
- # Arguments
- batch (list): List of states
- # Returns
- Processed list of states
-
process_step
(observation, reward, done, info)¶ Processes an entire step by applying the processor to the observation, reward, and info arguments.
- # Arguments
- observation (object): An observation as obtained by the environment. reward (float): A reward as obtained by the environment. done (boolean): True if the environment is in a terminal state, False otherwise. info (dict): The debug info dictionary as obtained by the environment.
- # Returns
- The tupel (observation, reward, done, reward) with with all elements after being processed.
-
core.common.random module¶
-
class
core.common.random.
AdaptiveParamNoiseSpec
(initial_stddev=0.1, desired_action_stddev=0.2, adaptation_coefficient=1.01)¶ Bases:
object
-
adapt
(distance)¶
-
get_stats
()¶
-
-
class
core.common.random.
AnnealedGaussianProcess
(mu, sigma, sigma_min, n_steps_annealing)¶ Bases:
core.common.random.RandomProcess
-
current_sigma
¶
-
-
class
core.common.random.
GaussianWhiteNoiseProcess
(mu=0.0, sigma=1.0, sigma_min=None, n_steps_annealing=1000, size=1)¶ Bases:
core.common.random.AnnealedGaussianProcess
-
sample
()¶
-
-
class
core.common.random.
OrnsteinUhlenbeckProcess
(theta, mu=0.0, sigma=1.0, dt=0.01, size=1, sigma_min=None, n_steps_annealing=1000)¶ Bases:
core.common.random.AnnealedGaussianProcess
-
reset_states
()¶
-
sample
()¶
-
-
class
core.common.random.
SimpleOUNoise
(size=1, mu=0, theta=0.05, sigma=0.25)¶ Bases:
object
Ornstein-Uhlenbeck process.
-
reset_states
()¶ Reset the internal state (= noise) to mean (mu).
-
sample
()¶ Update internal state and return it as a noise sample.
-
-
core.common.random.
ddpg_distance_metric
(actions1, actions2)¶ Compute “distance” between actions taken by two policies at the same states Expects numpy arrays
core.common.util module¶
-
class
core.common.util.
AdditionalUpdatesOptimizer
(optimizer, additional_updates)¶ Bases:
keras.optimizers.Optimizer
-
get_config
()¶
-
get_updates
(params, loss)¶
-
-
class
core.common.util.
OPS
¶ Bases:
enum.Enum
An enumeration.
-
ACTION_REPETITION
= '-act-rept'¶
-
BATCH_SIZE
= '-batch-size'¶
-
BUFFER_SIZE
= '-buf_size'¶
-
DISCOUNT_FACTOR
= '-df'¶
-
DOUBLE
= '-double'¶
-
DUELING
= '-dueling'¶
-
ENTROPY_LOSS
= '-ent_loss'¶
-
EPOCHS
= '-epo'¶
-
FRAMES_PER_STEP
= '-frms-per-step'¶
-
GAMMA
= '-gamma'¶
-
LEARNING_ACTOR_RATE
= '-actor-lr'¶
-
LEARNING_CRITIC_RATE
= '-critic-lr'¶
-
LEARNING_RATE
= '-lr'¶
-
MARGINAL_SPACE
= '-m-s'¶
-
MOVE_ANG
= '-move_a'¶
-
MOVE_DIST
= '-move_d'¶
-
NO_GUI
= '-no-gui'¶
-
N_STEPS
= '-nsteps'¶
-
OU_SIGMA
= '-ou-sm'¶
-
OU_THETA
= '-ou-tt'¶
-
PER
= '-PER'¶
-
POLICY
= '-p'¶
-
PURE_ACTION_RATIO
= '-ratio-pure-action'¶
-
REPLAY_MEMORY_SIZE
= '-repm-size'¶
-
REWARD_HEIGHT_RANK_WEIGHT
= '-r-w'¶
-
REWARD_SCALE
= '-rc'¶
-
REWARD_VERSION
= '-rv'¶
-
TARGET_NETWORK_UPDATE_INTERVAL
= '-tn-u-invl'¶
-
TIME_PENALTY_WEIGHT
= '-p-t'¶
-
TIME_WINDOW
= '-t-w'¶
-
USE_PARAMETERIZED_NOISE
= '-p-noise'¶
-
WINDOW_LENGTH
= '-window-length'¶
-
-
class
core.common.util.
PopArtLayer
(beta=0.0001, epsilon=0.0001, stable_rate=0.1, min_steps=1000, **kwargs)¶ Bases:
keras.engine.base_layer.Layer
Automatic network output scale adjuster, which is supposed to keep the output of the network up to date as we keep updating moving average and variance of discounted returns. Part of the PopArt algorithm described in DeepMind’s paper “Learning values across many orders of magnitude” (https://arxiv.org/abs/1602.07714)
-
build
(input_shape)¶ Creates the layer weights.
Must be implemented on all layers that have weights.
- # Arguments
- input_shape: Keras tensor (future input to layer)
- or list/tuple of Keras tensors to reference for weight shape computations.
-
call
(inputs, **kwargs)¶ This is where the layer’s logic lives.
- # Arguments
- inputs: Input tensor, or list/tuple of input tensors. **kwargs: Additional keyword arguments.
- # Returns
- A tensor or list/tuple of tensors.
-
compute_output_shape
(input_shape)¶ Computes the output shape of the layer.
Assumes that the layer will be built to match that input shape provided.
- # Arguments
- input_shape: Shape tuple (tuple of integers)
- or list of shape tuples (one per output tensor of the layer). Shape tuples can include None for free dimensions, instead of an integer.
- # Returns
- An input shape tuple.
-
de_normalize
(x: numpy.ndarray) → numpy.ndarray¶ Converts previously normalized data into original values.
-
pop_art_update
(x: numpy.ndarray) → Tuple[float, float]¶ Performs ART (Adaptively Rescaling Targets) update, adjusting normalization parameters with respect to new targets x. Updates running mean, mean of squares and returns new mean and standard deviation for later use.
-
update_and_normalize
(x: numpy.ndarray) → Tuple[numpy.ndarray, float, float]¶ Normalizes given tensor x and updates parameters associated with PopArt: running means (art) and network’s output scaling (pop).
-
-
core.common.util.
auto_executor
(params, filename)¶ 사용법 1. 사용하고 싶은 옵션을 OPS enum 클래스에 추가한다. 2. params에 enum 의 값을 배열로 입력한다. ex : param[OPS.추가한옵션.value] = [실행할 변수 list] 3. 수행할 filename 을 입력한다.
- 호출되는 소스에서 arg 처리를 해준다.
import argparse from core.common.util import OPS
parser = argparse.ArgumentParser(description=’DQN Configuration including setting dqn / double dqn / double dueling dqn’)
parser.add_argument(OPS.NO_GUI.value, help=’gui’, type=bool, default=False) parser.add_argument(OPS.DOUBLE.value, help=’double dqn’, default=False, action=’store_true’) parser.add_argument(OPS.DUELING.value, help=’dueling dqn’, default=False, action=’store_true’) parser.add_argument(OPS.DRQN.value, help=”drqn”, default=False, action=’store_true’) parser.add_argument(OPS.BATCH_SIZE.value, type=int, default=128, help=”batch size”) parser.add_argument(OPS.REPLAY_MEMORY_SIZE.value, type=int, default=8000, help=”replay memory size”) parser.add_argument(OPS.LEARNING_RATE.value, type=float, default=0.001, help=”learning rate”) parser.add_argument(OPS.TARGET_NETWORK_UPDATE_INTERVAL.value, type=int, default=60, help=”target_network_update_interval”) . . . parser.add_argument(“추가한 옵션”, type=타입, default=기본값, help=”help 에 표시될 도움말” …)
args = parser.parse_args()
dict_args = vars(args) post_fix = ‘’ for k in dict_args.keys():
- if k == ‘no_gui’:
- continue
post_fix += ‘_’ + k + ‘_’ + str(dict_args[k])
- args 를 적절히 소스에 사용해준다.
- output file naming 에 post_fix 를 추가해주면 좋음.
-
class
core.common.util.
cLogger
¶ Bases:
object
-
static
getLogger
(loggerName='not_init', loggerFile=None)¶
-
static
-
core.common.util.
clipped_masked_error
(y_true, y_pred, mask, delta_clip)¶
-
core.common.util.
clone_model
(model, custom_objects={})¶ model_copy = keras.models.clone_model(model) model_copy.set_weights(model.get_weights())
-
core.common.util.
clone_optimizer
(optimizer)¶
-
core.common.util.
denormalize
(x, stats)¶
-
core.common.util.
display_param_dic
(params={})¶
-
core.common.util.
display_param_list
(params=[])¶
-
core.common.util.
gen_agent_params
(params={}, filepath=None)¶
-
core.common.util.
get_kv_from_agent
(agent)¶
-
core.common.util.
get_logger
(logger_name, logger_level, use_stream_handler=True, use_file_handler=False)¶
-
core.common.util.
get_soft_target_model_updates
(target, source, tau)¶
-
core.common.util.
gradients
(loss, variables, grad_ys)¶ Returns the gradients of loss w.r.t. variables. # Arguments
loss: Scalar tensor to minimize. variables: List of variables.- # Returns
- A gradients tensor.
-
core.common.util.
gumbel_softmax
(logits, temperature=1, hard=False)¶ referenced by # https://github.com/ericjang/gumbel-softmax
Sample from the Gumbel-Softmax distribution and optionally discretize.
- Args:
- logits: [batch_size, n_class] unnormalized log-probs temperature: non-negative scalar hard: if True, take argmax, but differentiate w.r.t. soft sample y
- Returns:
- [batch_size, n_class] sample from the Gumbel-Softmax distribution. If hard=True, then the returned sample will be one-hot, otherwise it will be a probabilitiy distribution that sums to 1 across classes
-
core.common.util.
gumbel_softmax_sample
(logits, temperature)¶ Draw a sample from the Gumbel-Softmax distribution
-
core.common.util.
huber_loss
(y_true, y_pred, clip_value)¶
-
core.common.util.
mmdd24hhmmss
()¶
-
core.common.util.
normalize
(x, stats)¶
-
core.common.util.
pickle_to_plot
(pickle_name, png_filename, title='', time_window=100, overwrite=False)¶
-
core.common.util.
reward_moving_avg_plot
(y_data_list=[], title='', label='reward', window=100, filepath='plot.png', shadow_color_index=5)¶ save reward plot :param rewards: reward list :param window: moving average time window :param filepath: plot file to be saved :param shadow_color_index: color palette index number :return: None
-
core.common.util.
reward_quantile_plot
(y_data_list=[], title=None, label='reward', window=100, filepath='plot.png', shadow_color_index=5)¶ Parameters: - y_data_list – reward list
- title – optional, title
- label – y axis label
- window – moving average time window
- filepath – plot file to be saved
- shadow_color_index –
param shadow_color_index:
Returns:
-
core.common.util.
sample_gumbel
(shape, eps=1e-20)¶ Sample from Gumbel(0, 1)
-
core.common.util.
save_ci_graph
(y_data_list=[], title='Some Graph', xlabel='episode', ylabel='reward', window=100, filepath='plot.png', y_data=[], y_index=[], figsize=(12, 8), title_font_size=20)¶ save reward plot :param window: moving average time window :param filepath: plot file to be saved :return: None
-
core.common.util.
save_ci_graph_from_tuple
(y_data_list=[], graph_title=[], window=1000, filepath='2plot.png', y_data_legend=[], y_index=[], figsize=(12, 8), title_font_size=20)¶
-
core.common.util.
save_plot
(FILE_NAME)¶
-
core.common.util.
smoothL1
(y_true, y_pred)¶ https://stackoverflow.com/questions/44130871/keras-smooth-l1-loss
-
core.common.util.
yyyymmdd24hhmmss
()¶