core package

Submodules

core.callbacks module

class core.callbacks.DrawTrainMovingAvgPlotCallback(file_path, plot_interval=10000, time_window=1000, l_label=['reward', 'kill_cnts', 'hps'], save_raw_data=False, title='')

Bases: core.common.callback.Callback

on_episode_end(episode, logs)
Parameters:
  • episode – episode index
  • logs – logs is map containing value with its key which is also used as label.
Returns:

class core.callbacks.DrawTrainPlotCallback(file_path=None, plot_interval=10000, data_for_plot=['episode_reward', 'nb_episode_steps'])

Bases: core.common.callback.Callback

on_episode_end(episode, logs)

Called at end of each episode

class core.callbacks.FileLogger(filepath, interval=None)

Bases: core.common.callback.Callback

on_episode_begin(episode, logs={})

Initialize metrics at the beginning of each episode

on_episode_end(episode, logs={})

Compute and print metrics at the end of each episode

on_step_end(step, logs={})

Append metric at the end of each step

on_train_begin(logs={})

Initialize model metrics before training

on_train_end(logs={})

Save model at the end of training

save_data()

Save metrics in a json file

class core.callbacks.History(agent=None, *args, **kwargs)

Bases: core.common.callback.Callback

Callback that records events into a History object.

This callback is automatically applied to every Keras model. The History object gets returned by the fit method of models.

on_epoch_end(epoch, logs=None)
on_train_begin(logs=None)
class core.callbacks.ModelIntervalCheckpoint(filepath, step_interval=None, episode_interval=None, condition=None, condition_count=0, verbose=0, **kwargs)

Bases: core.common.callback.Callback

on_episode_end(episode, logs={})

Called at end of each episode

on_step_end(step, logs={})

Called at end of each step

class core.callbacks.TestLogger(agent=None, *args, **kwargs)

Bases: core.common.callback.Callback

Logger Class for Test

on_episode_end(episode, logs={})

Print logs at end of each episode

on_train_begin(logs={})

Print logs at beginning of training

class core.callbacks.TrainEpisodeLogger

Bases: core.common.callback.Callback

on_episode_begin(episode, logs={})

Reset environment variables at beginning of each episode

on_episode_end(episode, logs={})

Compute and print training statistics of the episode when done

on_step_end(step, logs={})

Update statistics of episode after each step

on_train_begin(logs={})

Print training values at beginning of training

on_train_end(logs={})

Print training time at end of training

class core.callbacks.TrainIntervalLogger(interval=10000)

Bases: core.common.callback.Callback

on_episode_end(episode, logs={})

Update reward value at the end of each episode

on_step_begin(step, logs={})

Print metrics if interval is over

on_step_end(step, logs={})

Update progression bar at the end of each step

on_train_begin(logs={})

Initialize training statistics at beginning of training

on_train_end(logs={})

Print training duration at end of training

reset()

Reset statistics

core.memories module

class core.memories.MinSegmentTree(capacity)

Bases: core.memories.SegmentTree

min(start=0, end=None)

Returns min(arr[start], …, arr[end])

class core.memories.SegmentTree(capacity, operation, neutral_element)

Bases: object

reduce(start=0, end=None)

Returns result of applying self.operation to a contiguous subsequence of the array.

self.operation(arr[start], operation(arr[start+1], operation(… arr[end])))
start: int
beginning of the subsequence
end: int
end of the subsequences
reduced: obj
result of reducing self.operation over the specified range of array elements.
class core.memories.SequentialMemory(limit, enable_per=False, per_alpha=0.6, per_beta=0.4, **kwargs)

Bases: core.common.memory.Memory

append(observation, action, reward, terminal, training=True)

Append an observation to the memory

# Argument
observation (dict): Observation returned by environment action (int): Action taken to obtain this observation reward (float): Reward obtained by taking this action terminal (boolean): Is the state terminal
get_config()

Return configurations of SequentialMemory

# Returns
Dict of config
nb_entries

Return number of observations

# Returns
Number of observations
sample(batch_size, batch_idxs=None)

Return a randomized batch of experiences

# Argument
batch_size (int): Size of the all batch batch_idxs (int): Indexes to extract per_beta (float): Prioritized Experience Replay Memory Hyper parameter, To what degree to use importance weights(0 - no corrections, 1 - full correction)
# Returns
A list of experiences randomly selected
update_priorities(idxes, priorities)

Update priorities of sampled transitions. sets priority of transition at index idxes[i] in buffer to priorities[i]. Parameters ———- idxes: [int]

List of idxes of sampled transitions
priorities: [float]
List of updated priorities corresponding to transitions at the sampled idxes denoted by variable idxes.
class core.memories.SumSegmentTree(capacity)

Bases: core.memories.SegmentTree

find_prefixsum_idx(prefixsum)
Find the highest index i in the array such that
sum(arr[0] + arr[1] + … + arr[i - i]) <= prefixsum

if array values are probabilities, this function allows to sample indexes according to the discrete probability efficiently. Parameters ———- perfixsum: float

upperbound on the sum of array prefix
idx: int
highest index satisfying the prefixsum constraint
sum(start=0, end=None)

Returns arr[start] + … + arr[end]

core.policies module

class core.policies.AdvEpsGreedyPolicy(max_score, min_score=0, score_queue_size=100, score_name='episode_reward', score_type='mean', str_eps=1, nb_agents=1, **kwargs)

Bases: core.policies.LinearAnnealedPolicy

Implement the AdvEpsGreedyPolicy

Eps Greedy policy either:

  • takes a random action with probability epsilon
  • takes current best action with prob (1 - epsilon)

epsilon is calculated by: - max(epsilon greedy value, score based value)

get_current_value()

Return current annealing value

# Returns
Value to use in annealing
on_episode_end(episode, logs={})
class core.policies.BoltzmannGumbelQPolicy(C=1.0)

Bases: core.common.policy.Policy

Implements Boltzmann-Gumbel exploration (BGE) adapted for Q learning based on the paper Boltzmann Exploration Done Right (https://arxiv.org/pdf/1705.10257.pdf).

BGE is invariant with respect to the mean of the rewards but not their variance. The parameter C, which defaults to 1, can be used to correct for this, and should be set to the least upper bound on the standard deviation of the rewards.

BGE is only available for training, not testing. For testing purposes, you can achieve approximately the same result as BGE after training for N steps on K actions with parameter C by using the BoltzmannQPolicy and setting tau = C/sqrt(N/K).

get_config()

Return configurations of BoltzmannGumbelQPolicy

# Returns
Dict of config
select_action(q_values)

Return the selected action

# Arguments
q_values (np.ndarray): List of the estimations of Q for each action
# Returns
Selection action
class core.policies.BoltzmannQPolicy(tau=1.0, clip=(-500.0, 500.0))

Bases: core.common.policy.Policy

Implement the Boltzmann Q Policy

Boltzmann Q Policy builds a probability law on q values and returns an action selected randomly according to this law.

get_config()

Return configurations of BoltzmannQPolicy

# Returns
Dict of config
select_action(q_values)

Return the selected action

# Arguments
q_values (np.ndarray): List of the estimations of Q for each action
# Returns
Selection action
class core.policies.EpsGreedyQPolicy(eps=0.1)

Bases: core.common.policy.Policy

Implement the epsilon greedy policy

Eps Greedy policy either:

  • takes a random action with probability epsilon
  • takes current best action with prob (1 - epsilon)
get_config()

Return configurations of EpsGreedyQPolicy

# Returns
Dict of config
select_action(q_values)

Return the selected action

# Arguments
q_values (np.ndarray): List of the estimations of Q for each action
# Returns
Selection action
class core.policies.GreedyQPolicy

Bases: core.common.policy.Policy

Implement the greedy policy

Greedy policy returns the current best action according to q_values

select_action(q_values)

Return the selected action

# Arguments
q_values (np.ndarray): List of the estimations of Q for each action
# Returns
Selection action
class core.policies.LinearAnnealedPolicy(inner_policy, attr, value_max, value_min, value_test, nb_steps)

Bases: core.common.policy.Policy

Implement the linear annealing policy

Linear Annealing Policy computes a current threshold value and transfers it to an inner policy which chooses the action. The threshold value is following a linear function decreasing over time.

get_config()

Return configurations of LinearAnnealedPolicy

# Returns
Dict of config
get_current_value()

Return current annealing value

# Returns
Value to use in annealing
metrics

Return metrics values

# Returns
List of metric values
metrics_names

Return names of metrics

# Returns
List of metric names
select_action(**kwargs)

Choose an action to perform

# Returns
Action to take (int)
class core.policies.MA_BoltzmannQPolicy(tau=1.0, clip=(-500.0, 500.0))

Bases: core.common.policy.Policy

get_config()

Return configuration of the policy

# Returns
Configuration as dict
select_action(q_values)
select_action_agent(q_value)
class core.policies.MA_EpsGreedyQPolicy(eps=0.1)

Bases: core.common.policy.Policy

get_config()

Return configuration of the policy

# Returns
Configuration as dict
select_action(q_values)
class core.policies.MA_GreedyQPolicy

Bases: core.common.policy.Policy

select_action(q_values)
class core.policies.MA_MaxBoltzmannQPolicy(eps=0.1, tau=1.0, clip=(-500.0, 500.0))

Bases: core.common.policy.Policy

A combination of the eps-greedy and Boltzman q-policy.

Wiering, M.: Explorations in Efficient Reinforcement Learning. PhD thesis, University of Amserdam, Amsterdam (1999)

https://pure.uva.nl/ws/files/3153478/8461_UBA003000033.pdf

get_config()

Return configuration of the policy

# Returns
Configuration as dict
select_action(q_values)
select_action_agent(q_value)
class core.policies.MaxBoltzmannQPolicy(eps=0.1, tau=1.0, clip=(-500.0, 500.0))

Bases: core.common.policy.Policy

A combination of the eps-greedy and Boltzman q-policy.

Wiering, M.: Explorations in Efficient Reinforcement Learning. PhD thesis, University of Amsterdam, Amsterdam (1999)

https://pure.uva.nl/ws/files/3153478/8461_UBA003000033.pdf

get_config()

Return configurations of MaxBoltzmannQPolicy

# Returns
Dict of config
select_action(q_values)

Return the selected action The selected action follows the BoltzmannQPolicy with probability epsilon or return the Greedy Policy with probability (1 - epsilon)

# Arguments
q_values (np.ndarray): List of the estimations of Q for each action
# Returns
Selection action
class core.policies.NoisePolicy(random_process, ratio_of_pure_action=1.0)

Bases: core.common.policy.Policy

Implement policy based on OrnsteinUhlenbeck Process This policy returns action added by noise for exploration in ddpg

reset_states()
select_action(pure_action)

Return the selected action

# Arguments
random_process : Random process
# Returns
Selection action
class core.policies.starcraft_multiagent_eGreedyPolicy(nb_agents, nb_actions, eps=0.1)

Bases: core.common.policy.Policy

Implement the epsilon greedy policy

Eps Greedy policy either:

  • takes a random action with probability epsilon
  • takes current best action with prob (1 - epsilon)

nb_actions = (64*64, 3)

get_config()

Return configurations of EpsGreedyPolicy

# Returns
Dict of config
select_action(q_values)

Return the selected action

# Arguments
q_values (list): [action_xy (np.array), action_type (np.array)] [(1, nb_agents, actions), (1, nb_agents, actions)]
# Returns
Selection action: [(x,y), nothing/attack/move] [(nb_agents, 1), (nb_agents, 1)]

Module contents