Multi-Objective Decision Making. Diederik M. Roijers
Чтение книги онлайн.

Читать онлайн книгу Multi-Objective Decision Making - Diederik M. Roijers страница 9

СКАЧАТЬ (right). There are two possible actions per agent, denoted by a dot (1) and a bar (1).

Image

      We now consider the multi-objective setting:

      • D and A are the same as in a CoG, but,

      • U = {u1,…,uρ} is now the set of ρ, d-dimensional local payoff functions. The total team payoff is the sum of local vector-valued payoffs: Image.

      For example, payoffs for a MO-CoG structured as in Figure 2.1, i.e.,

Image

      are provided in Table 2.2.

Image

      The only difference between a MO-CoG and a CoG is the dimensionality of the payoff functions. A CoG is thus a special case of a MO-CoG, i.e., a MO-CoG in which d = 1. Furthermore, when preferences are known, it may be possible to scalarize a MO-CoG and thus transform it into a CoG. For example, if we know the scalarization function is linear, i.e., f(u, w) = w · u, and its parameter vector w = (0.75,0.25) is, then we can scalarize the multi-objective payoff functions of Table 2.2 to the single-objective payoff functions of Table 2.1 before planning.

      In a deterministic policy for a MO-CoG, the agents pick one joint action. In a stochastic policy, the agents draw a joint action from a probability distribution. Note that such a probability distribution is in general defined over joint actions, and there can thus still be coordination between the agents when the policy is stochastic.

      When f and/or w are unknown in the planning or learning phase—as is the case in the unknown weights and decision support scenarios discussed in Section 1.1—we need to produce a set of policies that contains at least one optimal policy for each possible f and w. The solution of a MO-CoG is thus a coverage set of policies. In general, this can contain both deterministic and stochastic policies. We explain why this is important for MO-CoGs (but not for CoGs) in Chapter 3.

      The second class of MODPs that we treat is the multi-objective Markov decision process (MOMDP), in which a single agent needs to perform a sequence of actions over time. This sequence of actions typically takes place in an environment that is affected by these actions. Therefore, the agent has to consider not only its immediate reward, but also the reward it will be able obtain later in the states it visits in the future.

      Consider the robot (shown as a Pacman symbol) in Figure 2.2 that needs to navigate in a socially appropriate way to reach the blue target zone in the upper right corner. We want the robot to reach the target as soon as possible, i.e., minimize the time to reach the target, but also minimize the hindrance that the robot causes to the other person by avoiding her personal space (indicated in red) along the way. By driving through the personal space of the person, it can obtain a higher value with respect to the first objective but a lower value in the second objective. Which policy is optimal thus depends on how much we value the first objective relative to the second objective.

      Figure 2.2: A social robot navigation problem MDP: a robot (indicated by the pacman symbol) needs to reach the target zone in the upper right corner (objective 1). However, the robots needs to avoid the personal space of the person standing in between the start position of the robot and the target zone (objective 2).

      The single-objective version of an MOMDP is an MDP:

      • S is the state space, i.e., the set of possible states the environment can be in,

      • A is the action space, i.e., the set of actions the agent can take,

      • T : S × A × S → [0,1] is the transition function, giving the probability of a next state given an action and a current state,

      • R : S × A × S → ℝ is the reward function, specifying the immediate expected scalar reward corresponding to a transition.

      • μ0 is the distribution over initial states s0.

      At each timestep t, the agent observes the current state of the environment s ∈ S. When the agent takes an action a ∈ A the environment transitions to a new state s′.

      The state in an MDP is Markovian, i.e., the current state s of the environment and the current action of the agent a are a sufficient statistic for predicting the next transition probabilities T(s′|s, a) and the associated expected immediate reward. The agent’s history, i.e., the states and actions that led to the current state, do not provide additional information in that respect.

      The agent’s goal in an MDP is to find a policy π that maximizes the return Rt, which is some function of the rewards received from timestep t and onward. In the broadest sense, a policy can condition on everything that is known to the agent.

      A state-indepedent value function Vπ specifies the expected return when following π from the initial state:

      We further specify first the return, Rt, and then the policy, π.

      Typically, the return is additive [Boutilier et al., 1999], i.e., it is a sum of the rewards attained at each timestep. When the returns are additive, Vπ becomes an expectation over this sum. In a finite-horizon setting there is a limited number of timesteps, h, СКАЧАТЬ