Artificial Intelligent Techniques for Wireless Communication and Networking. Группа авторов
Чтение книги онлайн.

Читать онлайн книгу Artificial Intelligent Techniques for Wireless Communication and Networking - Группа авторов страница 13

СКАЧАТЬ Gradient, simplified, works as follows:

      1 Requires a condition and gets the probability of some action based on prior experience

      2 Chooses the most possible action

      3 Reiterates before the end of the game and evaluates the total incentives

      4 Using back propagation to change connection weights based on the incentives.

Schematic illustration of policy based learning.

      Figure 1.5 Policy based learning.

      1.4.1 Applications

      The ability to tackle a wide range of Deep RL techniques has been demonstrated to a variety of issues which were previously unsolved. A few of the most renowned accomplishments are in the game of backgammon, beating previous computer programmes, achieving superhuman-level performance from the pixels in Atari games, mastering the game of Go and beating professional poker players in the Nolimit Texas Hold’em Heads Up Game: Libratus and Deep stack.

      RL is also relevant to fields where one might assume that supervised learning alone, such as sequence prediction, is adequate. It has also been cast as an RL problem to build the right neural architecture for supervised learning tasks. Notice that evolutionary techniques can also be addressed for certain types of tasks. Finally, it should be remembered that deep RL has prospects in the areas of computer science in classical and basic algorithmic issues, such as the travelling salesman problem. This is an NP-complete issue and the ability to solve it with deep RL illustrates the potential effect it could have on many other NP-complete issues, given that it is possible to manipulate the structure of these problems [2, 12].

      1.4.2 Challenges

       Off-Line Learning

      Training is also not possible directly online, but learning happens offline, using records from a previous iteration of the management system. Broadly speaking, we would like it to be the case that the new system version works better than the old one and that implies that we will need to perform off-policy assessment (predicting performance before running it on the actual system). There are a couple of approaches, including large sampling, for doing this. The introduction of the first RL version (the initial policy) is one special case to consider; there is also a minimum output requirement to be met before this is supposed to occur. The warm-start efficiency is therefore another important ability to be able to assess.

       Learning From Limited Samples

      There are no different training and assessment environments for many actual systems. All training knowledge comes from the real system, and during training, the agent does not have a separate exploration policy as its exploratory acts do not come for free. Given this greater exploration expense, and the fact that very little of the state space is likely to be explored by logs for learning from, policy learning needs to be data-efficient. Control frequencies may be 1 h or even multi-month time steps (opportunities to take action) and even longer incentive horizons. One easy way to measure a model’s data efficiency is to look at the amount of data needed to meet a certain output threshold.

       High-Dimensional State and Action Spaces

      For several realistic real-world problems, there are wide and consistent state and action spaces, which can pose serious problems for traditional RL algorithms. One technique is to generate a vector of candidate action and then do a closest neighbor search to determine the nearest accessible real action.

       Safety Constraints

      Many control systems must function under security restrictions, even during phases of exploratory learning. Constrained MDPs (Markov Decision Processes) make it possible to define constraints on states and behavior. Budgeted MDPs enable the degree of constraint/performance trade-off to be explored rather than simply hard-wired by letting constraint levels be learned. Another solution is to add to the network a protection layer that prevents any breaches of safety.

       Partial Observability

      It is partly measurable for almost all real systems where we would like to incorporate reinforcement learning. For example, the efficiency of mechanical parts may deteriorate over time, ‘identical’ widgets may exhibit performance variations provided the same control inputs, or it may simply be unknown the condition of certain parts of the system (e.g. the mental state of users of a suggested system).

      Two common strategies to dealing with partial observability, including input history, and modelling history using repeated networks in the model. In addition, Robust MDP formalisms provide clear mechanisms to ensure that sensor and action noise and delays are robust to agents. If a given deployment setting may have initially unknown but learnable noise sources, then techniques for device detection may be used to train a policy that can learn in which environment it operates.

       Reward Functions

      Device or product owners do not have a good image of what they want to refine in certain instances. The incentive function is always multidimensional and involves different sub-goals to be balanced. Another great insight here which reminds me of machine latency discussions) is that ‘normal performance’ (i.e. expectation) is always an inadequate measure, and for all task instances, the system needs to perform well. A common approach is to use a Conditional Value at Risk (CVaR) target to measure the full distribution of rewards across classes, which looks at a given percentile of the distribution of rewards rather than the predicted reward.

       Explainability/Interpretability

      Real systems are owned and controlled by humans who need to be informed about the actions of the controller and need insights into cases of failure. For this purpose, for real-world policies, policy clarity is critical. In order to obtain stakeholder buy-in, it is necessary to consider the longer-term purpose of the policy, particularly in cases where the policy can find another solution and unforeseen approach to managing a system.

       Real-Time Inference

      Policy inference has to occur within the system’s control frequency. This could be in the order of milliseconds or shorter. This prevents us from using costly computational methods that do not follow the constraints (for example, certain types of model-based planning). Of course, systems with СКАЧАТЬ