Название: Artificial Intelligent Techniques for Wireless Communication and Networking
Автор: Группа авторов
Издательство: John Wiley & Sons Limited
Жанр: Программы
isbn: 9781119821786
isbn:
1 Requires a condition and gets the probability of some action based on prior experience
2 Chooses the most possible action
3 Reiterates before the end of the game and evaluates the total incentives
4 Using back propagation to change connection weights based on the incentives.
Figure 1.5 Policy based learning.
1.4 Applications and Challenges of Applying Reinforcement Learning to Real-World
1.4.1 Applications
The ability to tackle a wide range of Deep RL techniques has been demonstrated to a variety of issues which were previously unsolved. A few of the most renowned accomplishments are in the game of backgammon, beating previous computer programmes, achieving superhuman-level performance from the pixels in Atari games, mastering the game of Go and beating professional poker players in the Nolimit Texas Hold’em Heads Up Game: Libratus and Deep stack.
Such achievements in popular games are essential because in a variety of large and nuanced tasks that require operating with high-dimensional inputs, they explore the effectiveness of deep RL. Deep RL has also shown a great deal of potential for real-world applications such as robotics, self-driving vehicles, finance, intelligent grids, dialogue systems, etc. Deep RL systems are still in production environments, currently. How Facebook uses Deep RL, for instance, can be found for pushing notifications and for faster video loading with smart prefetching.
RL is also relevant to fields where one might assume that supervised learning alone, such as sequence prediction, is adequate. It has also been cast as an RL problem to build the right neural architecture for supervised learning tasks. Notice that evolutionary techniques can also be addressed for certain types of tasks. Finally, it should be remembered that deep RL has prospects in the areas of computer science in classical and basic algorithmic issues, such as the travelling salesman problem. This is an NP-complete issue and the ability to solve it with deep RL illustrates the potential effect it could have on many other NP-complete issues, given that it is possible to manipulate the structure of these problems [2, 12].
1.4.2 Challenges
Off-Line Learning
Training is also not possible directly online, but learning happens offline, using records from a previous iteration of the management system. Broadly speaking, we would like it to be the case that the new system version works better than the old one and that implies that we will need to perform off-policy assessment (predicting performance before running it on the actual system). There are a couple of approaches, including large sampling, for doing this. The introduction of the first RL version (the initial policy) is one special case to consider; there is also a minimum output requirement to be met before this is supposed to occur. The warm-start efficiency is therefore another important ability to be able to assess.
Learning From Limited Samples
There are no different training and assessment environments for many actual systems. All training knowledge comes from the real system, and during training, the agent does not have a separate exploration policy as its exploratory acts do not come for free. Given this greater exploration expense, and the fact that very little of the state space is likely to be explored by logs for learning from, policy learning needs to be data-efficient. Control frequencies may be 1 h or even multi-month time steps (opportunities to take action) and even longer incentive horizons. One easy way to measure a model’s data efficiency is to look at the amount of data needed to meet a certain output threshold.
High-Dimensional State and Action Spaces
For several realistic real-world problems, there are wide and consistent state and action spaces, which can pose serious problems for traditional RL algorithms. One technique is to generate a vector of candidate action and then do a closest neighbor search to determine the nearest accessible real action.
Safety Constraints
Many control systems must function under security restrictions, even during phases of exploratory learning. Constrained MDPs (Markov Decision Processes) make it possible to define constraints on states and behavior. Budgeted MDPs enable the degree of constraint/performance trade-off to be explored rather than simply hard-wired by letting constraint levels be learned. Another solution is to add to the network a protection layer that prevents any breaches of safety.
Partial Observability
It is partly measurable for almost all real systems where we would like to incorporate reinforcement learning. For example, the efficiency of mechanical parts may deteriorate over time, ‘identical’ widgets may exhibit performance variations provided the same control inputs, or it may simply be unknown the condition of certain parts of the system (e.g. the mental state of users of a suggested system).
Two common strategies to dealing with partial observability, including input history, and modelling history using repeated networks in the model. In addition, Robust MDP formalisms provide clear mechanisms to ensure that sensor and action noise and delays are robust to agents. If a given deployment setting may have initially unknown but learnable noise sources, then techniques for device detection may be used to train a policy that can learn in which environment it operates.
Reward Functions
Device or product owners do not have a good image of what they want to refine in certain instances. The incentive function is always multidimensional and involves different sub-goals to be balanced. Another great insight here which reminds me of machine latency discussions) is that ‘normal performance’ (i.e. expectation) is always an inadequate measure, and for all task instances, the system needs to perform well. A common approach is to use a Conditional Value at Risk (CVaR) target to measure the full distribution of rewards across classes, which looks at a given percentile of the distribution of rewards rather than the predicted reward.
Explainability/Interpretability
Real systems are owned and controlled by humans who need to be informed about the actions of the controller and need insights into cases of failure. For this purpose, for real-world policies, policy clarity is critical. In order to obtain stakeholder buy-in, it is necessary to consider the longer-term purpose of the policy, particularly in cases where the policy can find another solution and unforeseen approach to managing a system.
Real-Time Inference
Policy inference has to occur within the system’s control frequency. This could be in the order of milliseconds or shorter. This prevents us from using costly computational methods that do not follow the constraints (for example, certain types of model-based planning). Of course, systems with СКАЧАТЬ