Artificial Intelligent Techniques for Wireless Communication and Networking. Группа авторов
Чтение книги онлайн.

Читать онлайн книгу Artificial Intelligent Techniques for Wireless Communication and Networking - Группа авторов страница 12

СКАЧАТЬ MXnet simplify the lives of DL practitioners, the practical implementations of Reinforcement Learning are relatively young. The advent of RL frameworks, however, has already started and we can select from many projects right now that greatly encourage the use of specialized RL techniques. Frameworks such as Tensor Flow or PyTorch have appeared in recent years to help transform pattern recognition into a product, making deep learning easier for practitioners to try and use [17].

      In the Reinforcement Learning arena, a similar pattern is starting to play out. We are starting to see the resurgence of many open source libraries and tools to deal with this, both by helping to create new pieces (not by writing from scratch) and above all, by combining different algorithmic components of prebuild. As a consequence, by generating high abstractions of the core components of an RL algorithm, these Reinforcement Learning frameworks support engineers [7].

      A significant number of simulations include Deep Reinforcement Learning algorithms, introducing another multiplicative dimension to the time load of Deep Learning itself. This is mainly needed by the architectures we have not yet seen in this sequence, such as, among others, the distributed actor-critic methods or behaviors of multi-agents. But even choosing the best model also involves tuning hyper parameters and searching between different settings of hyper parameters; it can be expensive. All this includes the need for supercomputers based on distributed systems of heterogeneous servers (with multi-core CPUs and hardware accelerators such as GPUs or TPUs) to provide high computing power [18].

      1.2.3 Choice of the Learning Algorithm and Function Approximator Selection

      In deep learning, the function approximator characterizes how the characteristics are handled to higher levels of abstraction (a fortiori can therefore give certain characteristics more or less weight). In the first levels of a deep neural network, for example, if there is an attention system, the mapping made up of those first layers can be used as a framework for selecting features. On the other hand, an asymptotic bias can occur if the function approximator used for the weighted sum and/or the rule and/or template is too basic. But on the other hand, there would be a significant error due to the limited size of the data (over fitting) when the feature approximator has weak generalization.

      An especially better decision of a model-based or model-free method identified as a leading function approximator choice may infer that the state’s y-coordinate is less essential than the x-coordinate, and generalize that to the rule. It is helpful to share a performant function approximator in either a model-free or a model-based approach depending on the mission. Therefore the option to focus more on one or the other method is also a key factor in improving generalization [13, 19].

      One solution to eliminating non-informative characteristics is to compel the agent to acquire a set of symbolic rules tailored to the task and to think on a more extreme scale. This abstract level logic and increased generalization have the potential to activate cognitive high-level functions such as analogical reasoning and cognitive transition. For example, the feature area of environmental may integrate a relational learning system and thus extend the notion of contextual reinforcement learning.

      1.2.3.1 Auxiliary Tasks

      In the era of successful reinforcement learning, growing a deep reinforcement learning agent with allied tasks within a jointly learned representation would substantially increase sample academic success.

      This is accomplished by causing genuine several pseudo-reward functions, such as immediate prediction of rewards (= 0), predicting pixel changes in the next measurement, or forecasting activation of some secret unit of the neural network of the agent.

      1.2.3.2 Modifying the Objective Function

      In order to optimize the policy acquired by a deep RL algorithm, one can implement an objective function that diverts from the real victim. By doing so, a bias is typically added, although this can help with generalization in some situations. The main approaches to modify the objective function are

       i) Reward shaping

      For faster learning, incentive shaping is a heuristic to change the reward of the task to ease learning. Reward shaping incorporates prior practical experience by providing intermediate incentives for actions that lead to the desired outcome. This approach is also used in deep reinforcement training to strengthen the learning process in environments with sparse and delayed rewards.

       ii) Tuning the discount factor

      When the model available to the agent is predicted from data, the policy discovered using a short iterative horizon will probably be better than a policy discovered with the true horizon. On the one hand, since the objective function is revised, artificially decreasing the planning horizon contributes to a bias. If a long planning horizon is focused, there is a greater chance of over fitting (the discount factor is close to 1). This over fitting can be conceptually interpreted as related to the aggregation of errors in the transformations and rewards derived from data in relation to the real transformation and reward chances [4].

      1.3.1 Value-Based Method

Schematic illustration of value based learning.

      Figure 1.4 Value based learning.

      1 Take the status picture, transform it to grayscale, and excessive parts are cropped.

      2 Run the picture through a series of contortions and pooling in order to extract the important features that will help the agent make the decision.

      3 Calculate each possible action’s Q-Value.

      4 To find the most accurate Q-Values, conduct back-propagation.

      1.3.2 Policy-Based Method

      In the modern world, the number of potential acts may be very high or unknown. For instance, a robot learning to move on open fields may have millions of potential actions within the space of a minute. In these conditions, estimating Q-values for each action is not practicable. Policy-based approaches learn the policy specific function, without computing a cost function for each action. An illustration of a policy-based algorithm is given by Policy Gradient (Figure 1.5).

СКАЧАТЬ