Название: Customizable Computing
Автор: Yu-Ting Chen
Издательство: Ingram
Жанр: Программы
Серия: Synthesis Lectures on Computer Architecture
isbn: 9781627059640
isbn:
CHAPTER 3
Customization of Cores
3.1 INTRODUCTION
Because processing cores contribute greatly to energy consumption in modern processors, the conventional processing core is a good place to start looking for customizations to computation engines. Processing cores are pervasive, and their architecture and compilation flow are mature. Modifications made to processing cores then have the advantage that existing hardware modules and infrastructure invested in building efficient and high-performance processors can be leveraged, without having to necessarily abandon existing software stacks as may be required when designing hardware from the ground up. Additionally, programmers can use their existing knowledge of programming conventional processing cores as a foundation toward learning new techniques that build upon conventional cores, instead of having to adopt new programming paradigms, or near languages.
In addition to benefiting from mature software stacks, any modifications made to a conventional processing core can also take advantage of many of the architectural components that have made cores so effective. Examples of these architectural components are caches, mechanisms for out-of-order scheduling and speculative execution, and software scheduling mechanisms. By integrating modifications directly into a processing core, new features can be designed to blend into these components. For example, adding a new instruction to the existing execution pipeline automatically enables this instruction to benefit from aggressive instruction scheduling already present in a conventional core.
However, introducing new compute capability, such as new arithmetic units, into existing processing cores means being burdened by many of the design restrictions that these cores already exert on arithmetic unit design. For example, out-of-order processing benefits considerably from short latency instructions, as long latency instructions can cause pipeline stalls. Conventional cores are also fundamentally bound, both in terms of performance and efficiency, by the infrastructure necessary to execute instructions. As a result, conventional cores cannot be as efficient at performing a particular task as a hardware structure that is more specialized to that purpose [26]. Figure 3.1 illustrates this point, showing that the energy cost of executing an instruction is much greater than the energy that is required to perform the arithmetic computation (e.g., energy devoted to integer and floating point arithmetic). The rest of the energy is spent to implement the infrastructure internal to the processing core that is used to perform tasks such as scheduling instructions, fetch and decode, extracting instruction level parallelism, etc. Figure 3.1 shows only the comparison of structures internal to the processing core itself, and excludes external components such as memory systems and networks. These are burdens that are ever present in conventional processing cores, and they represent the architectural cost of generality and programmability. This can be contrasted against the energy proportions shown in Figure 3.2, which show the energy saving when the compute engine is customized for a particular application, instead of a general-purpose design. The difference in energy cost devoted to computation is primarily the result of relaxing the design requirements of functional units, so that functional units operate only at precisions that are necessary and are designed to emphasize energy efficiency per computation, and potentially exhibit deeper pipelines and longer latencies than would be tolerable when couched inside a conventional core.
Figure 3.1: Energy consumed by subcomponents of a conventional compute core as a proportion of the total energy consumed by the core. Subcomponents that are not computationally necessary (i.e., they are part of the architectural cost of extracting parallelism, fetching and decoding instructions, scheduling, dependency checking, etc.) are shown as slices without fill. Results are for a Nehalem era 4-core Intel Xeon CPU. Memory includes L1 cache energy only. Taken from [26].
This chapter will cover the following topics related to customization of processing cores:
• Dynamic Core Scaling and Defeaturing: A post-silicon method of selectively deactivating underutilized components with the goal of conserving energy.
Figure 3.2: Energy cost of subcomponents in a conventional compute core as a proportion of the total energy consumed by the core. This shows the energy savings attainable if computation is performed in an energy-optimal ASIC. Results are for a Nehalem era 4-core Intel Xeon CPU. Memory includes L1 cache energy only. Taken from [26].
• Core Fusion: Architectures that enable one “big” core to act as if it were really many “small cores,” and vice versa, to dynamically adapt to different amounts of thread-level or instruction-level parallelism.
• Customized Instruction Set Extensions: Augmenting processor cores with new workload-specific instructions.
3.2 DYNAMIC CORE SCALING AND DEFEATURING
When a general-purpose processor is designed, it is done with the consideration of a wide range of potential workloads. For any particular workload, many resources may not be fully utilized. As a result, these resources continue to consume power, but do not contribute meaningfully to program performance. In order to improve energy efficiency architectural features can be added that allow for these components to be selectively turned off. While this obviously does not allow for the chip area spent on deactivated components to be repurposed, it does allow for a meaningful energy efficiency improvement.
Manufacturers of modern CPUs enable this type of selective defeaturing, though typically not for this purpose. This is done with the introduction of machine-specific registers that indicate the activation of particular components. The purpose of these registers, from a manufacturer’s perspective, is to improve processor yield by allowing faulty components in an otherwise stable processor to be disabled. For this reason, the machine-specific registers governing device activation are rarely documented.
There has been extensive academic work in utilizing defeaturing to create dynamically heterogeneous systems. These works center around identifying when a program is entering a code region that systemically underutilizes some set of features that exist in a conventional core. For example, if it is possible to statically discover that a code region contains long sequences of dependencies between instructions, then it is clear that a processor with a wide issue and fetch width will not be able to find enough independent instructions to make effective use of those wide resources [4, 8, 19, 125]. In that case, powering off the components that enable wide fetch and issue, along with the architectural support for large instruction windows, can save energy without impacting performance. This academic work is contingent upon being able to discern run-time behavior or code, either using run-time monitoring [4, 8] or static analysis [125].
An example of dynamic resource scaling from academia is CoolFetch [125]. CoolFetch relies on compiler support to statically estimate the execution rate of a code region, and then uses this information to dynamically grow and contract structures within a processor’s fetch and issue units. By constraining these structures in code regions with few opportunities to exploit instruction-level СКАЧАТЬ