Hardware

CPU
GPU
Terminology

CPU

Clock-cycles

Caching

CPUs have small pools of memory that store information which the CPU is most likely to need next. The goal of the caching system is to ensure that the CPU has the next bit of data it need already loaded into the cache by the time it goes looking for it.

Branch-prediction

A branch predictor is a digital circuit that tries to guess which way a branch (e.g. an if-then-else structure) will go before it is known for sure.

I presume the perf-gain comes from upon a correct prediction allows the CPU to pre-load the required data into the CPU cache beforehand.

Apparently it also allows for speculatively execution of the predicted branch.

Misprediction

The time that is wasted in case of a branch misprediction is equal to the number of stages in the pipeline from the fetch stage to the execute stage. Modern microprocessors tend to have quite long pipelines so that the misprediction delay is between 10 and 20 clock cycles. As a result, making a pipeline longer increases the need for a more advanced branch predictor.

GPU

Architecture

The following is at the GPU level.

basic building block is a streaming multiprocessor (SM) which consists of

Streamin multiprocessor (SM)

Part of GPU that runs the CUDA kernels
Each SM contains
- Thousands of registers than can be partitioned among threads of execution
- Caches:
  - Shared memory for fast data interchange between threads
  - Constant cache for fast broadcast of reads from constant memory
  - Texture cache to aggregate bandwith from texture memory
  - L1 cache to reduce latency to local or global memory
- Warp schedulers that can quickly switch contexts between threads and issue instructions to warps that are ready to execute
- Execution cores for integer and floating-point operations:
  - Integer and single-precision floating point operations
  - Double-precision floating point
  - Special Function Units (SFUs) for single-precision floating-point transcendental functions