¾È³çÇÏ¼¼¿ä. ¹Ì·ç¿þ¾îÀÔ´Ï´Ù.

CUDAÃÖÀûÈ¿¡ Âü°íÇÒ ¸¸ÇÑ ÀÚ·áÀÔ´Ï´Ù. °¢ ¿ÀÆÛ·¡ÀÌ¼Çº° ¼º´ÉÀ» ³ªÅ¸³»°í ÀÖ½À´Ï´Ù.

64bit Á¤¼ö °ö¼ÀÀÇ °æ¿ì ¿À¹öÇØµå°¡ ¸Å¿ì Å©³×¿ä. 2009³â 8¿ù ÀÚ·áÀÌ±ä ÇÏÁö¸¸ À¯¿ëÇÏ°Ô »ç¿ëÇÏ½Ç ¼ö ÀÖÀ» °ÍÀ¸·Î »ý°¢µË´Ï´Ù.

http://ryan.bergstrom.ca/2009/08/cuda-instruction-performance.html

Toying around with CUDA some more, this time porting AES since it parallelizes so nicely. On a regular CPU, you can optimize the Rinjael algorithm heavily. By default, it's byte-oriented, but works in a grid of 4x4 bytes (aka 4 ints on a 32-bit or higher system). A paper that Wikipedia led me to apparently describes a method to reduce a cipher round to 16 table lookups and 16 32-bit XOR operations, provided you have room for 4 256-entry 32-bit tables (4kb). Now seeing as how the paper isn't free, I decided to see if it was even worth it for a GPU-based implementation.

Calculations vs Table Lookup |

Parameter Table Lookup |
1293.92 |

Local Table Lookup |
1488.25 |

Global Table Lookup |
955.37 |

Runtime Calculation |
755.83 |

For reference, all the tests I did followed the same pattern: store clock(), run the operation 100,000 times, store clock(). I have no idea what unit clock() actually returns on the GPU, but you can still see the relative performance. In this case, the calculation was squaring a 32-bit integer (x * x), compared to looking it up in various tables. Yes, initializing a local table every function call is stupid, but I figured I might as well be complete. Turns out I needn't have bothered though, since it's the slowest by a fair margin. The global table was defined as __constant__, so it's put in the read only segment, and fairly quick. So, as you can see, the table lookup is worth it if you can replace more than one operation.

Next, to check whether the data type has any effect on access time.

Memory Read/Write |

8-bit Integer |
640.74 |

16-bit Integer |
709.01 |

32-bit Integer |
658.02 |

64-bit Integer |
679.58 |

32-bit Float |
659.31 |

64-bit Double |
678.47 |

Apparently, yes. This test just read the element from the input array and wrote it to the output. The GTX260 uses 14x32-bit busses, so the fact that 32-bit accesses are fast isn't a surprise. The fact that the 8-bit access is so fast is, however.

The implementation of AES I'm using as a reference also uses the modulo operator, which according to the CUDA handbook is slow. You can replace modulo with a bitwise and operation if the second operand is a power of two (which for 128 and 256-bit AES, it always is).

Modulo Operator |

Bitwise AND |
699.97 |

% Operator |
791.57 |

So, it's faster, but not by much. It also won't work for 192-bit AES. Next up? The fact that the handbook mentions the __fmul24 macro as being faster than integer multiplication for operands of 24 bits or less.

Multiplication |

8-bit Integer |
671.05 |

16-bit Integer |
737.53 |

32-bit Integer |
751.1 |

32-bit Integer (__umul24) |
684.65 |

64-bit Integer |
907.95 |

32-bit Float |
688.49 |

64-bit Double |
722.59 |

Not bad. I'm running a GTX260, which supports Compute 1.3. I understand that in older cards, integer multiplication was MUCH slower, but still, this is a big boost if you can use it. Mind you, the macro still takes two longs as arguments, and just uses 24-bits worth, so if you pass it a byte or short, you'll take a penalty for the type cast. As an aside, 8-bit multiplication seems to be effectively tied with 32-bit floats, which is unexpected.

There's also a handly __fdividef macro for float divisions. Division is VERY slow relative to all the other arithmetic ops, so let's see how it does.

Division |

8-bit Integer |
3296.85 |

16-bit Integer |
3352.55 |

32-bit Integer |
3165.96 |

64-bit Integer |
2375.31 |

32-bit Float |
3342.31 |

32-bit Float (__fdividef) |
932.61 |

64-bit Double |
3325.55 |

Um, wow. I think that speaks for itself. CUDA also provides macros for the trig functions, which presumably use a hardware implementation on the card instead of a regular FPU.

Sine |

32-bit Float (sinf) |
1602.38 |

32-bit Float (__sinf) |
968.16 |

64-bit Double (sinf) |
1741.53 |

64-bit Double (__sinf) |
1079.28 |

Cosine |

32-bit Float (cosf) |
1558.02 |

32-bit Float (__cosf) |
964.72 |

64-bit Double (cosf) |
1716.69 |

64-bit Double (__cosf) |
1083.21 |

Logarithm |

32-bit Float (logf) |
1514.78 |

32-bit Float (__logf) |
968.03 |

64-bit Double (logf) |
1636.57 |

64-bit Double (__logf) |
1084.61 |

As expected, much faster.

Addition and subtraction are somewhat surprising, since it seems that 8 and 32-bit integer operations are faster than floats.

Addition |

8-bit Integer |
669.16 |

16-bit Integer |
735.79 |

32-bit Integer |
686.73 |

64-bit Integer |
739.62 |

32-bit Float |
691.78 |

64-bit Double |
722.62 |

Subtraction |

8-bit Integer |
700.6 |

16-bit Integer |
784.1 |

32-bit Integer |
703.18 |

64-bit Integer |
766.03 |

32-bit Float |
736.78 |

64-bit Double |
781.79 |

Last, the bitwise operators.

Bitwise AND |

8-bit Integer |
859.35 |

16-bit Integer |
945.53 |

32-bit Integer |
753.48 |

64-bit Integer |
878.86 |

Bitwise NOT |

8-bit Integer |
849.93 |

16-bit Integer |
930.9 |

32-bit Integer |
741.51 |

64-bit Integer |
886.59 |

Bitwise OR |

8-bit Integer |
861.99 |

16-bit Integer |
942.57 |

32-bit Integer |
753.31 |

64-bit Integer |
863.66 |

Bitwise XOR |

8-bit Integer |
859.55 |

16-bit Integer |
942.94 |

32-bit Integer |
753.64 |

64-bit Integer |
861.2 |

So, 32-bit bitwise operations are the fastest by a fair margin. Given that the majority of the operations in AES are XORs, combining 4 8-bit XORs into 1 32-bit XOR will be pretty big.

So, I still haven't definitively answered the question that prompted all this, but at least I answered a few others I hadn't bothered asking yet.