FAQ-CUDA
홈 > CUDA > FAQ-CUDA

CUDA 최적화를 위한 Operation 비교자료.

미루웨어

2010-05-04

9700

안녕하세요. 미루웨어입니다.

 

CUDA최적화에 참고할 만한 자료입니다.  각 오퍼래이션별 성능을 나타내고 있습니다.

64bit 정수 곱셈의 경우 오버해드가 매우 크네요. 2009년 8월 자료이긴 하지만 유용하게 사용하실 수 있을 것으로 생각됩니다.

 

 http://ryan.bergstrom.ca/2009/08/cuda-instruction-performance.html

 

 

Toying around with CUDA some more, this time porting AES since it parallelizes so nicely. On a regular CPU, you can optimize the Rinjael algorithm heavily. By default, it's byte-oriented, but works in a grid of 4x4 bytes (aka 4 ints on a 32-bit or higher system). A paper that Wikipedia led me to apparently describes a method to reduce a cipher round to 16 table lookups and 16 32-bit XOR operations, provided you have room for 4 256-entry 32-bit tables (4kb). Now seeing as how the paper isn't free, I decided to see if it was even worth it for a GPU-based implementation.

 

Calculations vs Table Lookup

Parameter Table Lookup  1293.92
Local Table Lookup  1488.25
Global Table Lookup  955.37
Runtime Calculation  755.83

For reference, all the tests I did followed the same pattern: store clock(), run the operation 100,000 times, store clock(). I have no idea what unit clock() actually returns on the GPU, but you can still see the relative performance. In this case, the calculation was squaring a 32-bit integer (x * x), compared to looking it up in various tables. Yes, initializing a local table every function call is stupid, but I figured I might as well be complete. Turns out I needn't have bothered though, since it's the slowest by a fair margin. The global table was defined as __constant__, so it's put in the read only segment, and fairly quick. So, as you can see, the table lookup is worth it if you can replace more than one operation.

Next, to check whether the data type has any effect on access time.

 

Memory Read/Write

8-bit Integer  640.74
16-bit Integer  709.01
32-bit Integer  658.02
64-bit Integer  679.58
32-bit Float  659.31
64-bit Double  678.47

Apparently, yes. This test just read the element from the input array and wrote it to the output. The GTX260 uses 14x32-bit busses, so the fact that 32-bit accesses are fast isn't a surprise. The fact that the 8-bit access is so fast is, however.

The implementation of AES I'm using as a reference also uses the modulo operator, which according to the CUDA handbook is slow. You can replace modulo with a bitwise and operation if the second operand is a power of two (which for 128 and 256-bit AES, it always is).

 

Modulo Operator

Bitwise AND  699.97
% Operator  791.57

So, it's faster, but not by much. It also won't work for 192-bit AES. Next up? The fact that the handbook mentions the __fmul24 macro as being faster than integer multiplication for operands of 24 bits or less.

 

Multiplication

8-bit Integer  671.05
16-bit Integer  737.53
32-bit Integer  751.1
32-bit Integer (__umul24)  684.65
64-bit Integer  907.95
32-bit Float  688.49
64-bit Double  722.59

Not bad. I'm running a GTX260, which supports Compute 1.3. I understand that in older cards, integer multiplication was MUCH slower, but still, this is a big boost if you can use it. Mind you, the macro still takes two longs as arguments, and just uses 24-bits worth, so if you pass it a byte or short, you'll take a penalty for the type cast. As an aside, 8-bit multiplication seems to be effectively tied with 32-bit floats, which is unexpected.

There's also a handly __fdividef macro for float divisions. Division is VERY slow relative to all the other arithmetic ops, so let's see how it does.

 

Division

8-bit Integer  3296.85
16-bit Integer  3352.55
32-bit Integer  3165.96
64-bit Integer  2375.31
32-bit Float  3342.31
32-bit Float (__fdividef)  932.61
64-bit Double  3325.55

Um, wow. I think that speaks for itself. CUDA also provides macros for the trig functions, which presumably use a hardware implementation on the card instead of a regular FPU.

 

Sine

32-bit Float (sinf)  1602.38
32-bit Float (__sinf)  968.16
64-bit Double (sinf)  1741.53
64-bit Double (__sinf)  1079.28

 

 

Cosine
32-bit Float (cosf)  1558.02
32-bit Float (__cosf)  964.72
64-bit Double (cosf)  1716.69
64-bit Double (__cosf)  1083.21

 

Logarithm

32-bit Float (logf)  1514.78
32-bit Float (__logf)  968.03
64-bit Double (logf)  1636.57
64-bit Double (__logf)  1084.61

As expected, much faster.

Addition and subtraction are somewhat surprising, since it seems that 8 and 32-bit integer operations are faster than floats.

 

Addition

8-bit Integer  669.16
16-bit Integer  735.79
32-bit Integer  686.73
64-bit Integer  739.62
32-bit Float  691.78
64-bit Double  722.62

 

Subtraction

8-bit Integer  700.6
16-bit Integer  784.1
32-bit Integer  703.18
64-bit Integer  766.03
32-bit Float  736.78
64-bit Double  781.79

 

Last, the bitwise operators.

 

Bitwise AND

8-bit Integer  859.35
16-bit Integer  945.53
32-bit Integer  753.48
64-bit Integer  878.86

 

Bitwise NOT

8-bit Integer  849.93
16-bit Integer  930.9
32-bit Integer  741.51
64-bit Integer  886.59

 

Bitwise OR

8-bit Integer  861.99
16-bit Integer  942.57
32-bit Integer  753.31
64-bit Integer  863.66

 

Bitwise XOR

8-bit Integer  859.55
16-bit Integer  942.94
32-bit Integer  753.64
64-bit Integer  861.2

 

So, 32-bit bitwise operations are the fastest by a fair margin. Given that the majority of the operations in AES are XORs, combining 4 8-bit XORs into 1 32-bit XOR will be pretty big.

So, I still haven't definitively answered the question that prompted all this, but at least I answered a few others I hadn't bothered asking yet.

Fermi C2050 성능비교

Fermi(페르미 C2050)을 몇 대까지 장착 가능한가요?
목록보기