A simple question, which is faster?
saturate(x)” or “max(0, x)”?

It’s a tiny problem, but it might be good to know for those people who write shader program.

For example, when writing a shader for diffuse lighting, write a code like this

, or like this?

saturate(x) means max(0, min(1, x)), so it is natural to think that max(0, x) is faster. However, that’s not true.

Microsoft’s shader assembly has saturate modifier.
http://msdn.microsoft.com/en-us/library/windows/desktop/bb219849(v=vs.85).aspx
That is, “r0 = saturate(r1 + r2)” can be written in a single instruction:

Of course, it doesn’t mean all GPUs have this kind of modifier, but basically, most of GPUs which support DirectX might follow this specifications. So, I always used “saturate(x)”, instead of “max(0, x)”.

However, I found that Unity bilt-in shader was using max(0, x) for diffuse lighting. Actually, if GPU doesn’t have saturate modifier, max(0, x) must be faster than saturate(x). Hmm…, it might happen especially on some mobile GPUs. Let’s check!

Before I describe the details, I would like to summarize the conclusion.

Conclusion:

  • In most cases, saturate(x) is faster or as good as max(0, x), and it is free. saturate(x) can be performed as fast as x.
  • PowerVR doesn’t have saturate modifier for ‘float’ and ‘half’ variables. That is, saturate modifier is available only for ‘fixed’ variables. This was the only case that saturate(x) was slower than max(0, x). For fixed variables, saturate(x) and x had same performance.
  • It seems like Tegra 3 has saturate modifier, additionally, it also has max(0, x) modifier in some specific cases. Tegra 3 might have very complicated architecture. The performance is unpredictable! However, saturate(x) was always better than or as good as max(0, x).
  • Adreno has saturate modifier. saturate(x) is cost free.
  • Mali might have ‘max(0, x)’ modifier as well as saturate(x). saturate(x), max(0, x) and x had same performance.

Additional conclusion:

  • You should use ‘fixed’ variables on PowerVR, ‘half’ and ‘float’ are very slow (‘half’ and ‘float’ are same speed).
  • Number of input variables of fragment shader will affect the performance on Adreno and Mali GPUs. On those GPUs, ‘half’ and ‘fixed’ have same performance. ‘float’ is a bit slower.
  • It is not obvious whether a precision of a variable affects the performance or not on Tegra 3. When fixed precision was used, there was a performance improvement in some cases.

Test 1:
I tested the following shader with Unity. It has 3 passes, each pass corresponds to saturate(x), max(0, x), and x.

This is the test code for rendering. Measure the frame rate for each ‘m_pass‘ (= 0, 1, 2). ‘m_quadCount‘ will be adjusted for each device.

Result 1:

iPod touch
 4th gen
(PowerVR SGX 535)
Galaxy Nexus
(PowerVR SGX 540)
Nexus 7
(2012)
(NVIDIA Tegra 3)
XPERIA M2
(Adreno 305)
Galaxy SII
(Mali-400)
m_quadCount
50
50
15
15
15
saturate(x)
47 (FPS)
53 (FPS)
26 (FPS)
46 (FPS)
38 (FPS)
max(0, x)
47 (FPS)
53 (FPS)
18 (FPS)
42 (FPS)
38 (FPS)
x
47 (FPS)
53 (FPS)
49 (FPS)
46 (FPS)
38 (FPS)

This result shows that saturate(x) actually have performance advantage on Adreno and Tegra 3. However, on Tegra 3, saturate(x) was slower than x. Taking into account the frame rates, it seems like saturate(x) takes 2 cycles, and x takes 1 cycle. Maybe, Tegra 3 can perform ‘x*(y-z)’ in 1 cycle.
To check this hypothesis, I tested multiplication instead of addition like this:

Then, frame rates were totally changed on Tegra 3. In all cases, the frame rates were 26 FPS. Hmm…, actually, x takes 2 cycles now. However, max(0, x) also takes 2 cycles. Does Tagra 3 have max(0, x) modifier for multiplication???

On the other GPUs, there was no performance difference. Maybe, there was anohter bottleneck such as pixel fill rate or interpolator. I suspect that it is interpolator. So, in the next test, I reduced the number of the input parameters passed from vertex shader to fragment shader.

Test 2:
Modified the shader in test 1 as follows. It has only ‘fixed4 col : TEXCOORD0;’ parameter in v2f struct.

Result 2:

iPod touch
 4th gen
(PowerVR SGX 535)
Galaxy Nexus
(PowerVR SGX 540)
Nexus 7
(2012)
(NVIDIA Tegra 3)
XPERIA M2
(Adreno 305)
Galaxy SII
(Mali-400)
m_quadCount
50
50
15
30
15
saturate(x)
47 (FPS)
53 (FPS)
26 (FPS)
48 (FPS)
49 (FPS)
max(0, x)
47 (FPS)
53 (FPS)
18 (FPS)
42 (FPS)
49 (FPS)
x
47 (FPS)
53 (FPS)
49 (FPS)
48 (FPS)
49 (FPS)

There was a difference between Test 1 and Test 2 on Adreno and Mali. Especially, Adreno’s performance was doubled (Please note that m_quadCount is 30 in Test 2). Does it mean interpolator was the bottleneck? No, on Adreno, there was a performance difference between saturate and max. So, interpolator was not the bottleneck. Maybe, interpolation process is included in fragment shader, and the number of the parameter passed from vertex shader to fragment shader affected the performance.

How about Mali? There is still possibility that interpolator is the bottleneck. However, there is only one parameter to be interpolated. It is natural to think that Mali has max(0, x) modifier as well as saturate modifier. For confirmation, I tested max(0.1, x) instead of max(0, x). Then, the performance was changed! FPS was changed to 38 from 49.

Regarding PowerVR, I made a mistake in above tests. PowerVR has a unique tile rendering architecture, which can highly optimize opaque polygon rendering. So, on PowerVR, only a single quad polygon (or a few quad polygons) had been rendered on the screen. That is the reason why m_quadCount could be 50 on this device.

So, I enabled alpha blending in the next test.

Test 3:
Enable alpha blending as follows:

Result 3:

iPod touch
 4th gen
(PowerVR SGX 535)
Galaxy Nexus
(PowerVR SGX 540)
Nexus 7
(2012)
(NVIDIA Tegra 3)
XPERIA M2
(Adreno 305)
Galaxy SII
(Mali-400)
m_quadCount
5
5
15
30
15
saturate(x)
20 (FPS)
45 (FPS)
29 (FPS)
48 (FPS)
49 (FPS)
max(0, x)
16 (FPS)
38 (FPS)
29 (FPS)
42 (FPS)
49 (FPS)
x
20 (FPS)
45 (FPS)
29 (FPS)
48 (FPS)
49 (FPS)

Yes! The performance was drastically changed on PowerVR! ‘m_quadCount’ is now only 5. Also, there is a performance difference between saturate(x) and max(0, x). PowerVR also had saturate modifier!

There was no difference between Test 2 and Test 3 on Adreno and Mali.

On Tegra 3, FPS was 29 in all cases. The fragment shader which is used in Test 3 always returns zero alpha. Maybe, the fragment shader was optimized and result in a same code for all cases. So, I changed alpha blending like this:

Then, the result on Tegra 3 was exactly same as Test 1 and Test 2.

Now, I found that all GPUs had saturate modifier for ‘fixed’ variables. How about ‘float’ and ‘half’?

Test 4:
Replace ‘fixed4’ with ‘float4’, and use the following alpha blend:

Result 4:

iPod touch
 4th gen
(PowerVR SGX 535)
Galaxy Nexus
(PowerVR SGX 540)
Nexus 7
(2012)
(NVIDIA Tegra 3)
XPERIA M2
(Adreno 305)
Galaxy SII
(Mali-400)
m_quadCount
2
5
15
30
15
saturate(x)
16 (FPS)
15 (FPS)
26 (FPS)
37 (FPS)
43 (FPS)
max(0, x)
21 (FPS)
19 (FPS)
18 (FPS)
30 (FPS)
43 (FPS)
x
30 (FPS)
23 (FPS)
26 (FPS)
37 (FPS)
43 (FPS)

By replacing ‘fixed4’ with ‘float4’, the performance was down except Tegra 3. Especially, the performance of PowerVR was very bad (note that m_quadCount is 2 on iPod touch).
Additionally, saturate(x) is slower than max(0, x)! PowerVR doesn’t have saturate modifier for ‘float’ variables.

Let’s test ‘half’ next.

Test 5:
Replace ‘float4’ with ‘half4’ in Test 4 shader.

Result 5:

iPod touch
 4th gen
(PowerVR SGX 535)
Galaxy Nexus
(PowerVR SGX 540)
Nexus 7
(2012)
(NVIDIA Tegra 3)
XPERIA M2
(Adreno 305)
Galaxy SII
(Mali-400)
m_quadCount
2
5
15
30
15
saturate(x)
16 (FPS)
15 (FPS)
26 (FPS)
48 (FPS)
49 (FPS)
max(0, x)
21 (FPS)
19 (FPS)
18 (FPS)
42 (FPS)
49 (FPS)
x
30 (FPS)
23 (FPS)
26 (FPS)
48 (FPS)
49 (FPS)

Result 4 and Result 5 show that Tegra 3 has no difference among ‘fixed’ ‘half’ and ‘float’. Adreno and Mali have some difference between ‘half’ and ‘float’, and ‘half’ and ‘fixed’ are same speed.
On PowerVR, ‘fixed’ is much faster than ‘half’ and ‘float’, and ‘half’ and ‘float’ are same speed. Also, PowerVR doesn’t have saturate modifier for ‘half’ and ‘float’ variables.

One thought on “saturate(x) vs max(0, x)

Leave a Reply

Your email address will not be published. Required fields are marked *

Anti Spam Code *