saturate(x) vs max(0, x)

A simple question, which is faster?
“saturate(x)” or “max(0, x)”?

It’s a tiny problem, but it might be good to know for those people who write shader program.

For example, when writing a shader for diffuse lighting, write a code like this

fixed lit = max(0, dot(lightDir, normal);

1	fixed lit = max(0, dot(lightDir, normal);

, or like this?

fixed lit = saturate(dot(lightDir, normal);

1	fixed lit = saturate(dot(lightDir, normal);

saturate(x) means max(0, min(1, x)), so it is natural to think that max(0, x) is faster. However, that’s not true.

Microsoft’s shader assembly has saturate modifier.
http://msdn.microsoft.com/en-us/library/windows/desktop/bb219849(v=vs.85).aspx
That is, “r0 = saturate(r1 + r2)” can be written in a single instruction:

add_sat r0, r1, r2

1	add_sat r0, r1, r2

Of course, it doesn’t mean all GPUs have this kind of modifier, but basically, most of GPUs which support DirectX might follow this specifications. So, I always used “saturate(x)”, instead of “max(0, x)”.

However, I found that Unity bilt-in shader was using max(0, x) for diffuse lighting. Actually, if GPU doesn’t have saturate modifier, max(0, x) must be faster than saturate(x). Hmm…, it might happen especially on some mobile GPUs. Let’s check!

Before I describe the details, I would like to summarize the conclusion.

Conclusion:

In most cases, saturate(x) is faster or as good as max(0, x), and it is free. saturate(x) can be performed as fast as x.
PowerVR doesn’t have saturate modifier for ‘float’ and ‘half’ variables. That is, saturate modifier is available only for ‘fixed’ variables. This was the only case that saturate(x) was slower than max(0, x). For fixed variables, saturate(x) and x had same performance.
It seems like Tegra 3 has saturate modifier, additionally, it also has max(0, x) modifier in some specific cases. Tegra 3 might have very complicated architecture. The performance is unpredictable! However, saturate(x) was always better than or as good as max(0, x).
Adreno has saturate modifier. saturate(x) is cost free.
Mali might have ‘max(0, x)’ modifier as well as saturate(x). saturate(x), max(0, x) and x had same performance.

Additional conclusion:

You should use ‘fixed’ variables on PowerVR, ‘half’ and ‘float’ are very slow (‘half’ and ‘float’ are same speed).
Number of input variables of fragment shader will affect the performance on Adreno and Mali GPUs. On those GPUs, ‘half’ and ‘fixed’ have same performance. ‘float’ is a bit slower.
It is not obvious whether a precision of a variable affects the performance or not on Tegra 3. When fixed precision was used, there was a performance improvement in some cases.

Test 1:
I tested the following shader with Unity. It has 3 passes, each pass corresponds to saturate(x), max(0, x), and x.

Shader "Custom/saturate test" {
	Properties {
		_Color ("Main Color", Color) = (1,1,1,1)
	}
	CGINCLUDE
	#include "UnityCG.cginc"
	fixed4 _Color;
	struct appdata {
		float4 vertex : POSITION;
	};
	struct v2f {
		float4 pos : SV_POSITION;
		fixed4 col1  : TEXCOORD0;
		fixed4 col2  : TEXCOORD1;
	};
	v2f vert(appdata_img v)
	{
		v2f o;
		o.pos = v.vertex;
		o.col1 = 0.5f + 0.5f * v.vertex.xyxw;
		o.col2 = 0.5f + 0.5f * v.vertex.yxyw;
		return o;
	}
	fixed4 frag_saturate(v2f i) : COLOR {
		return _Color * saturate(i.col1 - i.col2);
	}
	fixed4 frag_max(v2f i) : COLOR {
		return _Color * max((fixed4)0, i.col1 - i.col2);
	}
	fixed4 frag_none(v2f i) : COLOR {
		return _Color * (i.col1 - i.col2);
	}
	ENDCG
	SubShader {
		Pass {
			ZTest Always Cull Off ZWrite Off Blend Off
			Fog { Mode off }      
			CGPROGRAM
			#pragma vertex vert
			#pragma fragment frag_saturate
			#pragma fragmentoption ARB_precision_hint_fastest
			ENDCG
		}
		Pass {
			ZTest Always Cull Off ZWrite Off Blend Off
			Fog { Mode off }      
			CGPROGRAM
			#pragma vertex vert
			#pragma fragment frag_max
			#pragma fragmentoption ARB_precision_hint_fastest
			ENDCG
		}
		Pass {
			ZTest Always Cull Off ZWrite Off Blend Off
			Fog { Mode off }      
			CGPROGRAM
			#pragma vertex vert
			#pragma fragment frag_none
			#pragma fragmentoption ARB_precision_hint_fastest
			ENDCG
		}
	}
}

Shader "Custom/saturate test" {

Properties {

_Color ("Main Color", Color) = (1,1,1,1)

}

CGINCLUDE

#include "UnityCG.cginc"

fixed4 _Color;

struct appdata {

float4 vertex : POSITION;

};

struct v2f {

float4 pos : SV_POSITION;

fixed4 col1 : TEXCOORD0;

fixed4 col2 : TEXCOORD1;

};

v2f vert(appdata_img v)

{

v2f o;

o.pos = v.vertex;

o.col1 = 0.5f + 0.5f * v.vertex.xyxw;

o.col2 = 0.5f + 0.5f * v.vertex.yxyw;

return o;

}

fixed4 frag_saturate(v2f i) : COLOR {

return _Color * saturate(i.col1 - i.col2);

}

fixed4 frag_max(v2f i) : COLOR {

return _Color * max((fixed4)0, i.col1 - i.col2);

}

fixed4 frag_none(v2f i) : COLOR {

return _Color * (i.col1 - i.col2);

}

ENDCG

SubShader {

Pass {

ZTest Always Cull Off ZWrite Off Blend Off

Fog { Mode off }

CGPROGRAM

#pragma vertex vert

#pragma fragment frag_saturate

#pragma fragmentoption ARB_precision_hint_fastest

ENDCG

}

Pass {

ZTest Always Cull Off ZWrite Off Blend Off

Fog { Mode off }

CGPROGRAM

#pragma vertex vert

#pragma fragment frag_max

#pragma fragmentoption ARB_precision_hint_fastest

ENDCG

}

Pass {

ZTest Always Cull Off ZWrite Off Blend Off

Fog { Mode off }

CGPROGRAM

#pragma vertex vert

#pragma fragment frag_none

#pragma fragmentoption ARB_precision_hint_fastest

ENDCG

}

This is the test code for rendering. Measure the frame rate for each ‘m_pass‘ (= 0, 1, 2). ‘m_quadCount‘ will be adjusted for each device.

void OnPostRender()
{
	m_material.SetPass(m_pass);
	GL.Begin(GL.QUADS);
	for (int i = 0; i < m_quadCount; ++i) {
		GL.Vertex3(-1.0f, 1.0f, 0.0f);
		GL.Vertex3(-1.0f,-1.0f, 0.0f);
		GL.Vertex3( 1.0f,-1.0f, 0.0f);
		GL.Vertex3( 1.0f, 1.0f, 0.0f);
	}
	GL.End();
}

void OnPostRender()

{

m_material.SetPass(m_pass);

GL.Begin(GL.QUADS);

for (int i = 0; i < m_quadCount; ++i) {

GL.Vertex3(-1.0f, 1.0f, 0.0f);

GL.Vertex3(-1.0f,-1.0f, 0.0f);

GL.Vertex3( 1.0f,-1.0f, 0.0f);

GL.Vertex3( 1.0f, 1.0f, 0.0f);

}

GL.End();

}

Result 1:

	iPod touch 4th gen (PowerVR SGX 535)	Galaxy Nexus (PowerVR SGX 540)	Nexus 7 (2012) (NVIDIA Tegra 3)	XPERIA M2 (Adreno 305)	Galaxy SII (Mali-400)
`m_quadCount`	50	50	15	15	15
`saturate(x)`	47 (FPS)	53 (FPS)	26 (FPS)	46 (FPS)	38 (FPS)
`max(0, x)`	47 (FPS)	53 (FPS)	18 (FPS)	42 (FPS)	38 (FPS)
`x`	47 (FPS)	53 (FPS)	49 (FPS)	46 (FPS)	38 (FPS)

This result shows that saturate(x) actually have performance advantage on Adreno and Tegra 3. However, on Tegra 3, saturate(x) was slower than x. Taking into account the frame rates, it seems like saturate(x) takes 2 cycles, and x takes 1 cycle. Maybe, Tegra 3 can perform ‘x*(y-z)’ in 1 cycle.
To check this hypothesis, I tested multiplication instead of addition like this:

fixed4 frag_saturate(v2f i) : COLOR {
	return _Color * saturate(i.col1 * i.col2);
}

fixed4 frag_saturate(v2f i) : COLOR {

return _Color * saturate(i.col1 * i.col2);

}

Then, frame rates were totally changed on Tegra 3. In all cases, the frame rates were 26 FPS. Hmm…, actually, x takes 2 cycles now. However, max(0, x) also takes 2 cycles. Does Tagra 3 have max(0, x) modifier for multiplication???

On the other GPUs, there was no performance difference. Maybe, there was anohter bottleneck such as pixel fill rate or interpolator. I suspect that it is interpolator. So, in the next test, I reduced the number of the input parameters passed from vertex shader to fragment shader.

Test 2:
Modified the shader in test 1 as follows. It has only ‘fixed4 col : TEXCOORD0;’ parameter in v2f struct.

	CGINCLUDE
	#include "UnityCG.cginc"
	fixed4 _Color;
	struct appdata {
		float4 vertex : POSITION;
	};
	struct v2f {
		float4 pos : SV_POSITION;
		fixed4 col : TEXCOORD0;
	};
	v2f vert(appdata_img v)
	{
		v2f o;
		o.pos = v.vertex;
		o.col = 0.5f + 0.5f * v.vertex.xyxw;
		return o;
	}
	fixed4 frag_saturate(v2f i) : COLOR {
		return _Color * saturate(i.col.rgba * i.col.grga);
	}
	fixed4 frag_max(v2f i) : COLOR {
		return _Color * max((fixed4)0, i.col.rgba * i.col.grga);
	}
	fixed4 frag_none(v2f i) : COLOR {
		return _Color * (i.col.rgba * i.col.grga);
	}
	ENDCG

CGINCLUDE

#include "UnityCG.cginc"

fixed4 _Color;

struct appdata {

float4 vertex : POSITION;

};

struct v2f {

float4 pos : SV_POSITION;

fixed4 col : TEXCOORD0;

};

v2f vert(appdata_img v)

{

v2f o;

o.pos = v.vertex;

o.col = 0.5f + 0.5f * v.vertex.xyxw;

return o;

}

fixed4 frag_saturate(v2f i) : COLOR {

return _Color * saturate(i.col.rgba * i.col.grga);

}

fixed4 frag_max(v2f i) : COLOR {

return _Color * max((fixed4)0, i.col.rgba * i.col.grga);

}

fixed4 frag_none(v2f i) : COLOR {

return _Color * (i.col.rgba * i.col.grga);

}

ENDCG

Result 2:

	iPod touch 4th gen (PowerVR SGX 535)	Galaxy Nexus (PowerVR SGX 540)	Nexus 7 (2012) (NVIDIA Tegra 3)	XPERIA M2 (Adreno 305)	Galaxy SII (Mali-400)
`m_quadCount`	50	50	15	30	15
`saturate(x)`	47 (FPS)	53 (FPS)	26 (FPS)	48 (FPS)	49 (FPS)
`max(0, x)`	47 (FPS)	53 (FPS)	18 (FPS)	42 (FPS)	49 (FPS)
`x`	47 (FPS)	53 (FPS)	49 (FPS)	48 (FPS)	49 (FPS)

There was a difference between Test 1 and Test 2 on Adreno and Mali. Especially, Adreno’s performance was doubled (Please note that m_quadCount is 30 in Test 2). Does it mean interpolator was the bottleneck? No, on Adreno, there was a performance difference between saturate and max. So, interpolator was not the bottleneck. Maybe, interpolation process is included in fragment shader, and the number of the parameter passed from vertex shader to fragment shader affected the performance.

How about Mali? There is still possibility that interpolator is the bottleneck. However, there is only one parameter to be interpolated. It is natural to think that Mali has max(0, x) modifier as well as saturate modifier. For confirmation, I tested max(0.1, x) instead of max(0, x). Then, the performance was changed! FPS was changed to 38 from 49.

Regarding PowerVR, I made a mistake in above tests. PowerVR has a unique tile rendering architecture, which can highly optimize opaque polygon rendering. So, on PowerVR, only a single quad polygon (or a few quad polygons) had been rendered on the screen. That is the reason why m_quadCount could be 50 on this device.

So, I enabled alpha blending in the next test.

Test 3:
Enable alpha blending as follows:

	SubShader {
		Pass {
			ZTest Always Cull Off ZWrite Off
			Blend SrcAlpha OneMinusSrcAlpha
			Fog { Mode off }      
			CGPROGRAM
			#pragma vertex vert
			#pragma fragment frag_saturate
			#pragma fragmentoption ARB_precision_hint_fastest
			ENDCG
		}
		Pass {
			ZTest Always Cull Off ZWrite Off
			Blend SrcAlpha OneMinusSrcAlpha
			Fog { Mode off }      
			CGPROGRAM
			#pragma vertex vert
			#pragma fragment frag_max
			#pragma fragmentoption ARB_precision_hint_fastest
			ENDCG
		}
		Pass {
			ZTest Always Cull Off ZWrite Off
			Blend SrcAlpha OneMinusSrcAlpha
			Fog { Mode off }      
			CGPROGRAM
			#pragma vertex vert
			#pragma fragment frag_none
			#pragma fragmentoption ARB_precision_hint_fastest
			ENDCG
		}
	}

SubShader {

Pass {

ZTest Always Cull Off ZWrite Off

Blend SrcAlpha OneMinusSrcAlpha

Fog { Mode off }

CGPROGRAM

#pragma vertex vert

#pragma fragment frag_saturate

#pragma fragmentoption ARB_precision_hint_fastest

ENDCG

}

Pass {

ZTest Always Cull Off ZWrite Off

Blend SrcAlpha OneMinusSrcAlpha

Fog { Mode off }

CGPROGRAM

#pragma vertex vert

#pragma fragment frag_max

#pragma fragmentoption ARB_precision_hint_fastest

ENDCG

}

Pass {

ZTest Always Cull Off ZWrite Off

Blend SrcAlpha OneMinusSrcAlpha

Fog { Mode off }

CGPROGRAM

#pragma vertex vert

#pragma fragment frag_none

#pragma fragmentoption ARB_precision_hint_fastest

ENDCG

}

Result 3:

	iPod touch 4th gen (PowerVR SGX 535)	Galaxy Nexus (PowerVR SGX 540)	Nexus 7 (2012) (NVIDIA Tegra 3)	XPERIA M2 (Adreno 305)	Galaxy SII (Mali-400)
`m_quadCount`	5	5	15	30	15
`saturate(x)`	20 (FPS)	45 (FPS)	29 (FPS)	48 (FPS)	49 (FPS)
`max(0, x)`	16 (FPS)	38 (FPS)	29 (FPS)	42 (FPS)	49 (FPS)
`x`	20 (FPS)	45 (FPS)	29 (FPS)	48 (FPS)	49 (FPS)

Yes! The performance was drastically changed on PowerVR! ‘m_quadCount’ is now only 5. Also, there is a performance difference between saturate(x) and max(0, x). PowerVR also had saturate modifier!

There was no difference between Test 2 and Test 3 on Adreno and Mali.

On Tegra 3, FPS was 29 in all cases. The fragment shader which is used in Test 3 always returns zero alpha. Maybe, the fragment shader was optimized and result in a same code for all cases. So, I changed alpha blending like this:

Blend One SrcAlpha

1	Blend One SrcAlpha

Then, the result on Tegra 3 was exactly same as Test 1 and Test 2.

Now, I found that all GPUs had saturate modifier for ‘fixed’ variables. How about ‘float’ and ‘half’?

Test 4:
Replace ‘fixed4’ with ‘float4’, and use the following alpha blend:

Blend One SrcAlpha

1	Blend One SrcAlpha

Result 4:

	iPod touch 4th gen (PowerVR SGX 535)	Galaxy Nexus (PowerVR SGX 540)	Nexus 7 (2012) (NVIDIA Tegra 3)	XPERIA M2 (Adreno 305)	Galaxy SII (Mali-400)
`m_quadCount`	2	5	15	30	15
`saturate(x)`	16 (FPS)	15 (FPS)	26 (FPS)	37 (FPS)	43 (FPS)
`max(0, x)`	21 (FPS)	19 (FPS)	18 (FPS)	30 (FPS)	43 (FPS)
`x`	30 (FPS)	23 (FPS)	26 (FPS)	37 (FPS)	43 (FPS)

By replacing ‘fixed4’ with ‘float4’, the performance was down except Tegra 3. Especially, the performance of PowerVR was very bad (note that m_quadCount is 2 on iPod touch).
Additionally, saturate(x) is slower than max(0, x)! PowerVR doesn’t have saturate modifier for ‘float’ variables.

Let’s test ‘half’ next.

Test 5:
Replace ‘float4’ with ‘half4’ in Test 4 shader.

Result 5:

	iPod touch 4th gen (PowerVR SGX 535)	Galaxy Nexus (PowerVR SGX 540)	Nexus 7 (2012) (NVIDIA Tegra 3)	XPERIA M2 (Adreno 305)	Galaxy SII (Mali-400)
`m_quadCount`	2	5	15	30	15
`saturate(x)`	16 (FPS)	15 (FPS)	26 (FPS)	48 (FPS)	49 (FPS)
`max(0, x)`	21 (FPS)	19 (FPS)	18 (FPS)	42 (FPS)	49 (FPS)
`x`	30 (FPS)	23 (FPS)	26 (FPS)	48 (FPS)	49 (FPS)

Result 4 and Result 5 show that Tegra 3 has no difference among ‘fixed’ ‘half’ and ‘float’. Adreno and Mali have some difference between ‘half’ and ‘float’, and ‘half’ and ‘fixed’ are same speed.
On PowerVR, ‘fixed’ is much faster than ‘half’ and ‘float’, and ‘half’ and ‘float’ are same speed. Also, PowerVR doesn’t have saturate modifier for ‘half’ and ‘float’ variables.

One thought on “saturate(x) vs max(0, x)”

Leave a Reply Cancel reply