# Introduction

Frustum culling is process of discarding objects not visible on the screen. As we don’t see them – we don’t need to spend resources on computer to prepare it for rendering and rendering itself.

In this paper I will cover next themes:

- culling of: Bounding Spheres, Axis-Aligned Bounding Boxes (AABB), Oriented Bounding Boxes (OBB)
- culling of huge amount of objects
- using SSE
- multithreaded culling
- GPU culling
- comparison of approaches efficiency, working speed

What I will not cover:

- using hierarchical structures, trees. We can unite objects in groups according to world positions and first check visibility of whole group.
- optimizations of one object, like using last ‘successful’ culling plane
- visibility test taking into account scene depth buffer. Object can be inside frustum, but can be completely blocked by another, closer to viewer object. Hence we also might discard this object from rendering
- software culling. We may perform blocking of one objects by another on CPU side.
- Culling for shadows

## Simple culling

We have area of visibility, which is set by frustum of viewing pyramid. Objects that are not inside this area should be discarded from rendering process. Frustum one usually set with 6 planes.

We have objects in the scene. Each object might be aproxinated with simple geometry such as sphere or box. All object’s geometry lies inside this primitive.

Visibility test of such simple geometry performs very fast. Our aim is to understand if this object visible in frustum.

Consider the definition of visibility: spheres and boxes. There are different kinds of boxes: world axis aligned (AABB) and aligned according local object’s axis (OBB). One could clearly see that OBB beter aproximates object, but it’s visibility test performs harder than AABB.

## Sphere-frustum

Algorithm: for object’s center we find distance to each frustum plane. If point is behind any plane more than sphere radius, then sphere not in frustum. And found one is splitting plane.

__forceinline bool SphereInFrustum(vec3 &pos, float &radius, vec4 *frustum_planes)
{
bool res = true;
//test all 6 frustum planes
for (int i = 0; i < 6; i++)
{
//calculate distance from sphere center to plane.
//if distance larger then sphere radius - sphere is outside frustum
if (frustum_planes[i].x * pos.x + frustum_planes[i].y * pos.y + frustum_planes[i].z * pos.z + frustum_planes[i].w <= -radius)
res = false;
//return false; //with flag works faster
}
return res;
}

## AABB-frustum

Bounding sphere sometimes not great choise to approximate object. For more precise test one often use boxes. World Axis-Aligned or Oriented Bounding Boxes.

Basic idea to test the box visibility in frustum: if all 8 box points lie behind one of frustum planes then box is not in frustum. In next example implemented AABB-fustum test. But if we place OBB world-space points in equations – we get OBB-frustum test.

__forceinline bool RightParallelepipedInFrustum2(vec4 &Min, vec4 &Max, vec4 *frustum_planes)
{
//this is just example of basic idea - how BOX culling works, both AABB and OBB
//Min & Max are 2 world space box points. For AABB-drustum culling
//We may use transformed (by object matrix) to world space 8 box points. Replace Min & Max in equations and we get OBB-frustum.
//test all 6 frustum planes
for (int i = 0; i<6; i++)
{
//try to find such plane for which all 8 box points behind it
//test all 8 box points against frustum plane
//calculate distance from point to plane
//if point infront of the plane (dist > 0) - this is not separating plane
if (frustum_planes[i][0] * Min[0] + frustum_planes[i][1] * Max[1] + frustum_planes[i][2] * Min[2] + frustum_planes[i][3]>0)
continue;
if (frustum_planes[i][0] * Min[0] + frustum_planes[i][1] * Max[1] + frustum_planes[i][2] * Max[2] + frustum_planes[i][3]>0)
continue;
if (frustum_planes[i][0] * Max[0] + frustum_planes[i][1] * Max[1] + frustum_planes[i][2] * Max[2] + frustum_planes[i][3]>0)
continue;
if (frustum_planes[i][0] * Max[0] + frustum_planes[i][1] * Max[1] + frustum_planes[i][2] * Min[2] + frustum_planes[i][3]>0)
continue;
if (frustum_planes[i][0] * Max[0] + frustum_planes[i][1] * Min[1] + frustum_planes[i][2] * Min[2] + frustum_planes[i][3]>0)
continue;
if (frustum_planes[i][0] * Max[0] + frustum_planes[i][1] * Min[1] + frustum_planes[i][2] * Max[2] + frustum_planes[i][3]>0)
continue;
if (frustum_planes[i][0] * Min[0] + frustum_planes[i][1] * Min[1] + frustum_planes[i][2] * Max[2] + frustum_planes[i][3]>0)
continue;
if (frustum_planes[i][0] * Min[0] + frustum_planes[i][1] * Min[1] + frustum_planes[i][2] * Min[2] + frustum_planes[i][3]>0)
continue;
return false;
}
return true;
}

AABB-frustum test might be implemented more optimal.

Algorithm: from 8 point find closest one to the plane ant test if it is behind the plane. If yes – the object is not in frustum.

__forceinline bool RightParallelepipedInFrustum(vec4 &Min, vec4 &Max, vec4 *frustum_planes)
{
bool inside = true;
//test all 6 frustum planes
for (int i = 0; i<6; i++)
{
//pick closest point to plane and check if it behind the plane
//if yes - object outside frustum
float d = max(Min.x * frustum_planes[i].x, Max.x * frustum_planes[i].x) +
max(Min.y * frustum_planes[i].y, Max.y * frustum_planes[i].y) +
max(Min.z * frustum_planes[i].z, Max.z * frustum_planes[i].z) +
frustum_planes[i].w;
inside &= d > 0;
//return false; //with flag works faster
}
return inside;
}

## OBB-frustum

Algorithm: transform 8 box points from local space to clip-space. It is easy to test points against plane in such space as frustum becomes the unit cube [-1..1] (but we should take into account that for DirectX for Z axis we have another size [0..1]). If for one of axis all 8 vertexes <-1 or >1, then box outside the frustum.

__forceinline bool OBBInFrustum(const vec3 &Min, const vec3 &Max, mat4 &obj_transform_mat, mat4 &cam_modelview_proj_mat)
{
//transform all 8 box points to clip space
//clip space because we easily can test points outside required unit cube
//NOTE: for DirectX we should test z coordinate from 0 to w (-w..w - for OpenGL), look for transformations / clipping box differences
//matrix to transfrom points to clip space
mat4 to_clip_space_mat = cam_modelview_proj_mat * obj_transform_mat;
//transform all 8 box points to clip space
vec4 obb_points[8];
obb_points[0] = to_clip_space_mat * vec4(Min[0], Max[1], Min[2], 1.f);
obb_points[1] = to_clip_space_mat * vec4(Min[0], Max[1], Max[2], 1.f);
obb_points[2] = to_clip_space_mat * vec4(Max[0], Max[1], Max[2], 1.f);
obb_points[3] = to_clip_space_mat * vec4(Max[0], Max[1], Min[2], 1.f);
obb_points[4] = to_clip_space_mat * vec4(Max[0], Min[1], Min[2], 1.f);
obb_points[5] = to_clip_space_mat * vec4(Max[0], Min[1], Max[2], 1.f);
obb_points[6] = to_clip_space_mat * vec4(Min[0], Min[1], Max[2], 1.f);
obb_points[7] = to_clip_space_mat * vec4(Min[0], Min[1], Min[2], 1.f);
bool outside = false, outside_positive_plane, outside_negative_plane;
//we have 6 frustum planes, which in clip space is unit cube (for GL) with -1..1 range
for (int i = 0; i < 3; i++) //3 because we test positive & negative plane at once
{
//if all 8 points outside one of the plane
//actually it is vertex normalization xyz / w, then compare if all 8points coordinates < -1 or > 1
outside_positive_plane =
obb_points[0][i] > obb_points[0].w &&
obb_points[1][i] > obb_points[1].w &&
obb_points[2][i] > obb_points[2].w &&
obb_points[3][i] > obb_points[3].w &&
obb_points[4][i] > obb_points[4].w &&
obb_points[5][i] > obb_points[5].w &&
obb_points[6][i] > obb_points[6].w &&
obb_points[7][i] > obb_points[7].w;
outside_negative_plane =
obb_points[0][i] < -obb_points[0].w &&
obb_points[1][i] < -obb_points[1].w &&
obb_points[2][i] < -obb_points[2].w &&
obb_points[3][i] < -obb_points[3].w &&
obb_points[4][i] < -obb_points[4].w &&
obb_points[5][i] < -obb_points[5].w &&
obb_points[6][i] < -obb_points[6].w &&
obb_points[7][i] < -obb_points[7].w;
outside = outside || outside_positive_plane || outside_negative_plane;
//if (outside_positive_plane || outside_negative_plane)
//return false;
}
return !outside;
//return true;
}

Table 1: culing results of 100к objects. Intel Core I5. Single Thread.

Simple Culling |
Sphere |
AABB |
OBB |

Just cullung |
0,92 |
1,42 |
9,14 |

Whole frame |
1,94 |
2,5 |
10,3 |

The results are obvious. The harder calculations the slower it works. OBB test much slower than Spheres or AABB tests. But we get more precise culling with OBB.

May be, optimal solution is spiting objects into groups, For each group depending on distance to camera we use appropriate primitive. For closest groups use OBB, for middle one groups use ABB and Spheres for the rest.

Also should be notices than whole frame time is larger than just culling. 1 ms. in average. Because of transferring data about visible objects to gpu has cost, couple of dips and API commands. But it is necessary actions.

# SSE

SSE (Streaming SIMD Extensions) – with one instructions we perform calculations on group of operands. SSE includes in it’s architecture eight 128 bit registers and set of instructions to perform any operations on them.

Theoretically we might speedup code execution 4 times as we make operations with 4 operands simultaneously. Offcourse on practice perfomance win will be less because of SSE drawbacks.

- Not all algorithms could be easily rewrited in SSE
- data should be packed according to SSE requirements in registers to perform calculations
- SSE has some restrictions with vertical operations like dot products
- there are no conditions. One use so called static branching, when we execute bot 2 parts of condition an take just one interesting to us result.
- Loading data in registers and storing results back into memory
- don’t forget about sse data striding

Algorithm of SSE Spheres-frustum and SSE AABB-frustum culling almost identical to simple implementation. In exception of that we perform calculations on 4 objects simultaneously.

void sse_culling_spheres(BSphere *sphere_data, int num_objects, int *culling_res, vec4 *frustum_planes)
{
float *sphere_data_ptr = reinterpret_cast<float*>(&sphere_data[0]);
int *culling_res_sse = &culling_res[0];
//to optimize calculations we gather xyzw elements in separate vectors
__m128 zero_v = _mm_setzero_ps();
__m128 frustum_planes_x[6];
__m128 frustum_planes_y[6];
__m128 frustum_planes_z[6];
__m128 frustum_planes_d[6];
int i, j;
for (i = 0; i < 6; i++)
{
frustum_planes_x[i] = _mm_set1_ps(frustum_planes[i].x);
frustum_planes_y[i] = _mm_set1_ps(frustum_planes[i].y);
frustum_planes_z[i] = _mm_set1_ps(frustum_planes[i].z);
frustum_planes_d[i] = _mm_set1_ps(frustum_planes[i].w);
}
//we process 4 objects per step
for (i = 0; i < num_objects; i += 4)
{
//load bounding sphere data
__m128 spheres_pos_x = _mm_load_ps(sphere_data_ptr);
__m128 spheres_pos_y = _mm_load_ps(sphere_data_ptr + 4);
__m128 spheres_pos_z = _mm_load_ps(sphere_data_ptr + 8);
__m128 spheres_radius = _mm_load_ps(sphere_data_ptr + 12);
sphere_data_ptr += 16;
//but for our calculations we need transpose data, to collect x, y, z and w coordinates in separate vectors
_MM_TRANSPOSE4_PS(spheres_pos_x, spheres_pos_y, spheres_pos_z, spheres_radius);
__m128 spheres_neg_radius = _mm_sub_ps(zero_v, spheres_radius); // negate all elements
__m128 intersection_res = _mm_setzero_ps();
for (j = 0; j < 6; j++) //plane index
{
//1. calc distance to plane dot(sphere_pos.xyz, plane.xyz) + plane.w
//2. if distance < sphere radius, then sphere outside frustum
__m128 dot_x = _mm_mul_ps(spheres_pos_x, frustum_planes_x[j]);
__m128 dot_y = _mm_mul_ps(spheres_pos_y, frustum_planes_y[j]);
__m128 dot_z = _mm_mul_ps(spheres_pos_z, frustum_planes_z[j]);
__m128 sum_xy = _mm_add_ps(dot_x, dot_y);
__m128 sum_zw = _mm_add_ps(dot_z, frustum_planes_d[j]);
__m128 distance_to_plane = _mm_add_ps(sum_xy, sum_zw);
__m128 plane_res = _mm_cmple_ps(distance_to_plane, spheres_neg_radius); //dist < -sphere_r ?
intersection_res = _mm_or_ps(intersection_res, plane_res); //if yes - sphere behind the plane & outside frustum
}
//store result
__m128i intersection_res_i = _mm_cvtps_epi32(intersection_res);
_mm_store_si128((__m128i *)&culling_res_sse[i], intersection_res_i);
}
}

void sse_culling_aabb(AABB *aabb_data, int num_objects, int *culling_res, vec4 *frustum_planes)
{
float *aabb_data_ptr = reinterpret_cast<float*>(&aabb_data[0]);
int *culling_res_sse = &culling_res[0];
//to optimize calculations we gather xyzw elements in separate vectors
__m128 zero_v = _mm_setzero_ps();
__m128 frustum_planes_x[6];
__m128 frustum_planes_y[6];
__m128 frustum_planes_z[6];
__m128 frustum_planes_d[6];
int i, j;
for (i = 0; i < 6; i++)
{
frustum_planes_x[i] = _mm_set1_ps(frustum_planes[i].x);
frustum_planes_y[i] = _mm_set1_ps(frustum_planes[i].y);
frustum_planes_z[i] = _mm_set1_ps(frustum_planes[i].z);
frustum_planes_d[i] = _mm_set1_ps(frustum_planes[i].w);
}
__m128 zero = _mm_setzero_ps();
//we process 4 objects per step
for (i = 0; i < num_objects; i += 4)
{
//load objects data
//load aabb min
__m128 aabb_min_x = _mm_load_ps(aabb_data_ptr);
__m128 aabb_min_y = _mm_load_ps(aabb_data_ptr + 8);
__m128 aabb_min_z = _mm_load_ps(aabb_data_ptr + 16);
__m128 aabb_min_w = _mm_load_ps(aabb_data_ptr + 24);
//load aabb max
__m128 aabb_max_x = _mm_load_ps(aabb_data_ptr + 4);
__m128 aabb_max_y = _mm_load_ps(aabb_data_ptr + 12);
__m128 aabb_max_z = _mm_load_ps(aabb_data_ptr + 20);
__m128 aabb_max_w = _mm_load_ps(aabb_data_ptr + 28);
aabb_data_ptr += 32;
//for now we have points in vectors aabb_min_x..w, but for calculations we need to xxxx yyyy zzzz vectors representation - just transpose data
_MM_TRANSPOSE4_PS(aabb_min_x, aabb_min_y, aabb_min_z, aabb_min_w);
_MM_TRANSPOSE4_PS(aabb_max_x, aabb_max_y, aabb_max_z, aabb_max_w);
__m128 intersection_res = _mm_setzero_ps();
for (j = 0; j < 6; j++) //plane index
{
//this code is similar to what we make in simple culling
//pick closest point to plane and check if it begind the plane. if yes - object outside frustum
//dot product, separate for each coordinate, for min & max aabb points
__m128 aabbMin_frustumPlane_x = _mm_mul_ps(aabb_min_x, frustum_planes_x[j]);
__m128 aabbMin_frustumPlane_y = _mm_mul_ps(aabb_min_y, frustum_planes_y[j]);
__m128 aabbMin_frustumPlane_z = _mm_mul_ps(aabb_min_z, frustum_planes_z[j]);
__m128 aabbMax_frustumPlane_x = _mm_mul_ps(aabb_max_x, frustum_planes_x[j]);
__m128 aabbMax_frustumPlane_y = _mm_mul_ps(aabb_max_y, frustum_planes_y[j]);
__m128 aabbMax_frustumPlane_z = _mm_mul_ps(aabb_max_z, frustum_planes_z[j]);
//we have 8 box points, but we need pick closest point to plane. Just take max
__m128 res_x = _mm_max_ps(aabbMin_frustumPlane_x, aabbMax_frustumPlane_x);
__m128 res_y = _mm_max_ps(aabbMin_frustumPlane_y, aabbMax_frustumPlane_y);
__m128 res_z = _mm_max_ps(aabbMin_frustumPlane_z, aabbMax_frustumPlane_z);
//dist to plane = dot(aabb_point.xyz, plane.xyz) + plane.w
__m128 sum_xy = _mm_add_ps(res_x, res_y);
__m128 sum_zw = _mm_add_ps(res_z, frustum_planes_d[j]);
__m128 distance_to_plane = _mm_add_ps(sum_xy, sum_zw);
__m128 plane_res = _mm_cmple_ps(distance_to_plane, zero); //dist from closest point to plane < 0 ?
intersection_res = _mm_or_ps(intersection_res, plane_res); //if yes - aabb behind the plane & outside frustum
}
//store result
__m128i intersection_res_i = _mm_cvtps_epi32(intersection_res);
_mm_store_si128((__m128i *)&culling_res_sse[i], intersection_res_i);
}
}

OBB culling is a bit harder. We perform calculations on one object at once. But make calculations for three xyz axes simultaneously. It is not optimal but it reflects basic idea of algorithm. Besides, vector math (matrix multiplications and point transformations) with SSE perform faster.

void sse_culling_obb(int firs_processing_object, int num_objects, int *culling_res, mat4 &cam_modelview_proj_mat)
{
mat4_sse sse_camera_mat(cam_modelview_proj_mat);
mat4_sse sse_clip_space_mat;
//box points in local space
__m128 obb_points_sse[8];
obb_points_sse[0] = _mm_set_ps(1.f, box_min[2], box_max[1], box_min[0]);
obb_points_sse[1] = _mm_set_ps(1.f, box_max[2], box_max[1], box_min[0]);
obb_points_sse[2] = _mm_set_ps(1.f, box_max[2], box_max[1], box_max[0]);
obb_points_sse[3] = _mm_set_ps(1.f, box_min[2], box_max[1], box_max[0]);
obb_points_sse[4] = _mm_set_ps(1.f, box_min[2], box_min[1], box_max[0]);
obb_points_sse[5] = _mm_set_ps(1.f, box_max[2], box_min[1], box_max[0]);
obb_points_sse[6] = _mm_set_ps(1.f, box_max[2], box_min[1], box_min[0]);
obb_points_sse[7] = _mm_set_ps(1.f, box_min[2], box_min[1], box_min[0]);
ALIGN_SSE int obj_culling_res[4];
__m128 zero_v = _mm_setzero_ps();
int i, j;
//process one object per step
for (i = firs_processing_object; i < firs_processing_object+num_objects; i++)
{
//clip space matrix = camera_view_proj * obj_mat
sse_mat4_mul(sse_clip_space_mat, sse_camera_mat, sse_obj_mat[i]);
__m128 outside_positive_plane = _mm_set1_ps(0xffffffff);
__m128 outside_negative_plane = _mm_set1_ps(0xffffffff);
//for all 8 box points
for (j = 0; j < 8; j++)
{
//transform point to clip space
__m128 obb_transformed_point = sse_mat4_mul_vec4(sse_clip_space_mat, obb_points_sse[j]);
//gather w & -w
__m128 wwww = _mm_shuffle_ps(obb_transformed_point, obb_transformed_point, _MM_SHUFFLE(3, 3, 3, 3)); //get w
__m128 wwww_neg = _mm_sub_ps(zero_v, wwww); // negate all elements
//box_point.xyz > box_point.w || box_point.xyz < -box_point.w ?
//similar to point normalization: point.xyz /= point.w; And compare: point.xyz > 1 && point.xyz < -1
__m128 outside_pos_plane = _mm_cmpge_ps(obb_transformed_point, wwww);
__m128 outside_neg_plane = _mm_cmple_ps(obb_transformed_point, wwww_neg);
//if at least 1 of 8 points in front of the plane - we get 0 in outside_* flag
outside_positive_plane = _mm_and_ps(outside_positive_plane, outside_pos_plane);
outside_negative_plane = _mm_and_ps(outside_negative_plane, outside_neg_plane);
}
//all 8 points xyz < -1 or > 1 ?
__m128 outside = _mm_or_ps(outside_positive_plane, outside_negative_plane);
//store result
__m128i outside_res_i = _mm_cvtps_epi32(outside);
_mm_store_si128((__m128i *)&obj_culling_res[0], outside_res_i);
//for now we have separate result separately for each axis
//combine results. If outside any plane, then objects is outside frustum
culling_res[i] = (obj_culling_res[0] != 0 || obj_culling_res[1] != 0 || obj_culling_res[2] != 0) ? 1 : 0;
}
}

Table 2. SSE culling result of 100k objects. Intel Core I5. Single Thread. SSE.

SSE Culling |
Sphere |
AABB |
OBB |

Just culling |
0,26 |
0,46 |
3,48 |

Whole frame |
1,2 |
1,43 |
4,6 |

SSE implementation in average 3 times faster than simple one in C++.

## Multithreading

Nowadays processors has several cores. Calculations might be performed simultaneously on all cores.

Architecture on new games should be planned taking into account multithreading, i.e. split work on independent parts/tasks and solve them simultaneously, loading evenly all the processor cores. The design should be flexible. Too large amount of small tasks leads to overhead of synchronizing work and switching between tasks. Too small abound of big tasks leads to uneavenly loading of cores. Need a balance. In current games there might be from several hundreds to thousand tasks per frame.

In our case of frustum culling each object is independent from the rest. Thats why we easily could split work into equal groups and cull them simultaneously with different cores of processor. After running jobs execution we need to wait threads to do their job and gather results.

Off course we should not ask results right after execution start.

Worker::Worker() : first_processing_oject(0), num_processing_ojects(0)
{
//create 2 events: 1. to signal that we have a job 2.signal that we finished job
has_jobs_event = CreateEvent(NULL, false, false, NULL);
jobs_finished_event = CreateEvent(NULL, false, true, NULL);
}
void Worker::doJob()
{
//make our part of work
cull_objects(first_processing_oject, num_processing_ojects);
}
unsigned __stdcall thread_func(void* arguments)
{
printf("In thread...n");
Worker *worker = static_cast<worker*>(arguments);
//each worker has endless loop untill we signal to quit (stop_work flag)
while (true)
{
//wait for starting jobs
//if we have no job - just wait (has_jobs_event event). We do not wasting cpu work. Events designed for this.
WaitForSingleObject(worker->has_jobs_event, INFINITE);
//if we have signal to break - exit endless loop
if (worker->stop_work)
break;
//do job
worker->doJob();
//signal that we finished the job
SetEvent(worker->jobs_finished_event);
}
_endthreadex(0);
return 0;
}
void create_threads()
{
//create the threads
//split the work into parts between threads
int worker_num_processing_ojects = MAX_SCENE_OBJECTS / num_workers;
int first_processing_oject = 0;
int i;
for (i = 0; i < num_workers; i++)
{
//create threads
workers[i].thread_handle = (HANDLE)_beginthreadex(NULL, 0, &thread_func, &workers[i], CREATE_SUSPENDED, &workers[i].thread_id);
thread_handles[i] = workers[i].thread_handle;
//set threads parameters
workers[i].first_processing_oject = first_processing_oject;
workers[i].num_processing_ojects = worker_num_processing_ojects;
first_processing_oject += worker_num_processing_ojects;
}
//run workers to do their jobs
for (int i = 0; i < num_workers; i++)
ResumeThread(workers[i].thread_handle);
}
void process_multithreading_culling()
{
//signal workers that they have the job
for (int i = 0; i < num_workers; i++)
SetEvent(workers[i].has_jobs_event);
}
void wate_multithreading_culling_done()
{
//wait threads to do their jobs
HANDLE wait_events[num_workers];
for (int i = 0; i < num_workers; i++)
wait_events[i] = workers[i].jobs_finished_event;
WaitForMultipleObjects(num_workers, &wait_events[0], true, INFINITE);
}

Table 3. Culling results of 100k objects. Intel Core I5 (4 cores). In brackets – speedup relatively to simple c++ implementation.

Method |
Sphere |
AABB |
OBB |

Simple c++ |
0,92 (1) |
1,42 (1) |
9,14 (1) |

SSE |
0,26 (3,54) |
0,46 (3,08) |
3,48 (2,62) |

Simple c++, Mulithreaded |
0,25 (3,68) |
0,4 (3,55) |
2,5 (3,65) |

SSE, Multithreaded |
0,1 (9,2) |
0,18 (7,89) |
1 (9,14) |

Multithreaded version faster than single threaded in 3,6 times in average.

Using SSE gieves us 3 times speedup, relatively to simple c++ implementation.

Both using SSE and Multithreading gives us 8,7 times speedup!

I.e. we optimize our calculations by almost 9 times, depending on used culling primitive type.

# GPU culling

GPU designed to perform the same operation on huge amount of data. GPU has a lot more parallel threads (thousands) than in CPU (2-8 in most desktop cases). But culling on gpu not allwas comfortably:

- this assumes special graphics engine architecture
- there is unpleasant moment that we evaluate dip execution on cpu side. For this we need to know the amout of generated primitives by GPU (visible in frustum objects in our case). Thats why we need to ask feedback from GPU. There are special commands for this purpose.
- The problem is if we want get result in the same frame with culling and rendering we get GPU-stall, because we need to wait the result. This is bad for perfomance. If read result from previous frame – we get bugs. Full solution to this problem is using DrawIndirect commands and preparing information about dip on GPU side. This is available since DirectX11 and Opengl 4.

Implementation on gpu culling consist from next steps:

- Pack all instances data in vertex buffer. Assume that one vertex is one object for culling. Amount of atributes for vertex equal to amount of data per one object.
- Enable transform feedback. Send prepared vertex buffer on render. All results redirect to another vertex buffer with visible instances data.
- In vertex shader check visibility on the object
- In geometry shader discard object / kill the vertex if instance is not visible in frustum.
- Thus, we formed buffer with just visible instances data.
- But now we need to get information amout amount of visible objects from GPU to make the dip on CPU side. In this case we do this with transform feedback from previous frame (just for code simplicity).

void do_gpu_culling()
{
culling_shader.bind();
int cur_frame = frame_index % 2;
int prev_frame = (frame_index + 1) % 2;
//enable transform feedback & query
glBindBufferBase(GL_TRANSFORM_FEEDBACK_BUFFER, 0, dips_texture_buffer);
glBeginQuery(GL_TRANSFORM_FEEDBACK_PRIMITIVES_WRITTEN,
num_visible_instances_query[cur_frame]);
glBeginTransformFeedback(GL_POINTS);
//render cloud of points which we interprent as objects data
glBindVertexArray(all_instances_data_vao);
glDrawArrays(GL_POINTS, 0, MAX_SCENE_OBJECTS);
glBindVertexArray(0);
//disable all
glEndTransformFeedback();
glEndQuery(GL_TRANSFORM_FEEDBACK_PRIMITIVES_WRITTEN);
glBindBufferBase(GL_TRANSFORM_FEEDBACK_BUFFER, 0, 0);
glDisable(GL_RASTERIZER_DISCARD);
//get feedback from prev frame
num_visible_instances = 0;
glGetQueryObjectiv(num_visible_instances_query[prev_frame], GL_QUERY_RESULT, &num_visible_instances);
//next frame
frame_index++;
}

Vertex shader from 3rd step:

#version 330 core
in vec4 s_attribute_0;
in vec4 s_attribute_1;
out vec4 instance_data1;
out vec4 instance_data2;
out int visible;
uniform mat4 ModelViewProjectionMatrix;
uniform vec4 frustum_planes[6];
int InstanceCloudReduction()
{
//sphere - frustum test
bool inside = true;
for (int i = 0; i < 6; i++)
{
if (dot(frustum_planes[i].xyz, s_attribute_0.xyz) + frustum_planes[i].w <= -s_attribute_0.w)
inside = false;
}
return inside ? 1 : 0;
}
void main()
{
//read instance data
instance_data1 = s_attribute_0;
instance_data2 = s_attribute_1;
//visibility
visible = InstanceCloudReduction();
gl_Position = ModelViewProjectionMatrix * vec4(s_attribute_0.xyz,1);
}

Geometry shader from 4th step:

#version 400 core
layout (points) in;
layout (points, max_vertices = 1) out;
in vec4 instance_data1[];
in vec4 instance_data2[];
in int visible[];
out vec4 output_instance_data1;
out vec4 output_instance_data2;
void main( )
{
//if primitive is not visible - discard it !
//visible comes from vertex shader
if (visible[0] == 1)
{
//just transfer data
output_instance_data1 = instance_data1[0];
output_instance_data2 = instance_data2[0];
gl_Position = vec4(0.0, 0.0, 0.0, 1.0);
EmitVertex();
EndPrimitive();
}
}

Whole frame time with GPU culling is about 1,19 ms. Allmost the same as with the fastest CPU variant, SSE multithreaded sphere-frustum culling.

# Perfomance comparision of all methods

We compared culling speed CPU methods, but mention only culling time. For now we are going to measure whole frame time, so measurements will differs a bit as we need take into account data transferring and some other work.

Table 4: Perfomance comparition of all culling methods. Whole Frame time in ms.

Method |
Sphere |
AABB |
OBB |

Simple c++ |
1,94 |
2,46 |
10,3 |

SSE |
1,2 |
1,45 |
4,63 |

Simple c++, multithreading |
1,23 |
1,42 |
3,6 |

SSE, multithreading |
1,18 |
1,2 |
2,02 |

GPU |
1,19 |
– |
– |

There might be some inaccuracy in measurements (in 2nd sigh after coma). Even averaged time always differs from frame to frame.

# Conclusion

Using both SSE and multithreading gives us 9 times perfomance win in comparition with simple c++ implementation. If one can use Opengl 4 (map buffers with GL_MAP_PERSISTENT_BIT and GL_MAP_COHERENT_BIT flags), then data transfering to gpu becomes really fast. Plus we able to optimise culling using hierarchial structures and different trics like remembering last frustum plane which culled the object.

GPU culling works as fast as the most optimized CPU method. And has couple advantages:

- no need to transfer visible instances data to GPU. They are already there. If using DrawIndirect (DX11+, OpenGL 4+, new API: DX12, Vulkan, Metal) and form information about dip on CPU side, then we can get even better performance.
- Relatively easily able to make additional culling taking into account self blocking/hiding/inner visibility (Hierarchical-Z map based occlusion culling)

So – what to use CPU or GPU culling?

Depends on several things:

- project and amount of changes to make GPU culling
- Do we able to use DrawIndirect
- what is bottle neck: CPU or GPU?
- Do we want make additional culling according to depth complexity of the scene. On GPU side it will be much easier and faster.

If there is such opportunity – one should use GPU culling.

Source code of all examples.

Links:

- Collision Detection FAQ
- View frustum culling
- View frustum culling optimization
- Efficient View Frustum Culling
- Optimized View Frustum Culling Algorithms for Bounding Boxes
- Hierarchical-Z map based occlusion culling
- Beyond Porting. How Modern OpenGL can Radically Reduce Driver Overhead
- Parallelizing the Naughty Dog engine using fibers
- Practical Occlusion Culling in Killzone 3
- Software Occlusion Culling
- Streaming SIMD Extensions (SSE)

Gerlits Anatoliy.

February, 2017 year.