The Gaming Industry

OpenGL API Overhead
25
Feb 2017

OpenGL API Overhead

Introduction

In modern projects, to get a nice looking picture the engine will render thousands of different objects: characters, buildings, landscape, nature, effects and other. Of course, there are several ways to render geometry on the screen. In this article, we consider how to do that effectively, measure and compare the cost of rendering API calls. Consider cost of API calls:

  • state changes (frame buffers, vertex buffers, shaders, constants, textures)
  • different types of geometry instancing and compare their performance
  • several practical examples of how one should optimize geometry render in projects.

I will cover only the OpenGL API. I will not describe details, parameters and variations of each API call. There are reference books and manuals for this purpose. Computer configuration for all tests: Intel Core i5-4460 3.2GHz., Radeon R9 380. In all calculations, time is in ms.

States changing

We want to see ‘reach’ picture on the screen, a lot of unique objects with a lot of details. For this purpose engine takes all visible objects in camera, sets their parameters (vertex buffers, shaders, material parameters, textures, etc.) and send them to render. All these actions performed with special API commands. Let’s consider them, make some tests to understand how to organize the rendering process optimally.

Let’s measure the cost of different OpenGL calls: dip (draw index primitive), change of shaders, vertex buffers, textures, shader parameters.

Dips

Dip (draw indexed primitive) — command to GPU to render a bunch of geometry, more often triangles. Off course first we need to tell – what geometry we want to show, with what shader, set some options. But dip renders geometry; all other commands just describe parameters of what we want to show. The dip’s price usually includes all related state changes – not only one command. Of course, all depends on the amount of state changes. First, consider the simplest case – cost of one thousand simple dips, without state changes.

void simple_dips() 
{ 
    glBindVertexArray(ws_complex_geometry_vao_id); //what geometry to render 
    simple_geometry_shader.bind(); //with what shader
    //a lot of simple dips 
    for (int i = 0; i < CURRENT_NUM_INSTANCES; i++) 
        glDrawRangeElements(GL_TRIANGLES, i * BOX_NUM_VERTS, (i+1) * BOX_NUM_VERTS, BOX_NUM_INDICES, GL_UNSIGNED_INT, (GLvoid*)(i*BOX_NUM_INDICES*sizeof(int))); //simple dip 
} 

Table 1. Test of simple dip’s cost (depending on dips count)

2000 1000 500 100
0.4 0.21 0.107 0.0255

Time of whole frame is a bit larger than test time. In average for all tests it is around 0.2 ms. larger. Here and below in the table numbers indicate just the test time. The cost of API call will be calculated at the end of the article.

Frame buffer change

FBO (frame buffer object) — is an object, which allows rendering image not to the screen, but to another surface, which lately one could use as texture in shaders. Fbo changes not so often as other elements, but at the same time, the change cost is quite expensive for the CPU.

void fbo_change_test() 
{ 
    //clear FBO 
    glViewport(0, 0, window_width, window_height); 
    glClearColor(0.0f / 255.0f, 0.0f / 255.0f, 0.0f / 255.0f, 0.0); 
    
    for (int i = 0; i < NUM_DIFFERENT_FBOS; i++) 
    {
        glBindFramebuffer(GL_FRAMEBUFFER, fbo_buffer[i % NUM_DIFFERENT_FBOS]);
        glClear(GL_COLOR_BUFFER_BIT | GL_DEPTH_BUFFER_BIT);
    }
    
    //prepare dip 
    glBindVertexArray(ws_complex_geometry_vao_id); 
    simple_geometry_shader.bind();
    
    //bind FBO, render one object... repeat N times 
    for (int i = 0; i < NUM_FBO_CHANGES; i++) 
    { 
        glBindFramebuffer(GL_FRAMEBUFFER, fbo_buffer[i % NUM_DIFFERENT_FBOS]); //bind fbo
        glDrawRangeElements(GL_TRIANGLES, i * BOX_NUM_VERTS, (i + 1) * BOX_NUM_VERTS, BOX_NUM_INDICES, GL_UNSIGNED_INT, (GLvoid*)(i*BOX_NUM_INDICES * sizeof(int))); //simple dip
    } 
    
    glBindFramebuffer(GL_FRAMEBUFFER, 0); //set rendering to the 'screen' 
}

Table 2. Test fbo changes. (top-amount, time in ms.).

400 200 100 25
2.72 1.42 0.73 0.257

One needs to change FBO usually for post effects and different passes, like: reflections, rendering into cubemap, creating virtual textures, etc. Many things like virtual textures could be organized as atlases, to set FBO only once and change for example just viewport. Render in cubemap might be replaced on another technique. For example on dual paraboloid rendering. The matter of course, not only in FBO changes, but in the number of passes of scene rendering, material changes, etc. In general, the less state changes the better.

Shader changes

Shaders usually describe one of the scene’s materials or effect techniques. The more materials, kinds of surfaces the more shaders. Several materials might vary slightly. These should be combined into one and switching between them make as condition in the shader, The number of materials directly influence on dips amount.

void shaders_change_test() 
{ 
    glBindVertexArray(ws_complex_geometry_vao_id);
    
    for (int i = 0; i < CURRENT_NUM_INSTANCES; i++) 
    {
        simple_color_shader[i%NUM_DIFFERENT_SIMPLE_SHADERS].bind(); //bind certain shader
        glDrawRangeElements(GL_TRIANGLES, i * BOX_NUM_VERTS, (i + 1) * BOX_NUM_VERTS, BOX_NUM_INDICES, GL_UNSIGNED_INT, (GLvoid*)(i*BOX_NUM_INDICES * sizeof(int))); //simple dip
    }
}

Table 3. Shader change test timing. (top-amount of shader changes, time in ms.)

2000 1000 500 100
5.16 2.6 1.28 0.257

Changing shader here also includes transferring world-view-proj matrix as a parameter. Otherwise we could not render anything. Cost of parameters changing we measure in next step.

Shader parameters changing

Often materials make universal with a lot of options to get different kinds of materials. An easy way to make a variety of pictures, each character/object unique. We need somehow transfer to shader these parameters. This could be done with API commands glUniform*.

uniforms_changes_test_shader.bind(); 
glBindVertexArray(ws_complex_geometry_vao_id);</p><p>for (int i = 0; i &lt; CURRENT_NUM_INSTANCES; i++) 
{
    //set uniforms for this dip 
    for (int j = 0; j &lt; NUM_UNIFORM_CHANGES_PER_DIP; j++) 
        glUniform4fv(ColorShader_uniformLocation[j], 1, &amp;randomColors[(i*NUM_UNIFORM_CHANGES_PER_DIP + j) % MAX_RANDOM_COLORS].x);
    
    glDrawRangeElements(GL_TRIANGLES, i * BOX_NUM_VERTS, (i + 1) * BOX_NUM_VERTS, BOX_NUM_INDICES, GL_UNSIGNED_INT, (GLvoid*)(i*BOX_NUM_INDICES * sizeof(int))); //simple dip
}

It is not optimal to set parameters individually for each instance/object. Usually all instance data might be packed into 1 large buffer and transferred to gpu with one command. It only remains for each object to set a shift – where it’s data placed.

//copy data to ssbo buffer
glBindBuffer(GL_SHADER_STORAGE_BUFFER, instances_uniforms_ssbo);
float *gpu_data = (float*)glMapBufferRange(GL_SHADER_STORAGE_BUFFER, 0, CURRENT_NUM_INSTANCES * NUM_UNIFORM_CHANGES_PER_DIP * sizeof(vec4), GL_MAP_WRITE_BIT | GL_MAP_UNSYNCHRONIZED_BIT);
memcpy(gpu_data, &amp;all_instances_uniform_data[0], CURRENT_NUM_INSTANCES * NUM_UNIFORM_CHANGES_PER_DIP * sizeof(vec4)); //copy instances data
glUnmapBuffer(GL_SHADER_STORAGE_BUFFER);</p><p>//bind for shader to 0 &#39;point&#39; (shader will read data from this &#39;link point&#39;)
glBindBufferBase(GL_SHADER_STORAGE_BUFFER, 0, instances_uniforms_ssbo);
glBindBuffer(GL_SHADER_STORAGE_BUFFER, 0);</p><p>//render 
uniforms_changes_ssbo_shader.bind();
glBindVertexArray(ws_complex_geometry_vao_id);
static int uniformsInstancing_data_varLocation = glGetUniformLocation(uniforms_changes_ssbo_shader.programm_id, &quot;instance_data_location&quot;);</p><p>for (int i = 0; i &lt; CURRENT_NUM_INSTANCES; i++) 
{
    //set parameter to sahder - where object&#39;s data located
    glUniform1i(uniformsInstancing_data_varLocation, i*NUM_UNIFORM_CHANGES_PER_DIP);
    glDrawRangeElements(GL_TRIANGLES, i * BOX_NUM_VERTS, (i + 1) * BOX_NUM_VERTS, BOX_NUM_INDICES, GL_UNSIGNED_INT, (GLvoid*)(i*BOX_NUM_INDICES * sizeof(int))); //simple dip
}

Table 4. Tests to change shader parameters (top – amount of dips, time in ms.)

Test type 2000 1000 500 100
UNIFORMS_SIMPLE_CHANGE_TEST 2.25 1.1 0.54 0.1145
UNIFORMS_SSBO_TEST 1.3 0.628 0.32 0.0725

Using glMapBuffer(GL_SHADER_STORAGE_BUFFER, GL_WRITE_ONLY); causes CPU and GPU synchronization which should be avoided. One should use glMapBufferRange with flag GL_MAP_UNSYNCHRONIZED_BIT, to prevent synchronization. But programmer should guaranty that overwriting data arren’t using by GPU right now. Otherwise we get bugs as we rewriting data which are reading by GPU now. To completely resolve this problem, use triple buffering. When we use current buffer for writing data, the rest 2 uses GPU. Plus there is more optimal mapping buffer method with flags GL_MAP_PERSISTENT_BIT and GL_MAP_COHERENT_BIT.

Changing vertex buffers

There are a lot of objects with different geometries in the scene. This geometry usually placed in different vertex buffers. To render another object with different geometry, even with the same material we need to change vertex buffer. There are techniques which allow effectively render different geometry with same material with only one dip: MultiDrawIndirect, Dynamic vertex pulling. Such geometry should be placed in one buffer.

void vbo_change_test() 
{
    simple_geometry_shader.bind();
    
    for (int i = 0; i &lt; CURRENT_NUM_INSTANCES; i++) 
    {
        glBindVertexArray(separate_geometry_vao_id[i % NUM_SIMPLE_VERTEX_BUFFERS]); //change vbo
        glDrawRangeElements(GL_TRIANGLES, i * BOX_NUM_VERTS, (i + 1) * BOX_NUM_VERTS, BOX_NUM_INDICES, GL_UNSIGNED_INT, (GLvoid*)(i*BOX_NUM_INDICES * sizeof(int))); //simple dip
    }
}

Table 5. VBO change test performance (top-amount, time in ms.)

2048 1024 512 128
1.6 0.785 0.396 0.086

Textures changes

Textures give surfaces a detailed view. You can get a very large variety in the picture simply by changing the textures, blending different textures in the shader. Textures have to be changed frequently, but you can put them in the so-called texture array, to bind it only once for lots of dips and access to textures through an index in the shader. Same geometry with different textures might be rendered using instancing.

void textures_change_test() 
{
    glBindVertexArray(ws_complex_geometry_vao_id);
    int counter = 0;
    
    //switch between tests 
    if (test_type == ARRAY_OF_TEXTURES_TEST) 
    {
        array_of_textures_shader.bind();
        
        for (int i = 0; i &lt; CURRENT_NUM_INSTANCES; i++) 
        {
            //bind textures for this dip 
            for (int j = 0; j &lt; NUM_TEXTURES_IN_COMPLEX_MATERIAL; j++) 
            {
                glActiveTexture(GL_TEXTURE0 + j);
                glBindTexture(GL_TEXTURE_2D, array_of_textures[counter % TEX_ARRAY_SIZE]);
                glBindSampler(j, Sampler_linear); counter++; 
            }
            
            glDrawRangeElements(GL_TRIANGLES, i * BOX_NUM_VERTS, (i + 1) * BOX_NUM_VERTS, BOX_NUM_INDICES, GL_UNSIGNED_INT, (GLvoid*)(i*BOX_NUM_INDICES * sizeof(int))); //simple dip
        }
    }
    else if (test_type == TEXTURES_ARRAY_TEST) 
    { 
        //bind texture aray for all dips
        glActiveTexture(GL_TEXTURE0);
        glBindTexture(GL_TEXTURE_2D_ARRAY, texture_array_id);
        glBindSampler(0, Sampler_linear);
        
        //variable to tell shader - what textures uses this dip 
        static int textureArray_usedTex_varLocation = glGetUniformLocation(textureArray_shader.programm_id, &quot;used_textures_i&quot;);
        textureArray_shader.bind();
        
        float used_textures_i[6];
        for (int i = 0; i &lt; CURRENT_NUM_INSTANCES; i++) 
        {
            //fill data - what textures uses this dip 
            for (int j = 0; j &lt; 6; j++) 
            {
                used_textures_i[j] = counter % TEX_ARRAY_SIZE;
                counter++; 
            }
            
            glUniform1fv(textureArray_usedTex_varLocation, 6, &amp;used_textures_i[0]); //transfer to shader, tell what textures this material uses 
            glDrawRangeElements(GL_TRIANGLES, i * BOX_NUM_VERTS, (i + 1) * BOX_NUM_VERTS, BOX_NUM_INDICES, GL_UNSIGNED_INT, (GLvoid*)(i*BOX_NUM_INDICES * sizeof(int))); //simple dip
        }
    }
}

Table 6. Textures change test performance (top-amount of dips, time in ms.)

Test type 2048 1024 512 128
ARRAY_OF_TEXTURES_TEST 6.2 3.12 1.577 0.315
TEXTURES_ARRAY_TEST 1.42 0.7 0.35 0.08

Comparative estimation of state changes

Below is a table with the execution cost/time of all performed tests.

Table 7. State changes tests time (top-amount of dips, time in ms.)

Test type 2048 1024 512 128
SIMPLE_DIPS_TEST 0.4 0.21 0.107 0.0255
FBO_CHANGE_TEST 2.72 1.42 0.73 0.257
SHADERS_CHANGE_TEST 5.16 2.6 1.28 0.257
UNIFORMS_SIMPLE_CHANGE_TEST 2.25 1.1 0.54 0.1145
UNIFORMS_SSBO_CHANGE_TEST 1.3 0.628 0.32 0.0725
VBO_CHANGE_TEST 1.6 0.785 0.396 0.086
ARRAY_OF_TEXTURES_TEST 6.2 3.12 1.577 0.315
TEXTURES_ARRAY_TEST 1.42 0.7 0.35 0.08

Using this results we are able to calculate API call cost. Absolute cost per 1000 API calls. Relative cost calculate in relation to the simple dip call (glDrawRangeElements).

Table 8. API call cost. Intel Core i5-4460 3.2GHz. Time in ms. per 1k calls.

API call Absolute cost Relative cost %
glBindFramebuffer 7.1 3550%
glUseProgram 2.04 1020%
glBindVertexArray 0.765 382%
glBindTexture 0.584 292%
glDrawRangeElements 0.2 100%
glUniform4fv 0.09 45%

Of course, one should be very cautious to measurements as they will change depending on the version of the driver and hardware.

Instancing

Instancing invented to quickly render the same geometry with different parameters. Each object has a unique index according to which we can take desired for this object parameters in she shader, vary some options, etc. Main advantage of using instancing – we can greatly reduce the number of dips.

We can pack all instances parameters in the buffer, transfer them to GPU and make just one dip. Storing data in the buffer is a good optimization itself – we saving on what it is not necessary to constantly change the shader parameters. Also, if instance data do not change (for example we exactly know that it is static geometry), we don’t need to transfer data to GPU every frame, actually just once at program/level start. In general, for optimal rendering we should first to pack all instances data to one buffer, transfer them to GPU with one command. For each dip, type og geometry – just set the offset where to find instances data for this dip. Using instance index (gl_InstanceID in OpenGL) we able to sample certain data for this instance/object.

There are a lot of ways to store data in OpenGL: vertex buffer (VBO), uniform buffer (UBO), texture buffer (TBO), shader storage buffer (SSBO), textures. There are various features for each buffer type. Consider that.

Texture instancing

All data stored in the texture. To effectively change data in texture one should use special structures – Pixel Buffer Object (PBO) which allow transferring data asynchronously from CPU to GPU. CPU does not wait until the data will be transferred and continues to work.

Creation code:

    glGenBuffers(2, textureInstancingPBO);
    glBindBuffer(GL_PIXEL_UNPACK_BUFFER, textureInstancingPBO[0]); //GL_STREAM_DRAW_ARB means that we will change data every frame 
    
    glBufferData(GL_PIXEL_UNPACK_BUFFER, INSTANCES_DATA_SIZE, 0, GL_STREAM_DRAW_ARB);
    glBindBuffer(GL_PIXEL_UNPACK_BUFFER, textureInstancingPBO[1]);
    glBufferData(GL_PIXEL_UNPACK_BUFFER, INSTANCES_DATA_SIZE, 0, GL_STREAM_DRAW_ARB);
    glBindBuffer(GL_PIXEL_UNPACK_BUFFER, 0);
    
    //create texture where we will store instances data on gpu
    glGenTextures(1, &amp;textureInstancingDataTex);
    glBindTexture(GL_TEXTURE_2D, textureInstancingDataTex);
    glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_MIN_FILTER, GL_NEAREST);
    glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_MAG_FILTER, GL_NEAREST);
    glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_WRAP_S, GL_REPEAT);
    glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_WRAP_T, GL_REPEAT);
    glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_WRAP_R, GL_REPEAT); //in each line we store NUM_INSTANCES_PER_LINE object&#39;s data. 128 in our case
    
    //for each object we store PER_INSTANCE_DATA_VECTORS data-vectors. 2 in our case 
    //GL_RGBA32F &Atilde;&Atilde;&Atilde;&Atilde;&Acirc;&cent; we have float32 data
    //complex_mesh_instances_data source data of instances, if we are not going to update data in the texture
    glTexImage2D(GL_TEXTURE_2D, 0, GL_RGBA32F, NUM_INSTANCES_PER_LINE * PER_INSTANCE_DATA_VECTORS, MAX_INSTANCES / NUM_INSTANCES_PER_LINE, 0, GL_RGBA, GL_FLOAT, &amp;complex_mesh_instances_data[0]);
    glBindTexture(GL_TEXTURE_2D, 0);

Texture update:

glBindTexture(GL_TEXTURE_2D, textureInstancingDataTex);
glBindBufferARB(GL_PIXEL_UNPACK_BUFFER, textureInstancingPBO[current_frame_index]);</p><p>// copy pixels from PBO to texture object 
glTexSubImage2D(GL_TEXTURE_2D, 0, 0, 0, NUM_INSTANCES_PER_LINE * PER_INSTANCE_DATA_VECTORS, MAX_INSTANCES / NUM_INSTANCES_PER_LINE, GL_RGBA, GL_FLOAT, 0);</p><p>// bind PBO to update pixel values 
glBindBufferARB(GL_PIXEL_UNPACK_BUFFER, textureInstancingPBO[next_frame_index]);</p><p>//http://www.songho.ca/opengl/gl_pbo.html 
// Note that glMapBufferARB() causes sync issue. 
// If GPU is working with this buffer, glMapBufferARB() will wait(stall) 
// until GPU to finish its job. To avoid waiting (idle), you can call 
// first glBufferDataARB() with NULL pointer before glMapBufferARB(). 
// If you do that, the previous data in PBO will be discarded and 
// glMapBufferARB() returns a new allocated pointer immediately 
// even if GPU is still working with the previous data. 
glBufferData(GL_PIXEL_UNPACK_BUFFER, INSTANCES_DATA_SIZE, 0, GL_STREAM_DRAW_ARB);</p><p>gpu_data = (float*)glMapBuffer(GL_PIXEL_UNPACK_BUFFER, GL_WRITE_ONLY_ARB);
if (gpu_data) 
{
    memcpy(gpu_data, &amp;complex_mesh_instances_data[0], INSTANCES_DATA_SIZE); // update data
    glUnmapBuffer(GL_PIXEL_UNPACK_BUFFER); //release pointer to mapping buffer 
}

Rendering using texture instancing:

//bind texture with instances data 
glActiveTexture(GL_TEXTURE0);
glBindTexture(GL_TEXTURE_2D, textureInstancingDataTex);
glBindSampler(0, Sampler_nearest);</p><p>glBindVertexArray(geometry_vao_id); //what geometry to render 
tex_instancing_shader.bind(); //with waht shader</p><p>//tell shader texture with data located, what name it has 
static GLint location = glGetUniformLocation(tex_instancing_shader.programm_id, &quot;s_texture_0&quot;);
if (location &gt;= 0) 
    glUniform1i(location, 0);</p><p>//render group of objects 
glDrawElementsInstanced(GL_TRIANGLES, BOX_NUM_INDICES, GL_UNSIGNED_INT, NULL, CURRENT_NUM_INSTANCES);

Vertex shader to access the data:

#version 150 core
in vec3 s_pos; 
in vec3 s_normal; 
in vec2 s_uv;
uniform mat4 ModelViewProjectionMatrix;</p><p>uniform sampler2D s_texture_0;</p><p>out vec2 uv; 
out vec3 instance_color;</p><p>void main() 
{ 
    const vec2 texel_size = vec2(1.0 / 256.0, 1.0 / 16.0);
    const int objects_per_row = 128;
    const vec2 half_texel = vec2(0.5, 0.5);
    
    //calc texture coordinates - where our instance data located 
    //gl_InstanceID % objects_per_row &Atilde;&Atilde;&Atilde;&Atilde;&Acirc;&cent; no of object in the line
    //multiple by 2 as each object has 2 vectors of data 
    //gl_InstanceID / objects_per_row &Atilde;&Atilde;&Atilde;&Atilde;&Acirc;&cent; in what line our data located 
    //multiple by texel_size gieves us 0..1 uv to sample from texture from interer texel id 
    vec2 texel_uv = (vec2((gl_InstanceID % objects_per_row) * 2, floor(gl_InstanceID / objects_per_row)) + half_texel) * texel_size;
    vec4 instance_pos = textureLod(s_texture_0, texel_uv, 0);
    instance_color = textureLod(s_texture_0, texel_uv + vec2(texel_size.x, 0.0), 0).xyz;
    
    uv = s_uv; 
    gl_Position = ModelViewProjectionMatrix * vec4(s_pos + instance_pos.xyz, 1.0); 
}

</p>

Instancing through vertex buffer</h2>

Idea is to keep instance data in separate vertex buffer and have an axes to them in shader through vertex attributes. Code of buffer creation with data itself is trivial. Our main task is to modify information about vertex for shader (vertex declaration, vdecl)

//...code of base vertex declaration creation 
//special atributes binding 
glBindBuffer(GL_ARRAY_BUFFER, all_instances_data_vbo); //size of per instance data (PER_INSTANCE_DATA_VECTORS = 2 - so we have to create 2 additional attributes to transfer data) 
const int per_instance_data_size = sizeof(vec4) * PER_INSTANCE_DATA_VECTORS; 
glEnableVertexAttribArray(4); //4th vertex attribute, has 4 floats, 0 data offset
glVertexAttribPointer((GLuint)4, 4, GL_FLOAT, GL_FALSE, per_instance_data_size, (GLvoid*)(0)); //tell that we will change this attribute per instance, not per vertex 
glVertexAttribDivisor(4, 1);</p><p>glEnableVertexAttribArray(5); //5th vertex attribute, has 4 floats, sizeof(vec4) data offset
glVertexAttribPointer((GLuint)5, 4, GL_FLOAT, GL_FALSE, per_instance_data_size, (GLvoid*)(sizeof(vec4)));
//tell that we will change this attribute per instance, not per vertex 
glVertexAttribDivisor(5, 1); 

Rendering code:

vbo_instancing_shader.bind(); //our vertex buffer wit modified vertex declaration (vdecl) 
glBindVertexArray(geometry_vao_vbo_instancing_id); 
glDrawElementsInstanced(GL_TRIANGLES, BOX_NUM_INDICES, GL_UNSIGNED_INT, NULL, CURRENT_NUM_INSTANCES); 

Vertex shader to access data:

#version 150 core
in vec3 s_pos; 
in vec3 s_normal; 
in vec2 s_uv; 
in vec4 s_attribute_3; //some_data;
in vec4 s_attribute_4; //instance pos 
in vec4 s_attribute_5; //instance color
uniform mat4 ModelViewProjectionMatrix;
out vec3 instance_color;</p><p>void main() 
{
    instance_color = s_attribute_5.xyz; 
    gl_Position = ModelViewProjectionMatrix * vec4(s_pos + s_attribute_4.xyz, 1.0);
}

Uniform buffer instancing, Texture buffer instancing, Shader Storage buffer instancing

These three methods are very similar to each other. They differ mostly by buffer type. Uniform buffer (UBO) characterized by small size, but it should theoretically be faster than the others. Texture buffer (TBO) has very big size. We able to store all scene instances data into it, skeletal transformations. Shader Storage Buffer (SSBO) has both properties – fast with a large size. Also, we can write data to it. The only thing – it is new extension, and the old hardware does not support it.

Uniform buffer

Creation code:

    glGenBuffers(1, &amp;dips_uniform_buffer); 
    glBindBuffer(GL_UNIFORM_BUFFER, dips_uniform_buffer); 
    glBufferData(GL_UNIFORM_BUFFER, INSTANCES_DATA_SIZE, &amp;complex_mesh_instances_data[0], GL_STATIC_DRAW);
    
    //uniform_buffer_data 
    glBindBuffer(GL_UNIFORM_BUFFER, 0);
    
    //bind iniform buffer with instances data to shader 
    ubo_instancing_shader.bind(true); 
    GLint instanceData_location3 = glGetUniformLocation(ubo_instancing_shader.programm_id, &quot;instance_data&quot;);
    
    //link to shader 
    glUniformBufferEXT(ubo_instancing_shader.programm_id, instanceData_location3, dips_uniform_buffer); //actually binding 

Instancing vertex shader with uniform buffer:

#version 150 core 
#extension GL_EXT_bindable_uniform : enable 
#extension GL_EXT_gpu_shader4 : enable</p><p>in vec3 s_pos; 
in vec3 s_normal; 
in vec2 s_uv;
uniform mat4 ModelViewProjectionMatrix; 
bindable uniform vec4 instance_data[4096]; //our uniform with instances data
out vec3 instance_color;
void main() 
{
    vec4 instance_pos = instance_data[gl_InstanceID*2];
    instance_color = instance_data[gl_InstanceID*2+1].xyz;
    gl_Position = ModelViewProjectionMatrix * vec4(s_pos + instance_pos.xyz, 1.0);
}

Texture Buffer

Creation code:

tbo_instancing_shader.bind();
//bind to shader as special texture 
glActiveTexture(GL_TEXTURE0); 
glBindTexture(GL_TEXTURE_BUFFER, dips_texture_buffer_tex); 
glTexBuffer(GL_TEXTURE_BUFFER, GL_RGBA32F, dips_texture_buffer);</p><p>glBindVertexArray(geometry_vao_id);
glDrawElementsInstanced(GL_TRIANGLES, BOX_NUM_INDICES, GL_UNSIGNED_INT, NULL, CURRENT_NUM_INSTANCES);

Vertex shader:

#version 150 core 
#extension GL_EXT_bindable_uniform : enable 
#extension GL_EXT_gpu_shader4 : enable
in vec3 s_pos; 
in vec3 s_normal; 
in vec2 s_uv;
uniform mat4 ModelViewProjectionMatrix; 
uniform samplerBuffer s_texture_0; //our TBO texture bufer
out vec3 instance_color;
void main() 
{ 
    //sample data from TBO 
    vec4 instance_pos = texelFetch(s_texture_0, gl_InstanceID*2);
    instance_color = texelFetch(s_texture_0, gl_InstanceID*2+1).xyz; 
    gl_Position = ModelViewProjectionMatrix * vec4(s_pos + instance_pos.xyz, 1.0); 
}

SSBO

Creation code:

    glGenBuffers(1, &amp;ssbo); 
    glBindBuffer(GL_SHADER_STORAGE_BUFFER, ssbo);
    glBufferData(GL_SHADER_STORAGE_BUFFER, INSTANCES_DATA_SIZE, &amp;complex_mesh_instances_data[0], GL_STATIC_DRAW); 
    glBindBufferBase(GL_SHADER_STORAGE_BUFFER, 0, ssbo); 
    glBindBuffer(GL_SHADER_STORAGE_BUFFER, 0); // unbind 

Render:

//bind ssbo_instances_data, link to shader at &#39;0 binding point&#39; 
glBindBuffer(GL_SHADER_STORAGE_BUFFER, ssbo_instances_data);
glBindBufferBase(GL_SHADER_STORAGE_BUFFER, 0, ssbo_instances_data);
glBindBuffer(GL_SHADER_STORAGE_BUFFER, 0);
ssbo_instancing_shader.bind(); 
glBindVertexArray(geometry_vao_id);
glDrawElementsInstanced(GL_TRIANGLES, BOX_NUM_INDICES, GL_UNSIGNED_INT, NULL, CURRENT_NUM_INSTANCES);
glBindVertexArray(0); 

Vertex shader:

    #version 430
    #extension GL_ARB_shader_storage_buffer_object : require
    in vec3 s_pos; 
    in vec3 s_normal; 
    in vec2 s_uv;
    uniform mat4 ModelViewProjectionMatrix;<
    //ssbo should be binded to 0 
    binding point layout(std430, binding = 0) 
    buffer ssboData { vec4 instance_data[4096]; };
    out vec3 instance_color;
    void main() 
    {
        //gl_InstanceID is unique for each instance. So we able to set per instance data 
        vec4 instance_pos = instance_data[gl_InstanceID*2]; 
        instance_color = instance_data[gl_InstanceID*2+1].xyz;
        gl_Position = ModelViewProjectionMatrix * vec4(s_pos + instance_pos.xyz, 1.0);
    }

Uniforms instancing

Pretty simple. We have ability to set with special commands (glUniform*) several vectors with data to shader. Maximum amount depends on video card. Get the maximum number possible by calling glGetIntegerv with GL_MAX_VERTEX_UNIFORM_VECTORS parameter. For R9 380 will return 4096. Minimum value is 256.

uniforms_instancing_shader.bind();
glBindVertexArray(geometry_vao_id);
//variable - where in shader our array of uniforms located. We will write data to this array 
static int uniformsInstancing_data_varLocation = glGetUniformLocation(uniforms_instancing_shader.programm_id, &quot;instance_data&quot;);
//instances data might be written with just one call if there are enough vectors. 
//Just for clarity, divide into groups, because usually much more there are much more data than available uniforms. 
for (int i = 0; i &lt; UNIFORMS_INSTANCING_NUM_GROUPS; i++) 
{ 
    //write data to uniforms 
    glUniform4fv(uniformsInstancing_data_varLocation, UNIFORMS_INSTANCING_MAX_CONSTANTS_FOR_INSTANCING, &amp;complex_mesh_instances_data[i*UNIFORMS_INSTANCING_MAX_CONSTANTS_FOR_INSTANCING].x);
    glDrawElementsInstanced(GL_TRIANGLES, BOX_NUM_INDICES, GL_UNSIGNED_INT, NULL, UNIFORMS_INSTANCING_OBJECTS_PER_DIP); 
}

Multi draw indirect

Separately consider a command that allows drawing a huge number of dips for one call. This is a very useful command which allows rendering a group of instances with different geometry, even thousands of different groups with one command. As an input, it receives an array that describes the parameters of dips: the number of indexes, shifting in vertex buffers, amount of instances per group. The restriction is that the entire geometry should be placed in one vertex buffer and rendered with one shader. Additional plus is that we can fill information about dips for MultiDraw command on GPU side, which is very useful for GPU frustum culling for example.

//fill indirect buffer with dips information. Just simple array 
for (int i = 0; i &lt; CURRENT_NUM_INSTANCES; i++) 
{
    multi_draw_indirect_buffer[i].vertexCount = BOX_NUM_INDICES; 
    multi_draw_indirect_buffer[i].instanceCount = 1; 
    multi_draw_indirect_buffer[i].firstVertex = i*BOX_NUM_INDICES; 
    multi_draw_indirect_buffer[i].baseVertex = 0; 
    multi_draw_indirect_buffer[i].baseInstance = 0; 
}
glBindVertexArray(ws_complex_geometry_vao_id); 
simple_geometry_shader.bind();
glMultiDrawElementsIndirect(GL_TRIANGLES, GL_UNSIGNED_INT, (GLvoid*)&amp;multi_draw_indirect_buffer[0], //our information about dips 
                            CURRENT_NUM_INSTANCES, //number of dips 0);

glMultiDrawElementsIndirect command performs several glDrawElementsInstancedIndirect in one call. There is an unpleasant feature in the behavior of this command. Each such group (glDrawElementsInstancedIndirect) will have independent gl_InstanceID, i.e. each time it drops to 0 with new Draw*. Which makes difficult to access required per instance data. This problem solves by modifying vertex declaration of each type of objects being sent to the renderer. You can read an article about it Surviving without gl_DrawID. It is worth noting that glMultiDrawElementsIndirect performed huge number of dips with a single command. You don’t need to compare this command with the other types of instancing.

Performance comparison of different types of instancing

Table 8. Instancing tests performance. Amount of instances = 2000 (top-amount of iterations, repetition)

Instancing type x1 x10 x100
UBO_INSTANCING 0.0067 0.02 0.15
TBO_INSTANCING 0.0245 0.06 0.49
SSBO_INSTANCING 0.009 0.0225 0.17
VBO_INSTANCING 0.01 0.0213 0.155
TEXTURE_INSTANCING 0.018 0.0262 0.183
UNIFORMS_INSTANCING 0.058 0.58 6.03
MULTI_DRAW_INDIRECT 0.136 1.33 13.53

As can be seen UBO faster than TBO. It is the fastest method. TBO instancing allows to store huge amount of information, but it is slow in comparison with UBO. If possible, you should use SSBO storage. It is fast, handy and has a huge size.

Texture instancing is also a good alternative to UBO. Supported by the old hardware, you can store any amount of information. A little uncomfortable to update.

Transfering data each frame through glUniform* obviously is the slowest instancing method.

glMultiDrawElementsIndirect in tests performed 2к, 20к и 200к dips ! But we tested repetition of test. Such amount of dips might be done by just one call. The only thing – with so many dips an array with dips description will be pretty huge (better to use GPU for this).

Recommendations for optimization and conclusions

In this paper we make an analysis of API calls, measured different types of instancing performance. In general, the less state switches, the better. Use the newest features the latest version of the API: textures array, SSBO, Draw Indirect, mapping buffers with GL_MAP_PERSISTENT_BIT and GL_MAP_COHERENT_BIT flags for fast data transferring. Recommendations:

  • The less states changes the better. One should group objects by material.
  • You may wrap state changes (textures, buffers, shaders and other states). Check if state really changed before API call because it is much slower than just flag/index checking.
  • Unite geometry in one buffer.
  • Use texture arrays.
  • Store data in large buffers and textures
  • Use as little shaders as possible. But too complicated/universal shader with many branches obviously will be a problem. Especially on older video cards, where branching is expensive.
  • Use instancing
  • Use Draw Indirect if it is possible and generate information about dips on GPU side.

Some general advice:

  • It is necessary to calculate bottlenecks and optimize them first.
  • You need to know what limits performance – CPU or GPU and optimize it.
  • Don’t make work twice. Reuse results of different passes, reuse previous frames result (reprojection techniques, sorting, tracing, anything).
  • Difficult calculation might be precalculated
  • The best optimization – not to do the work
  • Use parallel calculations: split work into parts and do them on parallel threads.

Source code of all examples. Attached File
 GL_API_overhead.rar   83.59KB
  12 downloads

Links:

  1. Beyond Porting
  2. OpenGL documentation
  3. Instancing in OpenGL
  4. OpenGL Pixel Buffer Object (PBO)
  5. Buffer Object
  6. Drawing with OpenGL
  7. Shader Storage Buffer Object
  8. Shader Storage Buffers Objects
  9. hardware caps, stats (for GL_MAX_VERTEX_UNIFORM_COMPONENTS)
  10. Array Texture
  11. Textures Array example
  12. MultiDrawIndirect example
  13. The Road to One Million Draws
  14. MultiDrawIndirect, Surviving without gl_DrawID
  15. Humus demo about Instancing

Anatoliy Gerlits February 2017

Powered by WPeMatico

Tags are not defined
Comments are closed.