Using the SYCL Kernel Fusion Extension - A Hands-On Introduction
21 June 2023
In a previous blog post, we introduced the SYCL extension for user-driven kernel fusion developed by Codeplay, and how it can improve SYCL application performance.
In this post, we will take a more hands-on approach to show how to set up DPC++ with kernel fusion support and compile and run a SYCL application. Using that knowledge will allow you to apply the kernel fusion extension to your own SYCL applications.
We will also get a glimpse into SYCL application performance analysis with Intel VTune, investigating the reason for performance improvement through kernel fusion.
So, let's get started!
Prerequisites
Throughout this tutorial, we will be assuming a Linux system with at least one SYCL device compatible with the kernel fusion extension. Concretely, this would need to be a device that is compatible with a SPIR-V compatible SYCL backend. This would for example be the OpenCL backend with an Intel CPU or GPU (also integrated), or the Intel LevelZero backend with an Intel GPU (also integrated).
The installation of the necessary drivers and device backends is out-of-scope for this blog post, consult the corresponding manuals for OpenCL and LevelZero for installation instructions.
For the section on VTune performance analysis, we will assume an OpenCL CPU, suited for the selection of metrics that we will investigate. Intel VTune supports performance analysis across a wide range of different platforms and devices, for example, SYCL application performance analysis via a graphical user interface or via the command-line.
The installation of VTune itself is out-of-scope for this blog post, see the installation manual for instructions.
As an example, the system used throughout this blog post uses an Intel i7 6700K CPU with the OpenCL backend, Ubuntu 20.04 LTS and Intel VTune 2023.1.0 preview.
Setup of DPC++ with kernel fusion support
If you were hoping for a long setup procedure now (who isn't ? 😉), I'll have to disappoint you: Current DPC++ daily releases already support kernel fusion by default.
So all we need to do is download a daily release, unpack it and setup up some environment variables.
First, go to some directory of your choosing (you should have write permissions) and then execute the following command to download a daily release:
wget https://github.com/intel/llvm/releases/download/sycl-nightly%2F20230408/dpcpp-compiler.tar.gzNote that newer daily releases found here might also work.
In the next step, we'll unpack the DPC++ release and eventually remove the TAR file we downloaded:
tar xfz dpcpp-compiler.tar.gz
rm dpcpp-compiler.tar.gzexport PATH=$(pwd)/dpcpp_compiler/bin:$PATH
export LD_LIBRARY_PATH=$(pwd)/dpcpp_compiler/lib:$LD_LIBRARY_PATHNote that this step needs to be repeated each time you re-open the terminal and assumes your current working directory matches the directory you choose for setup.
To verify that the setup worked, use the following commands:
which clang++
which sycl-lsIn both cases, the output should point to an executable in the dpcpp_compiler/bin subdirectory of the setup directory you chose.
Also, the following command should show at least one device for either the OpenCL or LevelZero backend:
sycl-lsAn example output would be:
[opencl:cpu:0] Intel(R) OpenCL, Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz OpenCL 3.0In case
    this command does not print a device, see the OpenCL or LevelZero installation manual for instructions.
Example application
For demonstration purposes, we will be using the example application from the original blog-post.
You can download the complete source for the example application here.
After completing the setup above, we can compile the example application, still without kernel fusion:
clang++ -fsycl no-fusion.cpp -o no-fusionAfter that, we can execute the application:
./no-fusionThe output should be similar to this, the exact runtime is of course dependent on
    your hardware configuration*:
Elapsed time in microseconds: 298080
Elapsed time in microseconds: 235596
Elapsed time in microseconds: 235771
Elapsed time in microseconds: 235865
Elapsed time in microseconds: 235926
Elapsed time in microseconds: 235780
Elapsed time in microseconds: 235566
Elapsed time in microseconds: 235890
Elapsed time in microseconds: 235718
Elapsed time in microseconds: 235786Fusion
In the next step, we can enable kernel fusion for the application, but still without dataflow internalization. To that end, a few modifications to the source-code are necessary; you can download the resulting source file here.
We can compare the previous source file with the new one:
git diff --no-index no-fusion.cpp fusion.cppFrom the output, we can see that only minimal
    code changes are necesary to enable kernel fusion in the application:
diff --git a/no-fusion.cpp b/fusion.cpp
index 08595a0..4cd0702 100644
--- a/no-fusion.cpp
+++ b/fusion.cpp
@@ -28,7 +28,7 @@ int main() {
   auto in4 = get_random_data();
   std::vector<float> out(dataSize, -1.0);
-  queue q{};
+  queue q{ext::codeplay::experimental::property::queue::enable_fusion{}};
   {
     buffer<float> bIn1{in1.data(), range{dataSize}};
@@ -40,10 +40,14 @@ int main() {
     buffer<float> bTmp2{range{dataSize}};
     buffer<float> bTmp3{range{dataSize}};
+    ext::codeplay::experimental::fusion_wrapper fw{q};
+
     for (size_t i = 0; i < 10; ++i) {
       auto start = std::chrono::high_resolution_clock::now();
+      fw.start_fusion();
+
       // tmp1 = in1 * in2
       q.submit([&](handler &cgh) {
         auto accIn1 = bIn1.get_access(cgh);
@@ -80,6 +84,8 @@ int main() {
             dataSize, [=](id<1> i) { accOut[i] = accTmp1[i] - accTmp3[i]; });
       });
+      fw.complete_fusion();
+
       q.wait();
       auto stop = std::chrono::high_resolution_clock::now();
clang++ -fsycl fusion.cpp -o fusionAs you can see, there are no additional compilation flags needed for kernel fusion.
There are also no additional flags needed for kernel fusion when executing the application:
./fusionThis should result in output similar to the following*:
Elapsed time in microseconds: 337596
Elapsed time in microseconds: 184143
Elapsed time in microseconds: 184681
Elapsed time in microseconds: 184205
Elapsed time in microseconds: 184065
Elapsed time in microseconds: 184574
Elapsed time in microseconds: 184253
Elapsed time in microseconds: 184524
Elapsed time in microseconds: 184426
Elapsed time in microseconds: 184191
On the other hand, the remaining iterations are faster than in the non-fused case*. In the later section on VTune analysis, we will explore the reason for that more.
Dataflow Internalization
As discussed in the first blog-post, internalization of dataflow in the fused kernel can be an important optimization technique.
We can apply dataflow internalization to the buffers bTmp1, bTmp2 and
    bTmp3 in our application, resulting in the modified source code that you can download from here.
Using the following command, we can make the necessary code changes for internalization visible:
git diff --no-index fusion.cpp internalization.cppThe output shows that we mainly need to
    pass an additional property to the buffer definition. As an additional optimization, we have disabled the insertion
    of extra work-group barriers by the fusion JIT compiler by passing a property to
    complete_fusion():
diff --git a/fusion.cpp b/internalization.cpp
index 4cd0702..430c020 100644
--- a/fusion.cpp
+++ b/internalization.cpp
@@ -36,9 +36,15 @@ int main() {
     buffer<float> bIn3{in3.data(), range{dataSize}};
     buffer<float> bIn4{in4.data(), range{dataSize}};
     buffer<float> bOut{out.data(), range{dataSize}};
-    buffer<float> bTmp1{range{dataSize}};
-    buffer<float> bTmp2{range{dataSize}};
-    buffer<float> bTmp3{range{dataSize}};
+    buffer<float> bTmp1{
+        range{dataSize},
+        {sycl::ext::codeplay::experimental::property::promote_private{}}};
+    buffer<float> bTmp2{
+        range{dataSize},
+        {sycl::ext::codeplay::experimental::property::promote_private{}}};
+    buffer<float> bTmp3{
+        range{dataSize},
+        {sycl::ext::codeplay::experimental::property::promote_private{}}};
     ext::codeplay::experimental::fusion_wrapper fw{q};
@@ -84,7 +90,7 @@ int main() {
             dataSize, [=](id<1> i) { accOut[i] = accTmp1[i] - accTmp3[i]; });
       });
-      fw.complete_fusion();
+      fw.complete_fusion(ext::codeplay::experimental::property::no_barriers{});
       q.wait();
clang++ -fsycl internalization.cpp -o internalization ./internalization
Again, no additional flags are needed for compilation or execution to enable fusion and dataflow internalization.
The output shows the performance improvement we can get from fusion with dataflow internalization*:
Elapsed time in microseconds: 266811
Elapsed time in microseconds: 89876
Elapsed time in microseconds: 89861
Elapsed time in microseconds: 89939
Elapsed time in microseconds: 89904
Elapsed time in microseconds: 89844
Elapsed time in microseconds: 90018
Elapsed time in microseconds: 89887
Elapsed time in microseconds: 89891
Elapsed time in microseconds: 89831
In the next section, we will use VTune to get a glimpse into why performance improves so significantly with fusion.
VTune Performance Analysis
The Intel VTune profiler can provide great insight into the performance of applications (not only for SYCL), allowing users to identify hotspots, analyze bottlenecks and guiding optimizations.
We will be using it, or, more specifically, its command-line interface, to get some insight into performance differences between the three different versions of our application.
Fusion Performance
First, in the attempt to determine the reason fusion, even without dataflow internalization, improves application performance, we will investigate the cache performance of both version.
To analyze applications, we first need to collect some metrics with VTune. For the non-fused version, this can be achieved with this command, limiting execution to the OpenCL CPU device:
ONEAPI_DEVICE_SELECTOR=opencl:cpu vtune -collect memory-access -r report-no-fusion ./no-fusionWe can do the same for the fused version with no dataflow internalization:
ONEAPI_DEVICE_SELECTOR=opencl:cpu vtune -collect memory-access -r report-fusion ./fusionNow that we have collected the metrics, we can generate reports for both versions, specifically focusing on the actual kernel functions. As the application performs only few arithmetic operations for each item of data loaded, i.e., it is memory-bound, we will investigate memory metrics in particular**. For this first step of fusion, the most relevant metric will be the cache metric, as fusion, even without internalization, can improve cache hit rate*.
For the version without fusion, this works with the following command:
vtune -report hw-events -r report-no-fusion --column="stalls_l1d_miss,stalls_l2_miss,stalls_l3_miss"\
  --filter function="main::{lambda(sycl::_V1::handler&)#1}::operator()(sycl::_V1::handler&) const::KernelOne"\
  --filter function="main::{lambda(sycl::_V1::handler&)#2}::operator()(sycl::_V1::handler&) const::KernelTwo"\
  --filter function="main::{lambda(sycl::_V1::handler&)#3}::operator()(sycl::_V1::handler&) const::KernelThree"\
  --filter function="main::{lambda(sycl::_V1::handler&)#4}::operator()(sycl::_V1::handler&) const::KernelFour"
Function                                                                               Hardware Event Count:CYCLE_ACTIVITY.STALLS_L1D_MISS (M)  Hardware Event Count:CYCLE_ACTIVITY.STALLS_L2_MISS (M)  Hardware Event Count:CYCLE_ACTIVITY.STALLS_L3_MISS (M)
-----------------------------------------------------------------------------------------  -------------------------------------------------------  ------------------------------------------------------ ---------------------------------------
main::{lambda(sycl::_V1::handler&)#1}::operator()(sycl::_V1::handler&) const::KernelOne                                                     14,352                                                  13,858                                13,286
main::{lambda(sycl::_V1::handler&)#3}::operator()(sycl::_V1::handler&) const::KernelThree                                                   14,196                                                  13,598                                13,208
main::{lambda(sycl::_V1::handler&)#2}::operator()(sycl::_V1::handler&) const::KernelTwo                                                     14,716                                                  13,832                                13,494
main::{lambda(sycl::_V1::handler&)#4}::operator()(sycl::_V1::handler&) const::KernelFour                                                    16,016                                                  15,262                                15,002fused_0
vtune -report hw-events -r report-fusion --column="stalls_l1d_miss,stalls_l2_miss,stalls_l3_miss" --filter function=fused_0In this case, the output should be similar to the following*:
Function  Hardware Event Count:CYCLE_ACTIVITY.STALLS_L1D_MISS (M)  Hardware Event Count:CYCLE_ACTIVITY.STALLS_L2_MISS (M)  Hardware Event Count:CYCLE_ACTIVITY.STALLS_L3_MISS (M)
--------  -------------------------------------------------------  ------------------------------------------------------  ------------------------------------------------------
fused_0                                                    40,586                                                  38,532                                                  37,232
Effect of dataflow internalization
When comparing the fused version with and without internalization above, we have seen that major performance improvements can result from dataflow internalization.
To analyze this improvement, we will investigate another VTune metric.
For the fused case, we can reuse the existing report. For the version with internalization, we can generate the report using the following command:
ONEAPI_DEVICE_SELECTOR=opencl:cpu vtune -collect memory-access -r report-internalization ./internalizationAfter that, we can compare the number of memory loads and stores performed by the two different versions of the fused
    kernel.
For the version without dataflow internalization:
vtune -report hw-events -r report-fusion --column=all_load,all_store --filter function=fused_0This yields an output similar to the following*:
Function  Hardware Event Count:MEM_INST_RETIRED.ALL_LOADS_PS (M)  Hardware Event Count:MEM_INST_RETIRED.ALL_STORES_PS (M)
--------  ------------------------------------------------------  -------------------------------------------------------
fused_0                                                      995                                                      501
vtune -report hw-events -r report-internalization --column=all_load,all_store --filterfunction=fused_0The resulting output will be similar to the following*:
Function  Hardware Event Count:MEM_INST_RETIRED.ALL_LOADS_PS (M)  Hardware Event Count:MEM_INST_RETIRED.ALL_STORES_PS (M)
--------  ------------------------------------------------------  -------------------------------------------------------
fused_0                                                      499                                                      124
Comparing the output, we can see that dataflow internalization reduces the number of loads by almost 500 million and the number of stores by close to 380 million, resulting in performance improvements*.
This analysis of course only scratches the surface of performance analysis and what Intel VTune supports. If you want to learn more about these topics, have a look at the Intel VTune cookbook (in particular the SYCL sections) or the getting started guide Intel VTune.Outlook
The SYCL extension for kernel fusion is currently an experimental feature, allowing users to experiment with the feature and us to gather early user feedback on the API and functionality.
At the same time, there is a larger effort to define a graph API for SYCL. SYCL graphs allows users to define a directed acyclic graph of dependent SYCL commands ahead of execution and is meant to open new optimization opportunities for SYCL applications, for example for workloads that repeatedly submit a similar sequence of kernels.
Fusing the kernels in a graph into a single kernel is one of those potential optimizations. Codeplay have therefore started work on a SYCL extension for graph fusion, building on top of the SYCL graph API. The sequence of kernels to fuse is defined through the graph API, which offers two different mechanisms: one recording mode very similar to the existing kernel fusion extension, and one API for explicitly constructing a graph from the kernel and dependencies. Using fusion on top of graphs provides a number of advantages, for example a more fine-grained control over when the JIT compilation for fusion actually takes place.
If you want to follow the development of the new extension, you can do so here.
Disclaimer*
Experiments performed on 10/04/2023 by Codeplay, with Intel Core i7-6700K, Ubuntu 20.04.5 LTS, Linux kernel 5.15, Intel VTune Profiler 2023.1.0 pre-release (build 625246), and OpenCL driver version 2022.14.10.0.20_160000.xmain-hotfix.
DPC++ nightly version 2023-04-08 (git commit 3d6917f) was
    used for measurements.
Performance varies by use, configuration and other factors. Performance results are based on testing as of dates shown in configurations and may not reflect all publicly available updates. See backup for configuration details. No product or component can be absolutely secure. Your costs and results may vary. Intel technologies may require enabled hardware, software or service activation. Intel, the Intel logo, Codeplay and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others.
Metrics**
Availability and naming of hardware events are CPU-specific. To obtain a
    full list of available metrics for a report, a command similar to the following can be used:
vtune -report hw-events --column="?" -r ./report-internalization