Inductor C++ Wrapper Tutorial¶

Author: Chunyuan Wu, Bin Bao, Jiong Gong

Prerequisites:¶

torch.compile and TorchInductor concepts in PyTorch

Introduction¶

Python, as the primary interface of PyTorch, is easy to use and efficient for development and debugging. The Inductor’s default wrapper generates Python code to invoke generated kernels and external kernels. However, in deployments requiring high performance, Python, as an interpreted language, runs relatively slower compared to compiled languages.

We implemented an Inductor C++ wrapper by leveraging the PyTorch C++ APIs to generate pure C++ code that combines the generated and external kernels. This allows for the execution of each captured Dynamo graph in pure C++, thereby reducing the Python overhead within the graph.

Enabling the API¶

This feature is still in prototype stage. To activate this feature, add the following to your code:

import torch._inductor.config as config
config.cpp_wrapper = True

This will speed up your models by reducing the Python overhead of the Inductor wrapper.

Example code¶

We will use the below frontend code as an example:

import torch

def fn(x):
    return torch.tensor(list(range(2, 40, 2)), device=x.device) + x

x = torch.randn(1)
opt_fn = torch.compile()(fn)
y = opt_fn(x)

For CPU

The main part of Inductor-generated code with the default Python wrapper will look like this:

def call(args):
    arg0_1, = args
    args.clear()
    assert_size_stride(arg0_1, (1, ), (1, ))
    buf0 = empty_strided((19, ), (1, ), device='cpu', dtype=torch.float32)
    cpp_fused_add_lift_fresh_0(c_void_p(constant0.data_ptr()), c_void_p(arg0_1.data_ptr()), c_void_p(buf0.data_ptr()))
    del arg0_1
    return (buf0, )

By turning on the C++ wrapper, the generated code for the call function becomes a C++ function inductor_entry_cpp of the C++ extension module:

std::vector<at::Tensor> inductor_entry_cpp(const std::vector<at::Tensor>& args) {
    at::Tensor arg0_1 = args[0];
    at::Tensor constant0 = args[1];
    auto buf0 = at::empty_strided({19L, }, {1L, }, at::device(at::kCPU).dtype(at::kFloat));
    cpp_fused_add_lift_fresh_0((long*)(constant0.data_ptr()), (float*)(arg0_1.data_ptr()), (float*)(buf0.data_ptr()));
    arg0_1.reset();
    return {buf0};
}

module = CppWrapperCodeCache.load(cpp_wrapper_src, 'inductor_entry_cpp', 'c2buojsvlqbywxe3itb43hldieh4jqulk72iswa2awalwev7hjn2', False)

def _wrap_func(f):
    def g(args):
        args_tensor = [arg if isinstance(arg, torch.Tensor) else torch.tensor(arg) for arg in args]
        constants_tensor = [constant0]
        args_tensor.extend(constants_tensor)

        return f(args_tensor)
    return g
call = _wrap_func(module.inductor_entry_cpp)

For GPU

Based on the same example code, the generated code for GPU will look like this:

def call(args):
    arg0_1, = args
    args.clear()
    assert_size_stride(arg0_1, (1, ), (1, ))
    with torch.cuda._DeviceGuard(0):
        torch.cuda.set_device(0) # no-op to ensure context
        buf0 = empty_strided((19, ), (1, ), device='cuda', dtype=torch.float32)
        # Source Nodes: [add, tensor], Original ATen: [aten.add, aten.lift_fresh]
        stream0 = get_cuda_stream(0)
        triton_poi_fused_add_lift_fresh_0.run(constant0, arg0_1, buf0, 19, grid=grid(19), stream=stream0)
        run_intermediate_hooks('add', buf0)
        del arg0_1
        return (buf0, )

With the C++ wrapper turned on, the below equivalent C++ code will be generated:

std::vector<at::Tensor> inductor_entry_cpp(const std::vector<at::Tensor>& args) {
    at::Tensor arg0_1 = args[0];
    at::Tensor constant0 = args[1];

    at::cuda::CUDAGuard device_guard(0);
    auto buf0 = at::empty_strided({19L, }, {1L, }, at::TensorOptions(c10::Device(at::kCUDA, 0)).dtype(at::kFloat));
    // Source Nodes: [add, tensor], Original ATen: [aten.add, aten.lift_fresh]
    if (triton_poi_fused_add_lift_fresh_0 == nullptr) {
        triton_poi_fused_add_lift_fresh_0 = loadKernel("/tmp/torchinductor_user/mm/cmm6xjgijjffxjku4akv55eyzibirvw6bti6uqmfnruujm5cvvmw.cubin", "triton_poi_fused_add_lift_fresh_0_0d1d2d3");
    }
    CUdeviceptr var_0 = reinterpret_cast<CUdeviceptr>(constant0.data_ptr());
    CUdeviceptr var_1 = reinterpret_cast<CUdeviceptr>(arg0_1.data_ptr());
    CUdeviceptr var_2 = reinterpret_cast<CUdeviceptr>(buf0.data_ptr());
    auto var_3 = 19;
    void* kernel_args_var_0[] = {&var_0, &var_1, &var_2, &var_3};
    cudaStream_t stream0 = at::cuda::getCurrentCUDAStream(0);
    launchKernel(triton_poi_fused_add_lift_fresh_0, 1, 1, 1, 1, 0, kernel_args_var_0, stream0);
    arg0_1.reset();
    return {buf0};
}

module = CppWrapperCodeCache.load(cpp_wrapper_src, 'inductor_entry_cpp', 'czbpeilh4qqmbyejdgsbpdfuk2ss5jigl2qjb7xs4gearrjvuwem', True)

def _wrap_func(f):
    def g(args):
        args_tensor = [arg if isinstance(arg, torch.Tensor) else torch.tensor(arg) for arg in args]
        constants_tensor = [constant0]
        args_tensor.extend(constants_tensor)

        return f(args_tensor)
    return g
call = _wrap_func(module.inductor_entry_cpp)

Conclusion¶

In this tutorial, we introduced a new C++ wrapper in TorchInductor to speed up your models with just two lines of code changes. We explained the motivation of this new feature and walked through the easy-to-use API to activate this experimental feature. Furthermore, we demonstrated the Inductor-generated code using the default Python wrapper and the new C++ wrapper on both CPU and GPU to visually showcase the difference between these two wrappers.

This feature is still in prototype stage. If you have any feature requests or run into any issues, please file a bug report at GitHub issues.

Inductor C++ Wrapper Tutorial¶

Prerequisites:¶

Introduction¶

Enabling the API¶

Example code¶

Conclusion¶

Docs

Tutorials

Resources