Comfy Kitchen is a high-performance kernel library designed for Diffusion model
inference. It provides optimized implementations for critical operations,
including various quantization formats and Rotary Positional Embeddings (RoPE).
The library features a flexible dispatch system that automatically selects the
most efficient compute backend—CUDA, Triton, or eager PyTorch—based on available
hardware and input constraints.

Key features include:
* Optimized kernels specifically tuned for Diffusion inference workloads.
* Support for multiple compute backends (CUDA C, Triton JIT, and pure PyTorch).
* Transparent quantization via a QuantizedTensor subclass that intercepts
  PyTorch operations.
* Support for advanced quantization formats including FP8, NVFP4, and MXFP8.
* Automatic backend selection and constraint validation for hardware-specific
  optimizations.
* Implementation of performance-critical functions like RoPE and scaled matrix
  multiplication.
