Performance, Debugging, and Testing =================================== Performance model ----------------- Grilly performance depends on: 1. shader availability for your code path 2. memory movement volume (host to device and back) 3. tensor shapes and batch sizing 4. operation fusion opportunities Design choices -------------- Performance and correctness tooling in Grilly favors explicitness: 1. Keep kernel boundaries visible so bottlenecks are measurable. 2. Preserve CPU fallback paths for differential testing and debugging. 3. Use strict docs/test builds (`-W` and targeted suites) to catch regressions early in CI and local workflows. Profiling strategy ------------------ Use a layered profiling approach: 1. Measure end-to-end step time. 2. Isolate hotspot operators. 3. Verify whether code path is GPU or fallback CPU. 4. Reduce unnecessary downloads and host-side conversions. Debugging checklist ------------------- 1. Confirm Vulkan backend initialization. 2. Check tensor dtype (`float32`) and expected shape. 3. Verify required shader exists in loaded shader map. 4. Reproduce issue with smallest possible tensor sizes. 5. Add finite checks (`np.isfinite`) at major boundaries. Testing workflow ---------------- Useful commands: .. code-block:: bash pytest -q pytest tests/experimental -q pytest tests/test_integration_vulkan.py -q For docs: .. code-block:: bash uv run --with-requirements docs/requirements.txt sphinx-build -b html docs docs/_build/html -W Reproducibility tips -------------------- - Use stable hash utilities for deterministic seed derivation. - Save checkpoint artifacts for long ingestion/training flows. - Keep environment variables and driver versions tracked in experiment logs.