The New Standard for Program Synthesis: How Tree-Diffusion Outperforms CSGNet and REPL Flow

cover
25 Sept 2025

Abstract and 1. Introduction

  1. Background & Related Work

  2. Method

    3.1 Sampling Small Mutations

    3.2 Policy

    3.3 Value Network & Search

    3.4 Architecture

  3. Experiments

    4.1 Environments

    4.2 Baselines

    4.3 Ablations

  4. Conclusion, Acknowledgments and Disclosure of Funding, and References

Appendix

A. Mutation Algorithm

B. Context-Free Grammars

C. Sketch Simulation

D. Complexity Filtering

E. Tree Path Algorithm

F. Implementation Details

4.2 Baselines

We use two prior works, Ellis et al. [11] and CSGNet [28] as baseline methods.

CSGNet Sharma et al. [28] employed a convolutional and recurrent neural network to generate program statements from an input image. For a fair comparison, we re-implemented CSGNet using the same vision-language transformer architecture as our method, representing the modern autoregressive approach to code generation. We use rejection sampling, repeatedly generating programs until a match is found.

Figure 4: Performance of our approach in comparison to baseline methods in CSG2D and TinySVG languages. We give the methods n = 256 images from the test set and measure the number of nodes expanded to find a solution. The auto-regressive baseline was queried with rejection sampling. Our policy outperforms previous methods, and our policy combined with search helps boost performance further. Error bars show standard deviation across 5 random seeds.

REPL Flow Ellis et al. [11] proposed a method to build programs one primitive at a time until all primitives have been placed. They also give a policy network access to a REPL, i.e., the ability to execute code and see outputs. Notably, this current image is rendered from the current partial program. As such, we require a custom partial compiler. This is straightforward for CSG2D since it is a compositional language. We simply render the shapes placed so far. For TinySVG, it is not immediately obvious how this partial compiler should be written. This is because the rendering happens bottom-up. Primitives get arranged, and those arrangements get arranged again (see Figure 1). Therefore, we only use this baseline method with CSG2D. Due to its similarities with Generative Flow Networks [3], we refer to our modified method as “REPL Flow”.

Test tasks For TinySVG we used a held-out test set of randomly generated expressions and their images. For the CSG2D task, we noticed that all methods were at ceiling performance on an indistribution held-out test set. In Ellis et al. [11], the authors created a harder test set with more objects. However, simply adding more objects in an environment like CSG2D resulted in simpler final scenes, since sampling a large object that subtracts a large part of the scene becomes more likely. Instead, to generate a hard test set, we filtered for images at the 95th percentile or more on incompressibility with the LZ4 [7, 37] compression algorithm.

All methods were trained with supervised learning and were not fine-tuned with reinforcement learning. All methods used the grammar-based constrained decoding method described in Section 3.4, which ensured syntactically correct outputs. While testing, we measured performance based on the number of compilations needed for a method to complete the task.

Figure 4 shows the performance of our method compared to the baseline methods. In both the CSG2D and TinySVG environments, our tree diffusion policy rollouts significantly outperform the policies of previous methods. Our policy combined with beam search further improves performance, solving problems with fewer calls to the renderer than all other methods. Figure 6 shows successful qualitative examples of our system alongside outputs of baseline methods. We note that our system can fix smaller issues that other methods miss. Figure 7 shows some examples of recovered programs from sketches in the CSG2D-Sketch language, showing how the observation model does not necessarily need to be a deterministic rendering; it can also consist of stochastic hand-drawn images.

Figure 5: Effects of changing several design decisions of our system. We train smaller models on the Rainbow environment. We give the model n = 256 test problems to solve. In (a), for No Reverse Path, we train the model without computing an explicit reverse path, only using the last step of the noising process as targets. For No Current Image, we train a model that does not get to see the compiled output image of the program it is editing. For No Noising, instead of using our noising process, we generate two random expressions and use the path between them as targets. In (b) we examine the effect of training mixture between forward diffusion (ρ = 0.0) and pure random initialization (ρ = 1.0) further. Error bars show standard deviation across 5 random seeds.

Authors:

(1) Shreyas Kapur, University of California, Berkeley ([email protected]);

(2) Erik Jenner, University of California, Berkeley ([email protected]);

(3) Stuart Russell, University of California, Berkeley ([email protected]).


This paper is available on arxiv under CC BY-SA 4.0 DEED license.