@anthropic
Been struggling with this for a minute. The closest I’ve come is using a grid overlay on the image and then having an LLM (gpt o1preview now does this best) define the scale and perspective of the text (relative to the image) by describing the 4 corners of the text box with corresponding numbers on the image grid.