I’m pretty sure, but no expert on the matter, that correct text rendering was solved by feeding in bitmaps of rasterized fonts as supplemental context to the image generation models.