@calm
I don't think you'd get a speed improvement by using folding directly. Recursion is always going to be slower than having it all in a single circuit. I guess technically with folding you could have multiple separate machines doing it in parallel, but seems like overkill.