You only need a phrase of twelve words from a 2048 word dictionary to have 128 bits of entropy. Twelve words is up to "Thy kingdom" in the Lord's Prayer, so certainly people are able to memorize twelve word phrases or even 24 word phrases without too much trouble.
And English is a lot more than 2048 words - so you could probably use a shorter phrase and still be fine.
The thing about the Lord's Prayer doesn't really follow. If you use a grammatically correct and semantically commonplace 12 word sequence like that, you surely don't have 128 bits of entropy. But the ease of memorization comes almost entirely from those attributes!
To get 128 bits of entropy with words, you need to pick about thirteen out of a million words--which is on the order of all the words in the English language--and give all of them equal probability. The sequence needs to be fully random as well. What you end up with will surely be easier to memorize than a UUID, but substantially more difficult than the start of the Lord's Prayer.
EDIT: Math is wrong, I was thinking 10 bits per million instead of 20. So 6-7 words out of a million (whole language) or 13 words out of a thousand (very limited subset of the language). Point about random selection still stands, but it's certainly easier than 13 very uncommon words. Still much harder than a realistic sentence of that length, though.
Probably much higher than you suspect. Making password haikus is an obvious idea which has been suggested many times before.
I'm sure that even with a great statistical model of password haikus (say an LLM) yours would still be one in a billion which still seems unlikely, but a cracking cluster can try billions per second.
In these cases it's very easy to have security that depends on the odds that a powerful attacker just hasn't gotten around to seriously trying the broad class of predictable generation schemes you've used.
And English is a lot more than 2048 words - so you could probably use a shorter phrase and still be fine.