Agreed, it almost feels like we have a visual processing unit with special “opcodes” for operations like depth matching and pattern repetition.
The generator first needs a depth map, and then derives the repeating pattern from that.
A normal RGB image would be far too noisy; the fine texture variations would break the repetition needed for the brain to fuse the patterns correctly.
That makes sense. Using a depth map first sounds almost inevitable for keeping the repetition stable enough for the visual system to lock onto it.
What I always find interesting with these images is how sensitive the brain is to those horizontal disparities. Even tiny shifts create a surprisingly strong sense of structure once the eyes fuse the patterns. It really highlights how much of “seeing” depth is reconstruction rather than direct perception.
Do you generate the depth maps manually, or are they derived procedurally from some model or scene description?
Haha, fair question. No, just a human who tends to write in complete paragraphs.
I've been experimenting with the generator as a side project and got curious about how these stereograms actually work under the hood.
simedw ~ $ claude -p "random number between 1 and 10"
7
simedw ~ $ claude -p "random number between 1 and 10"
7
simedw ~ $ claude -p "random number between 1 and 10"
7
simedw ~ $ claude -p "random number between 1 and 10"
7
Still having some issues that match my previous comment, I'll try to follow your blog and give more feedback as you work on it.
Will comment that the shorter phrases (2-4 characters long) were generally accurate at normal speed, but the longer sentences have issues.
Maybe focusing on the accuracy of the smaller phrases and then scaling that might be a good way to go, since those smaller phrases are returning better accuracy.
Again, really think this is a great initiative, want to see how it grows. :)
It’s fairly sensitive to background noise at the moment. I’m planning to train an improved version with stronger data augmentation, including background noise.
For accents, I’ve mostly tested with a few friends so far. I’m wondering whether region should be a parameter, because training on all dialects might make the system too lax.
I had a quick look at Farsi datasets, and there seem to be a few options. That said, written Farsi doesn’t include short vowels… so can you derive pronunciation from the text using rules?
The generator first needs a depth map, and then derives the repeating pattern from that. A normal RGB image would be far too noisy; the fine texture variations would break the repetition needed for the brain to fuse the patterns correctly.
reply