Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
A hidden gem in sound symmetry (soundshader.github.io)
201 points by ssgh on Nov 9, 2020 | hide | past | favorite | 53 comments


Hi HN, author here. A few comments on how I came up with this idea. I've been trying to find a "proper" connection between audible sound and visible shape, a connection that would not only preserve all the information, but would also properly visualize the "symmetry" in sound, so that messy sound would turn into messy images and harmonic sound would turn into visually appealing images. The latter part is hard, as perception of "musical harmony" is vaguely defined and subjective. Nevertheless, after quite a few attempts, I came across a particularly simple FFT-based technique that produces impressive and unexpected results. Below is the summary of my finding.

Music is a temporal ornament. There are many types of ornaments, e.g. the 17 types of wallpaper tesselations, but few of them look like music. However there is one particular type of ornament that resembles music a lot - I mean those “mandala” images. I don’t know how those are produced, but I noticed a connection between those images and music:

- The 1st obvious observation is that a mandala is drawn in polar coordinates and is 2PI periodic. Sound is periodic too, so I thought the two facts are related.

- The 2nd observation is that patterns on those images evolve over the radial axis. Ans so is music is a sequence of evolving sound patterns.

- The 3rd observation is that a 2PI periodic function trivially corresponds to a set of frequencies. We usually use FFT to extract the frequencies and another FFT to restore the 2PI periodic function. Thus, a single radial slice of a mandala could encode a set of frequencies. If this is correct, a mandala is effectively an old school vinyl disk.

Putting these observations together, we naturally arrive with the idea of using ACF. More details in the linked github project.


In your example saying that the ear doesn’t work as an FFT, I think you’re confusing physical processes (the ear) with perceptual processes (psychoacoustics and interaction between the brain and ear). The cochlea itself is a physical FT, based on resonances in the thickness at various points. That gets passed to the brain as basically sine-frequency data. The autocorrelation part comes in when the brain processes this and actually reinforces it by feeding back signals it thinks are peaking back into the inner ear.

(Edit) also this is a really fascinating and enlightening way of viewing it - didn’t mean to imply otherwise with this quibble :)


IIRC, that's probably not quite true. It's been about a decade since I studied hearing, but my understanding is that the idea of the cochlea being a physical FT (e.g. the place theory of hearing) doesn't explain some phenomena, which also probably aren't psychoacoustic.

https://en.m.wikipedia.org/wiki/Temporal_theory_(hearing)


I'll read more about it, but the thickness/stiffness of the cochlea resonates at frequencies with sensitivity that matches our ability to distinguish frequencies - it has been removed from the inner ear and resonance tested outside of the context of other processes. Interesting that it doesn't explain some phenomena, but are the temporal theory and place theory mutually exclusive?

(Edit) After reading a bit more, it seems to make more sense that it's a combination of both, as the resonance on the cochlea is likely not 100% accurate, and conversely the impulses from peaks would tend to be around the areas of resonance, so rather than being mutually exclusive it makes sense these two effects work in parallel.

  Modern research suggests that the perception of pitch depends on both the places and patterns of neuron firings. Place theory may be dominant for higher frequencies.[4] However, it is also suggested that place theory may be dominant for low, resolved frequency harmonics, and that temporal theory may be dominant for high, unresolved frequency harmonics.[5]


> I've been trying to find a "proper" connection between audible sound and visible shape, a connection that would not only preserve all the information, but would also properly visualize the "symmetry" in sound, so that messy sound would turn into messy images and harmonic sound would turn into visually appealing images.

It is very exciting to come across others who are also interested in this topic. I am also very interested in the shape of sound but I have spent less time on empirical observations and more on imagining an abstract logic of numbers which can be visualized and heard. Real sound visualizations are also interesting to me but I decided to focus on abstract ideals because I thought it would be appropriate for a video game.

Hope you don't mind me sending some emails.


Your emails are welcome! As for a physical sound visualization, one exists already: see "Numerical simulation of Faraday waves" by L. Tuckerman. As usual, the dry technical paper glosses over the visual aspect of Faraday waves. Basically, if you create such a standing wave in a cup of tea (e.g. by shaking the table), you'd see its shape via reflections, similar to how we can see the surface of ocean waves via reflections of sun. However, if you could suspend the water surface, put a LED ring above (one of those used for professional photos) and take an extremely high quality photo to see all the reflections and refractions created by that LED ring, yo'd see a picture of remarkable complexity. It looks a lot like a 3D hologram. I've been trying to simulate this effect on GPU.


Surprisingly, I stumbled across a result that seemed similar yet weirdly different: https://twitter.com/theshawwn/status/1176070857468329984?s=2...

I take the FFT of the phase component, which is very similar to ACF; it’s the FFT of an FFT, but preserves phase. It even takes abs(), which might be mostly equivalent to your squaring operation.

Weird. I am really not trying to claim that I discovered ACF — quite the opposite. My result was shockingly different to what you found, even though the operations are so close to identical.

I think phase is extremely important in visualization. You can see why here: https://twitter.com/theshawwn/status/1176070853819342848?s=2...

You’ve come up with one of the most gorgeous visualizations I’ve ever seen for signals in general!

One way to incorporate phase: turn the angle into an x,y coordinate using atan2, then shade red and blue based on x and y. E.g. x of 1.0 is “full red”, x of -1.0 is no red; ditto for y, but with blue.

The other trick I used was to un-interleave the lines. Basically I noticed that every other line has a strong correlation; therefore drop every even numbered line to remove the aliasing artifacts. Then suddenly you get nice and smooth phase interpolations.


Interesting. Figuring out the phase problem is one of my biggest TODO items.

How did you compute FFT of the phase? The thing is, phase is discontinuous or multivalued function if we represent phase as a real number. We could also represent phase as a complex number of unit magnitude: exp(i phi). It would be continuous, but complex-valued.

And phase is indeed important for hearing:

https://auditoryneuroscience.com/vocalizations-speech/speech...

I didn't quite get the trick with uninterleaving the lines.


You're in luck -- I managed to dig up my WIP notes from a year ago.

https://imgur.com/xLcvLIm

As you can see, the raw phase waveform is very "wavy", as might be expected. It oscillates rapidly, making it hard to see the patterns. But if you go to the tweets I linked above, you'll see the phase is much smoother in those images. How did I do it?

The key is to focus on every other line. Notice that if you simply pay attention to every odd row, it will be smooth.

I think I simply did "row 0, row 2, row 4, ... row n" followed by "row 1, row 3, row 5, ... row n + 1"

As for the fft of the fft trick for phase, I'm rsync'ing all of my old code and demo images to here:

https://battle.shawwn.com/sdb/voicecloning/

You may be interested in the png images, in particular the ones with "phase" in the names. You can probably ignore all the code except repl2.py.

Those images were generated via unknown methods -- sadly my repl sessions weren't saved. But, I happened to write down in repl2.py how the tweet images were generated:

  cv2.imwrite(os.path.expanduser("~/Downloads/mel-phase-spectrogram-phase-fft-abs.png"), np.abs(np.fft.fft2(-1+2*1/255*cv2.imread(os.path.expanduser("~/Downloads/mel-phase-spectrogram-phase.png")))))

  cv2.imwrite(os.path.expanduser("~/Downloads/mel-phase-spectrogram-phase-fft-abs2.png"), -1+2.0*np.abs(np.fft.fft2(-1+2*1/255*cv2.imread(os.path.expanduser("~/Downloads/mel-phase-spectrogram-phase.png")))))
So, input: https://battle.shawwn.com/sdb/voicecloning/mel-phase-spectro...

Then, using the code above, the result: https://battle.shawwn.com/sdb/voicecloning/mel-phase-spectro...

I've verified that it still works. I think you can wget those images and copy-paste that code into a python repl.

So the only remaining question is, how was mel-phase-spectrogram-phase.png generated? Unfortunately that seems to be lost with the sands of time. But, as a hint, I think it was simply a matter of turning the phase component into x,y using atan2, then turning it into blue and red.

Also, completely unrelated, but I once made a super high resolution mel spectrogram that looked way cool and I can't resist showing it off: https://battle.shawwn.com/sdb/voicecloning/ultra-mel.png

I did all this when making 'Dr Kleiner sings "I Am the Very Model of a Modern Major General"' around a year ago.

https://www.youtube.com/watch?v=koU3L7WBz_s&ab_channel=Shawn...

Kinda funny that all of this visualization work was just to make memes, but the quest to meme turns out to be surprisingly motivating.

https://battle.shawwn.com/sdb/voicecloning/demo_output_101.w...

https://battle.shawwn.com/sdb/voicecloning/demo_output_75.wa...

Anyway, I think there's a lot left to discover in terms of audio visualization! I would definitely encourage you to play around with the phase component. The results can be pretty striking, as you can see from the "Result" image above (https://battle.shawwn.com/sdb/voicecloning/mel-phase-spectro...).

Sorry for the scattered explanation -- it's 4am here, but I wanted to give you some kind of writeup, even if it's rather disjointed. If you have more questions, be sure to ask! I can give better details tomorrow.


Thanks for the notes! Another idea I've been thinking about is to capture the phase by splitting ACF into multiple complementary parts at the spectral density step, i.e. when we draw the spectrogram, we use |r exp(i phi)|^2 of the FFT output and that |...|^2 drops the phi. However, we could split this |...|^2 into a sum of a few terms that would add up to the same spectral density, but would separately capture the phase. In the simplest case, |x + iy|^2 = x^2 + y^2, can be interpreted as two half-spectrograms: one for x^2 and another for y^2. If colored the same, we'd get the original spectrogram, but if colored differently, we'd know values of x^2 and y^2 and that's enough to know the phase up to a quadrant.



Yeah, that's where my journey started. I've explored those quite a bit, even wrote a GLSL solver (wave-glsl.github.io/web) to visualize height maps of the u_tt = u_xx + u_yy equation (there is a shallow water equation solver there as well). Problems were that:

(1) Making a stable and fast solver is very difficult. A simple solver for the canonical wave equation is fast, but unstable, so the solution has to be periodically adjusted to avoid NaNs. A stable solver, even for the simplest equation, would be 20x slower. For complex cases, we'd have to involve the Floquet theory, but that would bring an already slow solver to a halt.

(2) A wave diff equation can barely visualize a select frequency, not even a simple mix of frequencies or let alone music. The thing is such wave equations and their boundary shapes have a few select "resonance frequencies" that produce semi-stable patterns. Even a tiny step from a stable frequency, e.g. 6.1 Hz vs 6 Hz, and the solution turns into a mix of unstable patterns morphing one into another, which is cool, but not visually appealing. Mixing multiple frequencies together often produces an unstable mess and even if a pattern forms, you're never sure if it's the pattern for that frequency or just a transient shape, and if it's transient, you can't know if it's due to numeric errors or due to the nature of the equation.

(3) Limited resolution. The rule of thumb is that on a 1000x1000 px screen, the densest Chladni pattern would make 500 full wave repetitions, one pixel per positive and negative sides of the wave. This means we can render only the 500 different frequencies, with 500 Hz slowly turning into a mess due to rounding errors (2 pixels per wave period isn't really enough). Increasing the internal solver buffer to say 4096px brings fps down to 3-4.

However, despite all this, Chladni patterns are hiding something very remarkable, that seems to be glossed over in technical papers. If you imagine that a Chaldni pattern is a water or glass surface, with reflective and refractive properties, and look at the reflection of a simple symmetric object, e.g. a ring, you'd see something resembling a 3d hologram: all these inter-reflections will produce a "virtual image" or remarkable complexity. This picture can be taken by a hi-res camera, but visualizing it with GLSL is again very difficult: the raytracer needs to be outrageously precise.


I checked the live demo with some music. I was expecting something different and maybe you agree that it would be a better visualization.

Right now, the visual experience is like watching movement through a high-speed tunnel. I was expecting the "mandala" you mentioned in the sense that the end result is the accumulated visualization of all waves.

The sound representation would not disappear out of the borders. The first sound recorded would be stored as a narrow outside ring right next to the circle limit. The next sound would be stored as another narrow ring right before the first one. And it would continuously being accumulated sound after sound. The final result would be like a tree cut. It would have a final image representation the whole song, not just sequential snapshots of the sounds included in the song as it is now.

Anyway, congrats for the project! It is awesome and inspiring!


If you're on desktop, try moving your mouse vertically. "Up" seems to zoom in, and "Down" zooms out. Fully zoomed-out, I think it's closer to what you expected to see.


That's it! Thanks for the tip


I'm guessing that triple correlation could also be rendered into pretty pictures, did you try that?


Heh, you're reading my mind. Try the URL below, but don't increase fps/n params, as that will eat all your GPU cycles very quickly:

soundshader.github.io/?s=acf3&n=512&fps=1&acf.decay=0

It effectively computes the bispectrum as B(p, q) = F(p)F(q)F^(p+q) and runs the inverse 2D FFT to restore the triple autocorrelation. The results are interesting, but not impressive and very GPU intensive (NxNxlog(N) per frame is slow). In any case, I strongly believe that bispectrum is hiding something interesting and I just haven't figured how to see it.


Cool! Maybe it's because it contains phase information which is not that relevant to hearing. FWIW I remember long time ago using it for image processing and the trick was to look at a sections of it, i.e. TC(p1, p2) where one of p1 or p2 was fixed.


Probably not related at all, but this reminded me of the Fourier-Mellin transformation, of which a nice overview can be found here[1], used for image registration.

Images are Fourier transformed, and the result is transformed to log-polar coordinates. This turns rotation and scaling in the source image into translations in the resulting log-polar data.

Anyway, fun stuff, thanks for the share!

[1]: https://sthoduka.github.io/imreg_fmt/ (follow link to the pipeline description)


I like the vaguely mathematical connotation of the word "ornament" here, I have never encountered that as a technical term before. It makes me realize that ornament is a good description for many of my favorite mathematical concepts/structures.

Your comment reminds me of the search for the mandelbulb fractal. It seems a fitting comparison, a bulb being a sort of ornament.

Anyway, interesting work.


An ornament is something that's almost invariant under some transforms. For example, these ACF images are almost invariant under some rotations. Sound waves are almost invariant under temporal shifts. I'd argue that what makes an ornament look good is the ease of recognizing those transforms and what makes sound sound sound is the ease of recognizing those temporal shifts.


This is gorgeous! I'm going to have to spend more time with this. Very interesting approach


Are you familiar with the "Circle of fifths"? That's the first thing that came to mind when I saw your circle patterns, and it directly relates to "harmony".


ACF composes all harmonics together and that's what likely makes the images visually appealing. However ACF doesn't give special treatment to harmonics that are exactly N octaves apart, e.g. A4 and A7 notes.


Would ACF be better if did give special treatment to those harmonics? I understand they are an arbitrary distinction, but humans do seem to like them.


It would. The 12 notes are usually mapped to 12 colors, and ideally the sound image would reflect that. One "brute force" way to do that is to split the spectrum into 12 parts, draw 12 ACF images and then mix them. A less brute force approach is to tweak ACF to recognize that F-2F-4F-etc frequencies are specially related, even more specially than just F-2F-3F-4F-etc.


I'm very interested in this, but I can't seem to get the demo page working in latest Firefox:

> AudioContext.createMediaStreamSource: Connecting AudioNodes from AudioContexts with different sample-rate is currently not supported.

Edge worked though. Didn't try Chrome.

Edit: This is beautiful!

I have checked out a variety of songs, and I feel the visualization is rather dominated by whatever frequency is loudest. (E.g. bass sound -> three- to five-fold symmetry and most detail obscured by it).

Have you considered applying something like the https://en.wikipedia.org/wiki/Equal-loudness_contour somewhere in the process to more evenly weight the frequency contributions according to human hearing perception? Not sure if would have the intended effect, but I'd be curious what happens.


That's indeed the number 1 problem with the visualizer. ACF waves can greatly vary in magnitude and if there is a loud bass wave that's 1000x bigger than small wavelets from background music, those wavelets will be present, but barely visible. That's also why classical or otherwise "peaceful" music looks so good: all waves have about the same height.

I've in fact tried implementing the equal loudness contour - try adding ?acf.aweight=1 to the URL. However the result is mediocre. I've also tried applying a few bandpass filters for low, mid and high frequency ranges, rendering them separately with different colors and then mixing images together. The result is, again, medicore. I've been entertaining the idea that ACF waves are ought to be rendered like ocean waves: via light reflections.


It works on Firefox, but as the error says, you have to set the correct sample rate that matches the mp3 file. You can do that with ?sr=44.1 or ?sr=48. The default value is 44.1 and on Chrome you can use any sample rate.


Autocorrelation has been used in psychoacoustics at least since Licklider's work in the 1950s. But I'm not sure if I've seen this style of visualization before. It looks a bit like the output of a strobe tuner.


I've been playing some music on the livedemo. I like it a lot. The harmony changes are easy to see as concentric slices slighty rotated. On heavy drum music the rythm also leaves a trail of spaced circles easy to recognize. You might be onto something here. Colored waveforms, like those used in djing software, might be better for locating a point in a song, but I've never been able to locate a chord change on those based on the waveform representation alone. I would totally love to try to "mix" some of those mandalas, or try an editor software using this visualization where I can move around a virtual needle, copy paste slices, etc. Not saying It will be better than using a waveform, but I would experiment with it given the chance.

Very nice job, congratulations!


Seems to work really well! Playing Gasoline by Audioslave, it really misses the high intensity peaks.

It interprets the trumpet in Miles Davis - It Never Entered my Mind as dark ripples and is beautiful in it's own way.

It's a long way from the last time I used a visualizer on Winamp, nice!


One problem with the current color scheme is what I'd call the dynamic range of volumes. If the input is a mix of very loud and very quiet waves, as often the case in club music, its ACF will look like a stormy ocean: huge slow waves with a lot of ripples on them. ACF doesn't lose this information, but visualizing it is difficult.

Edit. The generated images in fact contain the small ripples, but our eyes don't notice the 0.1% modulation of color. If they did, we'd see bright orange-blue waves with a fine pattern of wavelets colored with a slightly different shade of orange and blue. People who can see 100 shades of orange would see this pattern.


Is this somehow related or similar to oscilloscope music? https://youtu.be/qnL40CbuodU


Interesting maths. I'm enjoying the graphics, they are beautiful to watch unfold, but bear little relation to the full range that is heard in the audio examples. Not in volume, or varying pitch. This technique seems to be a way of showing an aspect of sound via graphics that mean the two are unrelated in immediately meaningful ways. The word 'Periodic' is mentioned, but that isn't seen.

Still, great experiment, and interesting results!


I wish I could have seen an image of noise for comparison.


Would love to see this applied to financial price signals.


Very nice. Curious how colors are chosen?

Can’t try it on my dated iPhone - you need to vendor prefix the AudioContext with webkitAudioContext if AudioContext is undefined.


`z = acf(x, y) / (3 sigma); color_rgb = z < 0 ? z * vec3(4, 2, 1) : z * vec3(1, 2, 4);`

That's really it. The (4, 2, 1) is the oversaturated orange color, i.e. it would progress as (1, 0.5, 0.25) -> (1, 1, 0.5) -> (1, 1, 1). None of the tricky HSL/HSV schemes worked better than this.


Somehow I have memories of the 1990s and audio visualisations coming back... Fantastic. I'd love to be able to plug this into mpv for playing audio files. I wonder how hard this would be (or how many resources it would use...).

I really enjoyed your work: the write-up was clear and the demo page worked well.


For visualization, you'd want to take at least 20 sound samples per second, each sample about 0.1 sec long, so they would overlap a bit to capture freq/amp modulations. With the typical 44,100 Hz sample rate, 0.1 sec would be an array of 4410 float32s. Round that down to 4096. Then you run FFT twice over this input: that's Nlog2(N), N=4096. So you have about KN log2(N) ops/sec, with K=20, N=4096 that's 1 million flops. Updating pixels of the 1000x1000 image would be another 1 Mflops. Well, brightness adjustment to the 3 sigma range is another pass and another 1 Mflops. I know for a fact that this thing can draw 1024x1024 images at 60 fps on pure javascript, but if you don't like the noise of CPU fan, you'll have to write it in GLSL.


Images reminded me of cymatic patterns. I wonder if there is any relation there


Both share the same idea, but math is certainly different. Those cymatics patterns are solutions of the u_tt = u_xx + u_yy equation with t modulated by a sine wave and the images themselves are reflections of light in the u(x, y) surface.


Very great work. I'm also interested to see this patterns "unrolled". Not on a circle but on timeline (like spectrogram). Where x-axis is time and y-axis is unrolled pattern (0..2pi).


You can press "c" to switch to flat coordinates. Edit: or add ?acf.polar=0 to the URL.


It would be nice if the demo had more than one reference sound.


Oh man, I put "Superheroes" by "Daft Punk" on this and was just hypnotized by the visualization.


beautiful. What's the demo soundtrack?


It's a file from freesound that captures the male "A" vowel, but slowed down like 20x.


Why the slowdown? (given that slowdown typically introduces artifacting, making the result as much a pattern of those artifacts as it is of the original signal)

(also note that your site does not currently work in Firefox, which would be nice to fix)


It's just a random sound sample I found on freesound.

The demo works on Firefox 78, Ubuntu. However you'd have to set correct sample rate with ?sr=44.1 to match the mp3's sample rate: Firefox won't do resampling.


Fun fact, I ran into the esampling problem myself back in April when I posted https://github.com/WebAudio/web-audio-api/issues/30, V2 of the spec will allow this natively, but it also not seemingly making any headway towards initial release so who knows when it'll land...

What you can do though is look at the first few bytes of an .mp3 file (since it's a file drop/file load) to just directly read the sample rate from the MP3 block header[1], where you directly check the value encoded by [data[19], data[20]]: if it's [0,0] that means it's 44100, [0,1] means it's 48000, [1,0] means it's 32000 and that's it. There are no other sample rates allowed for MP3.

[1] http://mpgedit.org/mpgedit/mpeg_format/MP3Format.html for the full block format)


Thanks for the note! I've been wanting to auto-detect the sample rate.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: