Both formats are DCT-based (except for lossless JPEG XL). JPEG 2000's use of the DWT was unusual; in general, still-image lossy compression research has spent the last 35 years iteratively improving on JPEG's design. This is partly for compatibility reasons, but it's also because the original design was very good.
Since JPEG, improvements have included better lossless compression (entropy coding) of the DCT coefficients; deblocking filters, which blur the image across block boundaries; predicting the contents of DCT blocks from their neighbours, especially prediction of sharp edges; variable DCT block sizes, rather than a fixed 8x8 grid; the ability to compress some DCT blocks more aggressively than others within the same image; encoding colour channels together, rather than splitting them into three completely separate images; and the option to synthesise fake noise in the decoder, since real noise can't be compressed.
You might be interested in this paper: https://arxiv.org/pdf/2506.05987. It's a very approachable summary of JPEG XL, which is roughly the state of the art in still-image compression.
Thanks. The paper is fascinating. I only skimmed around so far and it is full of interesting details. Even beyond compression. They really tried hard to make the USB of image formats, by supporting as many features and use cases as possible. Even things like multiple layers and non-destructive cropping. I like the section where they talk about previous image formats, why many of them failed and how they tried to learn from past mistakes.
Regarding algorithms: Searching for "learned image compression", there are a lot of research papers which use neural networks rather than analytic algorithms like DCT. The compression rates seem to already outperform conventional compression. I guess the bottleneck is more slow decoding speed than compression rate. At least that's the issue with neural video compression.
As I understand it, very small neural networks have already been incorporated into both VVC and AV2 for intra prediction. You're correct that this strategy is limited by decoding performance, especially when predicting large blocks.
In general, I'm pessimistic about prediction-and-residuals strategies for lossy compression. They tend to amplify noise; they create data dependencies, which interfere with parallel decoding; they require non-local optimisation in the encoder; really good prediction involves expensive analysis of a large number of decoded pixels; and it all feels theoretically unsound (because predictors usually produce just one value, rather than a probability distribution).
I'm more optimistic about lossy image codecs based on explicitly-coded summary statistics, with very little prediction. That approach worked well for lossy JPEG XL.
Everything after JPEG is still fundamentally the same, but individual parts of the algorithm are supercharged.
JPEG has 8x8 blocks, modern codecs have variable-sized blocks from 4x4 to 128x128.
JPEG has RLE+Huffman, modern codecs have context-adaptive variations of arithmetic coding.
JPEG has a single quality scale for the whole image, modern codecs allow quality to be tweaked in different areas of the image.
JPEG applies block coefficients on top of a single flat color per block (DC coefficient), modern codecs use a "prediction" made by smearing previous couple of block for the starting point.