Hacker Newsnew | past | comments | ask | show | jobs | submit | ajfriend's commentslogin

Yes! A Gosper Island in H3 is just the outline of all the descendants of a cell at a some resolution. The H3 cells at that resolution tile the sphere, and the Gosper Islands are just non-overlapping subsets of those cells, which means they tile the sphere.


Very cool! And the prefix queries you mention are what I was trying to get at in another comment, but you explained it better :)


I agree that the lack of congruency in H3 hexagons can cause weird overlaps and gaps if you plot mixed resolutions naively, but there are some workarounds that work pretty well in practice. For example, if you have mixed resolutions from compacted H3 cells but a single “logical” target resolution underneath, you can plot the coarser cells not with their native geometry, but using the outline of their children. When you do that, there are no gaps. (Totally unrelated but fun: that shape is a fractal sometimes called a "flowsnake" or a "Gosper Island" (https://en.wikipedia.org/wiki/Gosper_curve), which predates H3 by decades.)

That said, this feels like an issue with rendering geometry rather than with the index itself. I’m curious to hear more about why you think the lack of congruency affects H3’s performance for spatial joins. Under the hood, it’s still a parent–child hierarchy very similar to S2’s — H3 children are topological rather than geometric children (even though they still mostly overlap).


In a highly optimized system, spatial joins are often bandwidth bound.

Congruency allows for much more efficient join schedules and maximizes selectivity. This minimizes data motion, which is particularly important as data becomes large. Congruent shards also tend to be more computationally efficient generally, which does add up.

The other important aspect not raised here, is that congruent DGGS have much more scalable performance when using them to build online indexes during ingestion. This follows from them being much more concurrency friendly.


I appreciate the reply! So, I might be wrong here, but I think we may be talking about two different layers. I’m also not very familiar with the literature, so I’d be interested if you could point me to relevant work or explain where my understanding is off.

To me, the big selling point of H3 is that once you’re "in the H3 system", many operations don’t need to worry about geometry at all. Everything is discrete. H3 cells are nodes in a tree with prefixes that can be exploited, and geometry or congruency never really enter the picture at this layer.

Where geometry and congruency do come in is when you translate continuous data (points, polygons, and so on) into H3. In that scenario, I can totally see congruency being a useful property for speed, and that H3 is probably slower than systems that are optimized for that conversion step.

However, in most applications I’ve seen, the continuous-to-H3 conversion happens upstream, or at least isn’t the bottleneck. The primary task is usually operating on already "hexagonified" data, such as joins or other set operations on discrete cell IDs.

Am I understanding the bottleneck correctly?


A DGGS is a specialized spatial unit system that, depending on the design, allows you to elide expensive computation for a select set of operations with the tradeoff that other operations become much more expensive.

H3 is optimized for equal-area point aggregates. Congruency does not matter for these aggregates because there is only a single resolution. To your point, in H3 these are implemented as simple scalar counting aggregates -- little computational geometry required. Optimized implementations can generate these aggregates more or less at the speed of memory bandwidth. Ideal for building heat maps!

H3 works reasonably for sharding spatial joins if all of the cell resolutions have the same size and are therefore disjoint. The number of records per cell can be highly variable so this is still suboptimal; adjusting the cell size to get better distribution just moves the suboptimality around. There is also the complexity if polygon data is involved.

The singular importance of congruence as a property is that it enables efficient and uniform sharding of spatial data for distributed indexes, regardless of data distribution or geometry size. The practical benefits follow from efficient and scalable computation over data stored in cells of different size, especially for non-point geometry.

Some DGGS optimized for equal-area point aggregates are congruent, such as HEALPix[0]. However, that congruency comes at high computational cost and unreasonably difficult technical implementation. Not recommended for geospatial use cases.

Congruence has an important challenge that most overlook: geometric relationships on a 2-spheroid can only be approximated on a discrete computer. If you are not careful, quantization to the discrete during computation can effectively create tiny gaps between cells or tiny slivers of overlap. I've seen bugs in the wild from when the rare point lands in one of these non-congruent slivers. Mitigating this can be costly.

This is how we end up with DGGS that embed the 2-spheroid in a synthetic Euclidean 3-space. Quantization issues on the 2-spheroid become trivial in 3-space. People tend to hate two things about these DGGS designs though, neither of which is a technical critique. First, these are not equal area designs like H3; cell size does not indicate anything about the area on the 2-sphere. Since they are efficiently congruent, the resolution can be locally scaled as needed so there are no technical ramifications. It just isn't intuitive like tiling a map or globe. Second, if you do project the cell boundaries onto the 2-sphere and then project that geometry into something like Web Mercator for visualization, it looks like some kind of insane psychedelic hallucination. These cells are designed for analytic processing, not visualization; the data itself is usually WGS84 and can be displayed in exactly the same way you would if you were using PostGIS, the DGGS just doesn't act as a trivial built-in visualization framework.

Taking data stored in a 3-space embedding and converting it to H3-ified data or aggregates on demand is simple, efficient, and highly scalable. I often do things this way even when the data will only ever be visualized in H3 because it scales better.

[0] https://en.wikipedia.org/wiki/HEALPix


I'd like to hear more about the synthetic part of these three spaces, because S2 works exactly as you say, embedding the 2-sphere in three (cartesian) dimensions. S2 points are always three dimensional.


S2 projects a 2-sphere onto a topologically homeomorphic 2-surface designed to play nicely with discrete representations. It is not a 3-space any more than a 2-sphere is. Most software uses three coordinates organized by a 2-surface DGGS.

DGGS that use 3-space embeddings are topologically 3-dimensional i.e. purely volumetric. They do not interpret the Earth as a 2-surface. In addition to polar coordinates, you must provide a volumetric model of the Earth to compute the DGGS cell. The shard distributions look very different between a 2-surface and a 3-surface. The latter has significantly better properties for large analytical data models but requires more sophisticated storage architectures.

The synthetic 3-space is optimized for two things. You want maximally efficient mapping function from the typical WGS84 geometry into it. Tidy math, basically. Since it is purely internal, the user will never see it, and it doesn't map to anything real, you have latitude to design it to satisfy software engineering objectives as long as it works. Second, the 3-space references are naturally less compact than 2-surface references at the same resolution even though you'll end up with roughly the same number of shards. A lot of effort goes to schemes to compress out the sparseness so that the storage requirements are similar to 2-surface DGGS e.g. how often do you need to represent geometry 1000 km below the Earth's surface?

These DGGS also have the low-key advantage that they natively represent and understand 3-space, not just surface geometry, if you move beyond making flat maps.


I'd be curious to hear more about how you do the 2 -> 3 embedding there. In S2 it uses cartesian three space, but points are constrained to be unit magnitude. This has advantage and disadvantages obviously.


h3o-zip is really impressive! I've been wanting to play around with it more, and I've been meaning to ask you if you have any good references for that encoding approach. I understand how it works in h3o-zip, but I'd be interested to know more about where else that approach has been used.


I’m pretty sure the approach isn’t that novel, but I really rediscovered it on my own while exploring several compressions approaches (generic compressions with tailored dict like zstd, integer packing/compressions, compressed bitmap, …: I probably have my notes about these somewhere)

As such, I don’t have any name/papers to give you nor point you to similar application. But I would also be interested ^^

But don’t hesitate to reach out if you work on something similar and wanna discuss about it.


I have project that's still very much at the experimental stage, where I try to get something similar to this pipe syntax by allowing users to chain "SQL snippets" together. That is, you can use standalone statements like `where col1 > 10` because the `select * from ...` is implied. https://ajfriend.github.io/duckboat/

    import duckboat as uck

    csv = 'https://raw.githubusercontent.com/allisonhorst/palmerpenguins/main/inst/extdata/penguins.csv'

    uck.Table(csv).do(
        "where sex = 'female' ",
        'where year > 2008',
        'select *, cast(body_mass_g as double) as grams',
        'select species, island, avg(grams) as avg_grams group by 1,2',
        'select * replace (round(avg_grams, 1) as avg_grams)',
        'order by avg_grams',
    )
I still can't tell if it's too goofy, or if I really like it. :)

I write a lot of SQL anyway, so this approach is nice in that I find I almost never need to look up function syntax like I would with Pandas, since it is just using DuckDB SQL under the hood, but removing the need to write `select * from ...` repeatedly. And when you're ready to exit the data exploration phase, its easy to gradually translate things back to "real SQL".

The whole project is pretty small, essentially just a light wrapper around DuckDB to do this expression chaining and lazy evaluation.


They also only have one type of neighbor. Square grids have 2 neighbor types. Triangular grids have 3.


Makes perfect sense. Thanks both


It depends on if you want to model a point or an area. lat/lng gives you a point, but you often want an area to, for example, count how many people are in that area. A spatial index like H3 provides a grid of area units.


But so do lat long ranges.


You can use those if they work for your application. One downside would be that you're storing 4 numbers compared to a single `int64` index with H3.

You also have to decide how you'll do that binning. Can bins overlap? What do you do at the poles? H3 provides some reasonable default choices for you so don't have to worry about that part of your solution design.


...and use H3 instead! https://h3geo.org/


Very different use case -- ZIPs/ZCTAs have some semblance of population normalization


If you care about that and have a data source, you can add, for example, population density per H3 cell as part of your analysis. That has the additional benefit of denoting the this quantity of interest explicitly, rather than some implicitly assumed correlation which may not be true.


Hey AJ, this is almost on topic, do you know of a more up to date version of the dataset you used on the blog post release for H3 v4.0.0 [1]? They stopped updating in Oct 2023. Thanks! [1] https://data.humdata.org/dataset/kontur-population-dataset


I don't. And maybe I should have emphasized "and have a data source" more, since its doing a lot of the heavy-lifting in my statement :)


Not necessarily true. The population isn't balanced at all between many. Census units are.


Absolutely this. Use other Census areal units if you can and ZCTAs only if you have to.


What H3 do I belong to if my house is split between three different ones, pretty much equally? Any/all of them?


You take a smaller H3 :-) The maximum area of a resolution 15 H3 is 1 square meter, so unlikely to split a house in two.


What is the benefit of H3 over a rectangular grid?


That map does seem to be using H3 hexagons: https://h3geo.org/


Oh man, what an exciting opportunity. clears throat The hacker news title seems to mistranslate the original Em dash to an En dash.


Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: