Hacker Newsnew | past | comments | ask | show | jobs | submit | throwaway67's commentslogin

... or they could have used BigQuery with a primary key on message ID.


BQ doesn't have primary keys. Perhaps you are thinking of the id that can be supplied with the streaming insert? This has very loose guarantees on what is de-duplicated (~5m iirc)


yea I think within the context of BigQuery the most sensible thing would be to do an aggregate per the column that would be considered a primary key. For example [0]. That said, Streaming API de-dupe window is very nice in practice.

I mentioned elsewhere on Google Cloud the most elegant way of doing this is with Google Cloud Dataflow [1]

(work at G)

[0]https://stackoverflow.com/questions/38446499/bigquery-dedupl...

[1]https://cloud.google.com/blog/big-data/2017/06/how-qubit-ded...


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: