Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> parse the file using ffmpeg, ghostscript, libreoffice ect

> Known illegal md5s

Yeah, no. Those lists (of known formats and of known illegal files) will quickly grow unmanageable.



You don't need to keep the lists, just keep a Bloom filter of sufficient size to keep false positives low.


You can at best save log2(n)-1.4 bits per entry using smarter encoding compared to a naive list. That's perhaps a factor of 2-3x, depending on list size and acceptable false positive rate. For example if you have a list of a billion entries and accept a one in a million false positive rate, the naive list needs 30+20=50 bits, while an ideal encoding will need 21.4 bits, a 57% reduction.

So I don't think bloom filters have a significant impact on the manageability of those lists. Though I doubt the storage size will be the main concern, compared to the effort of adding entries to that list.


would Select COUNT(1) from records where md5 = newMd5 not scale? (or if exists (select 1...) ... )


No. That’s only going to be performant if you can perform an indexed read. That index is not going fit in memory.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: