If you have some benchmark results, it'd be great to see how it compares to traditional data structures in practice, for different datasets and varying k-mer lengths
Thank you! The "Space Requirements" section in the README has a few examples, and your comment has made me realize our (micro-)benchmark link in the README is broken.
We'll get that fixed and maybe find the time to do a larger post with some benchmarks on both space/time tradeoffs and overall performance vs. other data structures.
Although not part of the language proper, C++'s STL has vector<bool> that packs boolean flags into a bitvector (a now regrettable optimisation, since it doesn't function the way a container should - a story for another time). There is also bitset, which is similar but static.
I use them whenever I need to. Often I don't need to, since a compiler should be able to optimise the very basic things. Optimising prematurely is a good way to stuff something up (see question 5 here: http://ridiculousfish.com/blog/posts/will-it-optimize.html)
And it hurts readability.
I would expect most good programmers to know at least some basic "tricks" though (and would hope they have the sense to not over use them).
Glad you mentioned this. The author linked to my blog post, where I mention fast rank/select structures. There are faster ways to do it without a supporting structure too of course :) but I do like this post.
I'm interested in what you use the WT for in your job... I'm just starting off my PhD in succinct data structures, so it's nice to know that people use them :P
Cool, WT and succinct data structures are definitely an interesting field these days.
I'm primarily using as a compact order preserving structure to support fast lookups for both (token -> position) and (position -> token). I'm using an updatable variant to support O(1) appends, and then use it for realtime indexing/compression. I probably should do a blog post at one point...
Would be interesting to hear about faster ways to do rank/select without using support structures... I'll keep an eye on your blog :-)
> I'm primarily using as a compact order preserving structure to support fast lookups for both (token -> position) and (position -> token). I'm using an updatable variant to support O(1) appends, and then use it for realtime indexing/compression. I probably should do a blog post at one point...
That's very interesting, are you using raw or compressed bitmaps (such as RLE or RRR)?
A limitation of (updatable) wavelet trees is that you need to know in advance the set of symbols that you will see (otherwise you would need to change the tree structure). Is it a problem for you?
I wrote a paper with a solution to this problem, I can't share it until mid march, let me know if you are interested.
My solution is to use a fixed tree with bloated leaf nodes (use more bits per element in the leaf node). A high number of nodes with a low number of elements causes a high overhead, so I simply rebuild the tree at a few given thresholds to keep the overhad and the cost of the bloated leaf nodes down. We can discuss further by email, if you want. I'd love to read your paper! :-)
For background: I'm 23, starting my PhD. Have had about a year of industry work (including a game company - actual interesting work). I have a few startup ideas (trying hard to execute), and I'm interviewing with Google (although I failed last year, and I might just wanna do my PhD). I wasn't passionate about programming until last year actually, after I started reading more.
People say that the smartest people are uncertain of their own abilities, because they see the bigger picture. They are the ones that can see just how deep this rabbit hole goes. Infinity can be a daunting thing. I guess it can either make you give up, or push harder.
I'd hazard a guess to say you are on track (although who am I to say) :) never become a senior developer, just move along the spectrum of a curious, 'young' programmer mind.
Thanks for reading the post, and Michael for posting it. I enjoyed all the comments. I only recognized Zed Shaw, Batman and a few others, so I've got some reading to do :)
If you have some benchmark results, it'd be great to see how it compares to traditional data structures in practice, for different datasets and varying k-mer lengths