Where is a high level description of the algorithm? How is it different from quick sort, it seems quite similar based on a quick observation of the code.
You're forgetting probably the most important optimization: block partitioning. This one alone makes it almost 2x faster (on random arrays) than typical introsort when sorting items by an integer key.