jprofiler, at least for my use cases, isn't really similar to vtune at all. I know what my hot spots are: it's the inner bits of algorithms that run a few billion to a few trillion times. What I need to do is understand, as granularly as possible, the exact instructions and how the various caches and memory are operating. Convex and tree optimizers are generally memory speed limited and my goal is to have this code run at eg 0.9+ of memory b/w speed.