I run a reporting & analytics team at a ~500 person SW company and I've never been paged to optimize a query because the database is on fire. That's like designing your sprinkler system while the building burns down. I've also never found a query that I can reduce from 10 hrs to 10 seconds. Maybe one that never finishes to actually running, but not the kind of magnitude improvement you reference.
My guess is that if you moved from reporting and analytics to a DBA role in web operations in a large scale shop you'd see something very different. It starts with programmers who don't know SQL (they use some ORM instead) and who don't know how to get or interpret a query plan and who aren't aware of indexes with the results you might imagine. These same developers tend not know how to get decent performance from a NoSQL database either. And since the code is in an ORM it can be difficult for DBAs to do a decent code review.
Also, there's a very big difference between the impact of performance requirements for reports which run once or even 20-30 times per day and something that runs in every transaction where that transaction runs 10s to 1000s of times per second. If you optimize a report to run 3% faster that's unlikely to make a difference. If you optimize a hot path transaction 3% on a large website that could mean you have a few less hosts/instances to provision and the impact is a direct budget impact of $5K/yr or more.
Oh and as an SRE I've found several queries where I've made 2 orders of magnitude performance improvements in one of those cases the process was beginning to fail regularly because the queries were so bad.
5k/yr? I've saved 5k+/mo on some clusters, by optimizing lots of 100-1000ms queries into the <5ms range!
It's really fun.
Also agree on the reporting vs transactional stuff. One reporting cluster I run I don't really look at queries that take 30 seconds. Optimizing them that much just doesn't matter compared to the queries ran interactively.
For a devs sake - and before I go ask my new AI overlord - what is a good rule of thumb re: indexing? My naive assumption has always just been that if its a column you utilize to filter a lot of your queries, and youre unhappy with current perf, index it. Examples would be datetime columns (if sorting serverside) and columns like TenantId, UserEmail, etc.
I'd say it depends a bit on your access pattern, the query frequency and the time you have for each query, and beyond that on your insert performance requirements. I'm kinda referring to PostgreSQL here since that's what we use at work. In there, I have like 2-3 rules of thumb by now:
A remarkable query pattern treats tables like hashmaps. You have a query, you expect 0 or 1 results, and you want those results as fast as possible, even more so than usual. This is usually accompanied with a low to very low write frequency in relationship to reads against this. Examples are - like you say - a login: Give me the account for this mail - generally, you'll have many more logins than registrations. Give me the current chat-room for a user - they enter a chatroom once every few hours and then look into it a lot. This usually warrants an index without much analysis. Often, foreign-key relationships also enter this pattern, but imo, that would require evidence through query plans to look further into it and normally, the primary key should handle these.
After that, it becomes a bit weird until you think about it, because you get into something like index sizes and selectiveness. The interesting part there is: Postgres on a decently quick core can iterate linearly through a few ten thousand to a low number of hundred thousand rows in a few micro- to milliseconds, per parallel worker thread. This is weighed against the cost of traversing a potentially large index - and an index cutting the dataset into too many very tiny pieces will be large. This might cause postgres to stop using an index even it if it matches conditions in the query, because just chugging through the dataset brute force in parallel is faster. And yes, I was apprehensive at first, but the optimizer tends to be right there.
So yeah, your approach sounds right there. Unless you really know this index is really good for the DB and you can make specific points about rows inspected per query, just don't bother. Once a query starts running slow - or classes of queries start running slow, throw the whole bus of EXPLAIN ANALYZE and something like a query plan visualizer[1] at it to identify the slow part. This will generally point to a slow sequential scan or a slow parallel sequential scan, or a crazy sized join. And then you can start thinking about creating some index for the conditions used in the query to chop the dataset into some c * 10k sized blocks for c being like 5-25ish usually.
The nature of that c is more about your performance requirements and the nature of these queries, as well as the write load. If this is a web app, with loads of reads and few inserts, c should probably be on the smaller side. Tenant-ID tends to be a good one, CreatedDate for some sequential data stream like an audit log or some tracking/reporting stuff, IDs transactions tend to reference this thing by. If you're looking at a write-heavy reporting setup, it might be better to be a bit coarser for quicker inserts, because who cares if a reporting query is chugging for 5 seconds? Though 5 seconds would be in that ugly uncanny valley - I can't just go and get coffee while the query runs in good faith, but it's too slow to be a good experience, lol.
And something to consider for the latter is also the general query patterns for the total application depending on the database schema. This, again, indicates that delaying index creating well into maturity of a codebase is a good idea. I've had quite a few applications and reporting-like solutions which ended up with like 6-8 different filter patterns throughout their queries. And sure, we could have created 6 different index structures to support all of those individually perfectly, but then we'd have ended up with a crazy amount of indexes which in turn wouldn't be great for insert-performance. Instead, since we waited for issues, we could identify 2 indexes that would support each individual query only to like 60% - 80% of the effect of the tailored index structure for this query, but these 2 indexes would support all queries across the board and - as we later saw - would support 90%+ of all queries created further down the line to a similar degree.
So, waiting until we had an understanding of the dataset and actual problems with actual data meant we could solve a lot of issue with a morning of analysis and a small touch of the system. And then performance issues were gone for a very long time.
I’ve had more than a few queries I’ve been able to reduce from multiple minutes down to multiple seconds. 4 minutes to 15 seconds, for example. In some cases, I’d written the original, and knew it wasn’t great, but didn’t understand the impact enough to realize how bad it would get.
Never seen a 10 hour to 10 seconds, but do not doubt that it might have happened at some point. I did reduce a 25 hour ETL process down to 25 minutes, but it wasn’t solely down to queries.
> That's like designing your sprinkler system while the building burns down
Aptly, that's how this application was built over several years. And it's in the best shape out of all the old flagships at work.
> I've also never found a query that I can reduce from 10 hrs to 10 seconds. Maybe one that never finishes to actually running, but not the kind of magnitude improvement you reference.
And you never should, if you understand a bit about efficient SQL. This doesn't stop other people from having one or two levels of dependent subqueries in reporting code which passed test, as test doesn't contain reporting relevant data.
And yes, from this point it takes 1-2 questions or comments and we will go from "Your SQL query sucks" all the way to more or less bad development processes and how to do this better, CI, test data, test data management, reviews, reviews of database schemas and so on.
Obviously on top of all the other tasks and duties. Hence: Don't know SQL well, at least officially.
My experience as the performance “expert” at the company I work at isn’t necessarily that the a single query gets sped up that much absolutely, but that much relatively is possible when dealing with a query that executes very frequently.
In my experience, if one of our core queries regresses from taking 20ms to taking 600ms (which happened recently although only one certain customer databases with 100m+ rows in the table) that the query will suddenly be taking 5+ seconds.
The problems compounds as the IO and compute resources of the database gets consumed and everything starts getting even slower.
In my recent case the same query before and after was actually taking the same amount of time when the DB wasn’t “on fire”. The query had started using a lot more IO resources which became a bottleneck but you wouldn’t know by query time until suddenly we were IO constrained under load.