I'm generally confused by the hype around ML and 'data science'. it seems like CS has somehow regressed to the behavourism era of psychology or economics before the Lucas critique.
The problem with all this data talk isn't just about implementation or bad structure, the limitations of putting all your bets on inductive reasoning are systemic.
The insights that economists had in the 70s and 80s was that reasoning from aggregated quantities is extremely limited. Without understanding at a structural level the generators of your data, trying to create policy based on outputs is like trying to reason about inhabitants of a city by looking at light pollution from the sky.
My guess why data science so rarely delivers what it promises is because you can't get any value from historical data if your circumstances change to the point where past data is irrelevant. Which in the world of business happens pretty quickly. To have a competitive advantage, one needs to figure out what has not been seen yet.
And trying to exploit signals suffers from the issue laid out above. There was a funny case of an AI hiring startup trying to predict good applicants, and the result was people putting "Oxford" in their application in a font matching the background color
There’s also the issue of data scientists just not having a seat at the table. Anyone can validate their point by using data to support their answer just like anyone can validate their opinion by doing a google search.
I only see ML and data science as having real value when considered as a single component of a larger system, most of which will not consist of anything close to ML. Many real world environments are too entropic to see much accuracy from ML models except in very, very limited bands (facial recognition, for example).
As other commenters here have posted, without the integration of data science into both the business needs and the rest of the existing tech stack it will remain a fun school course activity.
At a high level, it argued that basing predictions on historical data is problematic. The details of the argument are somewhat specific to economics, but the principle is more general. That's also why people recommending stocks say "past performance is no guarantee of future results."
One of the key issues is that circumstances change, and information about such changes will often be external to a data set.
In the Lucas critique, policy changes are an example of this. You can't predict future economic performance based on past economic performance if relevant policies have changed. But any complex situation has such factors that are external to the data that one can easily collect about it.
in psychology there was a time period between ca 1900 to the mid century where behaviourism rose in prominence, which was the paradigm, simplified, that internal processes of the mind are not really interesting, and what matters is rather only the relationship between input and output, treating the mind as a black box of sorts (roughly analog to ML models).
This came under heavy attack during what is called the cognitive revolution, which put focus on understanding mental processes at a structural level (for the reasons outlined in the post above).
Economics went through a similar process. Up until the 70s Keynesianism was very dominant, which mostly focusses on using aggregate economic quantified data, i.e output, unemployment, capital and so on to make policy suggestions. This began to be attacked and supplemented with what's called 'micro-foundations', which aimed to not just look at quantified data, but to model, from the individual up, not just top-down, fundamental behaviour and interaction, i.e the actual entities that generate the aggregate data.
There was also a similar movement to this in linguistics starting (mostly) with Chomsky at about the same time applying the same criticism to how we model language.
The problem with all this data talk isn't just about implementation or bad structure, the limitations of putting all your bets on inductive reasoning are systemic.
The insights that economists had in the 70s and 80s was that reasoning from aggregated quantities is extremely limited. Without understanding at a structural level the generators of your data, trying to create policy based on outputs is like trying to reason about inhabitants of a city by looking at light pollution from the sky.
My guess why data science so rarely delivers what it promises is because you can't get any value from historical data if your circumstances change to the point where past data is irrelevant. Which in the world of business happens pretty quickly. To have a competitive advantage, one needs to figure out what has not been seen yet.
And trying to exploit signals suffers from the issue laid out above. There was a funny case of an AI hiring startup trying to predict good applicants, and the result was people putting "Oxford" in their application in a font matching the background color