6-12 months of self directed learning & self promotion (blog posts, talks at local user groups, open source contributions, etc) can get you in that industry if you're already a semi-competent programmer.
My take on this: if you lack expertise in maths, you probably have functional expertise in one or several fields.
You can build a tool, projects and a company on the premise of "automating/improving field <X> through AI". What you lack in maths can be made up by your intimate knowledge of the field.
You will gradually improve your knowledge of maths and the day will come when you'll realise you are finally able to apply this knowledge to a field you know much less about.
That's pretty much what happened to me and the team I assembled for my current startup. Not bad in IT, not bad in e-commerce, decent in maths (as much as a double degree in EE & computer science goes: not an expert but can read maths written by others).
We started with a reduced functional scope, teamed with hardcore mathematicians, expanded our maths knowledge, expanded the functional scope, started writing our own maths, etc...
Like every skillset, learning maths is gradual. Don't expect to write the next "AI breakthrough research paper" tomorrow, but read and try to understand the whys and wherefores of trending stuff (which you'll often hear about here on HN).
It is a wide field. There are people building tools (obviously needing strong maths) and people applying tools to real-world problems. Don't get me wrong, you will benefit from learning math behind the concepts, but being a competent software developer is in itself a rare and valuable feature. Jump in, it's incredible what we can do. It's fun. :)
I work in this field. I can tell you that the maths involved is an easier (and more exciting) learning curve. A lot of software engineering already involves the tools necessary for getting good af math: example SVG transformations and matrix algebra, learning how to manipulate abstraction in operating systems etc.
I'd suggest an approach where you use your (presumably good) software engineering skills to join an engineering gig in a data science team. Then, gradually pick up tools, slowly pivoting into fully fledged ML/AI. It is definitely possible as a career path.
This particularly worries me because if technical skills go out of vogue, then people skills would become conversely more important, and the thing I've optimized for all my life would become useless.
Natural language can be incredibly imprecise though. I expect the ideal UI for AI will meet somewhere in the middle with something that combines the best of programming languages with the specificity and ease of use of natural language -- think "strongly typed natural language". I'd rather my command type-error than the machine go off and do what I say, not what I meant.
Step 1 was recognizing speech to text accurately. Step 2 is AI that is globally context- and subtext-aware.
I expect Step 3 to be AI that is personally situation-, context-, and subtext-aware (IoT & personal info storage-enabled). And I think that's drastically going to improve accuracy.
I can see some standardization of input (in the same way that GUIs evolved), but I think it will probably take the form of grammar structure & error behavior more than strongly typing questions.
That will solve the AI understanding you, but still won't make me more precise and clear in my instructions. Yes, the AI can hopefully infer the rest, but even then, I can currently type faster than I can speak. Will the AI be able to infer enough of what I want that I can say substantially less and have the AI still get it right? At that stage, why not go all the way and remove me from the equation completely?
On the typing angle, I'd argue the HN vs mass market point. Typing is still a skill we aren't born with. As much as it's now assumed in 1st world core demographics, there are a huge number of people for whom it isn't faster.
And if you've ever watched a grandparent try and use Windows, there are a lot of other metaphors they miss too.
I'm not saying NL recognition is going to be easy, but I am saying that getting there will be far more valuable than maybe the HN crowd assumes from self-reflection.
It depends on the use case. If its a typical Google Home-style "turn on the lights", "look up this thing", "do my accounts" then I agree (although the AI may need to do a lot of prompting since people (at least me) often don't know what we really want).
For less casual uses, I'm not sure. I guess there's a spectrum.
Yeah I think you're right that current systems don't have the complexity yet. It's either simple tasks (turn on the lights, play music), search (find X for me) or menus. As complexity of tasks increases, I think it'll become more important.
Natural languages (such as English) aren't the endgame because:
- They don't constrain what you can and can't say based on context. A graphical user interface can adapt to your context by disabling and hidings options that aren't relevant or allowed.
- They are linear and difficult to navigate. It's easier to navigate back and forward in time through a graph or tree than through a textual conversation.
- They are textual, and lack the richness and feedback of sliders, color pickers, maps, 3D models, etc.
While I'd agree with your points, I'd agree with them for me (and presumably for you).
I have a feeling we and HN in general are outliers in terms of "looking at a UI, mentally decomposing its elements, applying common UI knowledge to them, experimenting until we get the action we want, and doing all the above in our first 10s with a new UI."
As support, I would offer that despite the comments you've outlined above, normal people typically feel more comfortable performing an unfamiliar task via speech than GUI.
Here's why I put it there. Devices which feature functional natural language input can ditch virtual or physical keyboards and all the associated hardware.
Which is a fundamentally more cost-effective design. Which is going to enable "smart" devices at price points that traditional screen-and-input devices can't compete at.
And if only certain companies have strong enough AI and NL processing to enable this? That's going to reshape the market. (Yes, I'm looking at Apple)
Granted, but I'd expect subvocalization with predictive text or something else to infill where necessary. Stronger AI can substitute for a lot of "magic."
Can you expand on what you consider the AI UI problem and why you don't think anyone's working on it? If it's what I think it is, people are trying, but it's just hard and in some weird way not as flashy as "company X revolutionizes the world by making a 0.05% improvement in speech recognition task."
Uh, guys with the big bucks: AL/ML are at most a tiny fraction of the material, tools, power, and value of the pure/applied math on the shelves of the research libraries.
If you have a problem you want solved with AI/ML, 99 44/100% of the time you are better off going for what's on the shelves of the libraries, as taught in various high end grad schools, than with anything currently specifically AL/ML.
So, broadly, go for work in statistics, optimization, stochastic processes, and optimal control. For a specific application, may want some work that stands on that existing material and is also at least somewhat original.
E.g., the crucial technical core of my startup is some original applied math I derived based on pure/applied math that's long been on the shelves of the research libraries. For the valuable work of my startup, what's in AI/ML now is in comparison at best weak, nearly silly early grade school baby talk.
Really, guys, 99 44/100% of the good stuff is still where it's long been -- on the shelves of the best research libraries. And for the education for that work, it's definitely NOT in departments of computer science. Instead look at selected programs in pure/applied math in some of the best research universities.
This is a spectacularly bad advice, which "off-the-shelves" research libraries that you mention? OpenCV, LibSVM, Weka or Matlab/Octave?? Most examples/implementations in OpenCV are outpaced by Deep Learning methods.
TensorFlow, Caffe, PyTorch/Torch which today implement most of the state of the art methods were all written on average a year or two ago.
>> E.g., the crucial technical core of my startup is some original applied math I derived based on pure/applied math that's long been on the shelves of the research libraries. For the valuable work of my startup, what's in AI/ML now is in comparison at best weak, nearly silly early grade school baby talk.
Rather than vaguely mentioning libraries and startup, I suggest you offer concrete evidence behind your claims.
>> And for the education for that work, it's definitely NOT in departments of computer science. Instead look at selected programs in pure/applied math in some of the best research universities.
This is just truthiness (Math feels more "hardcore" than CS so Math department must be the better source.). Having known several people at Google/Apple/Facebook-FAIR, CS departments are typically major source of AI/ML researchers.
> ... which "off-the-shelves" research libraries that you mention? OpenCV, LibSVM, Weka or Matlab/Octave??
I think they meant "library" in the literal sense - the one with books -, and not "software libraries".
But yes, while common sense, it's empty of any meaning. Like "it's better to invest in the core math and statistical fundamentals than the specific applications". Right, but so what? Folks like DeepMind, Numenta, etc are developing new core research material, and applying in practice.
> Most examples/implementations in OpenCV are outpaced by Deep Learning methods.
Right. Except that:
- In some cases there are very good (and fast) techniques for finding out what you need quickly (like identifying faces)
- You don't want to retrain a network for your specific case (which might not have an existing model already)
- Traditional solutions might be good enough
- OpenCV is much more than just identifying images, it has several APIs like geometrical transformation, color processing, filtering, etc http://docs.opencv.org/2.4/modules/refman.html
Plenty of industrial AI is using other methods as well, it's just not where most of the current hype is, so there's a kind of bait and switch where they lead with the DL and then if you look at the products and APIs they're actually deploying and selling, the workhorses are often some mixture of very classical stats methods (like logistic regression) and general-purpose non-NN ML algorithms (like gradient boosting). One common breakdown is that the general-purpose APIs use these general methods, and then there are separate NN-based APIs for a few specific kinds of problems like object recognition in images, where NNs give a big performance increase.
Especially true for companies in the data-science niche, since DL rarely gives you much of a win for tasks like data-mining SQL databases, or when you have only modest sized data sets, but nonetheless you still need to at least offer some kind of DL solution to be perceived as a state-of-the-art AI offering. (Like with the "big data" hype wave, many companies that think they have gigantic data sets don't.) These other methods are better understood though and not in a ton of flux, so there's no need for an acquihire frenzy to get that expertise.
> This is a spectacularly bad advice, which "off-the-shelves" research libraries that you mention?
Harvard, Princeton, MIT, Stanford, Berkeley, U. Chicago, Johns Hopkins, Cal Tech, need I go on?
It's very good advice; sorry you don't understand it.
In short, the computer science people know how to write software, but for exploiting pure/applied math they don't know what software to write; that needs a mathematician, and the computer science profs rarely have enough background in math to do well with the subject.
The math is based on pure math and is heavily theorems and proofs, and about the only way to be good with that material is to be a BS, MS, and hopefully Ph.D. in pure/applied math.
The computer science AI/ML work has obtained some good results; just why the techniques work is too often still a mystery. But the emphasis on gigantic quantities of data make the work a niche and real applications rare.
> The computer science AI/ML work has obtained some good results; just why the techniques work is too often still a mystery. But the emphasis on gigantic quantities of data make the work a niche and real applications rare.
From an industrial point of view, that 'niche' includes a large proportion of the consumer internet and the 'rare' applications include online advertising, which powers two of the four most valuable companies on earth.
What uselessly vague and implausible advice: "Hey people who've made incredible progress in a variety of fields the last several years using Approach X; Approach X is actually worthless, using Approach Y will blow it away! I, random anonymous internet poster, guarantee it! Exactly how would you use Approach Y, you ask? Well, just read and figure it out yourselves, duh!"
What I wrote is fine. You have a rewrite that is not fine. For a better rewrite, there is a toolbox. Somewhat new X is in the toolbox. But the box also has a huge collection of other tools.
If you have a real problem and some data and need a tool, then the odds that X is your best tool are low and, instead, one of the old tools is likely better.
What old tools? The shelves of the research libraries are awash in them. There are books and journals, rows and rows of them.
How to know which tools to apply? Need an education in the associated pure/applied math. Uh, the old tools are mostly not from and not taught in departments of computer science.
> it's definitely NOT in departments of computer science. Instead look at selected programs in pure/applied math in some of the best research universities.
Slight qualification: It's in pure/applied math in some of the best research universities, BUT, it needs a CS treatment, i.e., it needs to be utilized as numerical programming, not pen-and-paper math.
I would also claim that the pure/applied math guys (MS, PhDs) don't have much edge here unless they shed their pen-and-paper habits and embrace computer-science. (likewise, CS MS,PhDs don't have an edge as long as they keep calling their discrete-algorithms class, free of numerical analysis and scientific computing, Algorithms! the exceptions are computer-graphics researchers, DSP, numerical-analysists etc).
And those who are good at both pure/applied-math as well as programming/software-engineering, (cough humblebrag cough) boy...!
Numerical analysis is an old field. Some of it was condition number of matrices and the resulting error analysis. Some of it was, say, double precision inner product accumulation and, for solving linear systems, iterative improvement. There was a lot more for solving ordinary (e.g., stiff) and partial differential equations. Courant was good at that. Too soon, e.g., trying to solve the Navier Stokes equations, numerical analysis gets quite deep into pure/applied math -- there's no royal road there through the CS departments and no royal road at all.
E.g., a good prerequisite for work in numerical analysis is a high end course in linear algebra, e.g., matrix norms, Gershgorin approximations; not many people teach or take such a course.
E.g., polynomial curve fitting can quickly lead people to solving a systems of statistics normal equations with the notoriously ill-conditioned Hilbert matrix. But, with some orthogonal polynomial approaches, can get around the numerical problems right away. I did that once for some profs at Georgetown U. I'd learned how to do that consulting in applied math and numerical analysis.
For more, the days of ill-conditioned matrices led M. Newman to multiply the equations by large powers of 10 so that he had all whole numbers. Then for each prime number in a list of prime numbers, he solved the equations in the field of the integers modulo that prime. Right, for the multiplicative inverse, use the Euclidean greatest common divisor algorithm. Cute. Then use the Chinese remainder theorem to glue together the answers as quotients of multiple precision whole numbers. Ah, numerical analysis!
Guys, that's not computer science, no matter how many nVidia GPUs they keep busy.
I like your style. I am really glad that you have the resources to go it alone, against the zeitgeist.
I don't think any big VC will take you up on your advice though. VCs don't work in vacuum, they need to exit at one point, and they need something that is sexy to sell at that point. AI/ML is sexy, and will be for at least the next 5-10 years.
BTW, my advice to you is to be careful that the zeitgeist does not get you. Even when you want to hire a dev-ops guy, you need to sell the guy a vision. And if the vision is "we don't to sexy ML, we do dull statistics", you will not be getting the best guys.
Just noticed you are being down voted. That is the zeitgeist getting to you my friend. Think how many good employees you are ditching :)
We just left the AI Spring of Hope. Now we are in the Summer of Heat. Soon will be the Fall of Failure and, then, once again, maybe this time a real ice age, an AI Winter.
There is old history here: (A) Sure, if count self-driving cars as AI, then that's new software, laser range finders, etc. Okay. (B) Otherwise what business wants from AL/ML is solving essentially just existing problems of their existing business. So, they have some business operations and want to do better.
Well, that situation goes way back to the field of operations research. From about 1950-1970, that was hot stuff, especially for the US DoD.
Well, there is some value there. But there was also a lot of hype. To get the value, usually have to select problems carefully and then do good work. Also, then, there were big bottlenecks on (i) understanding the relevant applied math, (ii) finding appropriate applications, (iii) gathering the input data, (iv) getting the software written and running, (v) being able to afford the computer time. So, yes, there were some successes. Later, I had some. But with the hype which led to low quality efforts, there was also a lot of failure. Net, soon the phrase went "operations research is dead". Similarly for applied statistics. Similarly quite broadly for applications of math to business operations. There were, and are, e.g., RSA encryption, some valuable, narrow applications, but broadly the applied math flopped.
Statistics? Well, it's continued to get used where it's really important -- e.g., industrial quality control, bio-medical challenges, and experimental design, e.g., in research in agriculture.
Okay, now AI/ML are being hyped for applications to business. Uh, as above, we've been there and done that. Valuable? It can be. Easy? Much easier now than decades ago but still, usually not really easy.
Problems with AI/ML now, remembering that mostly we're still talking about solving business problems, e.g., much as in the 100 year old Taylor time and motion studies in assembly lines, much as in logistics, transportation, inventory, facility location, ad targeting, marketing: (i) The successful, new tools of AI/ML have darned narrow successes and are the 100 - 99 44/100% of the total of the promising tools available. (ii) Since the 99 44/100% tools still are not doing well enough to get all excited (right, there is some use of Linpack, C-PLEX, SPSS, SAS, R, ODE solvers, statistical hypothesis tests, linear multivariate statistics, time series analysis, etc.), getting all excited by the AI/ML in the 100 - 99 44/100% is a wild overshoot.
The overshoot is also a waste.
Here I'm just telling business guys tempted to spend money on what are essentially applied math/computing projects to save/make money in their businesses that they would do well, actually much better, just to review their courses in linear programming, production and operations, applied math of marketing, and statistics they got in their BS/MBA B-school studies. That material was rock solid, nicely polished, well justified, nicely balanced, prudent, not awash in hype, and from a long track record of successful real projects that were quite valuable. There's a lot more than what was covered in the BS/MBA B-school programs, but for a consumer introduction that material was quite good -- it was intended to be by the degree accrediting groups.
> I don't think any big VC will take you up on your advice though.
My view of VCs is that they are puppets on the ends of strings held by the limited partners (LPs) who still want to think like commercial bankers advised by CPAs. So, they want to invest capital in assets. They believe that they have bent over backwards far enough to regard traction significant and growing rapidly as a necessary and sufficient asset, but anything before such traction and a dime won't cover a 10 cent cup of coffee. I doesn't matter what I say, explain, reference, etc.: The LPs won't let the VCs pay attention.
> VCs don't work in vacuum, they need to exit at one point, and they need something that is sexy to sell at that point. AI/ML is sexy, and will be for at least the next 5-10 years.
I believe your "5-10" is too long (is that the expected time to exit or the jail term for fraud -- sarcasm here!). Generaly, once projects start to fail, the bad vibes spread very quickly.
For real business uses, in general (not just very narrow niches), AI/ML is in hype mode. The time to the big flop is based on the hype, not the technology of AI/ML. But we've seen hype rise and fall before. So, to estimate the time to flop, just look, first cut, at the history of hype.
The usual exit time for a VC investment and time to full liquidation of a VC fund is ballpark 8-15 years. Well, the hype will flop long before that. So, except for a fast flip, M&A, to someone with more money than brains, AL/ML, aging like butterflies or flowers in spring, will die too fast for the VCs.
For my startup, there's a crucial, technical core, based on my original applied math derivations (I've published in statistics, but the math for my startup is not statistics), but the rest just looks like a promising startup. The tech core provides crucial enabling of delivering the good results -- I can't think of any way to get the good results otherwise, and current, hot AI/ML is really just weak and/or the wrong stuff. E.g., what I derived just ain't convolutional neural networks trained with terabytes of data or anything like that. And the input data I'm using is nowhere near terabytes. If there are some good, practical, valuable, doable, real business applications for convolutional neural networks trained with terabytes of data, terrific. Then that's (heuristic, curve fitting) applied math application n + 1 for some huge n already on the shelves of the research libraries.
Or, look, the computer science departments can teach how to build at least the software side of computer systems, from the ROM BIOS for booting the thing up to complicated applications -- running a bank, a lot in medical records, maybe air traffic control. Fine.
But with AI/ML, the CS departments are reaching to solve the problems the applied math people have been working on for decades, maybe centuries. There the main issue is not how to program but WHAT to program. And to answer the WHAT for an applied math application, darned better well be well trained in applied math. Well, nearly none of the CS profs are so well trained, and that's a problem. Yes, the applied math appllications will likely need a lot of computing, but it does not follow that the people to do the applied math are the CS people. Sorry 'bout that.
I'd like to hear the opinions of E. Cinlar at Princeton.
For my startup, I wrote the code. It's 24,000 programming language statements in Visual Basic .NET Framework 4 in 100,000 lines of typing. It all is like I originally envisioned and all runs, apparently correctly (although I want to do some more testing). The code is nicely documented and plenty efficient enough for production. Only a little of the code has the core applied math, and the rest is on the simple side of routine for a Web site that interacts with users.
So, really, so far, I don't need to recruit and hire software developers. And if they would need to know about the crucial technical core, then it's some original applied math complete with theorems and proofs with some advanced pure/applied math prerequisites and not AI/ML, or statistics.
For judging if people will be good for my startup, I believe I can do that; if they are highly impressed with AL/ML, then that would be a letter grade down!
For the status, last night I wrote two, simple scripts to help me understand where my disk space is being used; my old scripts got sick on file sizes > 2^32. Then with the output from those two scripts, I'll move around some data and do an incremental backup. Then I'll check about 10 times to be careful and then restore my main boot partition from an old copy made by NTBACKUP. I need to restore because I had a hardware problem that corrupted some of the code on my boot partition. That's the nature of the work at present -- e.g., system administration mud wrestling.
Then back to testing my software on the way to going live, ASAP. Then some more routine steps, and then going live.
It's been a long time: The parts of the work unique to me have all been fast, fun, and easy. But there were independent, exogenous interruptions -- maybe now I've swatted down nearly all of those.
I don't anticipate any equity funding. Since I'm a solo founder, by the time the VCs would write me a check, I'll have so much traction, and, thus, revenue, that I would have no good reason to accept their check -- no, 1/3rd of my business is not for sale yet! Net, for my startup, the LPs want their VCs to wait too long.
The math, enough to make it mathematical, complete with theorems and proofs. Examples are in mathematical and applied statistics, optimization, applied probability, stochastic processes, control theory, and more.
There is a lot in just statistical hypothesis test -- Kolmogorov-Smirnov with a really cute stochastic process foundation. Then there is a small ocean of non-parametric tests.
Or, suppose we want a statistical hypothesis test that is both multi-dimensional and distribution-free and with Ulam's 'tightness' is not trivial. Well, we'd like the most powerful test -- how to do that? There's essentially nothing in current AI/ML that is new and would work for such statistics.
For a 50,000 foot view, pure and applied math are huge, old, deep fields also awash in applications. What's in AL/ML so far now is tiny in comparison.
While I agree that it is vital for a practitioner know the mathematical basis behind what they're doing in order to maximize their chances of success, I find your current argument lacks substance.
You argue that there is a vast amount of solved problems in tomes of "pure" mathematics. But this doesn't mean that any of those solutions are relevant and/or necessary to the problems people are working on.
Your "statistical test" argument is misleading. Machine learning techniques aren't designed to form such statistics, and judging them by their ability to do so willfully disregards the real world success many companies have found by using ML/AI to solve their problems.
Certainly, there are challenges that are waiting to have their solutions implemented from a rediscovered manuscript, but listing impressive-sounding statistical terminology doesn't demonstrate lack of value for machine learning/AI techniques. Furthermore, it doesn't even demonstrate value for the "pure" techniques you list. To clarify, I'm talking about real-world value, not just theoretical value (and again, I'm not arguing that such techniques don't have value, just that your argument isn't doing a good job communicating how such techniques have orders of magnitude more value than ML/AI).
AI/ML now is too often just heuristic curve fitting. Okay, with good training data and then testing data, can get some utility that way. So, if that's the best can do in the context, okay. A secret in old applications of regression model building, that idea of cut the data in half, fit with one half and test with the other, was common and appropriate.
But we'd like to do better. Some of what is in statistics and applied math more generally will let us do better.
But the idea if dividing the data into training and testing is old. There's much more in the literature on how to build and evaluate models. E.g., there's analysis of variance and categorical data analysis (e.g., log linear).
E.g., for my startup, a guy gave a talk and said that can't do that, encounter four problems. He was right about the four problems. And if just look in elementary texts, yup, do see the four problems. But I'd already seen all four and worked up some theorems and proofs to get around all four, in the context of my real problem. The foundation of the theorems and proofs was mostly some advanced pure math. Without new theorems and proofs, I'd been stuck like he was.
In an important sense, the pure math guys are right: They are looking for the big, important structure and powerful properties, where all the furniture is in the room, turning on the lights so that can see all the furniture (A. Wiles), and that can be darned powerful stuff in practice.
> You argue that there is a vast amount of solved problems in tomes of "pure" mathematics. But this doesn't mean that any of those solutions are relevant and/or necessary to the problems people are working on.
Flour is a raw material. Some people can use it to make a fantastic Sacher Torte. Pure math can be regarded as a raw material. Some people can use it to get new, powerful, valuable results for real applications in practice.
> orders of magnitude more value than ML/AI
Those techniques are narrow and weak. The stuff long on the shelves of the libraries is much more broadly applicable and usually much more powerful. I've made a lot of applications of applied math, and not one of them would the AL/ML discussed these days be better and nearly never would be competitive.
You’ve told us vague things about wide areas of mathematics, but you haven’t told us what problem your start-up is solving using these branches of mathematics or how it is a better allocation of investor dollars than focusing on ML.
I’ve worked on a wide range of applied mathematics problems in machine learning, computer vision, signal / image processing, control theory, and applied statistics. Very few people outside of a mathematics department care at all about theorems, proofs, and "correctness" in the formal mathematical sense. I can count on one hand how many times I’ve needed to actually do a proof, and the number of people who have cared is smaller still. All anyone cares about is how well you can solve their problems. Sometimes advanced mathematics is the right tool for the job, but often it is not.
> You’ve told us vague things about wide areas of mathematics, but you haven’t told us what problem your start-up is solving using these branches of mathematics or how it is a better allocation of investor dollars than focusing on ML.
Easy: ML is quite narrow. More options yield the same, usually better, investment results. Compared with the applied math on the shelves of the libraries, ML is nearly pinpoint narrow. In addition, too much of ML is heuristic curve fitting short on math guarantees. Also, where some real problems have some features that justify some math assumptions, ML is slow to exploit those.
In the workshop, kitchen, or lab, use the best tools you can. So don't limit yourself to just what saw on a cooking show on TV.
My view is that mathematical proof is a fantastically powerful part of applying math. The core reason is that it lets do some additional, maybe significantly original, derivations to get better solutions for real problems.
E.g., once I was in an AI group working on monitoring and managing server farms and networks. We got rivers of data on many variables at high data rates. There we wanted to detect problems ASAP in near real time.
Okay, necessarily, inescapably, we have two ways to be wrong -- false positives (claiming the system is sick when it's healthy) and false negatives (claiming the systems is healthy when it's sick).
Okay, then, right away we are forced, like it or not, into the framework of some statistical hypothesis tests continually applied. There, as is usual
in hypothesis testing, we want to be able to select a false alarm rate and get that rate exactly. We'd like the most powerful test as in the lemma of Neyman-Pearson but likely do not have enough data. Still we want test power in some good sense; if we are willing to tolerate a higher false alarm rate, then we'd like a higher detection rate; when we get a detection, we'd like to know, as a measure of seriousness, the lowest false alarm rate at which the data would still be a detection; at least we don't want a trivial test.
For this data, no way can we know the probability distribution of the data under the null hypothesis (the system is healthy). This is because even the univariate data is commonly a total mess, far worse than was long common in statistics (U. Grenander explained this to me in his office at Brown). And, of course, with so much data on so many variables, we should be doing multi-variate testing.
So, net we need tests that are both multi-variate and distribution-free. Go look for those in E. Lehmann, etc. and won't find any.
So, I invented some. Then, how the heck to adjust false alarm rate? That was a problem in applied probability that needed new theorems and proofs. So, I did those. For those I used measure theory, measure preserving from ergodic theory, group theory from abstract algebra, and more. Yup, theorems and proofs.
Otherwise I would not have known what the heck I had, what the properties were, what the false alarm rates were, etc. Then do my tests have any power? I didn't have data enough to apply Neyman-Pearson, but I found a way. Are my tests trivial? I used a result of S. Ulam and stated and proved a theorem -- not trivial.
The work was intended to be practical. Likely and apparently the work remains the cat's meow, best in the world, for detection of problems never seen before -- behavioral monitoring, zero day monitoring.
Gee, just read that the last security attack on Target cost them $300 million. They needed better monitoring!
New theorems and proofs were crucial, and the work was intended to be practical.
A guy had a fast opportunity to do some marketing but, with the short time interval, had limited resources. He had a 0-1 linear programming formulation, 40,000 constraints and 600,000 variables for a small, test case. Sure, no linear programming or integer linear programming package would have a chance.
So, I got his formulation, saw some special structure to exploit, did some non-linear duality derivations for some Lagrangian relaxation, found how to get some bounds, wrote some code, and in 905 seconds on a slow computer found feasible solutions guaranteed to be within 0.025% of optimality. Well, the derivations were essentially theorems and proofs. He'd tried a computer science approach, simulated annealing. He ran for days and then stopped with the best he had and with no idea how close he was to optimality. My work, from theorems and proofs, was much, MUCH better.
The crucial core applied math of my startup is some theorems and proofs; I wouldn't believe that the mathematical operations would yield what I want otherwise.
Sure, usually a working applied mathematician doesn't always have to stir up new theorems and proofs, but at times that's darned helpful and valuable.
> In addition, too much of ML is .... short on math guarantees.
is utterly false. As I keep telling you need to upgrade :)
In fact I will level the same exact criticism on standard stats. NP lemma is fantastic math but of little practical importance because you rarely ever know the densities. Most of the time 'test of hypothesis' is a bad joke catering to a contrived scenario. Real problems are rarely about 'is it H0 or H1'. Estimation is typically more useful. Stats have long suffered from their fetish for parametric models, asymptotic normality, asymptotic guarantees and what i feel their greatest deficiency: the focus on parameter estimation rather than guarantees over prediction. Parameters and parametric models are convenient fiction, nobodody has seen them, will ever see them. What's all the fuss about estimating a fictional object to great accuracy under strong assumptions in infinite time about.
Made sense a century ago though. If their purported utility is in aiding prediction then why not go after the real deal, that is, non-asymptotic guarantees on prediction in a distribution free / nonparametric setting. That is what ML does. Please dont characterize ML by bad examples and work in progress stuff. I do know nonparametric stats exist but thats not the face of stats that people get to see, but even in those cases the focus is usually on accuracy of estimating a functional.
I was deliberately harsh on stats in my comment to offset some of the wrong things you mentioned about stats, but the relation between ml and stats are really not adversarial. I see ML as a course correction for stats: focus on prediction guarantees and adding some muscle in two areas (i) algorithms for scaling (ii) optimization. In anycase i know where both the tribe hide their dirty laundry :)
That said I strongly agree that Ml is indeed applied math a mix of probability, optimization, functional analysis and algorithms/Datastructures
But hypothesis testing remains important because in some contexts with too little data that's about the best can do. For parametric tests, there are some cases when that's okay.
The asymptotic stuff is mostly okay: Can't figure out precisely what the darned thing will do for case n so let n go to infinity and maybe can see what happens. Then say, "for large n, this is about what happens" -- crude but better than nothing.
The central limit theorem and more says that the Gaussian is really important. Okay. Done. We know that. But, right, for too long too much of statistics made nearly a religion out of the Gaussian. Bummer.
Today happened to see the discussion panel at Stanford at
They are laughing at ML: ML discovered statistics like Columbus discovered America. Columbus didn't discover America since when he landed millions of people were already living there. And when CS AL/ML discovered statistics, there was already a huge field there.
Moreover, the situation is reversed: Columbus brought better ship, etc. technology to the Americas than the Americas had, but CS AL/ML are bringing nearly all highly inferior technology to statistics, pure/applied math, etc.
CS AI/ML is nearly all new labels for adulterated old wine.
So, take some data on two variables, plot on an X-Y graph, fit a straight line, get the coefficients of the line as in first year algebra, and, presto, bingo, "The AI revolution is here!!!!", CS had had a machine "learn" the coefficients and they started with "training" data, that is trained the machine!!!!!! Where can I get one of those airline seat barf bags? Upchuck time!
Uh, I have a pretty good professional library; much of my house is lined with -bookshelves. The main topics are pure/applied math. Some of the books relevant to statistics include
(with TeX markup)
Alexander M.\ Mood, Franklin A.\ Graybill,
and Duane C.\ Boas, {\it Introduction to
the Theory of Statistics, Third
Edition,\/} McGraw-Hill, New York, 1974.\
\
N.\ R.\ Draper and H.\ Smith, {\it Applied
Regression Analysis,\/} John Wiley and
Sons, New York, 1968.\ \
C.\ Radhakrishna Rao, {\it Linear
Statistical Inference and Its
Applications:\ \ Second Edition,\/} ISBN
0-471-70823-2, John Wiley and Sons, New
York, 1967.\ \
Henry Scheff\'e, {\it Analysis of
Variance,\/} John Wiley and Sons, New
York, 1967.\ \
Yvonne M.\ M.\ Bishop, Stephen E.\
Fienberg, Paul W.\ Holland, {\it Discrete
Multivariate Analysis:\ \ Theory and
Practice,\/} ISBN 0-262-52040-0, MIT
Press, Cambridge, Massachusetts, 1979.\ \
Stephen E.\ Fienberg, {\it The Analysis of
Cross-Classified Data,\/} ISBN
0-262-06063-9, MIT Press, Cambridge,
Massachusetts, 1979.\ \
Leo Breiman, Jerome H.\ Friedman, Richard
A.\ Olshen, Charles J.\ Stone, {\it
Classification and Regression Trees,\/}
ISBN 0-534-98054-6, Wadsworth \&
Brooks/Cole, Pacific Grove, California,
1984.\ \
R.\ B.\ Blackman and J.\ W.\ Tukey, {\it
The Measurement of Power Spectra:\ \ From
the Point of View of Communications
Engineering,\/} Dover, New York, 1959.\ \
William W.\ Cooley and Paul R.\ Lohnes,
{\it Multivariate Data Analysis,\/} John
Wiley and Sons, New York, 1971.\ \
Maurice M.\ Tatsuoka, {\it Multivariate
Analysis: Techniques for Educational and
Psychological Research,\/} John Wiley and
Sons, 1971.\ \
E.\ L.\ Lehmann, {\it Testing Statistical
Hypotheses,\/} John Wiley, New York,
1959.\ \
E.\ L.\ Lehmann, {\it Nonparametrics:
Statistical Methods Based on Ranks,\/}
ISBN 0-8162-4994-6, Holden-Day, San
Francisco, 1975.\ \
Jaroslav H\'ajek and Zbyn\v ek \v Sid\'ak,
{\it Theory of Rank Tests,\/} Academia,
Prague, 1967.\ \
Sidney Siegel, {\it Nonparametric
Statistics for the Behavioral Sciences,\/}
McGraw-Hill, New York, 1956.\ \
Donald F.\ Morrison, {\it Multivariate
Statistical Methods: Second Edition,\/}
ISBN 0-07-043186-8, McGraw-Hill, New York,
1976.\ \
Harry H.\ Harman, {\it Modern Factor
Analysis: Second Edition, Revised,\/} The
University of Chicago Press, Chicago,
1967.\ \
George E.\ P.\ Box and Gwilym M.\ Jenkins,
{\it Time Series Analysis --- Forecasting
and Control: Revised Edition,\/} ISBN
0-8162-1104-3, Holden-Day, San Francisco,
1976.\ \
Jean-Ren\'e Barra, {\it Mathematical Basis
of Statistics}, ISBN 0-12-079240-0,
Academic Press, New York, 1981.\ \
So, what is really significant recent ML is adding to this quite old material? How much of the good stuff in that material has CS AI/ML even read and understood so far? Uh, what am I missing? I can think of a little but not much.
CS AI/ML going for applications of statistics is a lot like Stanley Tools, long in the hammer, screwdriver, saw, drill, and nail gun business, going into the residential construction business building, selling new houses and claiming that they have something new! That is, because they make good nail guns they conclude that they are the best at building houses. They know nothing of excavation, masonry, framing carpentry, windows, doors, roofing, plumbing, electrical, HVAC, dry wall, painting, circular stairs, walnut paneling, landscaping, or even the building codes, but they've got some good nail guns!
The math I derived and published in multi-dimensional, distribution-free anomaly detection has next to nothing to do with any of those books above; instead, I did original work and drew from ergodic theory, abstract algebra, and a classic result of Ulam. For the math for my startup, it's also original and even farther from those books above. In both cases, the core results are presented as theorems with proofs. Such math is mostly not what I'm seeing in CS AI/ML.
I DID see some cute, recent work in analysis, maybe from ML, of the over fitting problem; I don't need that work for my startup, but I do intend to go back and read that work.
My main points are:
(1) If AI/ML have some applied math tools that are solid, good, and new, terrific. Then those cases will add to the many thousands of such tools already on the shelves of the research libraries.
(2) The most important paradigm for
a solid future for such tools is new,
solid, correct, and powerful
theorems and proofs building on
solid material in math. That is,
we want better stuff than is
already in the libraries.
Small correction to what I said:
srean> I was deliberately harsh on stats in my comment to offset some of the wrong things you mentioned about stats.
Replace 'stats' with ML in my comment above. You are spot on regarding stats part, regarding ML not so much. In fact I have great respect for you as a applied mathematician and frequently learn something from these exchange of comments.
graycat> Moreover, the situation is reversed: Columbus brought better ship, etc. technology to the Americas than the Americas had, but CS AL/ML are bringing nearly all highly inferior technology to statistics, pure/applied math, etc.
This is unfair to the point of being absurd. Have you looked at stats software ? or compute infrastructure built by statisticians ? I can say more, but that would be below the belt. CS/ML has infused new blood in this area augmenting the capabilities of what can be done on a machine or cloud) 3 orders of magnitude or more. Part of it is advances in CS and software systems, but that is not all. One of the other parts is advances in algorithms. I am sure you would not scoff at ML pedigreed algorithms that run way faster than CPLEX. Many would not have the context, but I am sure you do. Of course these algorithms are not fr any LP, but a class of huge LPs that show up in graphical models. These belief propagation based algorithms outperform not only vanilla simplex algorithms but also the state of the at ones on these problems. Then there is culture. ML algorithms are more likely to be open sourced so that people try it, criticize it and that is how progress is made. In stats locking it behind some proprietary stuff is the norm. R is an exception, but as a piece of software I would keep it light years away from the money.
Let me move on.
graycat> The asymptotic stuff is mostly okay: Can't figure out precisely what the darned thing will do for case n so let n go to infinity and maybe can see what happens. Then say, "for large n, this is about what happens" -- crude but better than nothing.
Yeah n is indeed very large in most data sets, but p the dimension is often larger. All hell breaks lose in this situation when you go by classic theory of statistical inference.
This quandary applies to deep neural nets also. The number of parameters way exceeds the training data points it yet generalizes well to unseen data. WTF is going on !
Classic prob/stats have no answers.
Forget classic, even modern prob stats is struggling to explain this. A good fight is being fought to characterize their behavior. New mathematical arguments are being developed but this is clearly a work in progress, an open problem.
You have a chip on your shoulder of sorts regarding ML. That's fine, nothing wrong with that. What is wrong, however, is when you mischaracterize it because you are not reading the right material. You have to read those books and journal publications that are the moral equivalent of the prob/stats library you have.
Its not fair to compare two languages when you quote from corny porn in one but high literature from the other.
May I say Vapnik again :) And checkout the main publication s from COLT (computational learning theory), ICML, NIPS etc.
I am sure you have heard of the multi-armed bandit problem. Gittins index and all that. Now solve it for the distribution free case. Even that leaves something to be desired because rewards may not be governed by a stochastic process. Give guarantees for arbitrary but bounded reward sequences. Stats literature had no answer for these cases. These are just a glimpse of the tools and results that have come from ML.
You observe some noisy entries from a huge matrix that is low-rank. Now fill the unseen entries and give guarantees on performance. ML has the tools here.
A biggie is analysis of causality thanks to Judea Pearl. Some still dont get it that conditioning and causality are fundamentally different. Just regressing Y over X will not cut it. ML has the tools here.
Anyways...
Oh Jesus! not ANOVA again. Its a cute toy, that's all. The assumptions it needs are rarely met in the datasets we want to analyze today. Visionaries like Tukey had figured that out 3 decades ago.
Why do statisticians to this day keep using parametric tools in situations where we are drowning in data.
I agree that its not entirely the fault of statistics that people do the above. Stats really has to shed a lot of baggage that made sense a century ago but does not make sense now.
Given your experience I am sure you know that things are rarely ever Gaussian. Central limit theorem often does not apply in this popular form because of distributions with ginormous variance. Why use tools based on that.
Classic parameter estimation guarantees are all fine in the toy 'spherical cow' world but where in stats are the non-asymptotic guarantees on prediction performance in a distribution free setting. This is precisely the raison d'être for ML. Not that there are no statisticians who have gone this route (Dawid would be one who needs mention), but way too few.
ML does indeed have a lot to offer. To be cute I would leave it at what I said before:
> non-asymptotic guarantees on prediction in a distribution free / nonparametric setting. That is what ML does.
The newer developments aim at situations where the number of parameters in the model far exceed the data. What guarantees can you give in such situations. Surprisingly quite a bit, and contrary to what text books have said over the century, you can give guarantees. The algorithms work magically well, but of course you would need some (realistic assumptions). ML is an applied science so it borrows from and (crucially) adds to mathematics all the time. This particular line of work borrows a lot from the geometry of high dimensional random convex bodies, and Banach spaces.
The most exciting part I see is the melding of the good parts of ML and stats and a good deal of communication across the Stats and ML community lines. They collaborate, they publish in the same conferences and that's just freaking fantastic. This is what we need not some stupid fights over which is better or is it applied science or not.
Parting word, dont get an idea of ML from ML porn and spread untruths like,
With good stuff in ML you are describing a
world I've seen no solid evidence of first
hand.
Maybe that's to be expected: Long I did
well learning math in part because I
picked some really good books; due to some
lucky parts of my background, that was
easy to do. Then I tried this in
statistics and picked a junk book --
eventually figured that out.
I didn't have much respect for or progress
in statistics before I had a high end
background in probability from a star
student of E. Cinlar at Princeton from
Neveu's book, Neveu a star student of
Loeve at Berkeley. Then in stochastic
processes I got some more respect from
work of J. Doob (prof of Halmos who then
became an assistant to von Neumann, etc.)
and the material on probability in the
back of the Halmos, Measure Theory. The
Halmos work on sufficient statistics and
unbiased estimation also deserved
respect.
I've seen no such quality from ML. You
suggest that there is such quality out
there, but what I saw, e.g., Ng's on-line
lectures on ML, the idea of convolutional
neural networks for image recognition,
doesn't look nearly as solid as what I did
in probability or nearly as interesting as
some of what you point to.
a Web site of three profs in bio-stat, and
read some of the material of one of them,
Jeff Leek, looked at the TOC and some of
the lectures in his Coursera on-line
course, some of the research problems he's
working on, some of what he said about ML,
what he said about cross-validation and
bootsrapping in model building, etc. and
concluded that with ML still I was not
missing much.
I did look at
Trevor Hastie, Robert Tibshirani, and
Jerome Friedman, {\it The Elements of
Statistical Learning Data:\ \ Mining,
Inference, and Prediction, Second {\it
Edition,\/} Springer, 2008.\ \
found the interesting looking work on over
fitting and otherwise was not impressed at
all.
Yes, you have more references I have yet
to pursue, but what I did yesterday and
other times was fast and easy and seemed
to touch on some of the most respected
work. But, yes, it took me some real
digging and a lot of luck to find the
really good stuff in probability,
stochastic processes, optimization, etc.,
so maybe I'd need a lot of digging to find
the good stuff in ML. Since ML is a more
recent field, likely it has yet to receive
the polish of Tukey, Halmos, Loeve, Neveu,
Doob, Dynkin, Shreve, even back to
Kolmogorov, etc.
I do wonder, then, what the heck the ML
people are driving at? Really it looks
like they are growing around the edges of
old regression analysis model building,
trying to fit to data (training data)
and then use the fit to make predictions.
Okay. But I'm not going to take 5 TB of
pictures and try to recognize cats, dogs,
men, women, etc. with some version of
curve fitting with many thousands of
variables -- that's just a long way from
anything I'm doing, and more generally I'm
reluctant to believe that any technique
that needs that much data will be more
than a niche. E.g., for intelligence,
kittens, puppies, and toddlers do really
well with much less data. E.g., a kitten
who jumps up on the kitchen counter, walks
over to the stove, and gets a hot foot
learns well from just that one data point.
Moreover he starts to find causality and
generalize.
Really, my view is that for more in
getting intelligence out of data, we
should concentrate on cases where there
likely is some version of causality and
have our applied math look for and then
use that. I didn't see ML doing that. My
brother and my wife got their Ph.D.
degrees in the social sciences, and there
they were quite careful about causality.
For more parameters than data points,
okay: Sure, in some non-linear cases,
that might happen. And even in linear
cases, if just open eyes a little, can see
that and, thus, get around over fitting in
some cases. E.g., regression is really a
perpendicular projection onto a linear
subspace spanned by the vectors of the
input data. Okay. Now just enter the
input data twice. So, then the dats still
spans the same linear subspace and we get
the same perpendicular projections. If
write out the regression normal equations,
then the variance-covariance matrix get
won't be non-singular and won't have an
inverse. So, big heart burn. But heart
burn is premature: The normal equations
still have a solution, just infinitely
many. So, the regression coefficients
aren't unique. BUT, and apparently not
many people have noticed this, all the
predicted values from the given data still
are unique, that is, are just the same
projection. That is, take any of the
infinitely many solutions you want from
the normal equations and get the same
predicted values from the given data
(maybe not new data). That observation
solves some of over fitting but not all.
But for your remark, that IS a case of
working with too many parameters.
For more, looking at Jeff Leek on
regression model building, he quickly goes
into cross validation and bootstrap
techniques. Okay, cross validation looks
like intuitive, heuristic, Medieval
stirring pots of rat tails and frog eyes
to me, and boostrap needs some theorems (I
believe I have derived some -- I know,
look at Bradley Efron's work which I've
done a little but so far I prefer my
derivations) to justify and explain what
is going on, but roughly Leek is correct,
and I didn't see even that much care and
concern in the ML materials.
If ML is still trying to milk some more
utility out of the edges of simple, old
regression, that is, fitting to data and
using the fits to predict, then I'm not
much interested: Often regression works
really well; where it doesn't, and IMHO
over half the time it doesn't, then I'm
not going to strain to squeeze a little
more out of that curve fitting paradigm,
at least not now.
To me, curve fitting to build predictive
models is nothing like all that can be
done with statistics and only a tiny drop
of what can be done with applied math.
E.g., where is the good science from
simple, linear equations? From Newton,
Maxwell, Einstein? Nope, none of those.
Where did they make progress with fitting
linear equations? Gee, Kepler used
ellipses and Newton drew from those.
For fitting, the big success was Ptolemy's
epi-cycles, and that still was just
fitting and not causality or real science
or physics.
Moreover, now I'm concentrating on my
startup -- I've long since derived the
math, and I see nothing in ML to even hint
that their work could improve on mine.
Sure, when my startup is done, I'll take
on new interests. But curve fitting? I
doubt it! Real AI? Maybe. More likely
physics!
Sure, maybe some day there will be a
Kolmogorov, Doob, etc. to do terrific work
in ML or new frontiers of statistics
and/or applied math and a Loeve/Neveu to
do a polished job writing up the work.
Maybe.
Then there's a credibility problem: From
what I've seen, nearly none of the
computer science profs working in AL/ML
have what Jeff Leek called "math chops"
that promise they can do really good
research to move statistics ahead.
I'd have a sign over the door reading
"None enter without a proof that there are
no countably infinite sigma algebras."!
That will select the people with good
"math chops".
From your claims, maybe there actually is
some good work in ML but hard to find; as
I started, hard to find would not be too
surprising since early in my career
finding the good stuff in probability and
optimization was really hard. Even in the
statistics community, the good people are
hard to find. I can respect E. Dynkin, E.
Cinlar, J. Tukey, L. Breiman, D.
Brillinger, and a few more, but for a
beginner in statistics those people will
be difficult to find, understand,
appreciate, respect, etc. Maybe something
similar is the case in ML now.
In statistics taken broadly, there is a
guy from France I know and respect; he was
a student of Choquet, right, a member of
Bourbaki!
But, again, for ML to have my respect,
tough if they are just still trying to do
curve fitting.
Really do appreciate your long responses, so thank you and keep doing that.
More parameters than data points shows up a lot these days and not because of non-linearity. Now its common to measure everything you can about an object and then worry later about what is useful. So the situation is few objects but gobs and gobs of measurement for each of them.
Of course now you have a rank deficient situation with an entire affine space for a solution. But thats no good, application wants to find the sparsest point in that affine space in the given basis. ML has tricks up its sleeves to do that in poly time. Its quite unbelievable that this is even possible.
BTW little surprised that you dont talk about Lehman much.
And you are searching for quality in the wrong place. Go for the proceedings of the COLT conference, ICML and read some Vapnik to start.
Beating linear programming as in C-PLEX on
networks? In some cases, that's easy: On
a lot of networks, the simplex algorithm
specializes in an amazing way: A basic
solution corresponds to a spanning tree.
To consider adding a variable to the
basis, just add an arc to the tree, get a
circuit, run one unit of flow around the
circuit and see if can save money, and if
so run the flow around until hit a flow
capacity constraint on one of the arcs in
the circuit, remove that arc from the
basis, and, presto, have a simplex pivot.
If use the W. Cunningham (one of my profs)
modification of strongly feasible bases,
then are guaranteed won't cycle. It's
easy to program and just blindingly fast
-- e.g., the arithmetic is all just
addition and subtraction. It's still the
simplex algorithm so that C-PLEX could
have been used, but for a big network
problem will likely run ballpark thousands
of times faster than C-PLEX.
For more particular cases, maybe more can
be done.
Bertsekas has a polynomial algorithm for
that problem. It may get applied to that
wacko in Ping Pong Yang.
Network flows is a an old, big, deep field
e.g.,
Ravindra K.\ Ahuja, Thomas L.\ Magnanti,
James B.\ Orlin {\it Network Flows:
Theory, Algorithms, and Applications,\/}
ISBN 0-13-617549-X, Prentice Hall, New
Jersey, 1993.\ \
Mokhtar S.\ Bazaraa and John J.\ Jarvis,
{\it Linear Programming and Network
Flows,\/} ISBN 0-471-06015-1, John Wiley
and Sons, New York, 1977.\ \
For analysis of variance, there is, e.g.,
George W.\ Snedecor and William G.\
Cochran, {\it Statistical Methods, Sixth
Edition,\/} ISBN 0-8138-1560-6, The Iowa
State University Press, Ames, Iowa, 1971.\
\
No joke it's from Iowa since it's long
been important in agriculture research and
testing and in experimental design and
analysis much more generally. E.g.,
people who propose new ways to do K-12
education commonly use these techniques to
good advantage.
Sure, first cut, it's regression analysis
again, but with its special discreteness
really it's quite different.
It's powerful stuff.
For the Gaussian assumptions in
regression, don't get too concerned: The
normal equations remain and don't need a
Gaussian assumption -- there's still a
projection and still get the Pythagorean
theorem --
total sum of squares = regression sum of
squares + error sum of squares.
The Gaussian assumption yields F ratio and
t-tests, and ballpark in practice those
remain robust mostly ignoring the
Gaussian assumption. In some wild, edge
cases, might want to check the robust
claim, but otherwise do get some good
information about the fit.
Totally agreed. The algorithm that I was talking about is a rather curious one. It neither does vertex hopping like simplex, nor interior point iterations. It works with a factorized representation of the Lagrangian and then updates parts of the in a fixed point iteration that if converges would satisfy KKT. What is strange is that this is not monotonic in the dual cost (neither the primal).
As long as there is no more than 1 loop in the original problem the convergence is guaranteed. Otherwise no guarantee holds but it often does converge.
Yeah sure its a projection but so what. Why and when should a L2 norm projection be useful. It would when the tails of the residuals are Gaussian like. Otherwise things will and do go haywire. L2 loss is too sensitive at the tails. This observation is nothing new though.
What is new however are the near magical guarantees you can get (literally rabbit out of a hat) when you use L1 projection instead. It comes at a computational cost but solves a problem that cannot be efficiently solved otherwise: Xw = y + noise.
X is a random matrix woefully rank deficient. Exponentially more columns than rows. w \in R^n but has very few non zeros. Now find that w. This needs rather deep tools from asymptotic geometry and Banach spaces. Its an area that's quite active on stas / ML / signal processing right now.
> Yeah sure its a projection but so what. Why and when should a L2 norm projection be useful.
We want the answer, and more generally a predictive model.
We use the data we have to
define a vector subspace.
We use that subspace to
creep up on the answer we
want. Finally to get the answer,
we project onto the subspace
we have settled on. We CAN
do the projection
even though
we don't have the answer.
We take that projection
on to the subspace,
that point in that subspace, as our
approximation to the answer.
We use L^2 because that projection,
with the Pythagorean theorem,
is the closest in L^2.
Being
close in L^2 is a good thing:
L^2 is a Hilbert space, and being
Cauchy convergent means being
convergent in L^2, and
then there is a subsequence
that converges almost surely,
with probability 1, exactly, etc.,
and in practice it's
exactly. A sequence that
converges in L^2 but not
exactly is pathological.
Then as we use more variables,
the vector subspace gets
closer to the answer and we
get a better projection
although maybe not a better
fitting model.
To me, tails of distributions
have next to nothing to do
with it or the use of L^2.
In particular, the basic
regression math,
the normal equations,
the Pythagorean theorem
total sum of squares =
regression sum of squares
+ error sum of squares
makes
no assumptions at all
about distributions
(might want to argue
that need to be in L^2,
that is, for each random
variable X have E[X^2]
finite, or maybe at least
E[|X|] finite, that is,
L^1)
certainly does not assume
a Gaussian.
Moreover the matrix
in the normal equations
has variances and
covariances, but these
are close to just the
inner product in the
Hilbert space L^2 and
do NOT assume a Gaussian
or anything about
a distribution except
merely that the
expectations exist
which in practice is
a meager assumption,
essentially always the
case.
Just because are working
with variances and covariance
does NOT mean that are
assuming a Gaussian.
Also notice that nowhere
do we need to assume
that our data is not discrete.
Discrete data is still fine.
If in addition
do have an appropriate
Gaussian assumption, then
at the end of the usual
normal equations stuff,
can get an F ratio and
some t-tests that usually
can help evaluate the quality
of the fitting and the
importance of the
coefficients.
And for the model, we
can get confidence intervals
on the predicted values --
sure, will need a Gaussian
assumption here, but
there might be other approaches
and otherwise we can
hope or look for robust
estimators.
L^2's fine with me.
More can be done, but
the above is the first
cut reason for using
subspaces and L^2
projections although
this explanation is rarely
taught to students.
When ordinary regression
doesn't work very well, then
I'm reluctant to strain
with much more. I might
be willing in some circumstances
to use what L. Breiman
(I respect Breiman, Loeve
student) has in his Classification
and Regression Trees.
No no no you are thinking this one a little wrong.
Yes Pythagorean identity, or Pythagorean decomposition, whatever you want to call it, the orthogonality properties etc etc make L2 super convenient to apply and reason about, but that does not mean it is a suitable performance metric to use. This is again the phenomena of searching for the answer where its well lit vs where the answer lies.
The problem really lies in the tails of the errors, although it might not be immediately apparent. You say variance covariance etc etc, but it does not take much for RVs not to possess them. Think about it, Chebyshev's inequality will make it clear what decay you need for those to exist.
Yes you are right choice of L_2 by itself makes no assumptions. But once you start analyzing the performance of the estimator it will show up, particularly when Fisher information is no longer strongly convex. Cramer Rao becomes a vacuous bound in this case. The other problem is L2 balls become infinitely larger than say the L1 ball as dimensionality increases. So L2 error stops being very good at localizing a vector.
The problem is L_2 is not a good metric to measure inaccuracy with, it is super convenient to use though.
Let me demonstrate one of its well known problems. Say the error residual has tails that does not decay as fast as an exponential (MGF does not exist around 0). You can then show that the accuracy can be arbitrarily bad. No need for super ugly densities, something as benign as just a mixture of two Gaussians with the same expectation will break it.
I am just restating Huber's robustness argument here in a different way.
Gauss himself was well aware of the problem and chose it for convenience. Even his peers were well aware of situations when L1 made more sense than L2. Unfortunately optimization theory was not well developed at that time to use L1 based methods. But now we can.
BTW you might find it interesting that the Pythagorean property is not limited to L2 squared. The set of all 'divergences' for which it holds is the Bregman divergence family. Its an if and only if condition. This is a result that came from the ML community. One side of the implication was known though. Bregman divergences are again the likelihood ratios of exponential family densities, (also called Darmois Koopman family in older literature), hence has strong connections with NP lemma as well. What I find mind boggling is that Bregman divergence had its origin in convex optimization literature ! Its origins had absolutely nothing, nothing to do with probability and stats. Its amazing when two separate fields of Math make contact in these ways.
BTW if you dont mind sharing your suitably anonymized mail, I am srean.list at gmail. It will be pleasure to discuss math with you. I find it very helpful when i can stress test an idea aginst a human oracle. HN might not be a forum well suited for this.
Yes, now we can also do
best L^1 approximations.
IIRC -- I'm in a hurry this
morning -- it's a linear
programming problem or
some such but I haven't
thought about that in
decades.
You bring up a lot of stuff
I've never reviewed.
> but that does not mean it is a suitable performance metric to use.
If the plane do the L^2 projection
on is really close, then the
error is really small, and what's
not to like? Really close is
good enough for me.
Right, there are three biggie
choices, L^1 (minimize the
sum of absolute values of the
errors), L^2 (minimize the sum
of squared errors), and L^infinity
(minimize the worst error).
L^2's fine with me, good enough
for government work, okay
first cut, day in and day out.
If in some particular case
L^2 has some problems, then
maybe something like regression
is not the right tool for the
problem.
Gee, we do a lot of L^2:
We get orthogonal components
so that we can get L^2
cafeteria style, just pick the
components we want.
E.g., in filtering stochastic
processes, we take a
Fourier transform -- that
is finding the coefficients
of the sample path on
the sine-cosine orthogonal
components. Or we
do a convolution, which
is the same thing in the
end. The approximation
we get is an L^2 approximation.
So, with JPG -- sure, it's L^2.
Right, JPG does funny stuff
nearly lines. Life's not
perfect!
For some things, sure we
want L^infinity: E.g.,
if we want to use a
quotient of polynomials
to approximate the
usual special functions,
then we want to minimize
the worst error and do
what is called Chebyshev
approximation, but this is
very specialized.
Indeed it is. Lets catchup more sometime. Here's my email again srean.list on gmail
Totally agreed there is lot to like about L2 but there are plenty situations where it is terrible (in fact some of them are on one topic of your interest: monitoring server farms). In some of those situations L1 or a combination of L1 and L2 helps a lot.
Unfortunately its not true in many cases I have seen, essentially because the tail of the error does not fall fast enough. Technically, for all bounded RVs all moments are bounded, but in some of these situations the variance is so high that its infinite for practical purposes.
All you need is the tail to fall slower than a quadratic.
In my experience there is a huge gap between research at universities, the applied R&D done at industrial labs and the applications of research done at companies.
All are valuable in their own way, but research maths has a big gap before it is useful for a company.
(Source: I run a multi-university R&D project, including an applied math department.)
> All are valuable in their own way, but research maths has a big gap before it is useful for a company.
Counterexample: Consider the simple, first order, linear, ordinary differential equation initial value problem
y'(t) = k y(t) (b - y(t))
I was feasting on picnic pork shoulder BBQ in Memphis when I saw that and with freshman calculus (never took it, taught it to myself, started on sophomore calculus) got the closed form solution. Don't even need a course in differential equations, variation of parameters or any of that!
Well, one Saturday morning my solution kept the two representatives of FedEx BoD Member General Dynamics from walking out; they cancelled their plane reservations back to Texas and stayed; FedEx was saved from going out of business.
So a little bit of math work helped a major business!
Yup, gaps, that's the usual situation.
But, early in my education, I didn't know those facts of life about gaps, etc. and believed that the stuff I was studying in math and physics would be useful, to make money, for a house, wife, kids, etc.
But, naive or not, in the end I got a good education in pure/applied math.
Now I'm less naive.
But, now I'm an entrepreneur: I can pick a good real problem -- as we know, good problem selection is important for success. And there's no law against my drawing on all I learned in pure/applied math to derive some new math to get the first good solution to an important problem.
I did that for zero day monitoring of server farms and networks, but for success, the $ kind, I was in effect depending on others to make the practical application. I had some high quality flour but was not starting Domino's Pizza.
So, being too naive too often in the past, I decided to go to where the money was, to have a Web site that would get a lot of traffic, run ads from ad networks, get lots of clicks, and get paid by the click -- be 100% owner.
How to do that? First, pick a problem. Really, pick a pair of problem and a solution, but, to keep it simple, first pick a problem. Right, if in the second step can't find a good solution, then loop back and pick another problem.
So, there's the Internet and computing, both not nearly fully exploited. And there's ballpark 5 billion people on the Internet. Maybe in the more developed countries that can support good ad rates, there are maybe 400 million good Internet users.
So, then, find a problem these 400 miliion or 5 billion people would very much like to have solved, want a new solution on average once a week, and will be willing on average to look at screens from my Web site for 30 minutes a week and, there, see maybe 50 ads with decent ad targeting. Multiply that out, and it's a successful business. Get optimistic and excited multiplying that out and get an estimate that it's the first $1 T business.
Heck, get on average 24 x 7 one user a second, and it's a nice life style business -- plenty for house, wife, kids, etc. (uh, i've never had money enough to buy a house -- I'd like to be able to buy a house; I'm not joking about that, not even a little bit; if I'd had money enough to buy a house my wife might still be alive; I want money enough to buy a house; I know with 100% certainty I will never have money enough to buy a house unless I start, own, and run my own successful business -- period.).
So, I found such a problem. It's not solved worth a darn -- the best solutions totally suck. The solutions are so bad no one even suspects that there could be a solution and, thus, fails even to notice the problem.
Well, a solution is not trivial: Routine applications programming won't work. That AL/ML would work is a LOL joke. The applied math for a solution is not on the shelves of the libraries. So, I found some new math for a new solution. The work was not too difficult -- I've done high quality, new math before.
I never had any trouble;
get good at understanding the theorems and proofs and working the exercises in books by Halmos, Rudin, Fleming, Coddington, Royden, Loeve, Neveu, Breiman, Bertsekas, Hadley, Herstein, Hildebrand, Cinlar, Nemhauser, Zangwill, Luenberger, Simmons, Kelley, Suppes, Tukey, etc. and then maybe will be able to do publishable work in math.
Then I wrote the software. It appears to run fine. I wrote the software in Microsoft's Visual Basic .NET -- looked, still looks, like a fine approach to me. So, no LISP, no functional programming, no particular concern about object oriented architecture, ignored JavaScript, nearly ignored CSS, never used an HTML div element, no attention to languages intended to make high levels of parallelism easy and automatic, ignored the model, view, controller Web software architecture, just typed into my favorite text editor and never used an interactive development environment, had no problems with debugging (it was enough just to write little trace messages to the Web site log file), just kept it simple. Seems fine to me.
I'm on the way to going live and getting revenue.
I'm a sole, solo founder, entrepreneur. I can do math and write code; there's no law against it; I don't need permission; and maybe my work will make money -- I believe there is a huge chance.
If people like my work, then I'll be rich.
Yes, on the gaps,
one of my Ph.D. dissertation advisers told me with high concern that I should consider how long it takes to get original research into practice. Immediately I told him "I'll put it into practice right away.". That's what I'm trying to do. Uh, he assumed that I wanted to be a college prof; nope, I wanted to make money, the green kind, in business, the green money making kind.
Gap between academics and applications? Not for me, now. Sure, I suffered from that gap, but now I look at the flip side and see a big opportunity.
It's just in its gold rush phase. It happens with everything. It happened with the PC, the web, mobile, even Facebook apps, and it is happening to ML and VR now. Everyone is piling on to dominate before the winners are solidified and it's too late.
These are aquihires, teams that have gelled and accomplished things. Generally don't find them in universities. And why do you assume they're not filled with a lot of stats phds?