The math, enough to make it mathematical, complete with theorems and proofs. Exa...

sja · on May 29, 2017

While I agree that it is vital for a practitioner know the mathematical basis behind what they're doing in order to maximize their chances of success, I find your current argument lacks substance.

You argue that there is a vast amount of solved problems in tomes of "pure" mathematics. But this doesn't mean that any of those solutions are relevant and/or necessary to the problems people are working on.

Your "statistical test" argument is misleading. Machine learning techniques aren't designed to form such statistics, and judging them by their ability to do so willfully disregards the real world success many companies have found by using ML/AI to solve their problems.

Certainly, there are challenges that are waiting to have their solutions implemented from a rediscovered manuscript, but listing impressive-sounding statistical terminology doesn't demonstrate lack of value for machine learning/AI techniques. Furthermore, it doesn't even demonstrate value for the "pure" techniques you list. To clarify, I'm talking about real-world value, not just theoretical value (and again, I'm not arguing that such techniques don't have value, just that your argument isn't doing a good job communicating how such techniques have orders of magnitude more value than ML/AI).

graycat · on May 29, 2017

AI/ML now is too often just heuristic curve fitting. Okay, with good training data and then testing data, can get some utility that way. So, if that's the best can do in the context, okay. A secret in old applications of regression model building, that idea of cut the data in half, fit with one half and test with the other, was common and appropriate.

But we'd like to do better. Some of what is in statistics and applied math more generally will let us do better.

But the idea if dividing the data into training and testing is old. There's much more in the literature on how to build and evaluate models. E.g., there's analysis of variance and categorical data analysis (e.g., log linear).

E.g., for my startup, a guy gave a talk and said that can't do that, encounter four problems. He was right about the four problems. And if just look in elementary texts, yup, do see the four problems. But I'd already seen all four and worked up some theorems and proofs to get around all four, in the context of my real problem. The foundation of the theorems and proofs was mostly some advanced pure math. Without new theorems and proofs, I'd been stuck like he was.

In an important sense, the pure math guys are right: They are looking for the big, important structure and powerful properties, where all the furniture is in the room, turning on the lights so that can see all the furniture (A. Wiles), and that can be darned powerful stuff in practice.

> You argue that there is a vast amount of solved problems in tomes of "pure" mathematics. But this doesn't mean that any of those solutions are relevant and/or necessary to the problems people are working on.

Flour is a raw material. Some people can use it to make a fantastic Sacher Torte. Pure math can be regarded as a raw material. Some people can use it to get new, powerful, valuable results for real applications in practice.

> orders of magnitude more value than ML/AI

Those techniques are narrow and weak. The stuff long on the shelves of the libraries is much more broadly applicable and usually much more powerful. I've made a lot of applications of applied math, and not one of them would the AL/ML discussed these days be better and nearly never would be competitive.

srean · on May 31, 2017

What 4 problems ?

rjdagost · on May 29, 2017

You’ve told us vague things about wide areas of mathematics, but you haven’t told us what problem your start-up is solving using these branches of mathematics or how it is a better allocation of investor dollars than focusing on ML.

I’ve worked on a wide range of applied mathematics problems in machine learning, computer vision, signal / image processing, control theory, and applied statistics. Very few people outside of a mathematics department care at all about theorems, proofs, and "correctness" in the formal mathematical sense. I can count on one hand how many times I’ve needed to actually do a proof, and the number of people who have cared is smaller still. All anyone cares about is how well you can solve their problems. Sometimes advanced mathematics is the right tool for the job, but often it is not.

graycat · on May 29, 2017

> You’ve told us vague things about wide areas of mathematics, but you haven’t told us what problem your start-up is solving using these branches of mathematics or how it is a better allocation of investor dollars than focusing on ML.

Easy: ML is quite narrow. More options yield the same, usually better, investment results. Compared with the applied math on the shelves of the libraries, ML is nearly pinpoint narrow. In addition, too much of ML is heuristic curve fitting short on math guarantees. Also, where some real problems have some features that justify some math assumptions, ML is slow to exploit those.

In the workshop, kitchen, or lab, use the best tools you can. So don't limit yourself to just what saw on a cooking show on TV.

My view is that mathematical proof is a fantastically powerful part of applying math. The core reason is that it lets do some additional, maybe significantly original, derivations to get better solutions for real problems.

E.g., once I was in an AI group working on monitoring and managing server farms and networks. We got rivers of data on many variables at high data rates. There we wanted to detect problems ASAP in near real time.

Okay, necessarily, inescapably, we have two ways to be wrong -- false positives (claiming the system is sick when it's healthy) and false negatives (claiming the systems is healthy when it's sick).

Okay, then, right away we are forced, like it or not, into the framework of some statistical hypothesis tests continually applied. There, as is usual in hypothesis testing, we want to be able to select a false alarm rate and get that rate exactly. We'd like the most powerful test as in the lemma of Neyman-Pearson but likely do not have enough data. Still we want test power in some good sense; if we are willing to tolerate a higher false alarm rate, then we'd like a higher detection rate; when we get a detection, we'd like to know, as a measure of seriousness, the lowest false alarm rate at which the data would still be a detection; at least we don't want a trivial test.

For this data, no way can we know the probability distribution of the data under the null hypothesis (the system is healthy). This is because even the univariate data is commonly a total mess, far worse than was long common in statistics (U. Grenander explained this to me in his office at Brown). And, of course, with so much data on so many variables, we should be doing multi-variate testing.

So, net we need tests that are both multi-variate and distribution-free. Go look for those in E. Lehmann, etc. and won't find any.

So, I invented some. Then, how the heck to adjust false alarm rate? That was a problem in applied probability that needed new theorems and proofs. So, I did those. For those I used measure theory, measure preserving from ergodic theory, group theory from abstract algebra, and more. Yup, theorems and proofs.

Otherwise I would not have known what the heck I had, what the properties were, what the false alarm rates were, etc. Then do my tests have any power? I didn't have data enough to apply Neyman-Pearson, but I found a way. Are my tests trivial? I used a result of S. Ulam and stated and proved a theorem -- not trivial.

The work was intended to be practical. Likely and apparently the work remains the cat's meow, best in the world, for detection of problems never seen before -- behavioral monitoring, zero day monitoring.

Gee, just read that the last security attack on Target cost them $300 million. They needed better monitoring!

New theorems and proofs were crucial, and the work was intended to be practical.

A guy had a fast opportunity to do some marketing but, with the short time interval, had limited resources. He had a 0-1 linear programming formulation, 40,000 constraints and 600,000 variables for a small, test case. Sure, no linear programming or integer linear programming package would have a chance.

So, I got his formulation, saw some special structure to exploit, did some non-linear duality derivations for some Lagrangian relaxation, found how to get some bounds, wrote some code, and in 905 seconds on a slow computer found feasible solutions guaranteed to be within 0.025% of optimality. Well, the derivations were essentially theorems and proofs. He'd tried a computer science approach, simulated annealing. He ran for days and then stopped with the best he had and with no idea how close he was to optimality. My work, from theorems and proofs, was much, MUCH better.

The crucial core applied math of my startup is some theorems and proofs; I wouldn't believe that the mathematical operations would yield what I want otherwise.

Sure, usually a working applied mathematician doesn't always have to stir up new theorems and proofs, but at times that's darned helpful and valuable.

srean · on May 31, 2017

Hey sorry but this

> In addition, too much of ML is .... short on math guarantees.

is utterly false. As I keep telling you need to upgrade :)

In fact I will level the same exact criticism on standard stats. NP lemma is fantastic math but of little practical importance because you rarely ever know the densities. Most of the time 'test of hypothesis' is a bad joke catering to a contrived scenario. Real problems are rarely about 'is it H0 or H1'. Estimation is typically more useful. Stats have long suffered from their fetish for parametric models, asymptotic normality, asymptotic guarantees and what i feel their greatest deficiency: the focus on parameter estimation rather than guarantees over prediction. Parameters and parametric models are convenient fiction, nobodody has seen them, will ever see them. What's all the fuss about estimating a fictional object to great accuracy under strong assumptions in infinite time about.

Made sense a century ago though. If their purported utility is in aiding prediction then why not go after the real deal, that is, non-asymptotic guarantees on prediction in a distribution free / nonparametric setting. That is what ML does. Please dont characterize ML by bad examples and work in progress stuff. I do know nonparametric stats exist but thats not the face of stats that people get to see, but even in those cases the focus is usually on accuracy of estimating a functional.

I was deliberately harsh on stats in my comment to offset some of the wrong things you mentioned about stats, but the relation between ml and stats are really not adversarial. I see ML as a course correction for stats: focus on prediction guarantees and adding some muscle in two areas (i) algorithms for scaling (ii) optimization. In anycase i know where both the tribe hide their dirty laundry :)

That said I strongly agree that Ml is indeed applied math a mix of probability, optimization, functional analysis and algorithms/Datastructures

graycat · on June 1, 2017

I agree with a lot of that.

But hypothesis testing remains important because in some contexts with too little data that's about the best can do. For parametric tests, there are some cases when that's okay.

The asymptotic stuff is mostly okay: Can't figure out precisely what the darned thing will do for case n so let n go to infinity and maybe can see what happens. Then say, "for large n, this is about what happens" -- crude but better than nothing.

The central limit theorem and more says that the Gaussian is really important. Okay. Done. We know that. But, right, for too long too much of statistics made nearly a religion out of the Gaussian. Bummer.

Today happened to see the discussion panel at Stanford at

https://www.youtube.com/watch?v=hxXIJnjC_HI

It's fun stuff, with some big names.

They are laughing at ML: ML discovered statistics like Columbus discovered America. Columbus didn't discover America since when he landed millions of people were already living there. And when CS AL/ML discovered statistics, there was already a huge field there.

Moreover, the situation is reversed: Columbus brought better ship, etc. technology to the Americas than the Americas had, but CS AL/ML are bringing nearly all highly inferior technology to statistics, pure/applied math, etc.

CS AI/ML is nearly all new labels for adulterated old wine.

So, take some data on two variables, plot on an X-Y graph, fit a straight line, get the coefficients of the line as in first year algebra, and, presto, bingo, "The AI revolution is here!!!!", CS had had a machine "learn" the coefficients and they started with "training" data, that is trained the machine!!!!!! Where can I get one of those airline seat barf bags? Upchuck time!

Uh, I have a pretty good professional library; much of my house is lined with -bookshelves. The main topics are pure/applied math. Some of the books relevant to statistics include (with TeX markup)

Alexander M.\ Mood, Franklin A.\ Graybill, and Duane C.\ Boas, {\it Introduction to the Theory of Statistics, Third Edition,\/} McGraw-Hill, New York, 1974.\ \

N.\ R.\ Draper and H.\ Smith, {\it Applied Regression Analysis,\/} John Wiley and Sons, New York, 1968.\ \

C.\ Radhakrishna Rao, {\it Linear Statistical Inference and Its Applications:\ \ Second Edition,\/} ISBN 0-471-70823-2, John Wiley and Sons, New York, 1967.\ \

Henry Scheff\'e, {\it Analysis of Variance,\/} John Wiley and Sons, New York, 1967.\ \

Yvonne M.\ M.\ Bishop, Stephen E.\ Fienberg, Paul W.\ Holland, {\it Discrete Multivariate Analysis:\ \ Theory and Practice,\/} ISBN 0-262-52040-0, MIT Press, Cambridge, Massachusetts, 1979.\ \

Stephen E.\ Fienberg, {\it The Analysis of Cross-Classified Data,\/} ISBN 0-262-06063-9, MIT Press, Cambridge, Massachusetts, 1979.\ \

Leo Breiman, Jerome H.\ Friedman, Richard A.\ Olshen, Charles J.\ Stone, {\it Classification and Regression Trees,\/} ISBN 0-534-98054-6, Wadsworth \& Brooks/Cole, Pacific Grove, California, 1984.\ \

R.\ B.\ Blackman and J.\ W.\ Tukey, {\it The Measurement of Power Spectra:\ \ From the Point of View of Communications Engineering,\/} Dover, New York, 1959.\ \

William W.\ Cooley and Paul R.\ Lohnes, {\it Multivariate Data Analysis,\/} John Wiley and Sons, New York, 1971.\ \

Maurice M.\ Tatsuoka, {\it Multivariate Analysis: Techniques for Educational and Psychological Research,\/} John Wiley and Sons, 1971.\ \

E.\ L.\ Lehmann, {\it Testing Statistical Hypotheses,\/} John Wiley, New York, 1959.\ \

E.\ L.\ Lehmann, {\it Nonparametrics: Statistical Methods Based on Ranks,\/} ISBN 0-8162-4994-6, Holden-Day, San Francisco, 1975.\ \

Jaroslav H\'ajek and Zbyn\v ek \v Sid\'ak, {\it Theory of Rank Tests,\/} Academia, Prague, 1967.\ \

Sidney Siegel, {\it Nonparametric Statistics for the Behavioral Sciences,\/} McGraw-Hill, New York, 1956.\ \

Donald F.\ Morrison, {\it Multivariate Statistical Methods: Second Edition,\/} ISBN 0-07-043186-8, McGraw-Hill, New York, 1976.\ \

Harry H.\ Harman, {\it Modern Factor Analysis: Second Edition, Revised,\/} The University of Chicago Press, Chicago, 1967.\ \

George E.\ P.\ Box and Gwilym M.\ Jenkins, {\it Time Series Analysis --- Forecasting and Control: Revised Edition,\/} ISBN 0-8162-1104-3, Holden-Day, San Francisco, 1976.\ \

Jean-Ren\'e Barra, {\it Mathematical Basis of Statistics}, ISBN 0-12-079240-0, Academic Press, New York, 1981.\ \

So, what is really significant recent ML is adding to this quite old material? How much of the good stuff in that material has CS AI/ML even read and understood so far? Uh, what am I missing? I can think of a little but not much.

CS AI/ML going for applications of statistics is a lot like Stanley Tools, long in the hammer, screwdriver, saw, drill, and nail gun business, going into the residential construction business building, selling new houses and claiming that they have something new! That is, because they make good nail guns they conclude that they are the best at building houses. They know nothing of excavation, masonry, framing carpentry, windows, doors, roofing, plumbing, electrical, HVAC, dry wall, painting, circular stairs, walnut paneling, landscaping, or even the building codes, but they've got some good nail guns!

The math I derived and published in multi-dimensional, distribution-free anomaly detection has next to nothing to do with any of those books above; instead, I did original work and drew from ergodic theory, abstract algebra, and a classic result of Ulam. For the math for my startup, it's also original and even farther from those books above. In both cases, the core results are presented as theorems with proofs. Such math is mostly not what I'm seeing in CS AI/ML.

I DID see some cute, recent work in analysis, maybe from ML, of the over fitting problem; I don't need that work for my startup, but I do intend to go back and read that work.

My main points are:

(1) If AI/ML have some applied math tools that are solid, good, and new, terrific. Then those cases will add to the many thousands of such tools already on the shelves of the research libraries.

(2) The most important paradigm for a solid future for such tools is new, solid, correct, and powerful theorems and proofs building on solid material in math. That is, we want better stuff than is already in the libraries.

srean · on June 1, 2017

Agree with the gist but with some reservations.

Small correction to what I said: srean> I was deliberately harsh on stats in my comment to offset some of the wrong things you mentioned about stats.

Replace 'stats' with ML in my comment above. You are spot on regarding stats part, regarding ML not so much. In fact I have great respect for you as a applied mathematician and frequently learn something from these exchange of comments.

graycat> Moreover, the situation is reversed: Columbus brought better ship, etc. technology to the Americas than the Americas had, but CS AL/ML are bringing nearly all highly inferior technology to statistics, pure/applied math, etc.

This is unfair to the point of being absurd. Have you looked at stats software ? or compute infrastructure built by statisticians ? I can say more, but that would be below the belt. CS/ML has infused new blood in this area augmenting the capabilities of what can be done on a machine or cloud) 3 orders of magnitude or more. Part of it is advances in CS and software systems, but that is not all. One of the other parts is advances in algorithms. I am sure you would not scoff at ML pedigreed algorithms that run way faster than CPLEX. Many would not have the context, but I am sure you do. Of course these algorithms are not fr any LP, but a class of huge LPs that show up in graphical models. These belief propagation based algorithms outperform not only vanilla simplex algorithms but also the state of the at ones on these problems. Then there is culture. ML algorithms are more likely to be open sourced so that people try it, criticize it and that is how progress is made. In stats locking it behind some proprietary stuff is the norm. R is an exception, but as a piece of software I would keep it light years away from the money.

Let me move on.

graycat> The asymptotic stuff is mostly okay: Can't figure out precisely what the darned thing will do for case n so let n go to infinity and maybe can see what happens. Then say, "for large n, this is about what happens" -- crude but better than nothing.

Yeah n is indeed very large in most data sets, but p the dimension is often larger. All hell breaks lose in this situation when you go by classic theory of statistical inference.

This quandary applies to deep neural nets also. The number of parameters way exceeds the training data points it yet generalizes well to unseen data. WTF is going on !

Classic prob/stats have no answers. Forget classic, even modern prob stats is struggling to explain this. A good fight is being fought to characterize their behavior. New mathematical arguments are being developed but this is clearly a work in progress, an open problem.

You have a chip on your shoulder of sorts regarding ML. That's fine, nothing wrong with that. What is wrong, however, is when you mischaracterize it because you are not reading the right material. You have to read those books and journal publications that are the moral equivalent of the prob/stats library you have.

Its not fair to compare two languages when you quote from corny porn in one but high literature from the other.

May I say Vapnik again :) And checkout the main publication s from COLT (computational learning theory), ICML, NIPS etc.

I am sure you have heard of the multi-armed bandit problem. Gittins index and all that. Now solve it for the distribution free case. Even that leaves something to be desired because rewards may not be governed by a stochastic process. Give guarantees for arbitrary but bounded reward sequences. Stats literature had no answer for these cases. These are just a glimpse of the tools and results that have come from ML.

You observe some noisy entries from a huge matrix that is low-rank. Now fill the unseen entries and give guarantees on performance. ML has the tools here.

A biggie is analysis of causality thanks to Judea Pearl. Some still dont get it that conditioning and causality are fundamentally different. Just regressing Y over X will not cut it. ML has the tools here.

Anyways...

Oh Jesus! not ANOVA again. Its a cute toy, that's all. The assumptions it needs are rarely met in the datasets we want to analyze today. Visionaries like Tukey had figured that out 3 decades ago.

Why do statisticians to this day keep using parametric tools in situations where we are drowning in data.

I agree that its not entirely the fault of statistics that people do the above. Stats really has to shed a lot of baggage that made sense a century ago but does not make sense now.

Given your experience I am sure you know that things are rarely ever Gaussian. Central limit theorem often does not apply in this popular form because of distributions with ginormous variance. Why use tools based on that.

Classic parameter estimation guarantees are all fine in the toy 'spherical cow' world but where in stats are the non-asymptotic guarantees on prediction performance in a distribution free setting. This is precisely the raison d'être for ML. Not that there are no statisticians who have gone this route (Dawid would be one who needs mention), but way too few.

ML does indeed have a lot to offer. To be cute I would leave it at what I said before:

> non-asymptotic guarantees on prediction in a distribution free / nonparametric setting. That is what ML does.

The newer developments aim at situations where the number of parameters in the model far exceed the data. What guarantees can you give in such situations. Surprisingly quite a bit, and contrary to what text books have said over the century, you can give guarantees. The algorithms work magically well, but of course you would need some (realistic assumptions). ML is an applied science so it borrows from and (crucially) adds to mathematics all the time. This particular line of work borrows a lot from the geometry of high dimensional random convex bodies, and Banach spaces.

The most exciting part I see is the melding of the good parts of ML and stats and a good deal of communication across the Stats and ML community lines. They collaborate, they publish in the same conferences and that's just freaking fantastic. This is what we need not some stupid fights over which is better or is it applied science or not.

Parting word, dont get an idea of ML from ML porn and spread untruths like,

graycat> ML is .... short on math guarantees.

build a library just like you did for stats.

graycat · on June 1, 2017

Part II

With good stuff in ML you are describing a world I've seen no solid evidence of first hand.

Maybe that's to be expected: Long I did well learning math in part because I picked some really good books; due to some lucky parts of my background, that was easy to do. Then I tried this in statistics and picked a junk book -- eventually figured that out.

I didn't have much respect for or progress in statistics before I had a high end background in probability from a star student of E. Cinlar at Princeton from Neveu's book, Neveu a star student of Loeve at Berkeley. Then in stochastic processes I got some more respect from work of J. Doob (prof of Halmos who then became an assistant to von Neumann, etc.) and the material on probability in the back of the Halmos, Measure Theory. The Halmos work on sufficient statistics and unbiased estimation also deserved respect.

I've seen no such quality from ML. You suggest that there is such quality out there, but what I saw, e.g., Ng's on-line lectures on ML, the idea of convolutional neural networks for image recognition, doesn't look nearly as solid as what I did in probability or nearly as interesting as some of what you point to.

E.g., just yesterday I went to

https://simplystatistics.org/

a Web site of three profs in bio-stat, and read some of the material of one of them, Jeff Leek, looked at the TOC and some of the lectures in his Coursera on-line course, some of the research problems he's working on, some of what he said about ML, what he said about cross-validation and bootsrapping in model building, etc. and concluded that with ML still I was not missing much.

I did look at

Trevor Hastie, Robert Tibshirani, and Jerome Friedman, {\it The Elements of Statistical Learning Data:\ \ Mining, Inference, and Prediction, Second {\it Edition,\/} Springer, 2008.\ \

found the interesting looking work on over fitting and otherwise was not impressed at all.

Yes, you have more references I have yet to pursue, but what I did yesterday and other times was fast and easy and seemed to touch on some of the most respected work. But, yes, it took me some real digging and a lot of luck to find the really good stuff in probability, stochastic processes, optimization, etc., so maybe I'd need a lot of digging to find the good stuff in ML. Since ML is a more recent field, likely it has yet to receive the polish of Tukey, Halmos, Loeve, Neveu, Doob, Dynkin, Shreve, even back to Kolmogorov, etc.

I do wonder, then, what the heck the ML people are driving at? Really it looks like they are growing around the edges of old regression analysis model building, trying to fit to data (training data) and then use the fit to make predictions. Okay. But I'm not going to take 5 TB of pictures and try to recognize cats, dogs, men, women, etc. with some version of curve fitting with many thousands of variables -- that's just a long way from anything I'm doing, and more generally I'm reluctant to believe that any technique that needs that much data will be more than a niche. E.g., for intelligence, kittens, puppies, and toddlers do really well with much less data. E.g., a kitten who jumps up on the kitchen counter, walks over to the stove, and gets a hot foot learns well from just that one data point. Moreover he starts to find causality and generalize.

Really, my view is that for more in getting intelligence out of data, we should concentrate on cases where there likely is some version of causality and have our applied math look for and then use that. I didn't see ML doing that. My brother and my wife got their Ph.D. degrees in the social sciences, and there they were quite careful about causality.

For more parameters than data points, okay: Sure, in some non-linear cases, that might happen. And even in linear cases, if just open eyes a little, can see that and, thus, get around over fitting in some cases. E.g., regression is really a perpendicular projection onto a linear subspace spanned by the vectors of the input data. Okay. Now just enter the input data twice. So, then the dats still spans the same linear subspace and we get the same perpendicular projections. If write out the regression normal equations, then the variance-covariance matrix get won't be non-singular and won't have an inverse. So, big heart burn. But heart burn is premature: The normal equations still have a solution, just infinitely many. So, the regression coefficients aren't unique. BUT, and apparently not many people have noticed this, all the predicted values from the given data still are unique, that is, are just the same projection. That is, take any of the infinitely many solutions you want from the normal equations and get the same predicted values from the given data (maybe not new data). That observation solves some of over fitting but not all. But for your remark, that IS a case of working with too many parameters.

For more, looking at Jeff Leek on regression model building, he quickly goes into cross validation and bootstrap techniques. Okay, cross validation looks like intuitive, heuristic, Medieval stirring pots of rat tails and frog eyes to me, and boostrap needs some theorems (I believe I have derived some -- I know, look at Bradley Efron's work which I've done a little but so far I prefer my derivations) to justify and explain what is going on, but roughly Leek is correct, and I didn't see even that much care and concern in the ML materials.

If ML is still trying to milk some more utility out of the edges of simple, old regression, that is, fitting to data and using the fits to predict, then I'm not much interested: Often regression works really well; where it doesn't, and IMHO over half the time it doesn't, then I'm not going to strain to squeeze a little more out of that curve fitting paradigm, at least not now.

To me, curve fitting to build predictive models is nothing like all that can be done with statistics and only a tiny drop of what can be done with applied math.

E.g., where is the good science from simple, linear equations? From Newton, Maxwell, Einstein? Nope, none of those. Where did they make progress with fitting linear equations? Gee, Kepler used ellipses and Newton drew from those.

For fitting, the big success was Ptolemy's epi-cycles, and that still was just fitting and not causality or real science or physics.

Moreover, now I'm concentrating on my startup -- I've long since derived the math, and I see nothing in ML to even hint that their work could improve on mine.

Sure, when my startup is done, I'll take on new interests. But curve fitting? I doubt it! Real AI? Maybe. More likely physics!

Sure, maybe some day there will be a Kolmogorov, Doob, etc. to do terrific work in ML or new frontiers of statistics and/or applied math and a Loeve/Neveu to do a polished job writing up the work. Maybe.

Then there's a credibility problem: From what I've seen, nearly none of the computer science profs working in AL/ML have what Jeff Leek called "math chops" that promise they can do really good research to move statistics ahead.

I'd have a sign over the door reading "None enter without a proof that there are no countably infinite sigma algebras."!

That will select the people with good "math chops".

From your claims, maybe there actually is some good work in ML but hard to find; as I started, hard to find would not be too surprising since early in my career finding the good stuff in probability and optimization was really hard. Even in the statistics community, the good people are hard to find. I can respect E. Dynkin, E. Cinlar, J. Tukey, L. Breiman, D. Brillinger, and a few more, but for a beginner in statistics those people will be difficult to find, understand, appreciate, respect, etc. Maybe something similar is the case in ML now.

In statistics taken broadly, there is a guy from France I know and respect; he was a student of Choquet, right, a member of Bourbaki!

But, again, for ML to have my respect, tough if they are just still trying to do curve fitting.

srean · on June 1, 2017

Really do appreciate your long responses, so thank you and keep doing that.

More parameters than data points shows up a lot these days and not because of non-linearity. Now its common to measure everything you can about an object and then worry later about what is useful. So the situation is few objects but gobs and gobs of measurement for each of them.

Of course now you have a rank deficient situation with an entire affine space for a solution. But thats no good, application wants to find the sparsest point in that affine space in the given basis. ML has tricks up its sleeves to do that in poly time. Its quite unbelievable that this is even possible.

BTW little surprised that you dont talk about Lehman much.

And you are searching for quality in the wrong place. Go for the proceedings of the COLT conference, ICML and read some Vapnik to start.

graycat · on June 1, 2017

Part I

Beating linear programming as in C-PLEX on networks? In some cases, that's easy: On a lot of networks, the simplex algorithm specializes in an amazing way: A basic solution corresponds to a spanning tree. To consider adding a variable to the basis, just add an arc to the tree, get a circuit, run one unit of flow around the circuit and see if can save money, and if so run the flow around until hit a flow capacity constraint on one of the arcs in the circuit, remove that arc from the basis, and, presto, have a simplex pivot. If use the W. Cunningham (one of my profs) modification of strongly feasible bases, then are guaranteed won't cycle. It's easy to program and just blindingly fast -- e.g., the arithmetic is all just addition and subtraction. It's still the simplex algorithm so that C-PLEX could have been used, but for a big network problem will likely run ballpark thousands of times faster than C-PLEX.

For more particular cases, maybe more can be done.

Bertsekas has a polynomial algorithm for that problem. It may get applied to that wacko in Ping Pong Yang.

Network flows is a an old, big, deep field e.g.,

Ravindra K.\ Ahuja, Thomas L.\ Magnanti, James B.\ Orlin {\it Network Flows: Theory, Algorithms, and Applications,\/} ISBN 0-13-617549-X, Prentice Hall, New Jersey, 1993.\ \

Mokhtar S.\ Bazaraa and John J.\ Jarvis, {\it Linear Programming and Network Flows,\/} ISBN 0-471-06015-1, John Wiley and Sons, New York, 1977.\ \

For analysis of variance, there is, e.g.,

George W.\ Snedecor and William G.\ Cochran, {\it Statistical Methods, Sixth Edition,\/} ISBN 0-8138-1560-6, The Iowa State University Press, Ames, Iowa, 1971.\ \

No joke it's from Iowa since it's long been important in agriculture research and testing and in experimental design and analysis much more generally. E.g., people who propose new ways to do K-12 education commonly use these techniques to good advantage.

Sure, first cut, it's regression analysis again, but with its special discreteness really it's quite different.

It's powerful stuff.

For the Gaussian assumptions in regression, don't get too concerned: The normal equations remain and don't need a Gaussian assumption -- there's still a projection and still get the Pythagorean theorem --

total sum of squares = regression sum of squares + error sum of squares.

The Gaussian assumption yields F ratio and t-tests, and ballpark in practice those remain robust mostly ignoring the Gaussian assumption. In some wild, edge cases, might want to check the robust claim, but otherwise do get some good information about the fit.

srean · on June 1, 2017

Totally agreed. The algorithm that I was talking about is a rather curious one. It neither does vertex hopping like simplex, nor interior point iterations. It works with a factorized representation of the Lagrangian and then updates parts of the in a fixed point iteration that if converges would satisfy KKT. What is strange is that this is not monotonic in the dual cost (neither the primal).

As long as there is no more than 1 loop in the original problem the convergence is guaranteed. Otherwise no guarantee holds but it often does converge.

srean · on June 2, 2017

Ok did not agree :) on the Gaussian part.

Yeah sure its a projection but so what. Why and when should a L2 norm projection be useful. It would when the tails of the residuals are Gaussian like. Otherwise things will and do go haywire. L2 loss is too sensitive at the tails. This observation is nothing new though.

What is new however are the near magical guarantees you can get (literally rabbit out of a hat) when you use L1 projection instead. It comes at a computational cost but solves a problem that cannot be efficiently solved otherwise: Xw = y + noise.

X is a random matrix woefully rank deficient. Exponentially more columns than rows. w \in R^n but has very few non zeros. Now find that w. This needs rather deep tools from asymptotic geometry and Banach spaces. Its an area that's quite active on stas / ML / signal processing right now.

graycat · on June 3, 2017

> Yeah sure its a projection but so what. Why and when should a L2 norm projection be useful.

We want the answer, and more generally a predictive model.

We use the data we have to define a vector subspace. We use that subspace to creep up on the answer we want. Finally to get the answer, we project onto the subspace we have settled on. We CAN do the projection even though we don't have the answer. We take that projection on to the subspace, that point in that subspace, as our approximation to the answer.

We use L^2 because that projection, with the Pythagorean theorem, is the closest in L^2.

Being close in L^2 is a good thing: L^2 is a Hilbert space, and being Cauchy convergent means being convergent in L^2, and then there is a subsequence that converges almost surely, with probability 1, exactly, etc., and in practice it's exactly. A sequence that converges in L^2 but not exactly is pathological.

Then as we use more variables, the vector subspace gets closer to the answer and we get a better projection although maybe not a better fitting model.

To me, tails of distributions have next to nothing to do with it or the use of L^2.

In particular, the basic regression math, the normal equations, the Pythagorean theorem

total sum of squares = regression sum of squares + error sum of squares

makes no assumptions at all about distributions (might want to argue that need to be in L^2, that is, for each random variable X have E[X^2] finite, or maybe at least E[|X|] finite, that is, L^1) certainly does not assume a Gaussian.

Moreover the matrix in the normal equations has variances and covariances, but these are close to just the inner product in the Hilbert space L^2 and do NOT assume a Gaussian or anything about a distribution except merely that the expectations exist which in practice is a meager assumption, essentially always the case.

Just because are working with variances and covariance does NOT mean that are assuming a Gaussian.

Also notice that nowhere do we need to assume that our data is not discrete. Discrete data is still fine.

If in addition do have an appropriate Gaussian assumption, then at the end of the usual normal equations stuff, can get an F ratio and some t-tests that usually can help evaluate the quality of the fitting and the importance of the coefficients.

And for the model, we can get confidence intervals on the predicted values -- sure, will need a Gaussian assumption here, but there might be other approaches and otherwise we can hope or look for robust estimators.

L^2's fine with me.

More can be done, but the above is the first cut reason for using subspaces and L^2 projections although this explanation is rarely taught to students.

When ordinary regression doesn't work very well, then I'm reluctant to strain with much more. I might be willing in some circumstances to use what L. Breiman (I respect Breiman, Loeve student) has in his Classification and Regression Trees.

srean · on June 3, 2017

No no no you are thinking this one a little wrong.

Yes Pythagorean identity, or Pythagorean decomposition, whatever you want to call it, the orthogonality properties etc etc make L2 super convenient to apply and reason about, but that does not mean it is a suitable performance metric to use. This is again the phenomena of searching for the answer where its well lit vs where the answer lies.

The problem really lies in the tails of the errors, although it might not be immediately apparent. You say variance covariance etc etc, but it does not take much for RVs not to possess them. Think about it, Chebyshev's inequality will make it clear what decay you need for those to exist.

Yes you are right choice of L_2 by itself makes no assumptions. But once you start analyzing the performance of the estimator it will show up, particularly when Fisher information is no longer strongly convex. Cramer Rao becomes a vacuous bound in this case. The other problem is L2 balls become infinitely larger than say the L1 ball as dimensionality increases. So L2 error stops being very good at localizing a vector.

The problem is L_2 is not a good metric to measure inaccuracy with, it is super convenient to use though.

Let me demonstrate one of its well known problems. Say the error residual has tails that does not decay as fast as an exponential (MGF does not exist around 0). You can then show that the accuracy can be arbitrarily bad. No need for super ugly densities, something as benign as just a mixture of two Gaussians with the same expectation will break it. I am just restating Huber's robustness argument here in a different way.

Gauss himself was well aware of the problem and chose it for convenience. Even his peers were well aware of situations when L1 made more sense than L2. Unfortunately optimization theory was not well developed at that time to use L1 based methods. But now we can.

BTW you might find it interesting that the Pythagorean property is not limited to L2 squared. The set of all 'divergences' for which it holds is the Bregman divergence family. Its an if and only if condition. This is a result that came from the ML community. One side of the implication was known though. Bregman divergences are again the likelihood ratios of exponential family densities, (also called Darmois Koopman family in older literature), hence has strong connections with NP lemma as well. What I find mind boggling is that Bregman divergence had its origin in convex optimization literature ! Its origins had absolutely nothing, nothing to do with probability and stats. Its amazing when two separate fields of Math make contact in these ways.

BTW if you dont mind sharing your suitably anonymized mail, I am srean.list at gmail. It will be pleasure to discuss math with you. I find it very helpful when i can stress test an idea aginst a human oracle. HN might not be a forum well suited for this.

graycat · on June 3, 2017

Yes, now we can also do best L^1 approximations. IIRC -- I'm in a hurry this morning -- it's a linear programming problem or some such but I haven't thought about that in decades.

You bring up a lot of stuff I've never reviewed.

> but that does not mean it is a suitable performance metric to use.

If the plane do the L^2 projection on is really close, then the error is really small, and what's not to like? Really close is good enough for me.

Right, there are three biggie choices, L^1 (minimize the sum of absolute values of the errors), L^2 (minimize the sum of squared errors), and L^infinity (minimize the worst error).

L^2's fine with me, good enough for government work, okay first cut, day in and day out.

If in some particular case L^2 has some problems, then maybe something like regression is not the right tool for the problem.

Gee, we do a lot of L^2: We get orthogonal components so that we can get L^2 cafeteria style, just pick the components we want. E.g., in filtering stochastic processes, we take a Fourier transform -- that is finding the coefficients of the sample path on the sine-cosine orthogonal components. Or we do a convolution, which is the same thing in the end. The approximation we get is an L^2 approximation.

So, with JPG -- sure, it's L^2. Right, JPG does funny stuff nearly lines. Life's not perfect!

For some things, sure we want L^infinity: E.g., if we want to use a quotient of polynomials to approximate the usual special functions, then we want to minimize the worst error and do what is called Chebyshev approximation, but this is very specialized.

srean · on June 3, 2017

> it's a linear programming problem

Indeed it is. Lets catchup more sometime. Here's my email again srean.list on gmail

Totally agreed there is lot to like about L2 but there are plenty situations where it is terrible (in fact some of them are on one topic of your interest: monitoring server farms). In some of those situations L1 or a combination of L1 and L2 helps a lot.

graycat · on June 3, 2017

> You say variance covariance etc etc, but it does not take much for RVs not to possess them

In practice, essentially always, for real random variables X, Y, all of the following exist and are finite:

E[X], E[|X|], E[X^2], E[XY]

Var(X) = E[(X - E[X])^2]

Std(X) = Var(X)^(1/2)

Cov(X,Y) = E[(X - E[X]) (Y - E[Y])]

Cor(X,Y) = Cov(X,Y)/(Std(X) Std(Y))

E[Y|X]

E[X] and Std(X)are useful quite broadly and that we find/estimate them does not mean that we are working with a Gaussian distribution.

If the above is not true, then there is something bizarre and/or pathological, the ultimate edge case, and we need to review what we are doing.

srean · on June 4, 2017

Unfortunately its not true in many cases I have seen, essentially because the tail of the error does not fall fast enough. Technically, for all bounded RVs all moments are bounded, but in some of these situations the variance is so high that its infinite for practical purposes.

All you need is the tail to fall slower than a quadratic.