Generally, MLOps helps in reducing engineering headaches. During our user interviews and customer calls, we realized very early that customization is key for ML model monitoring since all models are different. Thus, we have built the framework to lessen the engineering headache while allowing customizability (think PyTorch). Would love to know your thoughts on this.
Thanks for the very relevant comment :) We provide users the option to attach their training data from csv/json (working to support loading from cloud storage provider or data lakes). We have illustrated this in some of our examples, such as the human orientation classification: https://github.com/uptrain-ai/uptrain/blob/main/examples/hum...
Additionally, refinement is a key focus of ours. Figuring out the best data points to retrain the model upon has twin benefits:
1) It provides automated issue resolution and saves data scientists' effort to debug and fix their models.
2) It allows us to reduce false positives in alerting: we send alerts only when we see a dip in model performance, or retraining can lead to improved model accuracy.
Awesome! Big fan of OS -- arize is powerful yet expensive, so I think there's a big market there. Alerting is super tough to get right, and false positives are often worse than no alerting at all. In ML its even harder cause "data looks weird" is like 90% of the bugs.
Anyway, congrats! Excited to see where you go with this.
Yeah, so we used it and built some custom solutions at Stitch Fix. Reach out to my co-founder Stefan (also in YC '23) -- he'll have some insight for you.
Thanks for the suggestion and links. Completely agree, ML production data management can be painful and to support model refinement for users that operate at scale, an abstraction at the data layer would be a useful feature.
When training over multiple GPUs, it's hard not to think about Ray (https://docs.ray.io/en/latest/train/train.html). Ray, as an open-source project, has exploded over the last few years and helps with the memory bottleneck by segregating memory and computing.
FYI, I am not affiliated with Ray. However, I did write the following paper on scaling data-parallel training for large ML models ;)
https://openreview.net/pdf?id=rygFWAEFwS
1. There is a lot more we want to do to make Ray better for working with large language models and for making training, serving, and batch inference work well out of the box.
2. The original post is about training, but we actually see even more interest in fine-tuning and serving with LLMs, in part because there are good pre-trained models.
3. For LLMs, we see a lot of interest in Ray + Jax or Ray + TPUs relative to what we see in other use cases.
Yes, but this will largely come down to whether the deep learning framework that you're using (PyTorch, TensorFlow, Jax, etc) works well in that setting. Ray is pretty framework and hardware agnostic and can be used to schedule / scale different ML frameworks on different types of devices (CPUs, GPUs, TPUs, etc), but the actual logic for running code on the accelerators lives in the deep learning framework.