vvipgupta's comments

vvipgupta · on March 9, 2023

Thanks for letting us know. We did some docs restructuring before the launch, and missed fixing this link. It is now available here: https://docs.uptrain.ai/docs/uptrain-examples/quickstart-tut...

vvipgupta · on March 8, 2023

Generally, MLOps helps in reducing engineering headaches. During our user interviews and customer calls, we realized very early that customization is key for ML model monitoring since all models are different. Thus, we have built the framework to lessen the engineering headache while allowing customizability (think PyTorch). Would love to know your thoughts on this.

vvipgupta · on March 8, 2023

Just curious, what kind of use cases do you have in mind?

vvipgupta · on March 8, 2023

Thanks for the very relevant comment :) We provide users the option to attach their training data from csv/json (working to support loading from cloud storage provider or data lakes). We have illustrated this in some of our examples, such as the human orientation classification: https://github.com/uptrain-ai/uptrain/blob/main/examples/hum...

vvipgupta · on March 8, 2023

Additionally, refinement is a key focus of ours. Figuring out the best data points to retrain the model upon has twin benefits:

1) It provides automated issue resolution and saves data scientists' effort to debug and fix their models. 2) It allows us to reduce false positives in alerting: we send alerts only when we see a dip in model performance, or retraining can lead to improved model accuracy.

elijahbenizzy · on March 8, 2023

Awesome! Big fan of OS -- arize is powerful yet expensive, so I think there's a big market there. Alerting is super tough to get right, and false positives are often worse than no alerting at all. In ML its even harder cause "data looks weird" is like 90% of the bugs.

Anyway, congrats! Excited to see where you go with this.

vvipgupta · on March 8, 2023

Thanks! Also, wondering how did you hear about Arize? Have you dealt with the pain of ML model monitoring in the past?

elijahbenizzy · on March 9, 2023

Yeah, so we used it and built some custom solutions at Stitch Fix. Reach out to my co-founder Stefan (also in YC '23) -- he'll have some insight for you.

sourabh03agr · on March 11, 2023

Thanks! Reaching out to Stefan

vvipgupta · on March 8, 2023

Thanks for the suggestion and links. Completely agree, ML production data management can be painful and to support model refinement for users that operate at scale, an abstraction at the data layer would be a useful feature.

vvipgupta · on Feb 18, 2023

The same is true for data (aka gradients) consistency while training large ML models. Asynchronous SGD is as good (and maybe even faster) than synchronous SGD: https://papers.nips.cc/paper/2011/file/218a0aefd1d1a4be65601...

vvipgupta · on Feb 15, 2023

You might also want to check out https://github.com/lucidrains/PaLM-rlhf-pytorch

vvipgupta · on Feb 12, 2023

When training over multiple GPUs, it's hard not to think about Ray (https://docs.ray.io/en/latest/train/train.html). Ray, as an open-source project, has exploded over the last few years and helps with the memory bottleneck by segregating memory and computing.

FYI, I am not affiliated with Ray. However, I did write the following paper on scaling data-parallel training for large ML models ;) https://openreview.net/pdf?id=rygFWAEFwS

Also, another one of my papers talks about distributed training while reducing the communication bottleneck for distributed training: https://dl.acm.org/doi/pdf/10.1145/3447548.3467080

robertnishihara · on Feb 12, 2023

I'm one of the Ray developers, thanks for the shoutout :)

If you're curious about how Ray is used for LLMs, here are some interesting examples of LLM projects using Ray!

- Alpa does training and serving with 175B parameter models https://github.com/alpa-projects/alpa

- GPT-J https://github.com/kingoflolz/mesh-transformer-jax

- Another HN thread on training LLMs with Ray (on TPUs in this case) https://news.ycombinator.com/item?id=27731168

- OpenAI fireside chat on the evolution of their infrastructure and usage of Ray for training https://www.youtube.com/watch?v=CqiL5QQnN64

- Cohere on their architecture for training LLMs https://www.youtube.com/watch?v=For8yLkZP5w&t=3s

Some other thoughts

1. There is a lot more we want to do to make Ray better for working with large language models and for making training, serving, and batch inference work well out of the box.

2. The original post is about training, but we actually see even more interest in fine-tuning and serving with LLMs, in part because there are good pre-trained models.

3. For LLMs, we see a lot of interest in Ray + Jax or Ray + TPUs relative to what we see in other use cases.

eternalban · on Feb 12, 2023

Do you see any convergence on wire (arrow?) and storage (pandas?) formats?

pavelstoev · on Feb 12, 2023

And we can make Ray more efficient by optimizing GPU hardware utilization https://centml.ai/

q1w2 · on Feb 13, 2023

Will it work with a PC that has 7 AMD Vega GPUs?

robertnishihara · on Feb 14, 2023

Yes, but this will largely come down to whether the deep learning framework that you're using (PyTorch, TensorFlow, Jax, etc) works well in that setting. Ray is pretty framework and hardware agnostic and can be used to schedule / scale different ML frameworks on different types of devices (CPUs, GPUs, TPUs, etc), but the actual logic for running code on the accelerators lives in the deep learning framework.

vvipgupta · on Feb 4, 2023

This would be a good application of chatGPT, testing a test for whether it tests for rote learning or fundamentals.