Hacker Newsnew | past | comments | ask | show | jobs | submit | vvipgupta's commentslogin

Thanks for letting us know. We did some docs restructuring before the launch, and missed fixing this link. It is now available here: https://docs.uptrain.ai/docs/uptrain-examples/quickstart-tut...


Generally, MLOps helps in reducing engineering headaches. During our user interviews and customer calls, we realized very early that customization is key for ML model monitoring since all models are different. Thus, we have built the framework to lessen the engineering headache while allowing customizability (think PyTorch). Would love to know your thoughts on this.


Just curious, what kind of use cases do you have in mind?


Thanks for the very relevant comment :) We provide users the option to attach their training data from csv/json (working to support loading from cloud storage provider or data lakes). We have illustrated this in some of our examples, such as the human orientation classification: https://github.com/uptrain-ai/uptrain/blob/main/examples/hum...


Additionally, refinement is a key focus of ours. Figuring out the best data points to retrain the model upon has twin benefits:

1) It provides automated issue resolution and saves data scientists' effort to debug and fix their models. 2) It allows us to reduce false positives in alerting: we send alerts only when we see a dip in model performance, or retraining can lead to improved model accuracy.


Awesome! Big fan of OS -- arize is powerful yet expensive, so I think there's a big market there. Alerting is super tough to get right, and false positives are often worse than no alerting at all. In ML its even harder cause "data looks weird" is like 90% of the bugs.

Anyway, congrats! Excited to see where you go with this.


Thanks! Also, wondering how did you hear about Arize? Have you dealt with the pain of ML model monitoring in the past?


Yeah, so we used it and built some custom solutions at Stitch Fix. Reach out to my co-founder Stefan (also in YC '23) -- he'll have some insight for you.


Thanks! Reaching out to Stefan


Thanks for the suggestion and links. Completely agree, ML production data management can be painful and to support model refinement for users that operate at scale, an abstraction at the data layer would be a useful feature.


The same is true for data (aka gradients) consistency while training large ML models. Asynchronous SGD is as good (and maybe even faster) than synchronous SGD: https://papers.nips.cc/paper/2011/file/218a0aefd1d1a4be65601...


You might also want to check out https://github.com/lucidrains/PaLM-rlhf-pytorch


When training over multiple GPUs, it's hard not to think about Ray (https://docs.ray.io/en/latest/train/train.html). Ray, as an open-source project, has exploded over the last few years and helps with the memory bottleneck by segregating memory and computing.

FYI, I am not affiliated with Ray. However, I did write the following paper on scaling data-parallel training for large ML models ;) https://openreview.net/pdf?id=rygFWAEFwS

Also, another one of my papers talks about distributed training while reducing the communication bottleneck for distributed training: https://dl.acm.org/doi/pdf/10.1145/3447548.3467080


I'm one of the Ray developers, thanks for the shoutout :)

If you're curious about how Ray is used for LLMs, here are some interesting examples of LLM projects using Ray!

- Alpa does training and serving with 175B parameter models https://github.com/alpa-projects/alpa

- GPT-J https://github.com/kingoflolz/mesh-transformer-jax

- Another HN thread on training LLMs with Ray (on TPUs in this case) https://news.ycombinator.com/item?id=27731168

- OpenAI fireside chat on the evolution of their infrastructure and usage of Ray for training https://www.youtube.com/watch?v=CqiL5QQnN64

- Cohere on their architecture for training LLMs https://www.youtube.com/watch?v=For8yLkZP5w&t=3s

Some other thoughts

1. There is a lot more we want to do to make Ray better for working with large language models and for making training, serving, and batch inference work well out of the box.

2. The original post is about training, but we actually see even more interest in fine-tuning and serving with LLMs, in part because there are good pre-trained models.

3. For LLMs, we see a lot of interest in Ray + Jax or Ray + TPUs relative to what we see in other use cases.


Do you see any convergence on wire (arrow?) and storage (pandas?) formats?


And we can make Ray more efficient by optimizing GPU hardware utilization https://centml.ai/


Will it work with a PC that has 7 AMD Vega GPUs?


Yes, but this will largely come down to whether the deep learning framework that you're using (PyTorch, TensorFlow, Jax, etc) works well in that setting. Ray is pretty framework and hardware agnostic and can be used to schedule / scale different ML frameworks on different types of devices (CPUs, GPUs, TPUs, etc), but the actual logic for running code on the accelerators lives in the deep learning framework.


This would be a good application of chatGPT, testing a test for whether it tests for rote learning or fundamentals.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: