Hi team, Any ideas on how users can monitor their...
# dev-metaflow
a
Hi team, Any ideas on how users can monitor their gang scheduled cluster when using the
@parallel
decorator. There is the
@gpu_profile
decorator which is a good start. But that is not in real-time; it profiles usage after training. For training LLMs, you typically want to monitor your cluster during training. By way of comparison, Ray's UI below allows users to monitor their clusters in real-time (see screenshot below).
a
If you use a service such as DataDog you can add an agent inside the K8 cluster to handle this and view real-time metrics in DataDog. That’s what I use for all my API Endpoints deployed on K8s and soon once I migrate my Metaflow setup over to K8s for it as well.
a
@acoustic-van-30942 we are working on making cards real time - which should be able to provide for a similar view
🙌🏽 1
🙌 2
a
That's awesome @ancient-application-36103! Thanks so much. True @ambitious-bird-15073. We haven't yet migrated to K8s though and you're right that there are ways to monitor the metrics through DataDog or NewRelic. But as it stands, we are currently locked into AWS Batch.
a
@acoustic-van-30942 you can still do it with AWS Batch and DataDog via AWS ECS https://www.datadoghq.com/blog/monitoring-ecs-with-datadog/
thankyou 1