important-london-94970
05/20/2025, 6:41 PMvictorious-lawyer-58417
05/21/2025, 2:54 AMhallowed-glass-14538
05/21/2025, 5:37 PMself.llama = AutoModel.from_pretrained("meta/llama-405b")
for huggingface models. And even in the future, when we have first class artifact serialization, AutoModel.from_pretrained("meta/llama-405b")
will always end up downloading the model since huggingface controls the logic of what to download and what to load from local cache (huggingface's local cache wont be available remotely). Hence persisting/caching the actual files belonging to the model would be a more long lasting approach. ,You can do the following (I am assuming that since you are running everything remotely):
1. you can use the pattern you suggested (i.e. cache the model in a previous step using @huggingface_hub like here). Just ensure that when you are running this the first time, you provide enough resources to the step so that I can download/cache the model faster. We resolve the reference to the model in a step before the foreach so that we don't redownload the model from huggingface multiple times during the foreach.
2. Once you have cached the model you can load it via @model
(just as you said) like here. Just ensure you provide enough disk / memory to the task so that it can load the model quickly enough.
Few things to note here:
1. The huggingface_hub decorator caches based on the namespace of the execution. So if user A and user B both run the pipeline the model may get cached twice.
a. One of the reason for the name spacing is to avoid any concurrent writes created by multiple flow executions.
b. If you want all executions to share a namespace for models/checkpoints etc, you can set the a tag with checkpoint:
prefix (example : python myflow.py run --tag checkpoint:foo
)
2. Ensure the bucket and the compute are in the same region (to avoid any excess network related charges and for faster downloads)