Hey everyone I have more of a design best practices question Outerbounds #ask-metaflow

Hey everyone I have more of a design/best practice...

important-london-94970

05/20/2025, 6:41 PM

Hey everyone I have more of a design/best practices question. I’m building a classification pipeline using a large model from huggingface. I’m receiving a list of files to process, storing that in self and then launching several jobs using foreach with the goal of speeding up the job. Should I use the @huggingface_hub to download the model in a previous step, and the use @model to load the model in each parallel step? Or is it better to load the model and store it in self then just call self.model.predict() in the parallel steps?

victorious-lawyer-58417

05/21/2025, 2:54 AM

good question! @hallowed-glass-14538 - what's your recommendation

hallowed-glass-14538

05/21/2025, 5:37 PM

Currently we don't support first class artifact serialization that you can just do

self.llama = AutoModel.from_pretrained("meta/llama-405b")

for huggingface models. And even in the future, when we have first class artifact serialization,

AutoModel.from_pretrained("meta/llama-405b")

will always end up downloading the model since huggingface controls the logic of what to download and what to load from local cache (huggingface's local cache wont be available remotely). Hence persisting/caching the actual files belonging to the model would be a more long lasting approach. ,You can do the following (I am assuming that since you are running everything remotely): 1. you can use the pattern you suggested (i.e. cache the model in a previous step using @huggingface_hub like here). Just ensure that when you are running this the first time, you provide enough resources to the step so that I can download/cache the model faster. We resolve the reference to the model in a step before the foreach so that we don't redownload the model from huggingface multiple times during the foreach. 2. Once you have cached the model you can load it via

@model

(just as you said) like here. Just ensure you provide enough disk / memory to the task so that it can load the model quickly enough. Few things to note here: 1. The huggingface_hub decorator caches based on the namespace of the execution. So if user A and user B both run the pipeline the model may get cached twice. a. One of the reason for the name spacing is to avoid any concurrent writes created by multiple flow executions. b. If you want all executions to share a namespace for models/checkpoints etc, you can set the a tag with

checkpoint:

prefix (example :

python myflow.py run --tag checkpoint:foo

) 2. Ensure the bucket and the compute are in the same region (to avoid any excess network related charges and for faster downloads)

3 Views

Open in Slack

Previous Next