Hey guys - I asked a while ago if metaflow has any...
# dev-metaflow
c
Hey guys - I asked a while ago if metaflow has anything on the cards to handle the ever expanding storage in S3 for models. I was told to rename the flows and then delete all the old stuff, which we have been doing every few months, just thought it was time to check in again and see if any better pattern has been generated as this can be quite a task as the number of flows increase. Thanks
1
1
d
hey. We are in the process of redesigning the metadata service in part to make this easier. There is no ETA yet but we are paving the way to do this (or more generally be able to better handle artifact lifecycle).
🙌 2
c
Thanks for the update @dry-beach-38304 - is this going to be something that is easy enough to switch over to from a
legacy
type deployment of your cfn template ( with some changes we have made ) ?
I look forward to that release in the future!!
d
we are very early in the process but yes, migration is definitely a top-of-mind concern (as well as no downtime).
🙌 1
👀 1
1
c
Hey @dry-beach-38304 @average-beach-28850,I am just revisiting this as we are having to rename flows every few months to avoid these costs. We run our flows every day and no longer need the artefacts after 24 - 28 hours. So I was going to just add a
LifecycleConfiguration
rule to the bucket.
Just wanted to check in with you guys and share here for anyone else that may need this, the rule will transition everything in the bucket after 3 days to GLACIER storage and delete it after 14 days. Is this pattern what you guys would consider? Thanks in advance
Copy code
LifecycleConfiguration:
        Rules:
          - Id: TransitionDataToGlacier
            Status: Enabled
            Prefix: metaflow
            ExpirationInDays: 14
            Transitions:
              - StorageClass: GLACIER
                TransitionInDays: 3
ooh or will this break stuff because there are other files stored in there that are accessed?
I can just filter for the
metaflow/
folder, cfn updated.
d
the main issue that this may cause is that since we use a content addressed store kind of thing, if a flow today produces an artifact of a flow from 5 days ago, it may get deleted even though it was “produced” today (but not written since it already existed).
that’s the current big difficutly. The “renaming” trick works because the cas is per flow.
c
if a flow today produces an artifact of a flow from 5 days ago
- can you explain this? I do not understand how this can possibly happen
well for our use case - all flows are essentially reading from a db -> doing some modelling -> writing to a db ( so can I assume that no artefacts would somehow be shared between runs a week apart ? )
d
if you never use the metaflow client to access artifacts (things stored in self.xyz) and don’t use the UI, you should be fine for the most part. There are always “hidden” artifacts (like the name of the flow, the graph for the flow, etc). What I mean is that basically, per flow, all artifacts are stored in a content addressable way (ie: the name of the file is a hash of its content). That works great for deduplication so that runs that produce the same data don’t end up incurring additional storage space but is not so great when you want to determine when to delete something based on creation time.
c
Right, yeah we don't use the UI. But we do use the client (only if a flow fails) to resume or debug etc Hmm. Our s3 costs are just ballooning so really want to come up with a better way to handle it. As our models increase its not sustainable to rename and delete every so often. 😔
d
we were literally talking about this this morning. I don’t have a good solution for you right now except to maybe have intelligent tiering which i think is based on access which should mitigate some of the issues of migrating based on creation time.
that should lower costs a bit.
c
Ok thanks Romain