Hi. I am struggling quite a bit with applying meta...
# dev-metaflow
t
Hi. I am struggling quite a bit with applying metaflow to our repository that currently consist many projects that all share a lot of common utility code. It looks something like this:
Copy code
/prod_utils
/projects
   /shared_utils/*
   /project_a/project_a_flow.py
   /project_a/projects/shared_utils -> ../projects/shared_utils (symlink)
   /project_c/prod_utils -> ../../prod_utils (symlink)
   /project_a/*
   /project_b/project_b_flow.py
   /project_b/projects/shared_utils -> ../projects/shared_utils (symlink)
   /project_c/prod_utils -> ../../prod_utils (symlink)
   /project_b/*
   ... (+10 projects)
As you can see, we currently use the recommended approach of springling symlinks around everywhere there is a flow file. This works, but is has a lot of downstream side-effects, like 1. It is a bit of a mess We end up with a lot of symlink files everywhere. As you can see we also have folders containing symlinks. This is to avoid recursion that would otherwise result in HUGE code packages. If I had used just a symlink without the extra folder:
Copy code
/project_a/projects -> ../projects (symlink)
then it seems like this kind of recursion cycles
/projects/project_a/projects/project_a/projects/project_a/
is detected and somehow stopped, but this kind of cycles`/projects/project_b/projects/project_a/projects/project_b/` is not detected. 2. Huge code packages in the experiment tracker Our experiment tracker makes a snapshot of the code, and this becomes huge as well, since every project will contain it own copy of all the shared code. 3. Messes with code editors. Eg. when I try to jump to a function definition I am not send to the correct file path but a new tab opens a mirrored version of that file which is located in a deeply nested directory. 4. Working with
git
becomes quite complex. 5. Using simple tool like
tar
and
rsync
becomes more prone to error. For these many reasons I would love to get rid of all the symlinks and just specify the root of the repository in order to get the whole repo packaged instead. Fortunately, someone has already proposed a solution and made a pull request pull request (although working only for Kubernetes it seems). I would love this PR (or some alternative) to move forward. Is there any plans on this? I would also be willing to put in some effort to make it work for Argo, even though I am quite new to both Metaflow and Argo, so I might need a few pointers.
đź‘€ 1
a
If you have a lot of common utilities why not create a library for it and then either: 1. Creating docker images with said libary packaged in it 2. Using
@pip
install making sure to pass in a
GITHUB_TOKEN
since its a private repo.
t
@ambitious-bird-15073 1. Would only really work well if you don’t have updates to your library that often, right? Since it requires you to build a new docker image each time you want to run your updated code, right? Or am I getting it wrong? 2. As far as I understand this would lead code synchronisation issues? You have a repo that references old versions of it self. I cannot see how this would be a viable solution, but I am also not sure I fully understand.
a
Right but regardless everytime there is a change to the base utilities you will need to redeploy the Flows for those changes to be reflected if you package it as a library or not.
t
@ambitious-bird-15073 Also with symlinks? Why do you think so?
a
Yes, deployments capture the state of the code at a particular point in time. They don’t get updated on the fly behind the scenes like how using a Docker image in the pipeline with the
latest
tag and then performing updates to said image with the
latest
tag would. Its pretty high risk cause you can introduce broken images and wouldn’t be able to identify what changes caused any issues.
d
While I will strongly agree with @ambitious-bird-15073 about the dangers of a morphing environment which can break already deployed and executed flows, it is technically possible to package all your code and make it accessible as an environment and update it and have it be picked up by current flows the next time they run. You’ll need an extension (see here and also on pypi to install: https://github.com/Netflix/metaflow-nflx-extensions). You can look at the docs about using named environments and fetch_at_exec (or some selection of those) to do what you want (hopefully). It may not work in all cases and there is no guarantee that this will be supported in thr mainline OSs but we use this at Netflix and there are definitely use cases of people building a dynamic environment in one step just prior to using it in another step. You typically have to use the cli to build the env. again a word of caution: this is definitely more advanced use and there may be simpler solutions for your use case. In general, I would agree that we need a better story to “package code”. Another option as well btw is packaging your code in a metaflow extension so it will be available. But it has the same issues of not auto updating. Anyways, hopefully this gives you a few tools/ideas. It’s hard for me to recommend a specific solution because I am not 100% sure what your full use case and use pattern is. Feel free to ask more questions if this wasn’t what you were looking for.
🙌🏽 1
t
@dry-beach-38304 I think we are slightly off topic now. What I want is to make sure my flow definition and the logic running inside the steps are in sync, while the logic inside the steps import code from parent folders (shared between flows - but that does not really matter). Eg. I don’t want my current version of the flow to import previous versions of the logic inside the steps. • One way to achieve this is symlink - which I really don’t like. • Another way is to move my flow file to the repository root at run-time - also a terrible solution, but much better than symlinks (What I do with currently). • And the last possible way I can think of is to build an extension and use @add_to_package. (What I will have to do). • An even better option would be to get this basic functionality into the Metaflow. (Since there are no response on the pull request I find this unlikely). @ambitious-bird-15073s solution regarding
pip
are already slightly off topic because it currently break the synchronisation and forces me to run different versions of the code. However, the ideas of taking me shared code and decoupling it more and putting it into a separate package might actually be a good idea, and some people would probably argue that it is the “right” solution. However, there is a pretty tight coupling between the projects and the shared-between-projects code, so it would require a bit of work, and I am not sure decoupling will come without a lot of version-upgrading to constantly update the shared code version, and it would also require me to push changes to shared code first, and sometimes you might just want to change things in the projects and in the shared code at the same time.. The other solution suggested by @ambitious-bird-15073 is to package all the code into a fresh docker image every time I run a flow. But I would then have to reference this image. Maybe generate a random string in the @start step and then build and image with that as tag, and then use decorators with
image=<random string>
. But this is also pretty hack in my opinion.