thankful-father-61351
09/14/2023, 7:21 PM/prod_utils
/projects
/shared_utils/*
/project_a/project_a_flow.py
/project_a/projects/shared_utils -> ../projects/shared_utils (symlink)
/project_c/prod_utils -> ../../prod_utils (symlink)
/project_a/*
/project_b/project_b_flow.py
/project_b/projects/shared_utils -> ../projects/shared_utils (symlink)
/project_c/prod_utils -> ../../prod_utils (symlink)
/project_b/*
... (+10 projects)
As you can see, we currently use the recommended approach of springling symlinks around everywhere there is a flow file. This works, but is has a lot of downstream side-effects, like
1. It is a bit of a mess
We end up with a lot of symlink files everywhere. As you can see we also have folders containing symlinks. This is to avoid recursion that would otherwise result in HUGE code packages.
If I had used just a symlink without the extra folder:
/project_a/projects -> ../projects (symlink)
then it seems like this kind of recursion cycles /projects/project_a/projects/project_a/projects/project_a/
is detected and somehow stopped, but this kind of cycles`/projects/project_b/projects/project_a/projects/project_b/` is not detected.
2. Huge code packages in the experiment tracker
Our experiment tracker makes a snapshot of the code, and this becomes huge as well, since every project will contain it own copy of all the shared code.
3. Messes with code editors. Eg. when I try to jump to a function definition I am not send to the correct file path but a new tab opens a mirrored version of that file which is located in a deeply nested directory.
4. Working with git
becomes quite complex.
5. Using simple tool like tar
and rsync
becomes more prone to error.
For these many reasons I would love to get rid of all the symlinks and just specify the root of the repository in order to get the whole repo packaged instead. Fortunately, someone has already proposed a solution and made a pull request pull request (although working only for Kubernetes it seems).
I would love this PR (or some alternative) to move forward. Is there any plans on this? I would also be willing to put in some effort to make it work for Argo, even though I am quite new to both Metaflow and Argo, so I might need a few pointers.ambitious-bird-15073
09/15/2023, 9:55 AM@pip
install making sure to pass in a GITHUB_TOKEN
since its a private repo.thankful-father-61351
09/19/2023, 6:32 AMambitious-bird-15073
09/19/2023, 7:36 AMthankful-father-61351
09/19/2023, 2:19 PMambitious-bird-15073
09/19/2023, 2:47 PMlatest
tag and then performing updates to said image with the latest
tag would. Its pretty high risk cause you can introduce broken images and wouldn’t be able to identify what changes caused any issues.dry-beach-38304
09/23/2023, 9:08 PMdry-beach-38304
09/24/2023, 3:00 AMthankful-father-61351
09/28/2023, 11:08 AMpip
are already slightly off topic because it currently break the synchronisation and forces me to run different versions of the code. However, the ideas of taking me shared code and decoupling it more and putting it into a separate package might actually be a good idea, and some people would probably argue that it is the “right” solution. However, there is a pretty tight coupling between the projects and the shared-between-projects code, so it would require a bit of work, and I am not sure decoupling will come without a lot of version-upgrading to constantly update the shared code version, and it would also require me to push changes to shared code first, and sometimes you might just want to change things in the projects and in the shared code at the same time..
The other solution suggested by @ambitious-bird-15073 is to package all the code into a fresh docker image every time I run a flow. But I would then have to reference this image. Maybe generate a random string in the @start step and then build and image with that as tag, and then use decorators with image=<random string>
. But this is also pretty hack in my opinion.