hey Metaflow team, I wanted to see whether it's fe...
# ask-metaflow
a
hey Metaflow team, I wanted to see whether it's feasible to do something like version and track the packages built for each flow (not the runs but rather the code backing the different runs). Basically, I have an org where there are people who write code around statistics and other math functions but are not engineering savvy. With this in mind, I wanted to see whether I could make something like the following: 1. I define a somewhat abstract flow that has static inputs and outputs 2. Someone else would just give me a function that adheres to the static interface (a simple statistical function that takes some well-defined data inputs and produces some well-defined outputs) 3. I'm able to run the same higher level flow where it just handles some of the data input handling and the function output handling in a consistent way 4. I build some index of the packages build & run for the different custom functions given so that I can: a. Re-run that specific package with different data inputs potentially b. Produce some index of the packages and runs The main goal here is to try to close the gap between the more engineer-y interface of Metaflow and the technical capabilities of people who are more math/statistics focused... but I think there are a few things that are hard: 1. Dynamically creating a
FlowSpec
subclass such that one of the steps calls whatever random function gets thrown into the mix 2. Managing the underlying packages backing the different runs I might be way over-complicating this, so I would appreciate any thoughts or pointers! I'm happy to dig into the code and work through some of the internal APIs for this. I do realize there are security concerns with executing arbitrary functions in this manner, but I think that is manageable in the environment we work in
h
If I understand correctly, you're looking to have some dynamically updated registry of functions that adhere to some API and then parameterize the flow to invoke some combination of those functions? Do those functions have their own dependencies that are different to that of the flow?
a
I think we would be able to get by with having one set of static dependencies (maybe similar to managing a single docker image or something like that), but I imagine it might be nice as a secondary goal to enable new dependencies
and yes I think you got what I was trying to poorly explain 🙂 haha and i think starting out, even just having a single function abstracted would suffice
h
We are working on something called "relocatable functions" that would address this use-case, but it's not released in open-source yet. The idea is that we bundle up the code, environment and artifacts of a function such that they can be invoked in different environments without having to worry about the dependencies. As a current workaround, here is one idea: You have one global library/package where everyone commits their functions to, and use named environments in the flow such that the environment is updated to have the latest version of the package when a function is added. This basically allows you to dynamically update the environment the flow is using without having to re-deploy it, and you can parameterize the flow to take the fully qualified module name / path to the function to invoke. You will need this extension to get named environments. The limitation of this approach is that all the functions will share the same environment, but from your messaging above didn't seem like a blocker
a
oh awesome! thanks a bunch for the help. This documentation is pretty great: https://github.com/Netflix/metaflow-nflx-extensions/blob/5df045dbf3da1fed221d2f4ff11797becb3e9739/docs/conda.md#named-environments I think this works! One complication I'm trying to work through: if I wanted to set it up so that the team(s) i'm supporting can just do the following: 1. pip install some tool i give them 2. they create their own directory for their team in any random location with just a simple
requirements.txt
and their simple python function file(s) a. The goal here is to create the absolute dumbest interface / code layout for easy adoption 3. run command with a pointer to the function they want to test (validate it gets the right result in the metaflow environment) 4. run command with a pointer to the function they want to mark as validated (i.e. tested as "good" now) At a basic level, I'm trying to decouple the package management from the flow execution, then I'm trying to keep my own metadata associated with the different package versions I wasn't sure exactly the best way to get the packaging of the functions correct. I guess locally, I could just create a temp directory, copy the generic
FlowSpec
subclass there and add a symlink to their simple function directory, but I think that doesn't quite solve for the remote case where we might be using Argo. I'm definitely a bit naive around general python packaging best practices / methods, so would you have any ideas about the options there? i.e. what would be the most minimal way to make a local python function available in a remote environment?
also the relocatable functions sound awesome - would you have any details about the timeline you could share? if there is any way I could help out, I would be interested 🙂
hey @hundreds-rainbow-67050, just following up on the above!
h
You can point the pypi decorator to a local dir like the following. This way users can test their code without needing to publish a package
Copy code
@pypi(
        python="3.11.5",
        packages={
            "/my/directory/path/mylibrary": "",
        },
    )
💯 1