Hi team, we're at a loss right now... the deployme...
# ask-metaflow
h
Hi team, we're at a loss right now... the deployment method we've been using for over a year suddenly stopped working last week and as far as we have been able to tell, it has something to do with running the flow with argo-workflows create which we didn't love in the first place, but it had been working thus far. basically, after a lot of troubleshooting, we were able to pinpoint the problem to be with the GCP oauth2 call when the flow is instantiating and ultimately, it ends up just timing out with the "internal error" where the token has expired.
Copy code
# trying /opt/hostedtoolcache/Python/3.10.16/x64/lib/python3.10/lib-dynload/termios.cpython-310-x86_64-linux-gnu.so
# extension module 'termios' loaded from '/opt/hostedtoolcache/Python/3.10.16/x64/lib/python3.10/lib-dynload/termios.cpython-310-x86_64-linux-gnu.so'
# extension module 'termios' executed from '/opt/hostedtoolcache/Python/3.10.16/x64/lib/python3.10/lib-dynload/termios.cpython-310-x86_64-linux-gnu.so'
import 'termios' # <_frozen_importlib_external.ExtensionFileLoader object at 0x7f3ae2a206d0>
import 'getpass' # <_frozen_importlib_external.SourceFileLoader object at 0x7f3ae2a20a90>
# trying /opt/hostedtoolcache/Python/3.10.16/x64/lib/python3.10/site-packages/google/oauth2/webauthn_handler_factory.cpython-310-x86_64-linux-gnu.so
# trying /opt/hostedtoolcache/Python/3.10.16/x64/lib/python3.10/site-packages/google/oauth2/webauthn_handler_factory.abi3.so
# trying /opt/hostedtoolcache/Python/3.10.16/x64/lib/python3.10/site-packages/google/oauth2/webauthn_handler_factory.so
# trying /opt/hostedtoolcache/Python/3.10.16/x64/lib/python3.10/site-packages/google/oauth2/webauthn_handler_factory.py
# /opt/hostedtoolcache/Python/3.10.16/x64/lib/python3.10/site-packages/google/oauth2/__pycache__/webauthn_handler_factory.cpython-310.pyc matches /opt/hostedtoolcache/Python/3.10.16/x64/lib/python3.10/site-packages/google/oauth2/webauthn_handler_factory.py
# code object from '/opt/hostedtoolcache/Python/3.10.16/x64/lib/python3.10/site-packages/google/oauth2/__pycache__/webauthn_handler_factory.cpython-310.pyc'
# trying /opt/hostedtoolcache/Python/3.10.16/x64/lib/python3.10/site-packages/google/oauth2/webauthn_handler.cpython-310-x86_64-linux-gnu.so
# trying /opt/hostedtoolcache/Python/3.10.16/x64/lib/python3.10/site-packages/google/oauth2/webauthn_handler.abi3.so
# trying /opt/hostedtoolcache/Python/3.10.16/x64/lib/python3.10/site-packages/google/oauth2/webauthn_handler.so
# trying /opt/hostedtoolcache/Python/3.10.16/x64/lib/python3.10/site-packages/google/oauth2/webauthn_handler.py
# /opt/hostedtoolcache/Python/3.10.16/x64/lib/python3.10/site-packages/google/oauth2/__pycache__/webauthn_handler.cpython-310.pyc matches /opt/hostedtoolcache/Python/3.10.16/x64/lib/python3.10/site-packages/google/oauth2/webauthn_handler.py
# code object from '/opt/hostedtoolcache/Python/3.10.16/x64/lib/python3.10/site-packages/google/oauth2/__pycache__/webauthn_handler.cpython-310.pyc'
# trying /opt/hostedtoolcache/Python/3.10.16/x64/lib/python3.10/site-packages/google/oauth2/webauthn_types.cpython-310-x86_64-linux-gnu.so
# trying /opt/hostedtoolcache/Python/3.10.16/x64/lib/python3.10/site-packages/google/oauth2/webauthn_types.abi3.so
# trying /opt/hostedtoolcache/Python/3.10.16/x64/lib/python3.10/site-packages/google/oauth2/webauthn_types.so
# trying /opt/hostedtoolcache/Python/3.10.16/x64/lib/python3.10/site-packages/google/oauth2/webauthn_types.py
# /opt/hostedtoolcache/Python/3.10.16/x64/lib/python3.10/site-packages/google/oauth2/__pycache__/webauthn_types.cpython-310.pyc matches /opt/hostedtoolcache/Python/3.10.16/x64/lib/python3.10/site-packages/google/oauth2/webauthn_types.py
# code object from '/opt/hostedtoolcache/Python/3.10.16/x64/lib/python3.10/site-packages/google/oauth2/__pycache__/webauthn_types.cpython-310.pyc'
import 'google.oauth2.webauthn_types' # <_frozen_importlib_external.SourceFileLoader object at 0x7f3ae298fb80>
import 'google.oauth2.webauthn_handler' # <_frozen_importlib_external.SourceFileLoader object at 0x7f3ae2a20430>
import 'google.oauth2.webauthn_handler_factory' # <_frozen_importlib_external.SourceFileLoader object at 0x7f3ae2a20790>
import 'google.oauth2.challenges' # <_frozen_importlib_external.SourceFileLoader object at 0x7f3ae2a20df0>
import 'google.oauth2.reauth' # <_frozen_importlib_external.SourceFileLoader object at 0x7f3ae2a21ba0>
import 'google.oauth2.credentials' # <_frozen_importlib_external.SourceFileLoader object at 0x7f3ae2a22e00>
    Internal error:
    ('Unable to acquire impersonated credentials', '{\n  "error": {\n    "code": 401,\n    "message": "Request had invalid authentication credentials. Expected OAuth 2 access token, login cookie or other valid authentication credential. See <https://developers.google.com/identity/sign-in/web/devconsole-project>.",\n    "status": "UNAUTHENTICATED",\n    "details": [\n      {\n        "@type": "<http://type.googleapis.com/google.rpc.ErrorInfo|type.googleapis.com/google.rpc.ErrorInfo>",\n        "reason": "ACCESS_TOKEN_EXPIRED",\n        "domain": "<http://googleapis.com|googleapis.com>",\n        "metadata": {\n          "method": "google.iam.credentials.v1.IAMCredentials.GenerateAccessToken",\n          "service": "<http://iamcredentials.googleapis.com|iamcredentials.googleapis.com>"\n        }\n      }\n    ]\n  }\n}\n')
We are making plenty of other calls to the google APIs in the same github workflow and we have validated that the runner is able to hit the cluster, the metaflow API and the Google APIs for secrets etc... At this point, any help would be welcome!
1
s
when you say, when the flow is instantiating, where exactly in the flow lifecycle are you?
h
Copy code
The namespace of this production flow is 

 production:hello-flow-0-imhn 

To analyze results of this production flow add this line in your notebooks: 

 namespace("production:hello-flow-0-imhn") 

If you want to authorize other people to deploy new versions of this flow to Argo Workflows, they need to call 

 argo-workflows create --authorize hello-flow-0-imhn 

when deploying this flow to Argo Workflows for the first time. 

See "Organizing Results" at <https://docs.metaflow.org/> for more information about production tokens. 

 


2025-03-24 18:52:12.712 Bootstrapping virtual environment(s) ...
it seems to be as part of the conda bootstrap
s
what version of metaflow are you on?
h
metaflow==2.12.34
(we also tried with the latest version at one point)
s
can you try with a newer version?
ack
is this the last log line before the stack trace -
Copy code
2025-03-24 18:52:12.712 Bootstrapping virtual environment(s)
[orthogonal, for laters] re: not liking argo-workflows create - what's the issue there?
h
(not liking it as it is not in line with the rest of our gitops / argocd driven deployments, I would rather we deploy a manifest than hit the k8s api directly)
s
i see - yeah argo-workflows create does a bunch more stuff than just spitting out a yaml
h
yeah
totally get the why, just not in love with the concept from an infra/security perspective
s
are the service account keys valid?
h
we are using workload identity, but yes, the auth is valid for everything else we do during the run
s
do you happen to know after how much time does the error state happen?
h
we even tested running a python script that only prints the flows in the environment successfully
Copy code
from metaflow import Metaflow, namespace
namespace(None)
print("Printing the flows")
print(Metaflow().flows)
s
this is only going against the metadata service and not the gcs bucket ^
so it wouldn't necessarily execute similar code paths
h
and we have successfully deployed the hello-flow with the argo-workflows create method from a local computer via impersonation of the said service account
makes sense, thank you!
s
does deploying the same flow via impersonation work?
h
from local, yes
our issue really is when we're running it from the github actions runner
we tried multiple versions of the runner without success
a
ah i see. i wouldn't be surprised if it has to do with some config on github runner. after how many minutes does the github runner spit out this error?
h
usually the full hour (assuming its the actual token that expired when it fails)
but yes, would love to be able to pinpoint what the error actually is or at least how to debug further
a
it's stuck at
bootstrapping...
for an hour?
h
we have the equivalent of days wasted on this with multiple people on our side
that line: import 'google.oauth2.credentials' # <_frozen_importlib_external.SourceFileLoader object at 0x7f3ae2a22e00> just stays there until we kill it, or sometimes just ends with the error taking about access token expired
I just enabled the conda debug env var, I will have more info soon I think
a
ah are you using the nflx-extensions?
h
yes
(I have also read about the fast-bakery thing, is this only for the hosted platform or can we use this too?)
a
i see. does it work with the native @conda/@pypi functionality?
fast bakery is only on outerbounds
h
gotcha ty
actually, how do I validate if this is using the netflix-exts? this deployment is pretty vanilla at this point
a
what are the full logs?
usually there is something like -
Copy code
Metaflow 2.15.6.post1-gitfd23626-dirty executing LinearFlow for user:savin
Validating your flow...
    The graph looks good!
the version number helps in figuring out if nflx-extensions or any other extensions are in play
h
ah right ok will look on next run
Copy code
Deploying Hello Flow with haus-staging 

Metaflow 2.12.34 executing HelloFlow for user:runner 

2025-03-25 02:48:11.777 Creating local datastore in current directory (/home/runner/work/planning-metaflow/planning-metaflow/.metaflow) 

Validating your flow... 

 The graph looks good! 

Running pylint... 

 Pylint not found, so extra checks are disabled. 

Deploying hello-flow to Argo Workflows... 

The namespace of this production flow is 

 production:hello-flow-0-imhn 

To analyze results of this production flow add this line in your notebooks: 

 namespace("production:hello-flow-0-imhn") 

If you want to authorize other people to deploy new versions of this flow to Argo Workflows, they need to call 

 argo-workflows create --authorize hello-flow-0-imhn 

when deploying this flow to Argo Workflows for the first time. 

See "Organizing Results" at <https://docs.metaflow.org/> for more information about production tokens. 


2025-03-25 02:48:12.469 Bootstrapping virtual environment(s) ...
a
yeah no extensions are being used here
it would be great if you can verify that you are able to deploy a simple hello flow without @conda etc. successfully through the runner
h
yep, this guy will just wait forever at this point
so just remove the --environment part?
a
you would have to remove the @conda/@pypi decorators too
h
that flow doesn't have it, it's the helloflow from the tutorials
a
then you can just remove the --environment parts
👀 1
h
Copy code
from __future__ import annotations

from metaflow import FlowSpec, step


class HelloFlow(FlowSpec):
    """
    A flow where Metaflow prints 'Hi'.

    Run this flow to validate that Metaflow is installed correctly.

    """

    @step
    def start(self):
        """
        This is the 'start' step. All flows must have a step named 'start' that
        is the first step in the flow.

        """
        print("HelloFlow is starting.")
        self.next(self.hello)

    @step
    def hello(self):
        """
        A step for metaflow to introduce itself.

        """
        print("Metaflow says: Hello!")
        self.next(self.end)

    @step
    def end(self):
        """
        This is the 'end' step. All flows must have an 'end' step, which is the
        last step in the flow.

        """
        print("HelloFlow is all done.")


if __name__ == "__main__":
    HelloFlow()
just sent without conda
a
also, are you able to list your gcs buckets from within the runner?
h
(and re-upgraded to latest)
🤞🏼 1
the previous run seems to be hanging again (the one without conda), sent another one with the listing of the metaflow bucket
s
i see - it seems that the issue is that the runner is unable to connect to gcs - very likely some configuration issue
are you able to log in to the runner?
h
that would be fantastic if that's as easy as that... no, its a gh hosted runner
s
not super safe, but if your runner doesn't have access to any secrets, this might be a mechanism to access your runner
h
lol wow, love how easy it is to hack anything these days!
that's super helpful though, for debugging sessions, we would have saved so many hours today!!!
welp, you are right, the runner doesn't seem to be able to list on the bucket, although that service account has the appropriate perms
a
bingo!
h
we're able to get secrets from secret manager, we're able to get the cluster creds from gcloud, we're able to do pretty much everything else but that though 😕
ok thanks @square-wire-39606, I'll try to pin an older version of the gcloud cli and try to find a combination that works, as always your help is much appreciated 🙂
yayfox 1
now if we were to want to do away without having to give the runner direct access to the cluster, what would be an inventive way to do some form of gitops deployment?
(back to not being in love with argo-workflows create directly from ci/cd 🙂 )
a
you could give it scoped access to just submit workflow templates?
h
is there a way we could we generate the workflow template and use argocd to deploy them?
a
not really - since it isn't just the workflow template - we read and manipulate workflow templates, cron workflows and sensors; besides writing to gcs.
is it possible to simply execute
argo-workflows create
within argo-cd?
h
that would have to be something like a cronjob that we deploy which would run the command 🤔
i mean just a k8s job that tells argocd to run a pod which would be doing the
argo-workflows create
... i kinda like that
excited 1
thank you, I'll keep thinking about it 🙂 meanwhile, have to figure out that gcloud issue! tomorrow though 😄
👍🏼 1
a
good luck!
thankyou 1