hello, I'm trying to deploy Metaflow to Azure for ...
# ask-metaflow
h
hello, I'm trying to deploy Metaflow to Azure for the first time. I'm following these instructions, which point to these permissions to be granted for the deployment. After granting all of the permissions and running the
terraform apply ...
command, I get a lot of errors like the following:
Copy code
β”‚ Cannot register provider Microsoft.KeyVault with Azure Resource Manager: resources.ProvidersClient#Register: Failure responding to request: StatusCode=403 -- Original Error: autorest/azure: Service returned an error. Status=403 Code="AuthorizationFailed" Message="The client '<redacted>' with object id '<redacted>' does not have authorization to perform action 'Microsoft.KeyVault/register/action' over scope '<redacted>' or the scope is invalid. If access was recently granted, please refresh your credentials.".
I get this error for a lot of
Microsoft.*/register/action
actions, and I'm not quite sure how to go about fixing this. Would anyone be able to help me figure out what permissions I'm missing for this specifically? (I'm new to Azure here so this might be pretty basic)
πŸ‘€ 1
I was able to avoid this issue by running through the whole setup using the account owner, but it might be useful to figure out the right permissions for deployment for others...
I'm trying to debug a related issue now. I'm trying to update one of my flows to run on Kubernetes/Azure, and I'm seeing a lot of these errors in the command line output:
Copy code
2024-07-02 21:02:20.456 [6/start/12 (pid 63739)] INFO:azure.core.pipeline.policies.http_logging_policy:Request URL: '<https://stgenerablemfdefault.blob.core.windows.net/metaflow-sto>
rage-container/tf-full-stack-sysroot/IntegratedExperimentFlow/6/start/12/0.task_stderr.log'                                                                                         
2024-07-02 21:02:20.457 [6/start/12 (pid 63739)] Request method: 'GET'                                                                                                              
2024-07-02 21:02:20.457 [6/start/12 (pid 63739)] Request headers:                                                                                                                   
2024-07-02 21:02:20.457 [6/start/12 (pid 63739)] 'x-ms-range': 'REDACTED'                                                                                                           
2024-07-02 21:02:20.457 [6/start/12 (pid 63739)] 'x-ms-version': 'REDACTED'                                                                                                         
2024-07-02 21:02:20.457 [6/start/12 (pid 63739)] 'Accept': 'application/xml'                                                                                                        
2024-07-02 21:02:20.457 [6/start/12 (pid 63739)] 'User-Agent': 'azsdk-python-storage-blob/12.20.0 Python/3.9.18 (Linux-6.5.0-41-generic-x86_64-with-glibc2.38)'                     
2024-07-02 21:02:20.457 [6/start/12 (pid 63739)] 'x-ms-date': 'REDACTED'                                                                                                            
2024-07-02 21:02:20.458 [6/start/12 (pid 63739)] 'x-ms-client-request-id': '9b8f098f-38a5-11ef-9230-c42360da3004'                                                                   
2024-07-02 21:02:20.458 [6/start/12 (pid 63739)] 'Authorization': 'REDACTED'                                                                                                        
2024-07-02 21:02:20.458 [6/start/12 (pid 63739)] No body was attached to the request                                                                                                
2024-07-02 21:02:20.458 [6/start/12 (pid 63739)] INFO:azure.core.pipeline.policies.http_logging_policy:Response status: 404
when I go to the metaflow UI, I can see the task/run failed, but I can't see any logs about the errors or the stdout. would anyone be able to provide any info about how to debug this? (@square-wire-39606 / @ancient-application-36103?)
a
Hi Ruben, sorry for the delay here
Can you share the full terminal log?
h
yep! let me re-run it to produce all of the output
πŸ‘πŸΌ 1
okay here you go! thank you for the help!
note that I'm able to see some folders and one single file being created:
a
do you set any env vars locally for azure sdk's to get a verbose output?
h
i haven't set anything like that! just the normal credential setup... although I think I put the
logging.basicConfig
level to info a long time ago. i'm not sure why it's getting 404s and not logging the errors leading to the run failure though...
i guess the logs being spammed would be fine if i could actually go debug a run failure haha
a
also, it seems that the error is with launching the kubernetes job. do you have a valid kubeconfig locally that allows you to submit jobs to kubernetes?
h
this one i might not know how to check πŸ™‚
πŸ‘πŸΌ 1
a
have you been able to get any flow running, say without
--with kubernetes
?
h
only locally, it's my first time trying out the cloud deployment
s
what is the output that you get when you do
python flow.py run
?
does that succeed and do you see all the logs correctly in the UI?
h
it succeeds but i don't see any logs in the UI
i have to run for 1-2 hours! maybe I can tag you when i'm available later today or tomorrow?
a
sure - i will try to be around, but my responses might be delayed. what i would recommend is getting a simple flow working (such as one here) without
kubernetes
first.
h
ah okay! sounds good, thank you! I appreciate it!
it seems like i just need to resolve some dependencies after adding the
pypi_base()
decorator
although it doesn't quite answer why i was able to run locally completely (without the
pypi_base()
decorator) and see the run in the metaflow UI as successful (along with the produced cards), but still not see any of the logs
anyways, i'll message back here once i have the dependencies resolved
πŸ‘πŸΌ 1
a
it is likely that it is an issue with how the deployment is wired. were the logs visible without the pypi_base decorator - say when running a very simple flow with a bunch of print statements?
h
hm... interesting! okay, I have a really simple flow with this result:
Copy code
(metaflow) ruben@ruben-A6:~/work/test-metaflow/model_flows/virtual-patient$ METAFLOW_PROFILE=azure python test.py run --with kubernetes
Metaflow 2.10.8+netflix-ext(1.1.1) executing TestFlow for user:ruben
Validating your flow...
    The graph looks good!
Running pylint...
    Pylint not found, so extra checks are disabled.
2024-07-03 00:35:08.550 Workflow starting (run-id 12):
2024-07-03 00:35:10.397 [12/start/24 (pid 69362)] Task is starting.
2024-07-03 00:35:12.953 [12/start/24 (pid 69362)] [pod t-47ed747d-pkdsf-pp4cg] Task is starting (Pod is running, Container is running)...
2024-07-03 00:36:25.506 [12/start/24 (pid 69362)] Kubernetes error:
2024-07-03 00:36:25.592 [12/start/24 (pid 69362)] Error (exit code 1). This could be a transient error. Use @retry to retry.
2024-07-03 00:36:25.593 [12/start/24 (pid 69362)] 
2024-07-03 00:36:25.707 [12/start/24 (pid 69362)] Task failed.
2024-07-03 00:36:25.824 Workflow failed.
2024-07-03 00:36:25.824 Terminating 0 active tasks...
2024-07-03 00:36:25.824 Flushing logs...
    Step failure:
    Step start (task-id 24) failed.
I can run it locally:
Copy code
(metaflow) ruben@ruben-A6:~/work/test-metaflow/model_flows/virtual-patient$ python test.py run
Metaflow 2.10.8+netflix-ext(1.1.1) executing TestFlow for user:ruben
Validating your flow...
    The graph looks good!
Running pylint...
    Pylint not found, so extra checks are disabled.
2024-07-03 00:38:23.560 Workflow starting (run-id 1719959903550385):
2024-07-03 00:38:23.563 [1719959903550385/start/1 (pid 69802)] Task is starting.
2024-07-03 00:38:23.768 [1719959903550385/start/1 (pid 69802)] Task finished successfully.
2024-07-03 00:38:23.771 [1719959903550385/main/2 (pid 69805)] Task is starting.
2024-07-03 00:38:23.938 [1719959903550385/main/2 (pid 69805)] This is the main function
2024-07-03 00:38:23.973 [1719959903550385/main/2 (pid 69805)] This is the main function
2024-07-03 00:38:23.973 [1719959903550385/main/2 (pid 69805)] This is the main function
2024-07-03 00:38:23.973 [1719959903550385/main/2 (pid 69805)] This is the main function
2024-07-03 00:38:23.973 [1719959903550385/main/2 (pid 69805)] This is the main function
2024-07-03 00:38:23.973 [1719959903550385/main/2 (pid 69805)] This is the main function
2024-07-03 00:38:23.973 [1719959903550385/main/2 (pid 69805)] This is the main function
2024-07-03 00:38:23.974 [1719959903550385/main/2 (pid 69805)] This is the main function
2024-07-03 00:38:23.974 [1719959903550385/main/2 (pid 69805)] This is the main function
2024-07-03 00:38:23.974 [1719959903550385/main/2 (pid 69805)] This is the main function
2024-07-03 00:38:23.974 [1719959903550385/main/2 (pid 69805)] Task finished successfully.
2024-07-03 00:38:23.978 [1719959903550385/end/3 (pid 69808)] Task is starting.
2024-07-03 00:38:24.178 [1719959903550385/end/3 (pid 69808)] Task finished successfully.
2024-07-03 00:38:24.179 Done!
with this code:
Copy code
class TestFlow(FlowSpec):
    @step
    def start(self) -> None:
        self.next(self.main)

    @step
    def main(self) -> None:
        for _ in range(10):
            print("This is the main function")

        self.next(self.end)

    @step
    def end(self) -> None:
        pass
(I tried rerunning 3-4 times and got the same error above)
s
if you dump this into a
job.yaml
-
Copy code
apiVersion: batch/v1
kind: Job
metadata:
  name: test-job
spec:
  template:
    spec:
      containers:
      - name: test
        image: busybox
        command: ["echo", "Hello, Kubernetes!"]
      restartPolicy: Never
and then execute
kubectl apply -f job.yaml
followed by
kubectl get jobs
- what is the output?
h
Copy code
(metaflow) ruben@ruben-A6:~/work$ kubectl apply -f job.yaml job.batch/test-job created
(metaflow) ruben@ruben-A6:~/work$ kubectl get jobs 
NAME               COMPLETIONS   DURATION   AGE
t-14365adb-shg28   0/1           11h        11h
t-47ed747d-pkdsf   0/1           7h15m      7h15m
t-51796fab-pwjsm   0/1           7h11m      7h11m
t-b19d1a41-g47z8   0/1           7h21m      7h21m
t-bc2fb6c3-shdcz   0/1           10h        10h
t-ca8d42f2-26p8d   0/1           10h        10h
t-e6478428-4rnq8   0/1           10h        10h
t-fa2a45d6-5qjbb   0/1           7h18m      7h18m
t-ff0eb83d-7wbpb   0/1           10h        10h
test-job           1/1           5s         5s
(i changed my terminal size after running so i think there are some collapsed lines)
should i be worried about these long running unfinished jobs? haha
a
Yeah it seems that something is off with your setup - metaflow is able to submit jobs to Kubernetes but then those jobs don’t make much progress
h
hmm... any suggestions on how to debug that setup issue?
a
Can you check the pod logs for any of the jobs that start with t-?
h
yeah, I was just checking that! I can find the jobs here:
however, I don't see anything meaningful when I click through:
a
The pod might have expired. Can you try executing the flow once again and checking the pod logs and events for the newest job?
πŸ’― 1
h
ah okay! maybe this is the underlying error?
a
The container seems to have started
Can you check the pod logs?
h
it seems like i might need to enable this service to check the logs?
a
You can also use kubectl from your terminal to fetch the pod logs
h
ah let me try that
hm, i'm actually a little confused because there is no pod available for the job i just ran when I run this command:
Copy code
$ kubectl get pods --no-headers -o custom-columns=":metadata.name"
but the Azure web portal shows a pod for the logs.... although it seems like that pod failed in the job info page?
need to take a break! I can rerun all of the above steps and try to find the logs when i resume πŸ™‚
okay, I guess it looks like the Kubernetes pod/job can't access the Azure storage service... need to poke around to see how to fix that
πŸ™Œ 1