Hi all, we're running Metaflow in AWS. We're runni...
# ask-metaflow
f
Hi all, we're running Metaflow in AWS. We're running into a problem where we'll intermittently get a 500 error from the Metaflow service saying
{"message": "Internal server error"}
whenever we try to run a flow. We're not exactly sure what's causing it, but forcing a redeployment of the Metadata Service on ECS seems to work. I believe the task definition is not updated with this redeployment, but the service is restarted. Is there a way to fix this or to enable logging to figure out what's going on? We'd like to run this in a production environment and need to know what exactly is causing it to fail intermittently. Thanks
h
Can you please share the logs of the metadata-service while this issue is happening?
f
Hi @hundreds-zebra-57629, sorry for the late reply. Here are the logs from Metadata-Service and the API Gateway. It appears that the flows we deployed as step functions previously still worked, along with the health check. The error occurred when he was trying to deploy a new step function or run the flow
--with batch
. We're using this image:
public.ecr.aws/outerbounds/metaflow_metadata_service:v2.4.13
Metadata Service Logs:
Copy code
May 16, 2025 at 11:17 (UTC-4:00)	INFO:aiohttp.access:10.0.2.5 [16/May/2025:15:17:01 +0000] "POST /flows/DeploymentFlow/runs/sfn-74d70788-001b-aa24-12ba-c42af4e771ef_cfa0fd88-2f20-9c96-a445-ab3f4b4b2c35/steps/end/tasks/adeccfba-da01-4b69-8dd0-e4baf76b108c/heartbeat HTTP/1.1" 404 227 "-" "python-requests/2.32.3"	875ec23b43834830b4dc28dca2f55437	prod-service-metaflow
May 16, 2025 at 11:17 (UTC-4:00)	INFO:aiohttp.access:10.0.2.5 [16/May/2025:15:17:00 +0000] "GET /healthcheck HTTP/1.1" 200 220 "-" "ELB-HealthChecker/2.0"	875ec23b43834830b4dc28dca2f55437	prod-service-metaflow
May 16, 2025 at 11:16 (UTC-4:00)	INFO:aiohttp.access:10.0.2.5 [16/May/2025:15:16:59 +0000] "POST /flows/DeploymentFlow/runs/sfn-74d70788-001b-aa24-12ba-c42af4e771ef_cfa0fd88-2f20-9c96-a445-ab3f4b4b2c35/steps/end/tasks/adeccfba-da01-4b69-8dd0-e4baf76b108c/heartbeat HTTP/1.1" 404 227 "-" "python-requests/2.32.3"	875ec23b43834830b4dc28dca2f55437	prod-service-metaflow
May 16, 2025 at 11:16 (UTC-4:00)	INFO:aiohttp.access:10.0.2.5 [16/May/2025:15:16:58 +0000] "POST /flows/DeploymentFlow/runs/sfn-74d70788-001b-aa24-12ba-c42af4e771ef_cfa0fd88-2f20-9c96-a445-ab3f4b4b2c35/steps/end/tasks/adeccfba-da01-4b69-8dd0-e4baf76b108c/heartbeat HTTP/1.1" 404 227 "-" "python-requests/2.32.3"	875ec23b43834830b4dc28dca2f55437	prod-service-metaflow
May 16, 2025 at 11:16 (UTC-4:00)	INFO:aiohttp.access:10.0.2.5 [16/May/2025:15:16:57 +0000] "POST /flows/DeploymentFlow/runs/sfn-74d70788-001b-aa24-12ba-c42af4e771ef_cfa0fd88-2f20-9c96-a445-ab3f4b4b2c35/steps/end/tasks/adeccfba-da01-4b69-8dd0-e4baf76b108c/metadata HTTP/1.1" 200 177 "-" "python-requests/2.32.3"	875ec23b43834830b4dc28dca2f55437	prod-service-metaflow
May 16, 2025 at 11:16 (UTC-4:00)	INFO:aiohttp.access:10.0.2.5 [16/May/2025:15:16:57 +0000] "POST /flows/DeploymentFlow/runs/sfn-74d70788-001b-aa24-12ba-c42af4e771ef_cfa0fd88-2f20-9c96-a445-ab3f4b4b2c35/steps/end/tasks/adeccfba-da01-4b69-8dd0-e4baf76b108c/metadata HTTP/1.1" 200 177 "-" "python-requests/2.32.3"	875ec23b43834830b4dc28dca2f55437	prod-service-metaflow
May 16, 2025 at 11:16 (UTC-4:00)	INFO:aiohttp.access:10.0.2.5 [16/May/2025:15:16:57 +0000] "POST /flows/DeploymentFlow/runs/sfn-74d70788-001b-aa24-12ba-c42af4e771ef_cfa0fd88-2f20-9c96-a445-ab3f4b4b2c35/steps/end/tasks/adeccfba-da01-4b69-8dd0-e4baf76b108c/metadata HTTP/1.1" 200 177 "-" "python-requests/2.32.3"	875ec23b43834830b4dc28dca2f55437	prod-service-metaflow
May 16, 2025 at 11:16 (UTC-4:00)	INFO:aiohttp.access:10.0.2.5 [16/May/2025:15:16:57 +0000] "GET /ping HTTP/1.1" 200 191 "-" "python-requests/2.32.3"	875ec23b43834830b4dc28dca2f55437	prod-service-metaflow
May 16, 2025 at 11:16 (UTC-4:00)	INFO:aiohttp.access:10.0.2.5 [16/May/2025:15:16:57 +0000] "POST /flows/DeploymentFlow/runs/sfn-74d70788-001b-aa24-12ba-c42af4e771ef_cfa0fd88-2f20-9c96-a445-ab3f4b4b2c35/steps/end/tasks/adeccfba-da01-4b69-8dd0-e4baf76b108c/metadata HTTP/1.1" 200 177 "-" "python-requests/2.32.3"	875ec23b43834830b4dc28dca2f55437	prod-service-metaflow
May 16, 2025 at 11:16 (UTC-4:00)	INFO:aiohttp.access:10.0.2.5 [16/May/2025:15:16:57 +0000] "GET /healthcheck HTTP/1.1" 200 220 "-" "ELB-HealthChecker/2.0"	875ec23b43834830b4dc28dca2f55437	prod-service-metaflow
May 16, 2025 at 11:16 (UTC-4:00)	INFO:aiohttp.access:10.0.2.5 [16/May/2025:15:16:57 +0000] "POST /flows/DeploymentFlow/runs/sfn-74d70788-001b-aa24-12ba-c42af4e771ef_cfa0fd88-2f20-9c96-a445-ab3f4b4b2c35/steps/end/tasks/adeccfba-da01-4b69-8dd0-e4baf76b108c/metadata HTTP/1.1" 200 177 "-" "python-requests/2.32.3"	875ec23b43834830b4dc28dca2f55437	prod-service-metaflow
May 16, 2025 at 11:16 (UTC-4:00)	INFO:aiohttp.access:10.0.2.5 [16/May/2025:15:16:57 +0000] "POST /flows/DeploymentFlow/runs/sfn-74d70788-001b-aa24-12ba-c42af4e771ef_cfa0fd88-2f20-9c96-a445-ab3f4b4b2c35/steps/end/task HTTP/1.1" 200 647 "-" "python-requests/2.32.3"	875ec23b43834830b4dc28dca2f55437	prod-service-metaflow
May 16, 2025 at 11:16 (UTC-4:00)	INFO:aiohttp.access:10.0.2.5 [16/May/2025:15:16:57 +0000] "GET /flows/DeploymentFlow/runs/sfn-74d70788-001b-aa24-12ba-c42af4e771ef_cfa0fd88-2f20-9c96-a445-ab3f4b4b2c35/steps/end/tasks/adeccfba-da01-4b69-8dd0-e4baf76b108c HTTP/1.1" 404 196 "-" "python-requests/2.32.3"	875ec23b43834830b4dc28dca2f55437	prod-service-metaflow
May 16, 2025 at 11:16 (UTC-4:00)	INFO:aiohttp.access:10.0.2.5 [16/May/2025:15:16:57 +0000] "POST /flows/DeploymentFlow/runs/sfn-74d70788-001b-aa24-12ba-c42af4e771ef_cfa0fd88-2f20-9c96-a445-ab3f4b4b2c35/steps/end/step HTTP/1.1" 200 563 "-" "python-requests/2.32.3"	875ec23b43834830b4dc28dca2f55437	prod-service-metaflow
May 16, 2025 at 11:16 (UTC-4:00)	INFO:aiohttp.access:10.0.2.5 [16/May/2025:15:16:57 +0000] "GET /flows/DeploymentFlow/runs/sfn-74d70788-001b-aa24-12ba-c42af4e771ef_cfa0fd88-2f20-9c96-a445-ab3f4b4b2c35/steps/end HTTP/1.1" 404 196 "-" "python-requests/2.32.3"	875ec23b43834830b4dc28dca2f55437	prod-service-metaflow
May 16, 2025 at 11:16 (UTC-4:00)	INFO:aiohttp.access:10.0.2.5 [16/May/2025:15:16:57 +0000] "GET /flows/DeploymentFlow/runs/sfn-74d70788-001b-aa24-12ba-c42af4e771ef_cfa0fd88-2f20-9c96-a445-ab3f4b4b2c35 HTTP/1.1" 200 576 "-" "python-requests/2.32.3"	875ec23b43834830b4dc28dca2f55437	prod-service-metaflow
May 16, 2025 at 11:16 (UTC-4:00)	INFO:aiohttp.access:10.0.2.5 [16/May/2025:15:16:57 +0000] "GET /flows/DeploymentFlow HTTP/1.1" 200 312 "-" "python-requests/2.32.3"	875ec23b43834830b4dc28dca2f55437	prod-service-metaflow
May 16, 2025 at 11:16 (UTC-4:00)	INFO:aiohttp.access:10.0.2.5 [16/May/2025:15:16:54 +0000] "GET /healthcheck HTTP/1.1" 200 220 "-" "ELB-HealthChecker/2.0"	875ec23b43834830b4dc28dca2f55437	prod-service-metaflow
API Gateway Logs:
Copy code
{"account_id":"-","api_id":"mbajwxgydb","api_key":"**********************************uOGW65","authenticate.error":"-","authenticate.status":"-","authorize.error":"-","authorize.status":"-","authorizer.error":"-","authorizer.integrationStatus":"-","authorizer.requestId":"-","caller":"-","endpointType":"EDGE","error.messageString":" "Internal server error"","error.responseType":"INTEGRATION_FAILURE","error.validationErrorString":"-","extendedRequestId":"Kqp-nF_nCYcEl5g=","http_method":"GET","identity.accountId":"-","identity.apiKey":"**********************************uOGW65","identity.apiKeyId":"c9ek4yahsa","identity.caller":"-","identity.user":"-","identity.userAgent":"python-requests/2.32.3","identity.userArn":"-","identity.vpcId":"-","identity.vpceId":"-","integration.error":"There was an internal error while executing your request","integration.latency":"11132","integration.requestId":"-","integration.status":"-","integrationStatus":"-","requestId":"13a7048d-4218-4821-a2f8-2a1e10fce0d7","request_id":"13a7048d-4218-4821-a2f8-2a1e10fce0d7","resource_id":"iw3ngo","resource_path":"/{proxy+}","source_ip":"66.183.225.111","stage":"api","user":"-","user-agent":"python-requests/2.32.3","user_arn":"-","waf.error":"-","waf.latency":"-"}
h
I was expecting to see some clues in the metadata-service logs. Usually, an internal error happens when the metadata-service is unable to communicate with the database or is in a failing state. Can you try curling the metadata-service on the
/ping
path first via the API server and then directly on the ec2 instance and see if you get back a 200 status?