famous-potato-4109
05/16/2025, 4:47 PM{"message": "Internal server error"}
whenever we try to run a flow. We're not exactly sure what's causing it, but forcing a redeployment of the Metadata Service on ECS seems to work. I believe the task definition is not updated with this redeployment, but the service is restarted. Is there a way to fix this or to enable logging to figure out what's going on? We'd like to run this in a production environment and need to know what exactly is causing it to fail intermittently. Thankshundreds-zebra-57629
05/18/2025, 10:09 PMfamous-potato-4109
05/20/2025, 1:37 PM--with batch
. We're using this image: public.ecr.aws/outerbounds/metaflow_metadata_service:v2.4.13
Metadata Service Logs:
May 16, 2025 at 11:17 (UTC-4:00) INFO:aiohttp.access:10.0.2.5 [16/May/2025:15:17:01 +0000] "POST /flows/DeploymentFlow/runs/sfn-74d70788-001b-aa24-12ba-c42af4e771ef_cfa0fd88-2f20-9c96-a445-ab3f4b4b2c35/steps/end/tasks/adeccfba-da01-4b69-8dd0-e4baf76b108c/heartbeat HTTP/1.1" 404 227 "-" "python-requests/2.32.3" 875ec23b43834830b4dc28dca2f55437 prod-service-metaflow
May 16, 2025 at 11:17 (UTC-4:00) INFO:aiohttp.access:10.0.2.5 [16/May/2025:15:17:00 +0000] "GET /healthcheck HTTP/1.1" 200 220 "-" "ELB-HealthChecker/2.0" 875ec23b43834830b4dc28dca2f55437 prod-service-metaflow
May 16, 2025 at 11:16 (UTC-4:00) INFO:aiohttp.access:10.0.2.5 [16/May/2025:15:16:59 +0000] "POST /flows/DeploymentFlow/runs/sfn-74d70788-001b-aa24-12ba-c42af4e771ef_cfa0fd88-2f20-9c96-a445-ab3f4b4b2c35/steps/end/tasks/adeccfba-da01-4b69-8dd0-e4baf76b108c/heartbeat HTTP/1.1" 404 227 "-" "python-requests/2.32.3" 875ec23b43834830b4dc28dca2f55437 prod-service-metaflow
May 16, 2025 at 11:16 (UTC-4:00) INFO:aiohttp.access:10.0.2.5 [16/May/2025:15:16:58 +0000] "POST /flows/DeploymentFlow/runs/sfn-74d70788-001b-aa24-12ba-c42af4e771ef_cfa0fd88-2f20-9c96-a445-ab3f4b4b2c35/steps/end/tasks/adeccfba-da01-4b69-8dd0-e4baf76b108c/heartbeat HTTP/1.1" 404 227 "-" "python-requests/2.32.3" 875ec23b43834830b4dc28dca2f55437 prod-service-metaflow
May 16, 2025 at 11:16 (UTC-4:00) INFO:aiohttp.access:10.0.2.5 [16/May/2025:15:16:57 +0000] "POST /flows/DeploymentFlow/runs/sfn-74d70788-001b-aa24-12ba-c42af4e771ef_cfa0fd88-2f20-9c96-a445-ab3f4b4b2c35/steps/end/tasks/adeccfba-da01-4b69-8dd0-e4baf76b108c/metadata HTTP/1.1" 200 177 "-" "python-requests/2.32.3" 875ec23b43834830b4dc28dca2f55437 prod-service-metaflow
May 16, 2025 at 11:16 (UTC-4:00) INFO:aiohttp.access:10.0.2.5 [16/May/2025:15:16:57 +0000] "POST /flows/DeploymentFlow/runs/sfn-74d70788-001b-aa24-12ba-c42af4e771ef_cfa0fd88-2f20-9c96-a445-ab3f4b4b2c35/steps/end/tasks/adeccfba-da01-4b69-8dd0-e4baf76b108c/metadata HTTP/1.1" 200 177 "-" "python-requests/2.32.3" 875ec23b43834830b4dc28dca2f55437 prod-service-metaflow
May 16, 2025 at 11:16 (UTC-4:00) INFO:aiohttp.access:10.0.2.5 [16/May/2025:15:16:57 +0000] "POST /flows/DeploymentFlow/runs/sfn-74d70788-001b-aa24-12ba-c42af4e771ef_cfa0fd88-2f20-9c96-a445-ab3f4b4b2c35/steps/end/tasks/adeccfba-da01-4b69-8dd0-e4baf76b108c/metadata HTTP/1.1" 200 177 "-" "python-requests/2.32.3" 875ec23b43834830b4dc28dca2f55437 prod-service-metaflow
May 16, 2025 at 11:16 (UTC-4:00) INFO:aiohttp.access:10.0.2.5 [16/May/2025:15:16:57 +0000] "GET /ping HTTP/1.1" 200 191 "-" "python-requests/2.32.3" 875ec23b43834830b4dc28dca2f55437 prod-service-metaflow
May 16, 2025 at 11:16 (UTC-4:00) INFO:aiohttp.access:10.0.2.5 [16/May/2025:15:16:57 +0000] "POST /flows/DeploymentFlow/runs/sfn-74d70788-001b-aa24-12ba-c42af4e771ef_cfa0fd88-2f20-9c96-a445-ab3f4b4b2c35/steps/end/tasks/adeccfba-da01-4b69-8dd0-e4baf76b108c/metadata HTTP/1.1" 200 177 "-" "python-requests/2.32.3" 875ec23b43834830b4dc28dca2f55437 prod-service-metaflow
May 16, 2025 at 11:16 (UTC-4:00) INFO:aiohttp.access:10.0.2.5 [16/May/2025:15:16:57 +0000] "GET /healthcheck HTTP/1.1" 200 220 "-" "ELB-HealthChecker/2.0" 875ec23b43834830b4dc28dca2f55437 prod-service-metaflow
May 16, 2025 at 11:16 (UTC-4:00) INFO:aiohttp.access:10.0.2.5 [16/May/2025:15:16:57 +0000] "POST /flows/DeploymentFlow/runs/sfn-74d70788-001b-aa24-12ba-c42af4e771ef_cfa0fd88-2f20-9c96-a445-ab3f4b4b2c35/steps/end/tasks/adeccfba-da01-4b69-8dd0-e4baf76b108c/metadata HTTP/1.1" 200 177 "-" "python-requests/2.32.3" 875ec23b43834830b4dc28dca2f55437 prod-service-metaflow
May 16, 2025 at 11:16 (UTC-4:00) INFO:aiohttp.access:10.0.2.5 [16/May/2025:15:16:57 +0000] "POST /flows/DeploymentFlow/runs/sfn-74d70788-001b-aa24-12ba-c42af4e771ef_cfa0fd88-2f20-9c96-a445-ab3f4b4b2c35/steps/end/task HTTP/1.1" 200 647 "-" "python-requests/2.32.3" 875ec23b43834830b4dc28dca2f55437 prod-service-metaflow
May 16, 2025 at 11:16 (UTC-4:00) INFO:aiohttp.access:10.0.2.5 [16/May/2025:15:16:57 +0000] "GET /flows/DeploymentFlow/runs/sfn-74d70788-001b-aa24-12ba-c42af4e771ef_cfa0fd88-2f20-9c96-a445-ab3f4b4b2c35/steps/end/tasks/adeccfba-da01-4b69-8dd0-e4baf76b108c HTTP/1.1" 404 196 "-" "python-requests/2.32.3" 875ec23b43834830b4dc28dca2f55437 prod-service-metaflow
May 16, 2025 at 11:16 (UTC-4:00) INFO:aiohttp.access:10.0.2.5 [16/May/2025:15:16:57 +0000] "POST /flows/DeploymentFlow/runs/sfn-74d70788-001b-aa24-12ba-c42af4e771ef_cfa0fd88-2f20-9c96-a445-ab3f4b4b2c35/steps/end/step HTTP/1.1" 200 563 "-" "python-requests/2.32.3" 875ec23b43834830b4dc28dca2f55437 prod-service-metaflow
May 16, 2025 at 11:16 (UTC-4:00) INFO:aiohttp.access:10.0.2.5 [16/May/2025:15:16:57 +0000] "GET /flows/DeploymentFlow/runs/sfn-74d70788-001b-aa24-12ba-c42af4e771ef_cfa0fd88-2f20-9c96-a445-ab3f4b4b2c35/steps/end HTTP/1.1" 404 196 "-" "python-requests/2.32.3" 875ec23b43834830b4dc28dca2f55437 prod-service-metaflow
May 16, 2025 at 11:16 (UTC-4:00) INFO:aiohttp.access:10.0.2.5 [16/May/2025:15:16:57 +0000] "GET /flows/DeploymentFlow/runs/sfn-74d70788-001b-aa24-12ba-c42af4e771ef_cfa0fd88-2f20-9c96-a445-ab3f4b4b2c35 HTTP/1.1" 200 576 "-" "python-requests/2.32.3" 875ec23b43834830b4dc28dca2f55437 prod-service-metaflow
May 16, 2025 at 11:16 (UTC-4:00) INFO:aiohttp.access:10.0.2.5 [16/May/2025:15:16:57 +0000] "GET /flows/DeploymentFlow HTTP/1.1" 200 312 "-" "python-requests/2.32.3" 875ec23b43834830b4dc28dca2f55437 prod-service-metaflow
May 16, 2025 at 11:16 (UTC-4:00) INFO:aiohttp.access:10.0.2.5 [16/May/2025:15:16:54 +0000] "GET /healthcheck HTTP/1.1" 200 220 "-" "ELB-HealthChecker/2.0" 875ec23b43834830b4dc28dca2f55437 prod-service-metaflow
API Gateway Logs:
{"account_id":"-","api_id":"mbajwxgydb","api_key":"**********************************uOGW65","authenticate.error":"-","authenticate.status":"-","authorize.error":"-","authorize.status":"-","authorizer.error":"-","authorizer.integrationStatus":"-","authorizer.requestId":"-","caller":"-","endpointType":"EDGE","error.messageString":" "Internal server error"","error.responseType":"INTEGRATION_FAILURE","error.validationErrorString":"-","extendedRequestId":"Kqp-nF_nCYcEl5g=","http_method":"GET","identity.accountId":"-","identity.apiKey":"**********************************uOGW65","identity.apiKeyId":"c9ek4yahsa","identity.caller":"-","identity.user":"-","identity.userAgent":"python-requests/2.32.3","identity.userArn":"-","identity.vpcId":"-","identity.vpceId":"-","integration.error":"There was an internal error while executing your request","integration.latency":"11132","integration.requestId":"-","integration.status":"-","integrationStatus":"-","requestId":"13a7048d-4218-4821-a2f8-2a1e10fce0d7","request_id":"13a7048d-4218-4821-a2f8-2a1e10fce0d7","resource_id":"iw3ngo","resource_path":"/{proxy+}","source_ip":"66.183.225.111","stage":"api","user":"-","user-agent":"python-requests/2.32.3","user_arn":"-","waf.error":"-","waf.latency":"-"}
hundreds-zebra-57629
05/22/2025, 3:31 PM/ping
path first via the API server and then directly on the ec2 instance and see if you get back a 200 status?