Hey folks, i am using metaflow deployed in Aws thr...
# ask-metaflow
r
Hey folks, i am using metaflow deployed in Aws through the Cloudformation Template. I am using UI version
v1.3.13
and service
2.4.11-
Everything works fine until the UI-Service at some point stops loadings cards. If i restart the service then it works fine again for some time. I am getting 504 errors and a message that something is not valid Json when i inspect it in the browser console. And in the Cloudwatch logs is see this:
Copy code
INFO:CacheAsyncClient:cache_data/log:Message: DEBUG:botocore.hooks:Event needs-retry.s3.HeadObject: calling handler <bound method S3RegionRedirectorv2.redirect_from_error of <botocore.utils.S3RegionRedirectorv2 object at 0x7f3e4316fe90>>
The strange thing though is as i said after restarting the UI in Aws i can see the cards again.
1
Some more infos, would really appreciate if someone could give me some pointers at least. Cloudwatch trail for the UI Service:
Copy code
INFO:CacheAsyncClient:cache_data/log:Message: DEBUG:botocore.hooks:Event request-created.s3.HeadObject: calling handler <function add_retry_headers at 0x7fa3c17a2e80>
INFO:CacheAsyncClient:cache_data/log:Message: DEBUG:botocore.endpoint:Sending http request: <AWSPreparedRequest stream_output=False, method=HEAD, url=<https://metaflow-metaflows3bucket-3oikszn7tnso.s3.eu-west-1.amazonaws.com/metaflow/AuxiliaryTransformer/425/train/3047/2.attempt.json>, headers={'User-Agent': b'Boto3/1.34.125 md/Botocore#1.34.125 ua/2.0 os/linux#5.10.220-209.869.amzn2.x86_64 md/arch#x86_64 lang/python#3.11.6 md/pyimpl#CPython exec-env/AWS_ECS_FARGATE cfg/retry-mode#legacy Botocore/1.34.125', 'X-Amz-Date': b'20240813T075532Z', 'X-Amz-Security-Token': b'****', 'X-Amz-Content-SHA256': b'e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855', 'Authorization': b'****', 'amz-sdk-invocation-id': b'49bd4a2b-b0f3-404c-b0d5-8cca0922cc95', 'amz-sdk-request': b'attempt=1'}>
INFO:CacheAsyncClient:cache_data/log:Message: DEBUG:botocore.httpsession:Certificate path: /opt/latest/lib/python3.11/site-packages/certifi/cacert.pem
INFO:CacheAsyncClient:cache_data/log:Message: DEBUG:urllib3.connectionpool:<https://metaflow-metaflows3bucket-3oikszn7tnso.s3.eu-west-1.amazonaws.com:443> "HEAD /metaflow/AuxiliaryTransformer/425/train/3047/2.attempt.json HTTP/1.1" 404 0
INFO:CacheAsyncClient:cache_data/log:Message: DEBUG:botocore.parsers:Response headers: {'x-amz-request-id': '36XNZKB260CVZJCR', 'x-amz-id-2': 'ii2UQnA/rujSADAG6CH+cWyNxBbfqXZ69H5PU97g648Yl4EvrAn+Flp70BQcsK+fksvnJFKcRhMj0AAZQ/xUSw==', 'Content-Type': 'application/xml', 'Date': 'Tue, 13 Aug 2024 07:55:32 GMT', 'Server': 'AmazonS3'}
INFO:CacheAsyncClient:cache_data/log:Message: DEBUG:botocore.parsers:Response body:
INFO:CacheAsyncClient:cache_data/log:Message: b''
INFO:CacheAsyncClient:cache_data/log:Message: DEBUG:botocore.hooks:Event needs-retry.s3.HeadObject: calling handler <botocore.retryhandler.RetryHandler object at 0x7fa3c05e8950>
INFO:CacheAsyncClient:cache_data/log:Message: DEBUG:botocore.retryhandler:No retry needed.
INFO:CacheAsyncClient:cache_data/log:Message: DEBUG:botocore.hooks:Event needs-retry.s3.HeadObject: calling handler <bound method S3RegionRedirectorv2.redirect_from_error of <botocore.utils.S3RegionRedirectorv2 object at 0x7fa3c05c1dd0>>
INFO:CacheAsyncClient:cache_data/log:Message: DEBUG:botocore.hooks:Event before-parameter-build.s3.HeadObject: calling handler <function sse_md5 at 0x7fa3c17a0e00>
INFO:CacheAsyncClient:cache_data/log:Message: DEBUG:botocore.hooks:Event before-parameter-build.s3.HeadObject: calling handler <function validate_bucket_name at 0x7fa3c17a0d60>
INFO:CacheAsyncClient:cache_data/log:Message: DEBUG:botocore.hooks:Event before-parameter-build.s3.HeadObject: calling handler <function remove_bucket_from_url_paths_from_model at 0x7fa3c17a2f20>
INFO:CacheAsyncClient:cache_data/log:Message: DEBUG:botocore.hooks:Event before-parameter-build.s3.HeadObject: calling handler <bound method S3RegionRedirectorv2.annotate_request_context of <botocore.utils.S3RegionRedirectorv2 object at 0x7fa3c05c1dd0>>
INFO:CacheAsyncClient:cache_data/log:Message: DEBUG:botocore.hooks:Event before-parameter-build.s3.HeadObject: calling handler <bound method ClientCreator._inject_s3_input_parameters of <botocore.client.ClientCreator object at 0x7fa3c089efd0>>
INFO:CacheAsyncClient:cache_data/log:Message: DEBUG:botocore.hooks:Event before-parameter-build.s3.HeadObject: calling handler <function generate_idempotent_uuid at 0x7fa3c17a0b80>
INFO:CacheAsyncClient:cache_data/log:Message: DEBUG:botocore.hooks:Event before-endpoint-resolution.s3: calling handler <function customize_endpoint_resolver_builtins at 0x7fa3c17a3100>
INFO:CacheAsyncClient:cache_data/log:Message: DEBUG:botocore.hooks:Event before-endpoint-resolution.s3: calling handler <bound method S3RegionRedirectorv2.redirect_from_cache of <botocore.utils.S3RegionRedirectorv2 object at 0x7fa3c05c1dd0>>
INFO:CacheAsyncClient:cache_data/log:Message: DEBUG:botocore.regions:Calling endpoint provider with parameters: {'Bucket': 'metaflow-metaflows3bucket-3oikszn7tnso', 'Region': 'eu-west-1', 'UseFIPS': False, 'UseDualStack': False, 'ForcePathStyle': False, 'Accelerate': False, 'UseGlobalEndpoint': False, 'Key': 'metaflow/AuxiliaryTransformer/425/train/3047/3.attempt.json', 'DisableMultiRegionAccessPoints': False, 'UseArnRegion': True}
INFO:CacheAsyncClient:cache_data/log:Message: DEBUG:botocore.regions:Endpoint provider result: <https://metaflow-metaflows3bucket-3oikszn7tnso.s3.eu-west-1.amazonaws.com>
INFO:CacheAsyncClient:cache_data/log:Message: DEBUG:botocore.regions:Selecting from endpoint provider's list of auth schemes: "sigv4". User selected auth scheme is: "None"
INFO:CacheAsyncClient:cache_data/log:Message: DEBUG:botocore.regions:Selected auth type "v4" as "v4" with signing context params: {'region': 'eu-west-1', 'signing_name': 's3', 'disableDoubleEncoding': True}
INFO:CacheAsyncClient:cache_data/log:Message: DEBUG:botocore.hooks:Event before-call.s3.HeadObject: calling handler <function add_expect_header at 0x7fa3c17a1120>
INFO:CacheAsyncClient:cache_data/log:Message: DEBUG:botocore.hooks:Event before-call.s3.HeadObject: calling handler <bound method S3ExpressIdentityResolver.apply_signing_cache_key of <botocore.utils.S3ExpressIdentityResolver object at 0x7fa3c0f9ded0>>
INFO:CacheAsyncClient:cache_data/log:Message: DEBUG:botocore.hooks:Event before-call.s3.HeadObject: calling handler <function add_recursion_detection_header at 0x7fa3c1983380>
INFO:CacheAsyncClient:cache_data/log:Message: DEBUG:botocore.hooks:Event before-call.s3.HeadObject: calling handler <function inject_api_version_header_if_needed at 0x7fa3c17a2660>
INFO:CacheAsyncClient:cache_data/log:Message: DEBUG:botocore.endpoint:Making request for OperationModel(name=HeadObject) with params: {'url_path': '/metaflow/AuxiliaryTransformer/425/train/3047/3.attempt.json', 'query_string': {}, 'method': 'HEAD', 'headers': {'User-Agent': 'Boto3/1.34.125 md/Botocore#1.34.125 ua/2.0 os/linux#5.10.220-209.869.amzn2.x86_64 md/arch#x86_64 lang/python#3.11.6 md/pyimpl#CPython exec-env/AWS_ECS_FARGATE cfg/retry-mode#legacy Botocore/1.34.125'}, 'body': b'', 'auth_path': '/metaflow-metaflows3bucket-3oikszn7tnso/metaflow/AuxiliaryTransformer/425/train/3047/3.attempt.json', 'url': '<https://metaflow-metaflows3bucket-3oikszn7tnso.s3.eu-west-1.amazonaws.com/metaflow/AuxiliaryTransformer/425/train/3047/3.attempt.json>', 'context': {'client_region': 'eu-west-1', 'client_config': <botocore.config.Config object at 0x7fa3c05c1d50>, 'has_streaming_input': False, 'auth_type': 'v4', 's3_redirect': {'redirected': False, 'bucket': 'metaflow-metaflows3bucket-3oikszn7tnso', 'params': {'Bucket': 'metaflow-metaflows3bucket-3oikszn7tnso', 'Key': 'metaflow/AuxiliaryTransformer/425/train/3047/3.attempt.json'}}, 'input_params': {'Bucket': 'metaflow-metaflows3bucket-3oikszn7tnso', 'Key': 'metaflow/AuxiliaryTransformer/425/train/3047/3.attempt.json'}, 'signing': {'region': 'eu-west-1', 'signing_name': 's3', 'disableDoubleEncoding': True}, 'endpoint_properties': {'authSchemes': [{'disableDoubleEncoding': True, 'name': 'sigv4', 'signingName': 's3', 'signingRegion': 'eu-west-1'}]}}}
INFO:CacheAsyncClient:cache_data/log:Message: DEBUG:botocore.hooks:Event request-created.s3.HeadObject: calling handler <bound method RequestSigner.handler of <botocore.signers.RequestSigner object at 0x7fa3c05c1d10>>
INFO:CacheAsyncClient:cache_data/log:Message: DEBUG:botocore.hooks:Event choose-signer.s3.HeadObject: calling handler <bound method ClientCreator._default_s3_presign_to_sigv2 of <botocore.client.ClientCreator object at 0x7fa3c089efd0>>
INFO:CacheAsyncClient:cache_data/log:Message: DEBUG:botocore.hooks:Event choose-signer.s3.HeadObject: calling handler <function set_operation_specific_signer at 0x7fa3c17a0a40>
INFO:CacheAsyncClient:cache_data/log:Message: DEBUG:botocore.hooks:Event before-sign.s3.HeadObject: calling handler <function remove_arn_from_signing_path at 0x7fa3c17a3060>
INFO:CacheAsyncClient:cache_data/log:Message: DEBUG:botocore.hooks:Event before-sign.s3.HeadObject: calling handler <bound method S3ExpressIdentityResolver.resolve_s3express_identity of <botocore.utils.S3ExpressIdentityResolver object at 0x7fa3c0f9ded0>>
INFO:CacheAsyncClient:cache_data/log:Message: DEBUG:botocore.auth:Calculating signature using v4 auth.
INFO:CacheAsyncClient:cache_data/log:Message: DEBUG:botocore.auth:CanonicalRequest:
INFO:CacheAsyncClient:cache_data/log:Message: HEAD
INFO:CacheAsyncClient:cache_data/log:Message: /metaflow/AuxiliaryTransformer/425/train/3047/3.attempt.json
1
a
@ripe-oyster-50903 re:
Copy code
The strange thing though is as i said after restarting the UI in Aws i can see the cards again.
does the behavior at some point revert to 504s?
i wonder if there is a malformed card in your card store that is resulting in this issue
r
Yes it does revert after some point.
So you mean a malformed card blocks even the working ones ?
And thats why restarting helps ?
a
it could be - just a conjecture right now 🙂. let me pull in our card expert @hallowed-glass-14538 into this thread
thankyou 1
r
For Context:
Copy code
@secrets(sources=['dev_metaflow_access_mlflow'])
@gpu_profile(interval=5)
@batch(image=image, queue=queue, shared_memory=16000)
@resources(memory=24000, cpu=4, gpu=num_gpus)
@card
@step
i only used the default card and the @gpu_profile so far. At first when i started with Metaflow i also saved a big dataset under
self
which caused the default card to not load. Thats the only "special"/strange thing i did with cards so far from what i can remember
h
hey! I think I know what is happennig. There might be a bug on the card server side (in the UI) that I have a PR to fix properly. In the meanwhile can you set the
CARD_CACHE_DISK_CLEANUP_INTERVAL
env variable to something like
604800
(7 days)
r
@little-apartment-49355 I just restarted the service with the
CARD_CACHE_DISK_CLEANUP_INTERVAL
set. Will check later if it still breaks 👍
@little-apartment-49355 @ancient-application-36103 thanks ! setting
CARD_CACHE_DISK_CLEANUP_INTERVAL
has fixed the issue for me :)
among us party 1