fast-pizza-24629
06/12/2023, 2:51 PMmetaflow service
. I can run metaflow jobs natively in EKS and everything works well (see job example), and I can see in the logs (cloudWatch
) for the service that everything runs.
apiVersion: batch/v1
kind: Job
metadata:
name: veo-ai-metaflow-native-jobs
namespace: metaflow
spec:
ttlSecondsAfterFinished: 100
spec:
template:
spec:
serviceAccountName: metaflow
containers:
- name: <NAME>
image: <IMAGE>
command: ["bash", "-c"]
args:
- pip install kubernetes boto3 --ignore-installed &&
python path/to/flow.py run --with kubernetes --max-num-splits 3000;
env:
- name: AWS_DEFAULT_REGION
value: <REGION>
- name: <METAFLOW_SERVICE_URL>
value: <METAFLOW_SERVICE_URL>
- name: METAFLOW_SERVICE_AUTH_KEY
value: <METAFLOW_SERVICE_AUTH_KEY>
- name: METAFLOW_DEFAULT_METADATA
value: service
- name: METAFLOW_ECS_S3_ACCESS_IAM_ROLE
value: <METAFLOW_ECS_S3_ACCESS_IAM_ROLE>
- name: METAFLOW_DATASTORE_SYSROOT_S3
value: <METAFLOW_DATASTORE_SYSROOT_S3>
- name: METAFLOW_DATATOOLS_SYSROOT_S3
value: <METAFLOW_DATATOOLS_SYSROOT_S3>
- name: METAFLOW_DEFAULT_DATASTORE
value: s3
- name: USERNAME
value: alex
- name: METAFLOW_RUNTIME_IN_CLUSTER
value: 'yes' # $ Important environment variable.
- name: METAFLOW_KUBERNETES_NAMESPACE
value: metaflow
- name: METAFLOW_KUBERNETES_SERVICE_ACCOUNT
value: metaflow
restartPolicy: Never
backoffLimit: 4
However, if I try with python path/to/flow.py run -with kubernetes ...
I get this:
requests.exceptions.ConnectionError: HTTPConnectionPool(host='<http://metaflow-nlb-pn0ancjz-460897aea31b08ac.elb.eu-west-1.amazonaws.com|metaflow-nlb-pn0ancjz-460897aea31b08ac.elb.eu-west-1.amazonaws.com>', port=80): Max retries exceeded with url: /flows/SportBasicFlow/runs/287/steps/start/tasks/39602/metadata (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fa11f6bec70>: Failed to establish a new connection: [Errno 110] Connection timed out'))
The job is created in EKS and I can see that the steps of setting up environment,
downloading code package
and etc are completed, but then fails at task is starting
... am probably missing a port configuration or something. If you have any ideas on how to solve this it would be highly appreciated. Thanks a lot for the great work 💪