I am seeing widespread issues trying to create instances using the python SDK/API. The API returns an APIException “server_error”. In some cases, in the dashboard the instance says “Error” for status. In other cases the instances proceed to be created, despite returning the API error, which requires manual cleanup in the datacrunch dashboard. As you can see from my forum posts, we are having a lot of issues with the API/SDK. Now it is not limited to Fin-03/B200s and I’m seeing it happen with other instance types at other locations like A100 (FIN-01) and H200 (Fin-02/ICE-01). This is a NEW issue. I have been using the exact same scripts for spinning up instances without issue for several months using the same volumes, etc.
Hi,
Sorry to hear that you’re having trouble. Please let us know your User ID and Project ID so we can investigate the problem ASAP.
I replied in a private conversation, I prefer to keep the User ID and Project ID confidential.
FYI, I have now tried spinning up instances using direct HTTP API calls (https://api.datacrunch.io/v1/instances) and am getting the same APIException “server error”. So it’s not a python SDK issue.
Here’s what’s happening and what DataCrunch needs to address:
- We create instances through the documented flow: POST /v1/instances with instance_type, hostname, etc. That call returns HTTP 202 with the new instance ID, and the instance really does
come up (we can see it running later). - The problem is the very next step: calling GET /v1/instances/{id}—either directly over HTTPS or via the official Python SDK’s InstancesService.get_by_id—almost immediately after the POST. That GET frequently comes back with HTTP 500 and the JSON error {“code”:“server_error”,“message”:null}. Because the SDK just wraps that response in APIException, every client that does “create → fetch details” fails even though the instance exists.
- To prove it’s a transient backend issue, we added a trivial workaround: after the POST returns the new ID, we wait a couple of seconds (and retry up to a few times). Once we delay, GET /v1/instances/{id} succeeds and returns the full instance payload. That’s why we’re confident the instance is being created correctly—the API just needs a breather before the single-instance endpoint stops returning 500.
So the ask for DataCrunch:
- Restore GET /v1/instances/{id} so it’s stable immediately after a successful create, or explicitly document/handle the consistency delay in the SDK.
- Update the SDK (instances.get_by_id) to incorporate a retry/backoff if the backend keeps requiring a startup delay, so downstream clients don’t all have to implement their own hacks.
Until then we’re sleeping a couple of seconds before the lookup, but it’s a stopgap.
Hello proteindesigner,
Thank you very much for all details you have provided. We have deployed a fix yesterday that resolves this issue.
We’ve added credit to your DataCrunch account since your input helped us track this down.
Alexey
Thanks so much- I’m just glad to see everything running super smoothly at my favorite neocloud provider
.