If we fail to create a session on the "best" agent (because it is unresponsive or otherwise), then we try the next best agent. If that also fails then we report back to the client that we could not find any working agents. If there are more agents in the cluster then this might not be the best approach as we might succeed if we continue trying.
The main problem with trying all agents is that it can take time. We give each agent 40 seconds before timing out and moving on. So even with the existing approach it can take up to 80 seconds before the user gets any feedback. However we might also detect that the agent is broken very quickly, in which case we probably have time to test more agents.
So doing this would probably need to be paired with work on user feedback. Either keeping an eye on the elapsed time, or having the client be more reponsive (e.g. bug 1197).
Hopefully this scenario happens very rarely in practice. If we've already detected that the agent is unresponsive then we'll ignore it (although see bug 4243). And if it fails to create sessions it should get penalty points and other agents should get prioritized.