According to “Based on the hardware and system configuration, we recommend participants to begin with 7B models. According to our experiments, 7B models like Vicuna-7B and Mistral can perform inference smoothly on 2 NVIDIA T4 GPUs”, does it mean Vicuna-7B, even mistral 7B(which is slower according to other discussions) wont get timeout, if everything sets well?
And i wonder how could we accelerate by parallelizing two T4 GPUs in inference? As the caller only calls function ”predict“, which only takes in one sample a time.
According to our tests, Vicuna-7B will never get timeouts. However, Mistral will (the ‘smoothly’ refers to GPU memory rather than the time limit).
At present, we do not support batched inference. One of the reason is that, there is not much space to batch if you run 7B models with 32G GPU memory (a single sample takes ~18G memory so not enough space for another). We will implement that in Phase 2.