LLMs
Benchmarking Goliath 120B
Mar 21, 2024
3 min read
We aimed to evaluate the performance of Goliath 120B for one of our customers who are fans of this model. We wanted to determine how to optimize its use for approximately 1,000 DAU (Daily Active Users).
We wanted to know how well it could handle various loads and configurations. So, we prepared two types of tests: a Surge load test and a Static load test.
Surge load test: We measured the number of concurrent requests the server could handle at any given time. We tested 8, 12, and 20 concurrent requests. Maintaining good latencies at surge load is very difficult, but if this happens a few times a day only, its easy to handle it with good latencies.
Static load test: We simulated the real-world benchmark where users come and go after a few conversations, maintaining a continuous load on the system. We tested for 10 minutes with various configurations. In practice, this is the test in which you need to have stable latencies.
All tests were conducted with the following configurations:
TGI:
-max-input-length=3072 --max-total-tokens=4096 --num-shard=4 --port 8000 --model-id=alpindale/goliath-120b
Vegeta: echo
"POST http://<URL>/generate" | ./vegeta attack -body payload_tgi.json -connections=100 -duration=10m -header 'Content-Type: application/json' -output=results_fp16_4xA100_18req_speculative.bin -timeout=300s -rate=18/1m
Payload: 2000 input tokens, 100 max token
Surge load testing
We put Goliath 120B to the test, pushing it with 8, 12, and 20 concurrent requests. Each time, we observed the average, lowest, and highest latencies.
We found that the server could handle concurrent requests without OOM (Out-of-Memory) issues upto 20 requests.
Static load testing
We then moved on to a more real-world scenario, the Static load test. This test simulated continuous user interaction over a 10-minute period. We tried various configurations to see how Goliath 120B would fare:
So here’s what we found
After putting Goliath 120B through its paces, we discovered that running it in FP16 or EETQ format resulted in the fastest performance, with EETQ performing only slightly worse.
We also found that this latency causes the queue time to increase, making the server less stable:
But here's the kicker: EETQ at 8bit was the best quantization option, performing similarly to FP16!
Our Grand Conclusion
So, what's the fastest way to deploy Goliath 120B? Our recommendation:
Use FP16 on A100. It scales well above 30+ req/min in static load tests and handles 20 concurrent requests during surge load:
But wait there is more! With new research happening, we figured why not see how much does speculative decoding help 🙂 (also since TGI supported it in v1.3.0)
Here is what we found,
As expected, it brought down the latency by 3 secs!
It also brought down latencies at 30 req/min by 4 secs!
We aimed to evaluate the performance of Goliath 120B for one of our customers who are fans of this model. We wanted to determine how to optimize its use for approximately 1,000 DAU (Daily Active Users).
We wanted to know how well it could handle various loads and configurations. So, we prepared two types of tests: a Surge load test and a Static load test.
Surge load test: We measured the number of concurrent requests the server could handle at any given time. We tested 8, 12, and 20 concurrent requests. Maintaining good latencies at surge load is very difficult, but if this happens a few times a day only, its easy to handle it with good latencies.
Static load test: We simulated the real-world benchmark where users come and go after a few conversations, maintaining a continuous load on the system. We tested for 10 minutes with various configurations. In practice, this is the test in which you need to have stable latencies.
All tests were conducted with the following configurations:
TGI:
-max-input-length=3072 --max-total-tokens=4096 --num-shard=4 --port 8000 --model-id=alpindale/goliath-120b
Vegeta: echo
"POST http://<URL>/generate" | ./vegeta attack -body payload_tgi.json -connections=100 -duration=10m -header 'Content-Type: application/json' -output=results_fp16_4xA100_18req_speculative.bin -timeout=300s -rate=18/1m
Payload: 2000 input tokens, 100 max token
Surge load testing
We put Goliath 120B to the test, pushing it with 8, 12, and 20 concurrent requests. Each time, we observed the average, lowest, and highest latencies.
We found that the server could handle concurrent requests without OOM (Out-of-Memory) issues upto 20 requests.
Static load testing
We then moved on to a more real-world scenario, the Static load test. This test simulated continuous user interaction over a 10-minute period. We tried various configurations to see how Goliath 120B would fare:
So here’s what we found
After putting Goliath 120B through its paces, we discovered that running it in FP16 or EETQ format resulted in the fastest performance, with EETQ performing only slightly worse.
We also found that this latency causes the queue time to increase, making the server less stable:
But here's the kicker: EETQ at 8bit was the best quantization option, performing similarly to FP16!
Our Grand Conclusion
So, what's the fastest way to deploy Goliath 120B? Our recommendation:
Use FP16 on A100. It scales well above 30+ req/min in static load tests and handles 20 concurrent requests during surge load:
But wait there is more! With new research happening, we figured why not see how much does speculative decoding help 🙂 (also since TGI supported it in v1.3.0)
Here is what we found,
As expected, it brought down the latency by 3 secs!
It also brought down latencies at 30 req/min by 4 secs!
We aimed to evaluate the performance of Goliath 120B for one of our customers who are fans of this model. We wanted to determine how to optimize its use for approximately 1,000 DAU (Daily Active Users).
We wanted to know how well it could handle various loads and configurations. So, we prepared two types of tests: a Surge load test and a Static load test.
Surge load test: We measured the number of concurrent requests the server could handle at any given time. We tested 8, 12, and 20 concurrent requests. Maintaining good latencies at surge load is very difficult, but if this happens a few times a day only, its easy to handle it with good latencies.
Static load test: We simulated the real-world benchmark where users come and go after a few conversations, maintaining a continuous load on the system. We tested for 10 minutes with various configurations. In practice, this is the test in which you need to have stable latencies.
All tests were conducted with the following configurations:
TGI:
-max-input-length=3072 --max-total-tokens=4096 --num-shard=4 --port 8000 --model-id=alpindale/goliath-120b
Vegeta: echo
"POST http://<URL>/generate" | ./vegeta attack -body payload_tgi.json -connections=100 -duration=10m -header 'Content-Type: application/json' -output=results_fp16_4xA100_18req_speculative.bin -timeout=300s -rate=18/1m
Payload: 2000 input tokens, 100 max token
Surge load testing
We put Goliath 120B to the test, pushing it with 8, 12, and 20 concurrent requests. Each time, we observed the average, lowest, and highest latencies.
We found that the server could handle concurrent requests without OOM (Out-of-Memory) issues upto 20 requests.
Static load testing
We then moved on to a more real-world scenario, the Static load test. This test simulated continuous user interaction over a 10-minute period. We tried various configurations to see how Goliath 120B would fare:
So here’s what we found
After putting Goliath 120B through its paces, we discovered that running it in FP16 or EETQ format resulted in the fastest performance, with EETQ performing only slightly worse.
We also found that this latency causes the queue time to increase, making the server less stable:
But here's the kicker: EETQ at 8bit was the best quantization option, performing similarly to FP16!
Our Grand Conclusion
So, what's the fastest way to deploy Goliath 120B? Our recommendation:
Use FP16 on A100. It scales well above 30+ req/min in static load tests and handles 20 concurrent requests during surge load:
But wait there is more! With new research happening, we figured why not see how much does speculative decoding help 🙂 (also since TGI supported it in v1.3.0)
Here is what we found,
As expected, it brought down the latency by 3 secs!
It also brought down latencies at 30 req/min by 4 secs!
Written by
Rohan Pooniwala
CTO
Enterprise GenAI Stack.
LLMs on your cloud & data.
© 2024 NimbleBox, Inc.
Enterprise GenAI Stack.
LLMs on your cloud & data.
© 2024 NimbleBox, Inc.
Enterprise GenAI Stack.
LLMs on your cloud & data.
© 2024 NimbleBox, Inc.