As I know so far, LLMs are pretty GPU-hungry stuff. So I’d go with a dedicated GPU server for the best results, but they cost kinda too much. Anyway I think you should check them out, maybe there would be some affordable options for your team.
As I know so far, LLMs are pretty GPU-hungry stuff. So I’d go with a dedicated GPU server for the best results, but they cost kinda too much. Anyway I think you should check them out, maybe there would be some affordable options for your team.
Agreed with you, profiling is really needed to see where the bottleneck is - my suggestion is in LLM itself (because some of them are really heavy stuff), but it actually maybe just because of the wrong server settings. And yeah running the whole neural network inside of the flask app is a bad idea not only for performance, but for stability (cuz if something gets crashed in model, then it would be high risk for something to get crashed in Flask app, and whole application get stuck)