Server Set up

thewanitz@alien.top · 1 year ago

Server Set up

victrolla@alien.top · 1 year ago

Needs more profiling. I doubt the API call is the bill of processing time. I’m completely blind about your app architecture, but I’m wondering if you’ve looked into something like google vertex. You’d benefit from hardware acceleration and horizontal scaling of your model. The API component is best to run in something like cloudrun.

Usually what I see for folks like you is they’ve got a python webapp like flask that’s running the model directly. This is a really bad plan. You need to decompose. Push the model into vertex and make an api call to it on the backend.

There’s a lot of possible optimizations but there’s not enough detail to know specifics.

EveryThingPlay@alien.top · 1 year ago

Agreed with you, profiling is really needed to see where the bottleneck is - my suggestion is in LLM itself (because some of them are really heavy stuff), but it actually maybe just because of the wrong server settings. And yeah running the whole neural network inside of the flask app is a bad idea not only for performance, but for stability (cuz if something gets crashed in model, then it would be high risk for something to get crashed in Flask app, and whole application get stuck)