Hey all! Firstly a huge thanks in advance to anyone who spends time responding to this.

So I’m working on my MVP which I’m about to launch (in its simplest form this is an AI based news aggregator)

To date my server set up has been:

  1. data storage, scraping, and app API calls to my digital ocean server. This is a 2GB memory, 1 AMD vCPU 50 GB disk server running LAMP on Ubuntu 20.04

  2. All my AI LLM work where I preprocess and clean text, locally run LLMs from hugging face is done through a Scaleway PLAY2-PICO instance.

A few issues I’m facing:

  1. The api calls to the digital ocean server are incredibly slow. Takes 5 seconds to load posts and I’m the only one using the app.

  2. The scaleway server processes for LLMs just get killed I assume due to memory issues or whatever it is.

So now to the question. What is the server architecture / providers you guys use? It needs to be able to deal with large data tables in MYSQL quickly as well as run large LLM models as well (the two don’t need to be the same set up)

Much appreciated!

  • victrolla@alien.topB
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    Needs more profiling. I doubt the API call is the bill of processing time. I’m completely blind about your app architecture, but I’m wondering if you’ve looked into something like google vertex. You’d benefit from hardware acceleration and horizontal scaling of your model. The API component is best to run in something like cloudrun.

    Usually what I see for folks like you is they’ve got a python webapp like flask that’s running the model directly. This is a really bad plan. You need to decompose. Push the model into vertex and make an api call to it on the backend.

    There’s a lot of possible optimizations but there’s not enough detail to know specifics.

    • EveryThingPlay@alien.topB
      link
      fedilink
      English
      arrow-up
      1
      ·
      1 year ago

      Agreed with you, profiling is really needed to see where the bottleneck is - my suggestion is in LLM itself (because some of them are really heavy stuff), but it actually maybe just because of the wrong server settings. And yeah running the whole neural network inside of the flask app is a bad idea not only for performance, but for stability (cuz if something gets crashed in model, then it would be high risk for something to get crashed in Flask app, and whole application get stuck)