I don't know why anyone would think a meh performing iGPU would encourage local LLM adoption at all? A 7B local model is already not going to match frontier models for many use cases - if you don't care about using a local model (don't have privacy or network concerns) then I'd argue you probably should use an API. If you care about using a capable local LLM comfortably, then you should get as powerful a dGPU as your power/dollar budget allows. Your best bang/buck atm will probably be Nvidia consumer Ada GPUs (or used Ampere models).
However, if for anyone that is looking to use a local model on a chip with the Radeon 890M:
- look into implementing (or waiting for) NPU support - XDNA2's 50 TOPS should provide more raw compute than the 890M for tensor math (w/ Block FP16)
- use a smaller, more appropriate model for your use case (3B's or smaller can fulfill most simple requests) and of course will be faster
- don't use long conversations - when your conversations start they will have 0 context and no prefill; no waiting for context
- use `cache_prompt` for bs=1 interactive use you can save input/generations to cache
However, if for anyone that is looking to use a local model on a chip with the Radeon 890M:
- look into implementing (or waiting for) NPU support - XDNA2's 50 TOPS should provide more raw compute than the 890M for tensor math (w/ Block FP16)
- use a smaller, more appropriate model for your use case (3B's or smaller can fulfill most simple requests) and of course will be faster
- don't use long conversations - when your conversations start they will have 0 context and no prefill; no waiting for context
- use `cache_prompt` for bs=1 interactive use you can save input/generations to cache