Local LLM'S, Hardware, UI and Models

Southside · April 18, 2024, 12:24pm

I would find that very doubtful as things stand. I think it would be quite some time before any but the smallest of LLMs would run on a phone. And by run I mean being useable as an app alongside the other essential functions of your phone. Hell, we need to establish if we can run a node effectively on a phone yet. I’m reasonably confident that once we get some plumbing sorted, it will be feasible resource-wise - though as always battery life will be a concern.

This will require a new way of thinking for app devs and may be a great leveller because even the most experienced devs will be “starting from scratch” to a great extent - Or am I too rigid in my thinking?

Josh · April 18, 2024, 12:30pm

If the last 24 hours has taught me anything it is that I have a huge amount to learn on the subject.

The question I have on my mind this morning is does quantization make them less reliable? is it a trade off that makes smaller devices more likely.

I think David has spent a lot of time on this and I am going to have to assume he knows what is possible and has it mostly figured out.
But with the caveat that he likely has that optimism required of a founder

Southside · April 18, 2024, 12:32pm

Not to mention an absolute beast of a laptop

dirvine · April 18, 2024, 12:53pm

For smaller machines stick with 7b models and less though.

Josh · April 18, 2024, 6:27pm

So just to be clear, you can run dbrx on that in a useable way? One might see 96 GB of memory and be green with envy, but then when I read the DBRX docs they says it requires 256 GB and potentially multiple GPU’s.

The more I look at these local models the more I am itching to go waste a bunch of money on hardware, I am a bit like a kid in candy store right now obvs going over 128 GB requires a whole new build and a lot more than just RAM.

Privacy issues put to the side. $20 a month vs thousands in hardware, I guess the value prop depends on the users needs but am interested in your take on that?

Does a user who spends a bunch of money on hardware get a significantly better outcome over using gpt4, significant enough to pay the price.

riddim · April 18, 2024, 6:39pm

… I just found out today that macs share their ram between gpu and CPU and it’s more or less the cheapest way to get to a lot of gpu memory…

Because when offloading to ram on windows/Linux it’s way slower than with a mac just utilising huge amounts of ram with the gpu directly

So even with an older Mac book with 32gb ram you should be able to run a quantitized mixtral (which is pretty powerful) at decent speeds… While a graphics card for windows/Linux wouldn’t fit into a laptop and would cost at least 2k…

dirvine · April 18, 2024, 7:03pm

I can run it easily on this machine. The M2 chips are really good.

My take is the h/w requirement put me 6 months ahead of the rest. Raspi will catch up. My belief is this is the way we are headed and I want to feel it first and create great things.

riddim · April 18, 2024, 8:25pm

The trick is that @dirvine neither uses the 256 gig model nor does he run it on CPU but really on his graphics cores

You can run quantitized versions of it that are way smaller, faster and nearly as good in output quality than the 263 gig original version of it…

He belongs to a small group of privileged people that are using macs which can utilise their ram as well for their graphics card… I ran up to 45 GB models on my company laptop on CPU but that’s getting really slow with a larger context length… You can watch single characters appear and it’s far from usable … (and you need larger context for it to be useful because the context contains the data that needs to be processed and the task description)

So any model not fitting fully into your graphics card is more or less a pipe dream on windows/Linux for now (shared ram seems to be way way slower than the mac way of sharing)… At least that’s my opinion from what I have seen in my tests yet

I would think the Linux world is lagging behind much more than that… And Windows again worse…
… That separation between graphics and cpu makes it an entirely different playing field…

riddim · April 18, 2024, 8:55pm

Oh sorry missed your earlier post so I wasn’t telling you anything new… Except maybe that quantization is the reason David can run mighty models on his insanely powerful machine… And this alone won’t bring llms anything near phones or regular users hardware…

Josh · April 18, 2024, 9:01pm

Looking at macs now (just browsing) the newer model is ~4000 which is significantly cheaper than the tinybox linked above, portable and in my opinion probably way more valuable in terms of usefulness.

It is hard for me, I have always given apple a wide berth. I dont like the walled garden.

dirvine · April 18, 2024, 9:03pm

I was to but massively impressed with the h/w and package. It really works smoothly

dirvine · April 18, 2024, 9:03pm

Yea I missed that. Good catch @riddim

riddim · April 18, 2024, 9:04pm

Same here… The first time I really am considering to save money for getting an apple …
For 1k you can get a used mac with 64 gb ram on eBay… That’s close to the capacity of a 20k graphics card…

TylerAbeoJordan · April 19, 2024, 5:59am

What about latest AMD APU’s? Those share CPU RAM with GPU as well and they also have AI accelerators on-chip.

Anyone using one with a large amount of RAM?

stout77 · April 19, 2024, 6:15am

To me what makes most sense is having a reasonably powerful machine, no screen needed, connected to autonomi, and connecting to it from lighter devices like phones.
Call it NAAi (Network Attached Ai)
All the advantages of secured data through autonomi network, and only one powerful machine needed for a personal Ai accessible anywhere.
And as shown above, that doesn’t need to be as powerful as

Although that looks sweet…

TylerAbeoJordan · April 19, 2024, 6:20am

So I did some looking around to see if I could answer my own question …

Found this in a reddit thread:

It’s memory bandwidth bound, not compute bound. Besides, most modern non-bargain basement CPUs do have hardware support for accelerated vector/matrix functions, just not as much as GPUs.

The reason Apple Silicon Macs are attractive for LLMs isn’t just the large shared memory, it’s that their memory bandwidth is a lot higher than a typical high end PC, and on par or better than many PC workstations/servers.

Apple’s LLM performance hasn’t improved much over the first 3 generations of Apple Silicon, but it still seems doubtful that APU’s are going to catch up any time soon.

https://www.reddit.com/r/LocalLLaMA/comments/185oa2q/new_apus_close_to_gpu_processing_but_with/

But also found this:

Not going to run a big model, but for little $$$ you can use AMD APU to get by cheaply with smaller one.

Josh · April 19, 2024, 9:57am

Mac Studio looks like a worthy contender too, might get a bit more bang for your buck over the MacBook.

I know not everyone will agree but to me although it is decent chunk of change it is much like hiring a few employees and then it becomes cheap as chips. And ability is only going to go parabolic.

Josh · April 19, 2024, 11:47am

Just thinking out loud here with regards to Autonomi and personal AI.

Maybe running phones and small devices are a way down the line, perhaps early adopters need to have somewhat specialized equipment to fully utilize the technology, there is still an opportunity and market here.

Not that it answers @happybeing questions on:

Another interesting aspect here to me is groups, why not chip in, get the necessary hardware and run specialized “local” LLM’s among each other.

DavidMc0 · April 19, 2024, 9:39pm

Is the upcoming Qualcomm Snapdragon X Elite likely to provide similar benefits to the Mac Arm devices in terms of running LLMs efficiently?

It’s apparently ‘built for AI’.

If so, it might be a cheaper way of getting bags of RAM that can be used for these models effectively.

Edit: I see it only supports up to 64gb RAM unfortunately, and bandwidth is less than M2 Pro or Max.

TylerAbeoJordan · April 19, 2024, 9:45pm

From your link:

Snapdragon X Elite is capable of running generative AI LLM models over 13B parameters on-device with blazing-fast speeds.

‘over 13B’ is a bit subjective … I’d guess that’s about it’s limit. Perhaps it has wider memory channels than previous generations as that seems to be the big bottleneck for AI.

Topic		Replies	Views
Local LLM Llamafile Community	46	2102	April 20, 2024
SAFE Compute? (AI) Features	17	818	June 12, 2023
Share, discuss interesting computer hardware Community hardware	63	3385	October 2, 2022
Testers wanted for an Autonomi app demo Apps	32	658	June 10, 2024
Dream machine build Community	31	1088	November 30, 2021

Local LLM'S, Hardware, UI and Models

Related topics