• 0 Posts
  • 289 Comments
Joined 1 year ago
cake
Cake day: June 22nd, 2023

help-circle


  • My experience is that AMDs virtual memory system for VRAM is buggy and those bugs cause kernel crashes. A few tips:

    1. If running both cards is overstressing your PSU you might be suffering from voltage drops when your GPU draws maximum power. I was able to run games absolutely fine on my previous PSU, but running diffusion models caused it to collapse. Try just a single card to see if it helps stability.

    2. Make sure your kernel is as recent as possible. There have been a number of fixes in the 6.x series, and I have seen stability go up. Remember: docker images still use your host OS kernel.

    3. If you can, disable the desktop (e.g. systemctl isolate multi-user.target, and run the web gui over the network to another machine. If you’re running ComfyUI, that means adding --listen to the command line options. It’s normally the desktop environment that causes the crashes when it tries to access something in VRAM that has been swapped to normal RAM to make room for your models. Giving the whole GPU to the one task boosts stability massively. It’s not the desktop environment’s fault. The GPU driver should handle the situation.

    4. When you get a crash, often it’s just that the GPU has crashed and not the machine (Won’t be true of a power supply issue). sshing in and shutting down cleanly can save your filesystems the trauma of a hard reboot. If you don’t have another machine, grab a ssh client for your phone like Juice SSH on android. (Not affiliated. It just works for me)

    5. Using rocm-smi to reset the card after a crash might bring things back, but not always. Obviously you have to do this over the network as your display has gone.

    6. Be aware of your VRAM usage (amdgpu_top) and try to avoid overcommitting it. It sucks, but if you can avoid swapping VRAM everything goes better. Low memory modes on the tools can help. ComfyUI has --low-vram for example and it more aggressively removes things from VRAM when it’s finished using them. Slows down generations a bit, but better than crashing.

    With this I’ve been running SDXL on a 8GB RX7600 pretty successfully (~1s per iteration). I’ve been thinking about upgrading but I think I’ll wait for the RX8000 series now. It’s possible the underlying problem is something with the GPU hardware as AMD are definitely improving things with software changes, but not solving it once and for all. I’m also hopeful that they will upgrade the VRAM across the range. The 16GB 7600XT says to me that they know <16GB isn’t practical anymore, so the high-end also has to go up, right?



  • With batteries that would have a multi-day cycle like these ones, you’re going to be trying to flatten out the demand curve (and supply, but the two are related).

    The US generates 4.2 PWh a year, and so averages a consumption rate of about 480GW. So, in an ideal system we’d only need this level of generation capacity and if it was higher sometimes and lower others the batteries would smooth it all out.

    I’m going to take your 560GW figure as representative of normal demand above the 480GW average. I’ll say half of every day is 80GW above average (when we’d be draining batteries) and half is 80GW below (when we’d be charging). The real curves are much more nuanced, but we’re establishing context. 80GW for 12 hours is 960GWh, so let’s call it 1TWh of battery capacity needed for the whole USA to smooth out a day.

    That’s 117 of these installation, which frankly I find amazing that it’s so low.