Jump to content
Corsair Community

64 Gb Vengeance instability mystery


The Ghost

Recommended Posts

Greetings!

 

I have a problem I can't figure out for several years - even deduce the faulty component.

 

I have three absolutely identical computers, two of them bought at the same time in the same place:

 

i7-3820
GA-X79-UP4 (updated to the latest BIOS)
64Gb Corsair Vengeance GMZ64GX3M8A1600C9

 

They all have the same symptoms: segfaults/hangups under load (and sometimes even without load). I've tried this without any overclocking OR XMP (at 1333 MHz).

 

Installing only half the RAM (32Gb) in the 4-channel mode seems to help, but in 2-channel mode it does not. (!!! and not the other way around!)

 

 

Recently, one of those computers died, so now I have two absolutely identical computers bought at the same time in the same place, and one that's a bit different:

i7-4930K
GA-X79-UP5S-WIFI (updated to the latest BIOS)
and the same old 64Gb Corsair Vengeance GMZ64GX3M8A1600C9 .

 

On this one, those symptoms have stopped.

 

"OK, so it's definitely not RAM", I thought, and tried the new CPU in one of the old motherboards. Nope, did not help. So then it's a faulty motherboard model? I've tried the old CPU AND RAM in the new motherboard - didn't help.

 

Then I've pulled one stick out, and it helped. So, with 56 Gb it works (in UP5S, but not in UP4!), but not with 64 Gb.

 

Now you would think it's the motherboard, but:

- I've tested four of them, one of which was a different model on a different chipset (C606)

- That one, equipped with any CPU, works with one 64Gb kit, but does not work with another one.

 

I've tried raising the IMC voltage, disabling XMP... it makes no difference. The only thing I have not tried is manually changing the timings, but I don't think that would help if running the whole kit at 1333 MHz did not.

 

Any ideas what this could be and what can I try?

 

I thought about buying an Asus motherboard, but those are kind of expensive, and now I basically have proof that it is NOT the motherboard since the problem reproduces even on another chipset...

 

(Before you ask, yes, there are a lot of uses for this much RAM. Two identical computers are our "new" ERP servers when I figure out this problem, and the third one is my workstation - and I like to be able to run several virtual machines with their HDD images stored in RAM...)

Link to comment
Share on other sites

  • 2 weeks later...
  • Corsair Employees

Hello The Ghost,

 

Sorry to hear that you've been having this issue. It seems like a compatibility issues considering that you have three separate systems that all have the same issue, especially if each kit is working correctly in a separate set up.

 

Are you positive that each system has its own kit and that some modules aren't part of one of the three kits intermixed? That could cause stability issues.

Link to comment
Share on other sites

Indeed it seems like a compatibility issue, however:

- With 32 Gb in 4 channels, everything works!

- The memory controller is in the CPU and not the chipset; surely i7-3820 can not be incompatible with this memory?..

- I've read that the Gigabyte "memory support list" lists only one 64-Gb ram kit model. However, this does not necessarily mean that everything else is unsupported...

 

I've done more tests with different combinations of CPUs, motherboards with different chipsets, and RAM kits. CPUs or motherboards make no difference, however RAM kits do! Which has led me to believe that maybe I've mixed up the modules from two kits. How do I tell which module belongs to which kit?

Link to comment
Share on other sites

The first batch of tests has revealed that:

- One of the two identical X79-based motherboards behaves the same (i.e. bad), whether it has an i7-3820 or an i7-4930K;

- My c606-based motherboard works well with its memory kit!, but has the same issues with one of the other kits (regardless of the CPU installed);

- The problem goes away if I remove one stick from the 4th slot, thus disabling one of the channels completely, but it is not completely gone if I only remove the stick from the 8th slot, thus disabling half a channel.

 

Now I've prepared for the second batch of tests. I guess the first thing I need is to make sure the kits are not mixed together. I'm relatively sure they are not, especially considering that in the old days, I had a similar problem with the third motherboard with a definitely correct kit. For some reason, replacing that motherboard helped with that kit, but not with other kits... so maybe "bad motherboard model" is not the only problem that I have. I could replace the motherboards and swear profusely at Gigabyte if I could make sure it's their fault, but for now I don't even know who is to blame for all this. %)

Link to comment
Share on other sites

The tests seem to show that the failure rate depends most on the ordering of the sticks. In some configurations, it fails sooner, and on other, it fails not so soon. This raises two questions:

 

- What is the best way to test? I'm testing with an 8-thread kernel build, which causes enough load to eventually fail, but does not test all memory.

- What should I look for in stick ordering? If I have several consecutive-numbered or almost consecutive-numbered sticks, should they go in one channel, or should they be different channels?

 

Here are the numbers.

The second number on both kits is 130502368 , ver3.24, CMZ64GX3M8A1600C9.

 

Kit 1:

236346

236506

236507

236508

236351

236352

236353

236354

 

Kit 2:

236225

236226

236227

236262

236288

236289

236291

236305

 

Also, it does not matter if they are running at 1333 MHz or 1600 MHz. By now, I wouldn't mind if they were running at a slower speed, if I could get them working at all. Are there any options I could try lowering to make them more stable?..

 

EDIT: The last stick number was incorrect, corrected

Edited by The Ghost
Link to comment
Share on other sites

OK, so finally I've tried one of the two motherboards with a "good" RAM kit placed sequentially, and it ran perfectly fine. I even tested it with OCCT with a 90% RAM Linpack test, and it ran well for enough time. So now I know it's not a problem with the motherboard or the MCU - it's a problem with those two kits of RAM.

 

I've tried ordering them differently with varying results, but I couldn't find a working combination. And if you look at those numbers... I'm pretty sure that's how they always were - meaning that the kits are not mixed.

Link to comment
Share on other sites

  • Corsair Employees

Hello The Ghost,

 

From everything you've tested, it does seem like it's more than likely the memory. I find it very odd that its only a specific channel that has the issues (4 and 8) Is that replicated regardless of how you configure the 8 modules of the each kit? It shouldn't matter the order of the modules, just that they are part of the same kit.

 

Have you tested the possible bad kits on the known working system?

Link to comment
Share on other sites

I did not test other channels - I just meant that with less memory, it's working better. I'm sure it will be the same on other channels, but it did not seem worth trying.

 

I believe my system is "known working" since it works fine with another memory kit. Also, I've tried this kit on my third system which is now different, and the problem was there, so yes, now we can be certain that it's a memory problem.

 

It seems like turning "interleaving" on makes it crash much faster (apparently, since it uses all modules).

Link to comment
Share on other sites

So what do we have now:

- It's definitely a memory problem.

- It's better to use interleaving since that tests all modules even with minimal RAM usage (the kernel build uses only a few gigabytes of memory). Is that true or am I mistaken?

- I never had another 64Gb kit of memory, so the sticks are all there. They might have been mixed up. It seems unlikely, but there is no other explanation - no way Corsair would ship two bad 64Gb kits. What can I do about it?

- Is there anything to tweak - frequency or timings - to increase stability in this case?

Link to comment
Share on other sites

I have confirmed that it runs fine with 6 sticks, seemingly in any configuration, but fails if I add a seventh stick, and fails very quickly if I add an eighth stick.

 

Should I maybe try raising the voltage on RAM or on the IMC?.. What exactly is the problem with having many sticks?..

Link to comment
Share on other sites

It seems to work better when I actually mixed the kits - four sticks from one and four sticks from the other!

 

In order to hunt for best stick combinations, maybe I could try overclocking the RAM to the point when it barely works with four sticks, and then try which fifth stick will it accept?..

Link to comment
Share on other sites

  • Corsair Employees

Hello The Ghost,

 

I'd suggest narrowing it down by trying the same testing method with 6 modules installed and adding more. Does it only happen with certain modules or does that happen with any of the modules when you add a 7th and/or 8th stick?

 

If it's only happening, regardless of configuration of modules, when adding a 7th and/or 8th module, adding more DRAM voltage and voltage for the IMC should help.

 

If one kit is still failing on the known working system, it may be faulty though.

Link to comment
Share on other sites

It seems to happen regardless of which channel I am removing or which sticks I'm adding.

 

I never dabbled in any overclocking other than setting higher Turbo Boost frequencies and raising CPU voltage within reasonable limits. With DRAM, should I only raise the general "DRAM voltage" (which is 1.5 V) or all the other too (which are around 0.75 V) ? And what are the reasonable limits for IMC voltage?

 

Also, if raising the voltage is supposed to help, then wouldn't lowering the clock speed also help?..

Link to comment
Share on other sites

I have accidentally discovered a combination of sticks and ordering that works with all 64 Gb! Of course, that also produced another set, which is utterly unusable even with two sticks missing.

 

Then I removed one stick and began trying the other ones. This way, I've discovered some sticks that work, some sticks that don't work, and some sticks that work separately but fail if installed together with specific other sticks.

 

225 - fails

508 - fails

226 and 227 - each stick works with other "good" ones, but they fail when installed together in one channel

 

That's already good accidental progress. I guess I can live with a little less memory in one of the servers, though it would be nice to find a working combination for them as well.

 

The result actually depends mostly on the order of the sticks. For example, 226 and 227 don't work when installed in one channel, but they seem to work in different channels.

 

I still don't have a way to reliably test all memory. Memtest does not cause enough load and takes almost a day for one run; kernel build only uses about 2 Gb, and there may be errors in upper areas; and I have no idea how can I check if interleaving is actually enabled - the UEFI setup does not warn or explain how is it supposed to interleave 7 sticks (it's "Enabled", and that's all).

Link to comment
Share on other sites

OK, I've managed to get one completely working kit and one kit that works without one bad stick. I guess I can live with that.

 

If somebody knows, could you tell me how will channel/rank interleaving work in this case? Will it not work at all, or will it work partially? Gigabyte UEFI is not very verbose about this topic, and apparently there are no tools to check that.

Link to comment
Share on other sites

  • 1 year later...

Turns out those systems were not stable. Instead, one of them was stable with 7 sticks, and the other one was not even with 6 sticks. This was certainly not enough, so I decided to try again. (And I did not write down the working 7-stick combination, being dead set on getting better results this time... %) )

 

I've found sequential warranty stickers on each module (put there by the shop), so now I can be cure the kits are not mixed. Except if they deliberately mixed the modules before putting the stickers there, which is unlikely. And their numbering makes sense when compared to the serial numbers on the modules.

 

Then I've found this: https://forum.corsair.com/forums/showthread.php?p=643727

Apparently, this specific mainboard has a grudge against this specific kit model. The "known good" kit I've tested it with was CL10, if I'm not mistaken.

 

Also, the "known good" system that was showing better stability was with another CPU and another mainboard. I've tried an i7-4820 in this mainboard, hoping for a better IMC, but it made no difference at all. So maybe it really is the mainboard...

 

I've tested the sticks individually, and found at least one stick in each kit that does not work even alone. But that does not help, because even without one stick, the kits still do not work. Also, it might be that this happens only on this mainboard model. And I don't think Corsair has a lifetime warranty on the RAM, do they?..

 

Since it worked with a CL10, maybe it could be possible to relax the timings? I've tried changing the timings to 10-10-10-30, but it made no difference, even on 1333 MHz without XMP. Since the Gigabyte BIOS seems to be unable to set voltages correctly, I guess it also can not set all the other timings correctly - there are many more timings than four. Could this be a possible way?..

 

Raising voltage on DRAM or IMC also crashes almost immediately - apparently, because the BIOS can not set all the other voltages correctly. I remember there being some complex rules about setting voltages on DRAM and IMC , so I think I just can't do it properly.

 

My backup option is to give up on those two and buying a new system with 64Gb or even 128Gb DDR4. But DDR4 has this same problem with density too, doesn't it?..

Link to comment
Share on other sites

  • 2 weeks later...
×
×
  • Create New...