Jump to content
Corsair Community

sporadic memtest+ errors with CM72SD1024RLP-3200


gpel

Recommended Posts

Hi,

I have a problem with some of the newer (manufactured mid 2005) CM72SD1024RLP-3200 modules used in server systems but these problems are spontaneous and only reproducable after a variable run time in memtest (sometimes straight after booting, sometimes after 20 hours).

 

I was able to reproduce errors in one particular module - always in the range 512.4 MB or 512.9 MB. Some errors can be corrected by ECC others cannot. I usually test with ECC off in the BIOS, as this gives more stable results.

Now the weird part: The reproductability of the error is very spontaneous. Some runs I collect a great number of errors, then other times it won't find an error in 100 passes (setting the test range to 500 MB - 520 MB). Whenever I got an error it was in the 512 MB range or somewhere around at 1500 MB when it was plugged in as a second module - the maths worked out.

 

When it failed - it failed in tests 4, but also in tests 6 and 7, never in any of the other tests.

 

It's not a thermal problem: Other modules do 40 full passes over night without a problem. It doesn't appear to be a problem with the memory slot on the motherboard as the error was reproducable in two different slots with the same module.

 

It sounds like a clear case for RMA - except I can't currently reproduce the error anymore having been able to do so this morning and yesterday. What could be up? In on of the threads the memory guy writes that errors can be triggered in random tests. It would be interesting if someone could be more specific on that.

 

RAM Guy: It would be great if you could explain the difference between SMI (System Management Interrupt), NMI (non maskable Interrupt) and SCI (System Control) when ECC mode is enabled?

My experience with Tyan boards (S2735 and S5350) is as follows:

* ECC enabled and set to SMI - machine will power off immediately when it finds an error in memtest / memtest+

* ECC enabled and set to NMI - machine shows "unknown interrupt" and halts the machine. memtest stops running, control menu can still be called but not further tests can be initiated.

* ECC enabled and set to SCI - machine performs as expected - memtest+ shows and counts both ECC and non-ECC errors and shows whether or not errors could be corrected by ECC.

Link to comment
Share on other sites

  • Corsair Employees

Please follow the link in my signature “I think I have a bad part!” and we will be happy to replace them or it!

 

And I am sorry you would need to ask the MB maker about that! In addition, ECC should be enabled when running these modules or you may get random errors.

Link to comment
Share on other sites

  • Corsair Employees
Ramdom errors like that can be caused by many things. And you purchased modules that are registered ECC and as such that is how they are tested with ECC enabled. In other words the layout of the MB may be generating these errors, and if you have a failing module you will still get errors in memtest usually on test 4 at the same address. Or I would try and test them in another system and see if you get the same errors to be sure!
Link to comment
Share on other sites

Thanks for the info about random errors with ECC off.

 

If I get errors with ECC turned on in memtest and memtest reports that one ECC error was detected and corrected - is that acceptable? To me it sounds like a failing module, as ECC errors shouldn't occur unless the module has an issue. It's RAM from a machine that's unstable as a Linux server system (uptime 60 days, then 10 days) and I don't get ECC errors with other modules.

 

Just as an info for others: The only "bad" thing about using memtest with ECC turned on is that it doesn't give pointers to where exactly it corrected the error. For the Tyan Tiger with i7320 Chipset it reports corrected errors in the 32 000 MB or 24 000 MB range, wheras with ECC turned off the errors are recurringly at the same, valid address (however as you said, they may be random errors [but still strange they happen at the same address]).

Link to comment
Share on other sites

  • Corsair Employees
I really need to know what the error is and on what test to try and determine if its a memory error or not. So please try it and then let me know! But I have no problem replacing the modules, but if you are testing them with ECC turned off you might go thru 10 modules before you find one that passes and have the modules not be bad.
Link to comment
Share on other sites

Hi!

 

With ECC enabled it repeatedly fails at test #8 - usually in the 0th or 1st pass. In one case it took ~40 passes to fail in test #8 overnight. I've read #8 points to timing issues with the machine. Could it be that the module can't always handle the timings? On the Tyan server boards there is no way to manually set the timings - and all modules RLPs should conform to JEDEC standards?

Link to comment
Share on other sites

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...