gpel Posted November 9, 2005 Share Posted November 9, 2005 Hi, I have a problem with some of the newer (manufactured mid 2005) CM72SD1024RLP-3200 modules used in server systems but these problems are spontaneous and only reproducable after a variable run time in memtest (sometimes straight after booting, sometimes after 20 hours). I was able to reproduce errors in one particular module - always in the range 512.4 MB or 512.9 MB. Some errors can be corrected by ECC others cannot. I usually test with ECC off in the BIOS, as this gives more stable results. Now the weird part: The reproductability of the error is very spontaneous. Some runs I collect a great number of errors, then other times it won't find an error in 100 passes (setting the test range to 500 MB - 520 MB). Whenever I got an error it was in the 512 MB range or somewhere around at 1500 MB when it was plugged in as a second module - the maths worked out. When it failed - it failed in tests 4, but also in tests 6 and 7, never in any of the other tests. It's not a thermal problem: Other modules do 40 full passes over night without a problem. It doesn't appear to be a problem with the memory slot on the motherboard as the error was reproducable in two different slots with the same module. It sounds like a clear case for RMA - except I can't currently reproduce the error anymore having been able to do so this morning and yesterday. What could be up? In on of the threads the memory guy writes that errors can be triggered in random tests. It would be interesting if someone could be more specific on that. RAM Guy: It would be great if you could explain the difference between SMI (System Management Interrupt), NMI (non maskable Interrupt) and SCI (System Control) when ECC mode is enabled? My experience with Tyan boards (S2735 and S5350) is as follows: * ECC enabled and set to SMI - machine will power off immediately when it finds an error in memtest / memtest+ * ECC enabled and set to NMI - machine shows "unknown interrupt" and halts the machine. memtest stops running, control menu can still be called but not further tests can be initiated. * ECC enabled and set to SCI - machine performs as expected - memtest+ shows and counts both ECC and non-ECC errors and shows whether or not errors could be corrected by ECC. Link to comment Share on other sites More sharing options...
Corsair Employees RAM GUY Posted November 9, 2005 Corsair Employees Share Posted November 9, 2005 Please follow the link in my signature “I think I have a bad part!” and we will be happy to replace them or it! And I am sorry you would need to ask the MB maker about that! In addition, ECC should be enabled when running these modules or you may get random errors. Link to comment Share on other sites More sharing options...
gpel Posted November 10, 2005 Author Share Posted November 10, 2005 Thank you for your quick reply. Can you please elaborate on on why ECC needs to be enabled when testing these modules? Surely errors shouldn't be present even with ECC checking disabled in the BIOS? Link to comment Share on other sites More sharing options...
Corsair Employees RAM GUY Posted November 10, 2005 Corsair Employees Share Posted November 10, 2005 Ramdom errors like that can be caused by many things. And you purchased modules that are registered ECC and as such that is how they are tested with ECC enabled. In other words the layout of the MB may be generating these errors, and if you have a failing module you will still get errors in memtest usually on test 4 at the same address. Or I would try and test them in another system and see if you get the same errors to be sure! Link to comment Share on other sites More sharing options...
gpel Posted November 14, 2005 Author Share Posted November 14, 2005 Thanks for the info about random errors with ECC off. If I get errors with ECC turned on in memtest and memtest reports that one ECC error was detected and corrected - is that acceptable? To me it sounds like a failing module, as ECC errors shouldn't occur unless the module has an issue. It's RAM from a machine that's unstable as a Linux server system (uptime 60 days, then 10 days) and I don't get ECC errors with other modules. Just as an info for others: The only "bad" thing about using memtest with ECC turned on is that it doesn't give pointers to where exactly it corrected the error. For the Tyan Tiger with i7320 Chipset it reports corrected errors in the 32 000 MB or 24 000 MB range, wheras with ECC turned off the errors are recurringly at the same, valid address (however as you said, they may be random errors [but still strange they happen at the same address]). Link to comment Share on other sites More sharing options...
Corsair Employees RAM GUY Posted November 14, 2005 Corsair Employees Share Posted November 14, 2005 I really need to know what the error is and on what test to try and determine if its a memory error or not. So please try it and then let me know! But I have no problem replacing the modules, but if you are testing them with ECC turned off you might go thru 10 modules before you find one that passes and have the modules not be bad. Link to comment Share on other sites More sharing options...
gpel Posted November 15, 2005 Author Share Posted November 15, 2005 Hi! With ECC enabled it repeatedly fails at test #8 - usually in the 0th or 1st pass. In one case it took ~40 passes to fail in test #8 overnight. I've read #8 points to timing issues with the machine. Could it be that the module can't always handle the timings? On the Tyan server boards there is no way to manually set the timings - and all modules RLPs should conform to JEDEC standards? Link to comment Share on other sites More sharing options...
Corsair Employees RAM GUY Posted November 17, 2005 Corsair Employees Share Posted November 17, 2005 Let's get them both replaced. Please follow the link in my signature “I think I have a bad part!” and we will be happy to replace them or it! Link to comment Share on other sites More sharing options...
Recommended Posts
Archived
This topic is now archived and is closed to further replies.