Jump to content
Corsair Community

currently fault-finding but need advice


b-w-d.net

Recommended Posts

I have a pair of CMV4GX3M1A1333C9 in an MSI 760GM-E51. Unfortunately the system has recently become unstable, and I am attempting to isolate the reason for this. At present I am keeping a log of my tests on the MSI forum, but have recently found myself in an argument with a member there over Memtest and my RAM. Could you please clarify the situation regarding instability and memory?

 

The claim being put forth is that my RAM cant possibly be the reason why my machine is locking up sometimes, because if there was any fault involving the RAM I would experience constant BSODs. This does not match my own experience of system instability with other systems, and seems an unreasonable generalisation.

 

I have run Memtest, and found errors, which would seem to suggest a problem involving memory, surely it makes no sense to attempt a fault-finding exercise on other components by using a piece of hardware (the RAM) which is returning incorrect values?

 

In particular the contentious post claims the following:

 

You really do not know how memory works do you?

The memory that is being used in Desktop is unbuffered and non-ECC, as such it takes 1 unstable cell to cause an BSOD instandly.

Ofcourse it depends a bit on the instruction, but be sure you will see BSOD's every time you hit that cell.

 

It's not a massive over-generalisation, as these are not servers that have ECC (Error-Correction-Code), buffering or registering of memory.

 

Bad memory in desktops is a hit -> wrong data -> BSOD thing

Server memory is different: hit -> ECC-check -> bad cell -> rewrite due to ECC corrective code -> placed in new cell -> deliver correct data.

Your system will NOT run for hours without a BSOD if your memory is bad.

 

Memory is NOT your problem, 100% sure of that.

 

Surely this is exaggurated? Minor instabilities dont cause constant BSOD do they? Why should a BSOD occur just because a single address returns the wrong value?

 

If I am incorrect in my assertions, then I really need to start looking elsewere to find the source of my problems. My counter-argument is as follows, and I would appreciate your opinions:

 

This makes no sense, I have already explained that Memtest failed, by definition this means that an address in memory returned an incorrect value. According to your reasoning I should already be experiencing BSODs constantly. This is clearly not what is happening. Having said that, the memory returning the wrong value is clearly incorrect behaviour, and even if it were not causing the lock-ups, this is evidence of the system being unstable. Im not sure why you are so forcibly trying to argue with this.

...

There is no reason why a BSOD should occur when a single address comes back with the wrong value - that is utter nonsense, that is also the reason why you dont see BSODs whenever a buffer overflow occurs. Corrupting areas of memory by screwing up a pointer in C has basically the same effect as simulating a single error from unstable memory, and that is highly unlikely to cause a BSOD. If you start overwriting data in kernel mode there is plenty of reason to believe the OS will crash, but in user mode? it is unlikely to have a major effect at all. It is only because failing RAM tends to error at such a high rate that these kinds of crashes occur, (equivalent to eventually addressing something important and the usual error-handling failing). We are not talking about chips which are consistently returning bad data, we are talking about instability, the rate of that instability has a massive impact on its outcome.

Link to comment
Share on other sites

To clarify, I am fully aware that the reason for Memtest failing may be due to the CPU or motherboard, but would like to clarify whether I am right in thinking that this memory issue is of greatest importance before I start testing anything else.
Link to comment
Share on other sites

On one hand, a bad bit somewhere can blow it all up to hell. But on the other hand, if that bit is in a part of memory that's not being used by certain parts of the OS, then the program using that memory may error correct around it, or just give you corrupt data or something.

 

Since I'm not sure how thorough you were with memtest, it's advised to test one stick at a time in each slot using Memtest86+.

Link to comment
Share on other sites

That is exactly what I thought, though the "error correct arround it" may very well be an app acting buggy or throwing exceptions, not quite a BSOD, and not necessarily showing any visible problem either. Thankyou for clarifying, for a moment I thought I was going crazy.

 

It was an awkward mixed-memory setup, I have since tested the two sets individually, one set (the Corsair) failed memtest86+ so I am re-testing individual sticks as we speak. I realise this makes me look like a moron, not sure what I was thinking at the time.

Link to comment
Share on other sites

The full setup errored after approximately 14 hours, I have since tested all of the sticks individually and found a single Corsair stick which errored within ~2.5hrs, im leaving memtest running to see if the failing address is consistent or not incase this helps.

 

It is at stock settings, everything set to auto and running at 1333. will take a photo of the screen once I have a couple more errors and transcribe the details here. Someone has suggested I try locking it to 1066, what do you think?

Link to comment
Share on other sites

Ok, small problem. I have two photos of memtest finding an error in this stick, but on both occasions memtest only failed the test once out of countless passes. From the first attempt, the stick errored with the following:

 

test 6, pass 2, failing address 000d1eba184 - 3358.7MB, good ffdfffff, bad: ffffffff, err-bits 00200000, count 1

 

and when I stopped the test 8 passes had completed in less than 9hrs, with only that one single error. I tried limiting the memory range around this address and restricting it to test 6 but could not get an error out of it.

 

I then decided to restart, change the IGP memory (taking 32mb from above/below), and go back to the full test suite and just leave it on much longer to see if I could get a consistant hit at that address. From the second attempt, the stick errored with the following:

 

test 6, pass 14, failing address 000d1eba184 - 3358.7MB, good ffdfffff, bad: ffffffff, err-bits 00200000, count 1

 

and when I stopped the test 36 passes had completed in just over 40hrs, with only that one single error. :bigeyes:

 

So the address, test, and error bits are consistent, but it is otherwise quite hard to repeat. Seems very odd indeed. It is as if something (motherboard/cpu/ram) is teetering on the edge of instability and some external factor, a tiny fluctuation of some kind, pushes it over the edge in this one rather over-specific way. :confused: . Doesnt it seems strange that the address is identical if the IGP has actually reserved memory from a different location? I dont really understand how the memory is mapped accross the DIMMs/channels and how the IGP shared memory is reserved.

Link to comment
Share on other sites

Tried 1066, still failing memtest, this time the address, test, and error bits changed:

 

test 8, pass 7, failing address 00092397304 - 2339.5MB, good 55cd05fb, bad: 55c505fb, err-bits 00080000, count 1

 

Total time running was 16:50, with only that 1 error.

 

I have already checked for power and the usual CPU issues with Prime95 (done propperly with seperate instances for each core, each with their own affinity set). These tests ran fine for >48hrs with no errors. Will swap the other Corsair stick back in and double check it is fine, in which case the fault is probably not CPU (i.e. memory controller)/motherboard related, and just a single stick behaving badly.

 

Comments welcome, even if just to say I am going along the right lines.

Link to comment
Share on other sites

Will start with a run-down of my BIOS settings followed by specifics from CPU-Z. Most of this is set to the BIOS default, though I have tried changing the UMA Location, and the FSB/DRAM ratio.

 

CPU Feature

SVM Support: Enabled

C1E Support: Disabled

Chipset Feature

HPET: Enabled

On Chip VGA: UMA

VGA Share Memory: 32M

UMA Location: Below

Green Power

CPU Phase Control: Auto

Cell Menu

Cool'n'Quiet: Auto

Adjust CPU FSB Frequency (MHz): 200

Adjust CPU Ratio: Auto

Adjust CPU-NB Ratio: Auto

Unlock CPU Core: Disabled

Advanced Clock Calibration: Disabled

CPU Core Control: Auto

Auto OverClock Technology: Disabled

MultiStep OC Booster: Disabled

FSB/DRAM Ratio: Auto (although I did change this to test at 1066)

HT Link Speed: Auto

Adjust PCI-E Frequency (MHz): 100

Auto Disable DRAM/PCI Frequency: Enabled

CPU VDD, CPU-NB VDD, CPU Voltage, CPU-NB Voltage, DRAM Voltage, NB Voltage, HT Link Voltage and SB Voltage: Auto

Advanced DRAM Configuration

DRAM Timing Mode: Auto

DRAM Drive Strength: Auto

DRAM Advance Control: Auto

1T/2T Unganged Mode: Auto

DCT Unganged Mode: Enabled

Bank Interleaving: Auto

Power Down Enable: Disabled

MemClk Tristate C3/ATLVID: Disabled

 

The following was collected from CPU-Z.

CPU: Phenom II X4 945

Core voltage 1.256V

Core speed 3000.0MHz

Multipler 15.0

Bus Speed 200.0MHz

Memory at 1333: 1 stick of CMV4GX3M1A1333C9

NB Frequency 2000.1MHz

DRAM Frequency 666.7MHz

FSB:DRAM 3:10

CL 9.0

tCRD 9

tRP 9

tRAS 24

tRC 34

CR 1T

Memory at 1066: 1 stick of CMV4GX3M1A1333C9

NB Frequency 2000.1MHz

DRAM Frequency 533.4MHz

FSB:DRAM 3:8

CL 8.0

tCRD 8

tRP 8

tRAS 20

tRC 27

CR 1T

 

p.s. ive removed my massively out of date forum signature

Link to comment
Share on other sites

I would set the Command Rate to 2t and set the memory Voltage to 1.7 Volts and see if that will solve the problem.

 

Tried this, left memtest and just checked the results. Wall time 63hr, passes 56, errors 3

 

test 8, pass 36, failing address 0010a2bc3a4 - 4258.7MB, good 17413fc1, bad: 17433fc1, err-bits 00020000, count 1
test 8, pass 50, failing address 000a48bcf0c - 2632.7MB, good b0ffeefd, bad: b0f7eefd, err-bits 00080000, count 2
test ?, pass 56, failing address 000d1eba184 - 3358.7MB, good ffdfffff, bad: ffffffff, err-bits 00200000, count 3

 

I really dont like the way its reporting an error beyond the 4gb installed in the system. Updated with another error which occurred while I was writing my post, just before I ended the test. Unfortunately my photo cut off the test number, but im sure you recognise the pattern.

Link to comment
Share on other sites

  • Corsair Employees
These errors would suggest it may not be the memory it self, and you should set the Commands Rate manually at 2T and if you have a setting for the memory controller you might try and increase that +.2 Volts
Link to comment
Share on other sites

These errors would suggest it may not be the memory it self, and you should set the Commands Rate manually at 2T and if you have a setting for the memory controller you might try and increase that +.2 Volts

 

Previous post are the result of setting 2T and 1.7v. Sorry for the confusion

Link to comment
Share on other sites

Embarrassing confession: I was using Memtest86+ v4.00, and have noticed the v4.20 changelog included bugfixes.

 

Re-ran the test with v4.20, wall time >60hrs with 2T and 1.7V .. no errors. To check whether the "errors" seen before are just from issues with the older memtest ive set the memory back to 1T and 1.5V and will report back with the results.

Link to comment
Share on other sites

Set back to 1T 1.5V, failed memtest86+ v4.2 at a wall time of 33hrs with:

test 6, pass 2, failing address 00083d74f14 - 2109.4MB, good ffbfffff, bad: ffffffff, err-bits 00400000, count 1

Changed back to 2T 1.7V, and this time even that failed, wall time of 5hrs with:

test 6, pass 1, failing address 00083d74f14 - 2109.4MB, good ffbfffff, bad: ffffffff, err-bits 00400000, count 1

Which is unfortunate.. so.. tried again by simply restarting the machine without changing anything. Failed again at 2T 1.7V:

test 6, pass 14, failing address 00083d74f14 - 2109.4MB, good ffbfffff, bad: ffffffff, err-bits 00400000, count 1

 

So, massively consistent this time, what next? I cant test the ram in any other machines right now as im waiting for a new graphics card (my other rig doesnt have onboard).

Link to comment
Share on other sites

  • Corsair Employees
Do you have the latest MB BIOS installed? if not I would try that and then test the modules one at a time to see if you can isolate it to one of the modules. If you cannot find a stable setting then I would suggest we try replacing them. You can use the link on the left to request an RMA!
Link to comment
Share on other sites

Had latest bios, and had isolated the individual module causing the errors. Have subsequently tested the same stick in another machine and receive errors at exactly the same address so intend to RMA.

 

Thanks very much for the moral support & guidance :)

Link to comment
Share on other sites

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...