Hello, maybe my express is not good, I am not sure if what I’m clear. The situation is following: I have inherited a custom embedded device which uses an Renesas H8S MCU to store data on an SD card. It does this directly with no filesystem implementing SPI at a low level in firmware written in assembler. The original (MMC, pre SPI) version worked for many years without a problem on 2GB SD http://www.kynix.com/Detail/1220745/SD.html cards.
A while back we started installing 8GB Kingston cards into the devices, as our traditional 2GB card choice were becoming scarce, so we had to have some modifications made to the firmware read/write routines to support the “new” SPI modes and timings. The developer got this working and thought it was all fine.
Now we have found that some of our data is being intermittently overwritten with blocks of 0x55, ie. like 55555555555555555… (this is 85d, or interestingly 101010101 in binary)
During normal operation, our firmware buffers some incoming data in RAM and NVRAM until it reaches a 512 block, then it writes a single block to the SDC, keeping track of the next address. It keeps writing to subsequent 512 blocks, never backtracking as it’s adding to a log.
However when we read the data back, sometimes the older data has been overwritten with 0x55 in chunks of some currently unknown page size (at least 128 bytes, so probably 512). It seems these 55’s are being written ‘below’ our data and sometimes overlap it.
We can debug the firmware (with HEW and the USB hardware emulator) but cannot see a problem in our code. I’m a bit blind to what’s going on in the SDC as the only way I know to look at it, is to remove and image it (using
dd on my mac) - which is very slow. (In fact I’ve had trouble seeing anything sensible on the 8GB SD images.)
I’m pretty sure the data is corrupted on the card during write or shortly after, and not being read back wrongly. We have two fairly different read routines and they both return the same data. Debugging shows that the 55 data is coming off the card, and not being broken later in the processing.
So I can only guess that it’s some race-condition, some internal caching or buffering in the SDC, or possibly the erase-block-size which might be overwriting larger areas than the 512 blocks we are writing.
The card is used for some other purposes in other, lower reserved areas, with various duty cycles in our firmware and also interrupts so it’s possible there’s a conflict there, but I would have thought this would have happened years ago.
It only happens with these 8BG cards. We haven’t managed to try any other card variants/brands yet but do plan to.
So my questions are:
- are there known risks and common problems like this writing directly to the card and avoiding an FS? (like block alignment)
- do the larger cards internally buffer or cache data?
- is 0x55 a known “filler” value (as opposed to the normal 0xFF erase)
- Can anyone think of what is causing the intermittent overwrites?
Today I tested as many cards as I could find, and found all but one fail in similar, but sometimes different ways.
Result Card FAIL 8GB Kingston Ultra microSDHC Class 4 (CO8G Taiwan), production. PASS 4GB Verbatim microSD HC Class 4 - brand new FAIL 16GB SanDisk Ultra microSDHC 80mb/s 533x - new FAIL 8GB SanDisk microSDHC I Class 4 "BI Made in China", reused. FAIL 8GB SanDisk ULTRA microSD HC I - UHS Class 1, production. FAIL 4GB SanDisk microSDHC class4 - new. FAIL* ADATA 4GB microSDHC class4 - new. PASS Transcend 2GB microSD (SC?) ON ORDER Transcend 8GB microSD PASS Verbatim 8GB microSDHC Class 10 PASS Kingston 8GB microSDHC Class 10 / UHC 1 FAIL Un-branded Taiwan 8GB SDHC Class 10
Of these cards, the “worst” is the 16Gb Sandisk which seemed unpredictable, and even moved data around after it was written as subsequent views of the data are different. The ADATA card was also different, which needed two reads to read the true data. On the first read, the data was only partially visible, but on the second read it looked correct.
Most of them behaved in the same way, as described above, with old blocks looking like they were overwritten when subsequent writes move into new blocks (probably). Some of them failed almost immediately after one or two 512 block writes, much worse than our primary Kingston 8GB cards.
After watching the data strangely move around on these disks, I currently feel like the larger disks are “doing clever stuff” to the data (like buffering or caching) which we are only experiencing because we are writing/reading directly to memory addresses rather than respecting the filesystem layer (even if there isn’t a specific bug in our code).
Today I focused on two things: the block size suggestion, and examining the READ routine - on thinking more about the fact the data seems written correctly, but read intermittently.
Firstly, I have tried sending CMD16 to set the block size to 512, but
a) it didn’t have any effect on the failures (if it worked) b) I’m not 100% sure I got it right in assembler - although while debugging I did get a “parameter error” bit set in the response, which I resolved and then got a 0x00 response, so I think I’m making the command call right… c) On reading the simplified SPI spec, it clearly says the block size is 512 for SDHC and this command is not used. So many people say you should set the block size - is this for MMC cards? or is the spec wrong?
In the case of a Standard Capacity SD Memory Card, this command sets the block length (in bytes) for all following block commands (read, write, lock). Default block length is fixed to 512 Bytes. Set length is valid for memory access commands only if partial block read operation are allowed in CSD. In the case of SDHC and SDXC Cards, block length set by CMD16 command does not affect memory read and write commands. Always 512 Bytes fixed block length is used. This command is effective for LOCK_UNLOCK command. In both cases, if block length is set larger than 512Bytes, the card sets the BLOCK_LEN_ERROR bit. In DDR50 mode, data is sampled on both edges of the clock.
Secondly I examined the read routine and as before, it seems generally correct. I previously tried adding more “dummy clocks” with no effect.
Sorry to long novels, I have a little bit dizzy, thanks you taking your time to read it!