FPGA Hacking

Postcards from Paul



Memory access prediction and dma_ok generation

23rd October 2002
> I'm certainly going to need your help with understanding the MOVX
> predictor, that is a beauty
 It's simpler than it looks.  The left side of the schematic
 produces just a single signal that gives a clock sync'd
 pulse when the 8051 is fetching one of the six opcode that
 can be MOVX.
 The right side is a 5 bit counter that starts at 0 and
 counts up to 31 and stops when it gets to 31.  When it's
 at 31, the DMA_OK signal is asserted and it is ok for the
 controller to begin DMA and refresh operations.  The pulse
 from the left side zeros the counter, so that no DMA
 operations can begin in the next 31 cycles.  There's also
 a couple "done" signals that force it to 31 so that DMA
 can begin again immediately after we've serviced the MOVX
 (this probably isn't necessary... I doubt it makes any
 real difference).
 On the left side, there's a several cycle delay on PSEN
 which ultimately enables those flip-flops to (hopefully)
 catch the opcode in the middle of the time it's
 available on the bus (this timing has never really been
 verified well).  A bit of combinatorial logic detects if
 the opcode was one of the 6 we need, and that same PSEN
 pulse is delayed another couple cycles and AND'd with the
 opcode detection so that the counter only gets its reset
 for one cycle, and only when PSEN makes its low to high
 The timing of all this stuff has never really been
 verified, and this might indeed be the thing that can
 cause instability.  Does the circuit really capture the
 opcode when it's nice and stable?  Does the 31 cycle
 delay begin too late, and a DMA operation can begin and
 last long enough into the MOXV fetch that the controller
 doesn't service the MOVX in time?  Is 31 cycles really
 long enough to wait for the worst case time from when
 we capture the 8051's opcode fetch to when it will assert
 RD or WR (and we see it, +/- 1 cycle since we're not on
 the same clock as the 8051)?
 But I think this circuit really does work, because when
 I've built the chip with bad timing constraints and ended
 up with a highly unstable FPGA, I saw read/write errors
 to DRAM when no DMA was running.  Hmm... the refresh does
 always run, so maybe it could be here?
16th December 2002
> I'm thinking that I will use ALE to determine the start of the next cycle
> rather than the 32 byte counter.  This is just a different approach, however

The reason I went to the 32 cycle count was seeing the ALE signal stop
pulsing when the 8051 executes from internal code memory (such as when
we call those routines to write into the flash).

I originally has a very simple scheme based on ALE, but it resulted in
the refresh not getting to access the dram when ALE stopped pulsing.
It sounds like you won't have this problem with your approach, but I
just wanted to bring up the issue where the 8051 turns off ALE pulses
while it's running from internal code.
17th December 2002
>Other questions, does the timer need to start immediately or can we start it
>at the start of the DRAM access?

Well, there's some flexibility.  It sounds like you've done quite a bit
more looking at the 8051 bus than I ever did.  I never connected a logic
analyzer to a real 8051 and I never simulated anything.  I just dreamed
the whole thing up based on the datasheets timing diagrams.  I sketched
up some waveforms, made some little scribbels on bits of paper, and then
I wired up gates on the schematic and downloaded bit files into the FPGA
until it worked.

So you can probably make changes and it might even work better.  The 8051
bus is pretty slow, so there's some flexibility in how to do things.  The
main reason for 7.3728 MHz on the 8051 was to allow plenty of time in the
RD and WR pulses to access the DRAM.

Now, the way I originally intended it was for the timer to begin counting
before the end of the PSEN pulse, and it would not reach zero before the
RD or WR pulse begins.  It does not matter if the counter reachs zero once
the state machine enters the first state that services the request.  All
that matters is that the DMA_OK signal is de-asserted soon enough for
whatever operation that might be in progress to completely finish, so the
state machine will be waiting in the idle state and can respond immediately
to the 8051's bus.

While I'm thinking of this, I should mention a pitfall that you are
probably already aware of.  The MOVX prediction is quite simple... and
code memory fetch of one of the six MOVX opcodes causes the timer to
reset.  We can't tell the difference between opcodes and operands, so
any operand with one of those 6 bytes causes the DMA to stall for 31
cycles.  This probably isn't a big deal, but then again, I've never
really investigated how much time the DMA is being suspected needlessly.

But I do know that my original choices for the inter-bank calling code
were not so great.  The code to call bank1 is at 0x0FE0, and the code
to return to bank0 is at 0x0FF0.  These locations are fixed in the 87C52
and we can't change them because code that jumps to different places
would not be compatible with existing boards.  The 2 important MOVXs
are opcodes 0xE0 and 0xF0, so all jumps between the two memory banks
will stall the DMA.  Hindsight....  I suspect this doesn't actually
slow things down much, but I really don't know what the impact is.  I
thought a few times about change the inter-bank calling to just use
some code that exists in both banks, or to take a closer look at the
calling conventions that Roger and Marco discussed at length some
time ago.  Anyway, this is (probably) a minor slowdown and it doesn't
have any harmful effects.  I just thought I'd mention it while I was
think about the MOVX prediction circuitry.

>It looks like DMA_OK off can freeze a
>cycle (hold the state machine in the middle of a cycle).  Is this really
>what we want?

That would definately be bad, but I can't see how it could happen.  The
DMA_OK signal goes into a bunch of gates that use the request lines to
assert exactly one of the DO_xxx signals.  When there are no requests,
either DO_WAIT or DO_NOTHING is asserted (they both have the same
effect...  it would be interesting to see if the xilinx translate step
removes the redundant logic).

The way I had intended it was that all the DO_xxx signals only effect
the state transition at the idle state.  Once the state machine enters
a sequence of states like S_RD_DRAM_x, there is no way the DMA_OK signal
or any of the request bits can alter the flow until it returns to the
idle state.  That is why I wanted the DMA_OK signal to begin as soon as
possible.  The hope is that there are more cycles between the initial
de-assertion of DMA_OK than there are in the longest operation (S_IDEXFER_x)
so that in the worst case where the state machine begin an operation in
the same cycle where DMA_OK is de-assered, that operation will return to
the idle state before REQ_RD or REQ_WR are asserted due to the movx.

This approach does have the drawback that none of these atomic operations
can be really long, because they'd cause the controller to respond too
late to the 8051's RD or WR pulse.  It would have been really good to use
a CPU with a wait state input.  The other thing I had considered, was
using the FPGA to clock the 87C52.  It's supposed to be fully static, so
at least in theory, the FPGA could suspend the 8051's clock until it is
ready to respond to the MOVX.  This could also allow the 8051 to clock
quite a bit faster.  The main reason I did not persue that was the
difficulty of transitioning between the initial clock (before the FPGA
is configured) and a clock from the FPGA.

CPU Bus driver Logic

2nd November 2002
>I'm a little confused about the polarity of the RD_RAW signal driving the
>data bus output (DOUT => DBUS) enable.  From the schematic it looks to be
>active high, but I can't see how this could possibly work.  Can you help me

It is active low.  Almost every signal is active high, but in this case
Xilinx's OFDTX8 symbol requires an active low enable.

State machine and memory basics

23rd December 2002
> 1. The write timing you've used seems to use a second CAS
> cycle to clock the data in.  This is different from the
> fast mode cycles described in the micron data sheets and
> seems to be different from what you described (early
> on) on the simm page.  Is the method you have used
> described anywhere

I've never documented it (until now).  Question 2 is related.

> 2. I'm really confused about 8 bit write cycles.  I can't see
> how these are possible since we always write 16 bits (I think).
> If this is the case the how do variables in the xdata space
> ever work?

The S_WR_DRAM_x states implement a read-modify-write operation
to the DRAM.  That is why you see two CAS pulses.  The first
one reads all 16 bits into the register in the FPGA (almost
exactly the same as S_RD_DRAM_x).  Then the 8051's 8 bits are
written into whichever half of the register A0 specifies, and
the second CAS pulse writes the modified 16 bits back out to
the DRAM.

> DRAM was a black art.

I used to think that too before this project.  It is a bit of
pain compared to normal SRAM and peripherals, but it's not
really that bad.

> 3. Finally I take it the DRAM refresh address counter is internal
> to the chips, all we have to do is to keep asking it to refresh

Yes.  All modern DRAM chips have a row address counter inside
so all you have to do is assert CAS before RAS and it refreshes
the next row.

The main thing to remember is to allow for the "precharge time"
after any operation (de-asserting RAS).  DRAM reads are always
destructive, since the tiny charge on the little capacitors
makes a little change on the column lines that is picked up by
the sense amplifiers and written to the row buffer.  The time
after RAS is needed for the chip to write the entire row back
to that row of the memory array.  According to the datasheets,
60 ns ought to be enough, so in theory one 68 ns cycle should do
it.  But my experience was that some simms were problematic
until I allowed 2 cycles for precharge time.  (I have never even
attempted to overlap the percharge time with the IDEXFER for
faster DMA... but it ought to be possible).  Anyway, until it's
working well, just leave an extra couple cycles for the
precharge time to be cautious.
15th October 2002
>If you have some kind of description of the states in the transfer machines
>that would be great - I'm of course quite mystified by the RAS, CAS and
>column select logic too.  At this stage I've just coded it blind from your
>schematics.  My plan is to work towards getting a DRAM interface going
>leaving out the IDE and using its pins for debug.

I have some hand-drawn diagrams of the expected waveforms.  Maybe
that would help?  I'll describe a bit....

The basic idea is that the DRAM requires the address provided in
two parts (row and then column).  The full address is formed from
12 bits from the 8051, and from the SRAM block that holds the page
mapping.  The 8051's A0 never makes it to the SIMM, since SIMM
access is always 16 bits wide.

The address mux logic takes the 24 bit SIMM address (16 meg of
16 bit wide locations), where 11 bits are from the 8051/DMA
address, and the upper 13 bits are from the page mapping registers.

11 bits of 16 bit data = 4k block 

The low 20 bits get muxed to the low 10 bits of the DRAM.  The
next two above that select which of the 4 RAS lines will be
asserted, and the top two go to A11 (16 and 32 meg simms).  A
couple signals force certain bits to 0 or 1 when accessing the
IDE drive.

I wrote quite a bit about how the SIMM access works some time
ago, and it's archived here:


I should really dig up those waveform sketches.  The basic idea
is that the control signals are asserted at certain times in
relation to other control signals, so that both halves of the
address are output, and then after 1 idle state, the data is
captured into the latch (for reading) or data is transmitted
to the simm (for writing).


9th November 2002
>I need some info on what constraints you've been using.  I can see there are
>some on the main schematic, but it difficult to tell what they apply to.

Well, I've tried many different things in the constraints, and I've
never really been happy with it.  Maybe I didn't consider something,
but the results were often so strange that I think the schematic
"flow" has a lot of bugs.  It could also be some sort of strange
async thing between the clocked logic and the unsync'd 8051 signals
(but I tried to sync all of them up right at the inputs).

Anyway, the main constraint I used was the period on the clock.  In
the end I believe I also used two other constrains on the inputs and
outputs of the chip relative to the clock.

All over the design are TNM= attirbutes that assign register to various
timing groups.  There are all unused.  I originally had a bunch of
constraints for things like control registers to data registers, etc.
This produced very erratic results.  Version 3.1i added the "offset"
timing constraint relative to the clock.  So there's only three
constratins... the speed of the clock, the time we are willing to
accept from the clock to when the outputs change, and the time before
the clock that inputs must be stable.  At least that's roughly how
I remember it.

Many times I would fiddle for hours with the constrains and ultimately
get a lot of unreliable compiles, and then just switch off timing
constrains altogether and get a quick compile that ran pretty good.

When I revisit the FPGA, the next major thing I want to try (other than
learning the simulator and setting up some good simulations), is the
floorplanner and more relative location specs.  The automatic placement
makes a giant mess and why it places things the way it does is a total
mystery.  There's a reason for the floorplanner.

IDE and DRAM address pins

9th December 2002
>It took me a while to fight my way through the dram/ide address selection
>logic - it was much clearer once I checked the schematic to determine where
>the various ide signals were connected (must have suited the PCB since it
>makes the fpga look a little weird).

Yep, I did the board layout first, so it could fit into 2 layers.  I also
routed the whole board in 15 mil traces with 10 mil clearance, so it would
be possible to etch with a hobby etching kit.  In hindsight, that was a
lot of extra routing work that I probably wouldn't bother to repeat.  But
it did put a lot of limits on which pins could route where.  The address
and data connections to the flash rom are also all scrambled to make the
signals route (and the monitor has constants to adjust for them).

If I could go back in time knowing what I know now, I'd definately use the
5 strobes for two RAS, two CAS and one extra address bit (instead of 4 RAS
and 1 CAS).  That would allow a lot more SIMMs to be used at full capacity
that aren't wired like the Micron datasheet.  Oh well.

>I don't quite understand why the
>generation of A5 is ORed with ide_addr_z and others use an inverted input
>AND.  I need to understand this well since I will be changing it to add the
>chip select for the CS8900.

A5 goes to pin 7 on the FPGA, which connects to pin 38 (CS1) on the IDE
connector.  When the IDE DMA reads from the drive, it needs to access
the IDE data register, which is A0=0, A1=0, A2=0, CS0=0 and CS1=1.  So the
four AND gates force A0-A2 and CS0 low, and the OR gate forces CS1 high.
Nothing is done to the other 6 lines, since they do not connect to the
IDE bus.

You'll almost certainly have to do similar things if you want to make the
FPGA circuits access certain registers in the CS8900 without CPU bus
cycles providing the address.  As this function gets more complex, a nice
way to minimize the logic might be to use initialized RAM blocks to hold
the DMA addresses, and then follow each with a mux for the CPU addresses.
That ought to allow a single CLB (1/2 for the RAM, 1/2 for the mux, and
the internal H mux to choose between them) to generate each address bit.

STA013 data and control

31st December 2002
> 1. The wierd bit ordering in the shift register.  At first I
> thought this was just an endian thing, but it seems to also
> reverse the order of the bits too.

Refering to this schematic:

I supposed that depends on what you consider "reverse".  The
STA013 wants to see the MSB first.  FAT32 is little endian,
so bits 0-7 are the first byte, and 8-15 are the second byte.
That is why the shift register is connected in that order.

> 2. The operation of the bitcount block.  It is really hard
> to read when you're not used to those logic blocks.

Refering to this schematic (is anyone else actually trying
to follow these FPGA conversations??)

First, ignore those FMAP symbols.  The have absolutely no
logic functionality.  Their only purpose is to serve as a
placeholder for RLOC constraits to the xilinx compiler.
Once you get past that, it's just a simple 5-bit down
counter with sync preload of 16.  Yes, the carry logic is
a bit strange and the symbols do a very poor job of
conveying their function (I always print out the relevant
pages from libguide.pdf regarding those CY4_xx symbols).

> I'm guessing it is a 4 bit counter
> with a clock enable and a parallel load (begin) with a
> non-zero output. However there seems to be more logic that
> that in the block.

Hmm... it's been quite a long time since I designed that....
I'm looking at it now.......

Ok, it is a bit tricky and difficult to figure out by looking.
That counter works together with a little state machine
which is drawn on the main schematic.  The state machine has
two flip-flops, so ignore that IFD flip-flop on the MP3_REQ
line, since all it does is sync the STA013's request signal
to the FPGA's clock.

That state machine has 3 valid states 0/0, 0/1, 1/0.  Both
flip-flops should never go high (if they ever did, it would
immediately go back to 0/0).  In this little informal
notation for this message, the first number refers to the
top flip-flop.  Here's a conceptual way to think of the 3

0/0 = Waiting for STA013 to be ready or for parallel load
1/0 = First half of clocking to STA013
0/1 = Second half of clocking to STA013
1/1 = (illegal state)

The state sequence when transfering data is:

0/0 -> 1/0 -> 0/1 -> 1/0 -> 0/1

If the STA013 de-asserts its data request signal, or if the
counter reaches zero (starts at 16 and counts down) then
the 0/0 state is entered and it remains at 0/0 until the
STA013 is ready AND the counter is non-zero.

Of course, when the counter reaches zero, the "nonzero" output
is inverted to become "DECODE_READY" on the main schematic,
and that gets AND'd with the DMA request so the state machine
receives a request to load another 16 bits whenever DECODE_READY
is asserted AND we're still servicing a DMA request.

The SR_LOAD signal (from the control state machine) is
asserted when the 16 bits are transfered from DRAM to fill
the shift register, and it sets the counter back to 16.
On the next clock cycle, the DECODE_READY causes that little
state machine to begin clocking the bits out (of course, if
the STA013 is also requesting data... if not it stays in the
0/0 waiting state and the counter remains at 16).  It needs
to be a 5 bit counter because it there is one state for each
bit, and 00000 is used to represent "empty".

> You also seem to have used RLOC constaints within the
> counter, can you remember why?

Not for any great reason.  Mostly my low opinion of the
xilinx compiler's placer.  It's also a habbit I got into
with XACT 5.0 (before "foundation"), where the placer was
not able to use carry logic with RLOCs.  They seem to have
fixed this sometime in the last several years... old
habbits die hard, I guess.  But I didn't put lots of RLOCs
inside the MOXV prediction which was designed later on.