Driving the WS2811 at 800 kHz with an 8 MHz AVR

From Just in Time

(Redirected from Controlling WS2811 led strings at 800kbit/s without external oscillator)

This page describes how to drive a WS2811 from an 8Mhz or 9.6Mhz AVR like an Atmel ATmega88, ATtiny2313 or ATtiny13 without added components such as an external oscillator. With only 30 instructions for the driver loop (110 bytes for the complete function), the driver code presented here is likely to be the shortest code that delivers the exact timing (i.e. exactly 10 clock cycles for 1 byte, exactly 60 clock cycles for 6 bytes, etc.). With even less code a "sparse driver" for 9.6Mhz MCUs can drive long led strings with only a few bytes of buffer in RAM.

Of course, if you're creating a hardware project that controls more than 1 LED, you're going to have to demonstrate it with a Knight Rider LED sequence (which, I just learned, is actually called a Larson scanner)... The sources for all the demonstrations in these videos can be found on github.

Knight Rider on Steroids ([1])
"Water Torture" or "Lava Drops" demo (source code, details)
Special sparse driver allows an attiny13 to drive arbitrarily large LED strings from 64 bytes of memory
Flares demo on an attiny13 (source code)
Fire without flies

Download source code

The library is header-only. Using it is a matter of downloading the source code and including it in your AVR code. Example source code can be found here. The code comes as an avr-eclipse project consisting for a large part of C++ demonstration code and the main driver function in assembly, in files ws2811_8.h and ws2811_96.h (for the 9.6Mhz version).

I don't recommend trying to understand the assembly code by reading these sources. How the code functions is described below. Usage information is in the next section. The rest of this page describes the 8Mhz version. The 9.6Mhz code was added later, but is created in the same way.

New: If you'd like to see the assembly code in action, take a look at the online AVR emulator! Avrgo demo screenshot.png


You'll need the C++ compiler for this to work (turning ws2811.h into "pure C" is left as an exercise to the reader). I am told that this works just as good for an Arduino, but I haven't tested this myself. Remember that this code was written and optimized for 8Mhz and 9.6Mhz, it would run too fast on an 16Mhz Arduino. From the sources, you'll need files ws2811.h, ws2811_8.h, ws2811_96.h and rgb.h, though you only include "ws2811.h". A simple example of how to use this code:

#include <avr/io.h> // for _BV()
#define WS2811_PORT PORTD// ** your port here **
#include "ws2811.h" // this will auto-select the 8Mhz or 9.6Mhz version
using ws2811::rgb;
namespace {
  const int output_pin = 3;
  rgb buffer[] = { rgb(255,255,255), rgb(0,0,255)};
int main()
  // don't forget to configure your output pin,
  // the library doesn't do that for you.
  // in this example DDRD, because we're using PORTD.
  DDRD = _BV( output_pin);
  // send the RGB-values in buffer via pin 3
  // you can control up to 8 led strips from one AVR with this code, as long as they
  // are connected to pins of the same port. Just 
  // provide the pin number that you want to send the values to here.
  send( buffer, output_pin);
  // alternatively, if you don't statically know the size of the buffer
  // or you have a pointer-to-rgb instead of an array-of-rgb.
  send( buffer, sizeof buffer/ sizeof buffer[0], output_pin);


WS2811 LED controllers are hot. Projects using WS2811 (or WS2812, WS2812B or NeoPixel) LED strips have been featured on HackaDay several times in the last few months. One feature showed how an AVR clocked at 16Mhz could send data at the required high rates. Inspired by this, I ordered an LED strip and 16Mhz oscillators from ebay. The LED strip arrived quickly, only the oscillators took weeks to arrive, which gave me plenty of time to think about the possibility of driving these led strips from an 8Mhz atmega88 without an external oscillator. With only 10 clock ticks per bit, this was going to be a challenge.

Normally I'd go straight to the datasheet and start working from there, but in this particular case the datasheets are not so very informative. Luckily, the HackaDay links provide some excellent discussions. This one by Alan Burlison is especially helpful. That article not only explains in great detail why a library like FastSPI isn't guaranteed to work, but it comes with working code for a 16Mhz AVR that appears rock solid in its timing.

Small problem: I didn't have any 16Mhz crystals on stock, so I ordered a few, on ebay again and sat back for the 25 day shipping time to pass. 25 Days is a long time. The led strip had arrived and was sitting on my desk. 25 Days is a really long time. Maybe it could work off an AVR on its internal 8Mhz oscillator? It would be a lot of work. But 25 days is a very, very, long time.

So, that is how I got to sit down and write my 8Mhz version of a WS2811@800Khz bit banger. The challenge is of course that I have 10 clock cycles for every bit, no more no less, and 80 cycles for every byte, no more no less. I wanted the timing to be as rock-steady as Alan's, give-or-take the imprecise nature of the AVR internal oscillator. The part about it being steady was important to me. People have argued that the code can be made a lot easier if you're willing to have a few extra clock cycles in between bytes or triplets and that such code works for them. I agree that such code is a lot easier to create or read. It's trivial, in fact. However, the WS2811's datasheets are ambiguous at best with regards to the maximum allowed delay between bytes (or bits) and anyway, I liked the challenge of trying to have zero clock ticks delay between bytes or triplets.

The challenge

For a full description of the required protocol to communicate with a WS2811, please refer to either Alans page or the datasheet. In summary, the microcontroller should send a serial signal containing 3 bytes for every LED in the chain, in GRB-order. The bits of this signal are encoded in a special way. See the figure below.

illustration of a WS2811 waveform

This image shows a sequence of a "0" followed by a "1". Every bit starts with a rising flank. For zeros, the signal drops back to low "quickly" while for ones the signal stays high and drops nearer the end of the bit. I've chosen the following timing, in line with Alans observations and recommendations:

  • Zero: 250ns up, 1000ns down
  • One: 1000ns up, 250ns down

Giving a total duration of 1250ns for every bit, or 10μs per byte. These timings do not fall in the ranges permitted by the data sheet, but Alan describes clearly why that should not be a problem. 1250ns means 10 clock ticks per bit. That is not a lot. A typical, naive implementation would need to do the following things at every bit:

  1. determine whether the next bit is a 1 or a 0
  2. decrease a bit counter and determine if the end of a byte has been reached, if at the end:
    1. determine if we're at the end of the total sequence
    2. load a new byte in the data register
    3. decrement the byte counter
    4. reset the bit counter
  3. jump back to the first step

Oh yes, and that is of course in addition to actually switching the output levels.

All of that does not fit into a single 10-clock time frame. Luckily, it doesn't have to. My first version of this driver partially unrolled the bit loop into a 2-bit loop. This allowed all those actions described above to fit within the loop, but it also required 4 versions of the loop (one for every 2-bit combination). The code would jump from one version of the loop to the other as appropriate.

When writing code for the 9.6 Mhz version and the version for sparse LED strings (strings where most LEDs were off), I figured out a way where I could basically have one small loop for each bit but where the code for the last two bits would be unrolled, giving enough time to fetch the next byte and reset the bit counter. This resulted in the much smaller driver code that I have now.

Defining the puzzle

Inventing a notation

Juggling with many states, jumping from one piece of code to the other without introducing phase errors turns out to be interesting. I spent a couple of lonely lunch breaks and several pages in my little (paper!) notebook before I even figured out how to describe the problem. When a notation became clear, however, the going was easy enough and this exercise turned into one of the nicer kinds of pastimes.

Ws2811 driver code.png

The image above shows the full code for the driver in a spreadsheet with pseudo assembly code in the yellow blocks. To the left of each yellow block is a graphic representing the wave form being generated. Tilt your head to the right to see the more conventional waveform graphic. The blue blocks show where the signal could be high or low, depending on the current bit value being sent. Each horizontal row in the yellow blocks represents a clock tick, not necessarily an instruction word. To the left of each waveform graphic there are numbers from 00 to 19 that represent the "phase" at the corresponding clock tick. Phases 00-09 are those of the first 7 bits, phases 10-19 are those of the last bit.

What makes this notation so convenient is the fact that I can now easily determine the waveform phase at each point in the code and can also check whether a jump lands in the correct phase. Each jump at phase n (0 <= n < 09) should land at a label which is placed at phase n + 2 (modulo 10), because jumps take 2 clock cycles. Put differently: each jump should be to a label that is two lines down from the jump location (or 8 or 18 lines up).

The drawn waveforms make it easy to verify that when I jump from the middle of a wave, the code lands in a place where that same wave form is continued. It also shows clearly where the 'up' and 'down' statements that do the actual signal levels need to go.

Wherever there is a "^^^" in the table, it means that the previous instruction takes 2 clock cycles, so that particular clock cycle still belongs to the previous instruction.

How the code works

In summary, the code works as follows: The start of a bit waveform occurs at label s00. At this point the value of the bit to be sent is assumed to be in the carry flag. The line is pulled high and if the current bit (carry flag) is a zero bit, it is pulled low two clock cycles later. Then a bit counter is decreased and if we're not in the second-to-last bit, we continue the second half of the waveform by jumping to label cont06, which is above s00. From cont06 the code just waits a while, then brings the line down (which has no effect if the line was already brought down) and shifts the next bit from the data byte into the carry flag. From here the code falls back into label s00, ready to transmit the next bit.

If we were in the second-to-last bit, the code continues downward after label skip03. We need to free up the data register for the next byte, so we quickly test the last bit of the current byte and then branch into one of two essentially equivalent pieces of code. The code on the left hand side generates a "1"-waveform, while the code on the right generates a "0" for the last bit of the byte. In between the OUT-instructions we find some free cycles to reset the bit counter (to 7), to load the next byte and to decrease the 16-bit byte counter. If indeed there is a next byte to send, we jump up to either label cont07 or cont09 where the rest of the bit waveform is generated before we continue with the bits of the next byte.

Combining the code

The latest version of the code is pretty small (32 instructions/64 bytes), but earlier versions were bigger, requiring jumps over longer address distances. This posed a problem, because a jump from the end of the code right to the beginning would be too long for the branch instructions of the AVR.

Note how all conditional jumps are in the form of branch instructions ("BRCC", "BREQ", etc). There is one important limit to these relative branches, they can only jump to a range of [PC - 63, PC + 64] (with PC the address of the jump instruction)! Any instruction more than 64 instructions away from the branch cannot be reached.

At first I tried to piece the code together manually in a spreadsheet that would calculate the maximum jump distance for me. After a few failed attempts I gave up and decided that computers are better at this. In the end, I just wrote a dedicated program in C++ that uses some common sense heuristics to shuffle the blocks of code around until it finds a sequence in which all jumps are within range.

After this, it became a matter of just pasting the code blocks into one sequence and changing some of the pseudo instructions into real instructions.


The main point of this text is not that I can show 4 (four!) Larson scanners in one led strip. Actually there are two different points I am trying to make:

First of all, it is possible to control WS2811 led strips from an AVR without external 16 Mhz oscillator with clock-tick-exact timing and I want to tell the world.

Secondly, during this exercise I discovered that this kind of extremely time-critical code can be solved with a number or techniques:

  • unrolling loops. That is not a new technique, but in this case it not only saves on the number of test-and-jump-to-the-starts (the normal reason to unrol a loop), but also decreases the number of other tests and allows me to sweep a few precious left-over clock cycles into contiguous blocks.
  • Write a conditional jump in such a way that the jump is made in case there is not much left to do, saving a precious clock cycle for the busy case. Ignore the reflex to jump in "exceptional cases", trying to minimize the total number of times a jump is made.
  • when code is "phase critical", abandon the idea of a list-of-instructions and organize the code in "phase aligned" side-by-side blocks, where a jump is most often a jump "to the right" or "left".
  • Use software to optimize code layout in memory. I am not aware of any assembler that will automatically do this when jump labels are out of reach, but I know I have wished for it more than once.

Comments? Questions?

16 February 2013 08:25:36
Great stuff, just ordered some led strips...

could you tweak this code to run at 16mhz? i got a small atmega32u2 board with 16mhz crystal id like to play with..

16 February 2013 12:12:40
Any reason you couldn't use Alan Burlisons version ( It does the same thing at 16Mhz. My version would take up more program space if I added the NOPs to bring it down to half the speed.

I do have a 9.6Mhz version forthcoming, so that I can run a few of these LEDs with a $0.66 attiny13.
2 April 2013 20:42:20
Hi Danny,

I'm trying to get this to run on an 8mhz arduino, and it doesn't appear to be working (ATMega328p with 8mhz crystal).
I've got it wired up properly (I wrote simple code that is just PORTD=0x08, 8x NOP, PORTD=0x0, 2x NOP), and sent it 24 times. And I get a white pixel from pin 3 as expected.
However, your code doesn't appear to do anything.
My code is simply:
#define WS2811_PORT PORTD
#include "WS2811.h"
#define NUM_LEDS 2
rgb buffer[NUM_LEDS];
void setup() {
   buffer[0] = rgb(255,255,255);
   buffer[1] = rgb(0,0,255);
void loop() {

Am I do anything obviously wrong? Would you expect your code to work on an arduino?
2 April 2013 20:46:05
Erm. I didn't set my pinMode. Sorry! Not sure how to delete a comment... It's working great now!
2 April 2013 21:56:55
Cool, thanks for the feedback. I'm glad it works for you too now. If you don't mind I'll leave your comment here and also add a remark in the instructions for use, so that others are reminded to set their pin modes as well...

I expect this type of thing happens quite often. I know that I've spent an impressive amount of time debugging projects where I had forgotten to set the correct pin modes.
28 May 2013 05:35:31
if you could manage to use that to feed input from a PC/RasPI/... directly to the LEDs you could get some highlevel control using
(e.g. to combine various multiple LED chains) We're always happy about testers/contributors :)
12 July 2013 20:36:22
I've been working on a browser-based JavaScript simulation of LED Strips, and I've been converting your color driver code. So far I've done Flares, Chasers, and Water Torture. Some of the code is still a bit sloppy, and the Water Torture code is currently the best encapsulation. I'm getting ready to refactor the others, then finish up with the Color Cycle code.

I might abstract some bits a little more before I'm done. It's basically a framework so that I can test my own color changing algorithms before going to hardware.

I'll put it up on Github soon, too.
12 July 2013 23:02:16
That's impressive! This could be very useful to test new patterns. On linux, I had to switch from firefox to chromium to make it work, but it's definitely promising. I can imagine that this is especially interesting for those who want to create different Christmas lighting patterns...
30 August 2013 20:19:32
What compiler are you using?  I'm trying to build this in Atmel Studio 6.1 (AVR-GCC/AVR-G++), and ws2811_8.h gives a compiler error

"inconsistent operand constraints in an 'asm'"

at line 177.
30 August 2013 22:10:31
I'm running this with AVR-GCC on linux from within eclipse. The avr-g++ version I have is 4.7.2 ("avr-g++ -v"). For which MCU are you compiling this? Could it be that your MCU doesn't have a C-port? In that case you can change the port by defining WS2811_PORT before including ws2811_8.h
Running avrdude from eclipse under linux
1 September 2013 07:02:17
I was trying to compile for the ATtiny10, and I did change the port to PORTB.  Tried targeting the Atmega88 and it worked fine, so it must just be the chip.  I was hoping the ATtiny would have worked.  Unlike some of the others like the ATtiny13, it actually does have an 8MHz internal oscillator, and the SOIC-8 package would be pretty nice...
1 September 2013 07:11:26
Looks like the ATtiny45/85 will work.  Or at least it compiles.
1 September 2013 10:38:25
I'd like to reproduce the issue, but my version of avr-gcc won't even compile for attiny10, it says "avr-g++: error: unrecognized argument in option ‘-mmcu=attiny10’".
1 September 2013 11:11:38
Btw., given the methods described on this page, it should be possible to write an attiny13/9.6Mhz version. Single bits should then take 12 cycles where the short pulse takes 3 and the long pulse takes 9 cycles. That will render a shape that fits the specs for the ws2811. I guess that would be worthwhile since the tiny13 seems to be the cheapest version around (at around $5 for 10pcs).
I may spend some time on this, but don't let that stop you from trying it yourself, it's fun :-)
1 September 2013 23:27:21
I finally got around to refactoring my code some, and it's up on Github:

I was having problems converting your color cycle code to work right in my JavaScript framework. I might try to go back and revisit it sometime. The ColorWave driver is my own from-scratch replacement. My JS version uses native Math.sin(), but in Arduino land people might want to use integer-based approximations (there are plenty of code samples around).

The Chasers code was the first pattern driver I converted, and I was a little looser in translating from your code. With Water Torture, I started with a much more direct conversion, basically just copying your code and modifying it "in-place" into JavaScript. However, I know it has some issues, and I put a couple of "cheats" in.

But basically, I'm pretty happy with how the actual ws2812.js and ledstrip.js code came out.
1 September 2013 23:33:38
Oh, and I had Flares working with my original code, but I still need to refactor it to work with API updates in my ledstrip code since then. In its current form, it's pretty married to the assumption that certain global variables are available, and I want to make it more of a self-contained black-box. I'll probably end up changing it significantly from your original code.
3 September 2013 07:03:56
Yeah, apparently the ATtiny4/5/9/10 chips are apparently pretty notoriously difficult to work with, but the $0.70 price tag and SOT-23 package are pretty sweet, especially for a project like this.
3 September 2013 22:21:11
OK. The GIT-repository now has updated code for 9.6Mhz MCUs like the attiny13. Be aware that that most of the demos for my code won't fit in an attiny13 (both flash and RAM).
5 September 2013 22:10:31
They're still looking good. Will you be moving some of your versions of the animations to silicon as well?
14 October 2013 21:37:12
This is great.

One quick question.  I am trying to run it on hardware that has output pins on both PORTB and PORTD.  I can get the LEDS to work on one or the other, but not both.

Any thoughts?
19 October 2013 09:50:26
This code can send on one port at a time only and can't switch ports at run-time. If you're willing to spend twice the amount of code space, you could adapt ws2811_8.h so that you can include it twice, re-defining WS2811_PORT each time. You would need to remove the include guards, of course. It's not trivial, but it can be done:

One reasonably easy way to do that is to wrap the send function in a namespace. You can't use the token "WS2811_PORT" to generate that namespace name (all kinds of trouble because that token resolves to something like 'PORTB', which in turn gets preprocessed into other tokens) , so it would be easiest to just create a second define WS2811_NAMESPACE.

In summary, you could wrap the send(...) function (the three-argument one) in a namespace: "namespace WS2811_NAMESPACE { ..." and then, each time when you include ws2811.h you define both WS2811_PORT and WS2811_NAMESPACE. Finally, you can send to the right port using a statement like ws2811::namespace_for_port_x::send( my_leds, led_count, pin). Typicially in a function, you would state "using ws211::namespace_for_port_x" and then just use the send()-function.
18 January 2014 19:40:12
Hello Danny!
I appreciate your effort for writing this code, it's great.
Managed to compile your code and upload to an attiny2313, but some strange thing happens:
when connecting the attiny's output pin to the ledstip's DIN, some random, changing colors appears on the strip.
If the DIN of the ledstrip is not connected to attiny's pin, and I power cycle (turn down and then up) the dedicated power source of the strip, all leds are off.
Analyzed the signal generated by the program using a digital analyzer.
I would like to have a confirmation from you, if this is the desired output signal:
- 0 bit: 250us at 5v, 1000us at 0v
- 1 bit: 1000us at 5v, 250us at 0v

My code looks like:
 #define WS2811_PORT PORTB
 DDRB = _BV(channel);
 ws2811::rgb colors[1];
 colors[0].red = 0xff;
 colors[0].green = 0x00;
 colors[0].blue = 0xff;

I would appreciate a hint for solving this problem.

18 January 2014 19:58:35
0 bit: 0.25us at 5V and 1.0us at 0V
1 bit: 1.0us at 5V and 0.25us at 0V
18 January 2014 20:57:48
Hi Adam,
The timing you're seeing is indeed as intended. You should see the pin at 5V in its normal state, then drop low for at least 40us, then 24 bits of grb and finally the line should be at 5V again. There's nothing obviously wrong with the code you're showing as well. Are you defining the WS2811_PORT symbol before including ws2811.h? And how is channel defined?
18 January 2014 22:20:00
Hei Danny,

Thanks for your quick answer!
static const uint8_t channel = 5;
#define F_CPU 8000000UL
#define WS2811_PORT PORTB
#include "ws2811.h"

Meanwhile I found out, that my ledstrip probably is not using ws2811 but ws2812 (led is integrated with control chip)
The chip on the strip has 6 pins (like in ws2812 spec); ws2811 has 8 pins.
In 2812 datasheet the timings are other that in the 2811 datasheet.
Probably this is the problem.

Thanks, again, for your answer
Best regards,
19 January 2014 18:08:24
Hey Danny,

Managed to find the problem, it seems that's the power source used (a PC power source).
Cut one led off the strip, and powered using the power from usb port, it's working.

Thanks again, for your feedback,
19 January 2014 21:45:10
You're welcome Adam. Good luck with your project!
21 January 2014 14:57:50
Hello Danny!
The problem was caused by using separate ground for microcontroller and ledstrip...
If both run on the same power source, it's working!

Best regards,
14 February 2014 21:16:39
Can someone did it run in C? If it is possible please main.c code because I can not cope.
15 February 2014 10:03:43
All the demos are pretty much tied to C++. The basic driver code however should be relatively easy to adapt to a plain C compiler: (1) in ws2811_8.h and ws2811_96.h remove the namespace declaration, (2) in ws2811.h remove the whole part that's within the namespace declaration (plus the declaration itself). And finally in rgb.h remove the constructors.

You have to remember then, that the structs must be initialized in GRB-order, so a string of yellow leds must be initialized as:

struct rgb yellow[] = { {255, 255, 0}, { 255, 255, 0} /*etc...*/};
17 February 2014 20:07:07
Unfortunately, I think I can not handle with this code made ​​into pure c
17 February 2014 20:33:17
What the Library are necessary for the effect of water
17 February 2014 21:08:21
Maybe he found someone who would have to help me run water_torture in pure c eclips bigjack at
18 March 2014 18:57:26
Apart from the WS2811 driver code described here, there is no need for a library. I've added hyperlinks to the source code under each of the videos.
8 May 2014 08:06:24
Though I like the way you found the solution very much, your solution is over complicated.

Here is my code for 9.6MHz (attiny13A) : 

 ldi r16, 8        ; bits counter
 sbi PORTB, PORTB3 ; L6-L8-H0 L3-L5-H0
 lsl r20 ; H0-H1 H0-H1
 brcs WS2801_red_send_1 ; H1-H2 H1-H3
 cbi PORTB, PORTB3 ; H2-H4-L0 -----
 nop ; L0-L1 H3-H4
 cbi PORTB, PORTB3 ; L1-L3 H4-H6-L0
 dec r16 ; L3-L4 L0-L1
 brne WS2812_red_bit ; L4-L6 L1-L3

Repeat 3 times for GRB.

On slower AVR, you can save 2 cycles by using OUTs instead of SBI/CBI, then you'll need to loose one extra cycle IIRC.

On both 9.6 and 8MHz mcu, you can be fully compliant with description of the protocol given in the datasheet, no need to make assumption about what the driver really expects.
8 May 2014 23:14:33

First of all: welcome fellow clock tick counter.

The code above indeed fits in 12 cycles and if you want to drive a single WS2812 LED it would indeed suffice to repeat the code three times (allocating one register each for the G, R and B value). Note however that the code I describe in the article drives an array of LEDs, not just one. Also, multiplying the above code three times takes 27 instruction words, while the equivalent part of the simple (non-sparse) array driver for the attiny13 takes 22 instruction words.

Additionally, the code in my article uses a 16-bit register for the byte counter (bytes = LEDs X 3), which is admittedly overkill for the attiny13 with barely enough ram for about 13 RGB-leds, but certainly useful for 8Mhz atmegas and the sparse version of the attiny13 driver, if you want to be able to drive more than 83 LEDs.

In summary: yes, for a single LED the code you show is sufficient and simple enough if you don't mind spending a few bytes more on code. For a string of LEDs however, you'd need something more complex than the code you're showing. I don't know if my code is the simplest for that job, so I'd encourage you to give it a try yourself. I have to warn you though, it's highly addictive...

10 May 2014 22:24:25
mmm, actually the code I how can be put inside a loop, because the datasheet tells that if you want to start a new serie of value, you have to wait more than 50µs.

Technically, it does not mean you can wait 49µs before sending the next value, it shorter than that, but on a tiny13 at 9.6MHz, I have enough time to make some interpolation on rgb value and loop before I need to send next value. I did not try to send more than 64 values (because I have only 64 WS2811 soldered), but 64 works fine :)
11 May 2014 01:09:12
The datasheet mentions a minimum delay. So we know that the chain will be reset by some inter-LED delay between 0 and 50µs. Where the treshold lies in the population of ws2812s is probably determined by some bell-curve and I don't know how wide that curve is. There was quite some discussion about what an acceptable delay would be in several fora and many people share your opinion and have reported successes with libraries like FastSPI, which, as I understand it, also introduces some inter-LED delay.

I won't repeat the discussion here, suffice it to say that I chose the design criterium that the timing should be (quoting from my own page) "as rock-steady as Alan's", which means a zero inter-LED delay.