Driving the WS2811 at 800 kHz with an 8 MHz AVR

From Just in Time

WS2811 LED controllers are hot. Projects using WS2811 (or WS2812, WS2812B or NeoPixel) LED strips have been featured on HackaDay several times in the last few months. One feature showed how an AVR clocked at 16Mhz could send data at the required high rates. Inspired by this, I ordered an LED strip and 16Mhz oscillators from ebay. The LED strip arrived quickly, only the oscillators took weeks to arrive, which gave me plenty of time to think about the possibility of driving these led strips from an 8Mhz atmega88 without an external oscillator. With only 10 clock ticks per bit, this was going to be a challenge.

It turns out that this is doable, with the same output timing as the 16Mhz version...

On this page I'll describe how to drive a WS2811 from an 8Mhz or 9.6Mhz AVR like an Atmel ATmega88, ATtiny2313 or ATtiny13 without added components such as an external oscillator.

In order to write the first version of this time-critical code:

  • I used a spreadsheet (or in fact: some tabular layout) instead of an IDE so that I could keep an eye on the timing of each instruction I write;
  • I combined assembly code fragments using a dedicated C++ program to minimize jump distances.

In the latest version of the driver, the C++ program wasn't needed anymore, because the current driver is so small (32 instructions) that it falls completely within the range of an AVR conditional jump instruction.

And of course, if you're creating a hardware project that controls more than 1 LED, you're going to have to demonstrate it with a Knight Rider LED sequence (which, I just learned, is actually called a Larson scanner)... The sources for all the demonstrations in these videos can be found on github.

Knight Rider on Steroids (source code)
"Water Torture" or "Lava Drops" demo (source code, details)
Special sparse driver allows an attiny13 to drive arbitrarily large LED strings from 64 bytes of memory
Flares demo on an attiny13 (source code)

Download source code

For the impatient: example source code can be found here. The code comes as an avr-eclipse project consisting for a large part of C++ demonstration code and the main driver function in assembly, in files ws2811_8.h and ws2811_96.h (for the 9.6Mhz version). I don't recommend trying to understand the assembly code by reading these sources. How the code functions is described below. Usage information can be found after the videos. The rest of this page describes the 8Mhz version. The 9.6Mhz code was added later, but is created in the same way.


You'll need the C++ compiler for this to work (turning ws2811.h into "pure C" is left as an exercise to the reader). I am told that this works just as good for an Arduino, but I haven't tested this myself. From the sources, you'll need files ws2811.h, ws2811_8.h, ws2811_96.h and rgb.h. A simple example of how to use this code:

#include <avr/io.h> // for _BV()
#define WS2811_PORT PORTD// ** your port here **
#include "ws2811.h" // this will auto-select the 8Mhz or 9.6Mhz version
using ws2811::rgb;
namespace {
  const int output_pin = 3;
  rgb buffer[] = { rgb(255,255,255), rgb(0,0,255)};
int main()
  // don't forget to configure your output pin,
  // the library doesn't do that for you.
  // in this example DDRD, because we're using PORTD.
  DDRD = _BV( output_pin);
  // send the RGB-values in buffer via pin 3
  // you can control up to 8 led strips from one AVR with this code, as long as they
  // are connected to pins of the same port. Just 
  // provide the pin number that you want to send the values to here.
  send( buffer, output_pin);
  // alternatively, if you don't statically know the size of the buffer
  // or you have a pointer-to-rgb instead of an array-of-rgb.
  send( buffer, sizeof buffer/ sizeof buffer[0], output_pin);


Normally I'd go straight to the datasheet and start working from there, but in this particular case the datasheets are not so very informative. Luckily, the HackaDay links provide some excellent discussions. This one by Alan Burlison is especially helpful. That article not only explains in great detail why a library like FastSPI isn't guaranteed to work, but it comes with working code for a 16Mhz AVR that appears rock solid in its timing.

Small problem: I didn't have any 16Mhz crystals on stock, so I ordered a few, on ebay again and sat back for the 25 day shipping time to pass. 25 Days is a long time. The led strip had arrived and was sitting on my desk. 25 Days is a really long time. Maybe it could work off an AVR on its internal 8Mhz oscillator? It would be a lot of work. But 25 days is a very, very, long time.

So, that is how I got to sit down and write my 8Mhz version of a WS2811@800Khz bit banger. The challenge is of course that I have 10 clock cycles for every bit, no more no less, and 80 cycles for every byte, no more no less. I wanted the timing to be as rock-steady as Alan's, give-or-take the imprecise nature of the AVR internal oscillator. The part about it being steady was important to me. People have argued that the code can be made a lot easier if you're willing to have a few extra clock cycles in between bytes or triplets and that such code works for them. I agree that such code is a lot easier to create or read. It's trivial, in fact. However, the WS2811's datasheets are ambiguous at best with regards to the maximum allowed delay between bytes (or bits) and anyway, I liked the challenge of trying to have zero clock ticks delay between bytes or triplets.

The challenge

For a full description of the required protocol to communicate with a WS2811, please refer to either Alans page or the datasheet. In summary, the microcontroller should send a serial signal containing 3 bytes for every LED in the chain, in GRB-order. The bits of this signal are encoded in a special way. See the figure below.

illustration of a WS2811 waveform

This image shows a sequence of a "0" followed by a "1". Every bit starts with a rising flank. For zeros, the signal drops back to low "quickly" while for ones the signal stays high and drops nearer the end of the bit. I've chosen the following timing, in line with Alans observations and recommendations:

  • Zero: 250ns up, 1000ns down
  • One: 1000ns up, 250ns down

Giving a total duration of 1250ns for every bit, or 10μs per byte. These timings do not fall in the ranges permitted by the data sheet, but Alan describes clearly why that should not be a problem. 1250ns means 10 clock ticks per bit. That is not a lot. A typical, naive implementation would need to do the following things at every bit:

  1. determine whether the next bit is a 1 or a 0
  2. decrease a bit counter and determine if the end of a byte has been reached, if at the end:
    1. determine if we're at the end of the total sequence
    2. load a new byte in the data register
    3. decrement the byte counter
    4. reset the bit counter
  3. jump back to the first step

Oh yes, and that is of course in addition to actually switching the output levels.

All of that does not fit into a single 10-clock time frame. Luckily, it doesn't have to. My first version of this driver partially unrolled the bit loop into a 2-bit loop. This allowed all those actions described above to fit within the loop, but it also required 4 versions of the loop (one for every 2-bit combination). The code would jump from one version of the loop to the other as appropriate.

When writing code for the 9.6 Mhz version and the version for sparse LED strings (strings where most LEDs were off), I figured out a way where I could basically have one small loop for each bit but where the code for the last two bits would be unrolled, giving enough time to fetch the next byte and reset the bit counter. This resulted in the much smaller driver code that I have now.

Defining the puzzle

Inventing a notation

Juggling with many states, jumping from one piece of code to the other without introducing phase errors turns out to be interesting. I spent a couple of lonely lunch breaks and several pages in my little (paper!) notebook before I even figured out how to describe the problem. When a notation became clear, however, the going was easy enough and this exercise turned into one of the nicer kinds of pastimes.

Ws2811 driver code.png

The image above shows the full code for the driver in a spreadsheet with pseudo assembly code in the yellow blocks. To the left of each yellow block is a graphic representing the wave form being generated. Tilt your head to the right to see the more conventional waveform graphic. The blue blocks show where the signal could be high or low, depending on the current bit value being sent. Each horizontal row in the yellow blocks represents a clock tick, not necessarily an instruction word. To the left of each waveform graphic there are numbers from 00 to 19 that represent the "phase" at the corresponding clock tick. Phases 00-09 are those of the first 7 bits, phases 10-19 are those of the last bit.

What makes this notation so convenient is the fact that I can now easily determine the waveform phase at each point in the code and can also check whether a jump lands in the correct phase. Each jump at phase n (0 <= n < 09) should land at a label which is placed at phase n + 2 (modulo 10), because jumps take 2 clock cycles. Put differently: each jump should be to a label that is two lines down from the jump location (or 8 or 18 lines up).

The drawn waveforms make it easy to verify that when I jump from the middle of a wave, the code lands in a place where that same wave form is continued. It also shows clearly where the 'up' and 'down' statements that do the actual signal levels need to go.

Wherever there is a "^^^" in the table, it means that the previous instruction takes 2 clock cycles, so that particular clock cycle still belongs to the previous instruction.

How the code works

In summary, the code works as follows: The start of a bit waveform occurs at label s00. At this point the value of the bit to be sent is assumed to be in the carry flag. The line is pulled high and if the current bit (carry flag) is a zero bit, it is pulled low two clock cycles later. Then a bit counter is decreased and if we're not in the second-to-last bit, we continue the second half of the waveform by jumping to label cont06, which is above s00. From cont06 the code just waits a while, then brings the line down (which has no effect if the line was already brought down) and shifts the next bit from the data byte into the carry flag. From here the code falls back into label s00, ready to transmit the next bit.

If we were in the second-to-last bit, the code continues downward after label skip03. We need to free up the data register for the next byte, so we quickly test the last bit of the current byte and then branch into one of two essentially equivalent pieces of code. The code on the left hand side generates a "1"-waveform, while the code on the right generates a "0" for the last bit of the byte. In between the OUT-instructions we find some free cycles to reset the bit counter (to 7), to load the next byte and to decrease the 16-bit byte counter. If indeed there is a next byte to send, we jump up to either label cont07 or cont09 where the rest of the bit waveform is generated before we continue with the bits of the next byte.

Combining the code

The latest version of the code is pretty small (32 instructions/64 bytes), but earlier versions were bigger, requiring jumps over longer address distances. This posed a problem, because a jump from the end of the code right to the beginning would be too long for the branch instructions of the AVR.

Note how all conditional jumps are in the form of branch instructions ("BRCC", "BREQ", etc). There is one important limit to these relative branches, they can only jump to a range of [PC - 63, PC + 64] (with PC the address of the jump instruction)! Any instruction more than 64 instructions away from the branch cannot be reached.

At first I tried to piece the code together manually in a spreadsheet that would calculate the maximum jump distance for me. After a few failed attempts I gave up and decided that computers are better at this. In the end, I just wrote a dedicated program in C++ that uses some common sense heuristics to shuffle the blocks of code around until it finds a sequence in which all jumps are within range.

After this, it became a matter of just pasting the code blocks into one sequence and changing some of the pseudo instructions into real instructions.


The main point of this text is not that I can show 4 (four!) Larson scanners in one led strip. Actually there are two different points I am trying to make:

First of all, it is possible to control WS2811 led strips from an AVR without external 16 Mhz oscillator and I want to tell the world.

Secondly, during this exercise I discovered that this kind of extremely time-critical code can be solved with a number or techniques:

  • unrolling loops. That is not a new technique, but in this case it not only saves on the number of test-and-jump-to-the-starts (the normal reason to unrol a loop), but also decreases the number of other tests and allows me to sweep a few precious left-over clock cycles into contiguous blocks.
  • Write a conditional jump in such a way that the jump is made in case there is not much left to do, saving a precious clock cycle for the busy case. Ignore the reflex to jump in "exceptional cases", trying to minimize the total number of times a jump is made.
  • when code is "phase critical", abandon the idea of a list-of-instructions and organize the code in "phase aligned" side-by-side blocks, where a jump is most often a jump "to the right" or "left".
  • Use software to optimize code layout in memory. I am not aware of any assembler that will automatically do this when jump labels are out of reach, but I know I have wished for it more than once.

Comments? Questions?