Skip to content

Integer Systolic Array Implementation in Verilog - Dot Product, Elementwise, Memory Manipulation, Branching with custom ISA

Notifications You must be signed in to change notification settings

Eshaancoding/Astra

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Project Astra

This project attempts to create a hybrid between a CPU & controlled systolic array for mainly machine learning application, but could be extended towards other scientific uses. It specializes in summation, multiplication, and other linear algebra applications.

It is not like a graphics processing unit, where it would calculate shaders and others, but more of a "linear algebra" processing unit (calculates more hardcore mathematical operations).

Design

Control Unit:

Systolic Array:

Pay special attention to the actual systolic array, and the selectors. The control unit/RAM/etc. is detailed in the picture above.

Detailed Control Flow:

Goes through 4 clock cycles (each instruction is 4 cycles). Furthermore, showcases the pipelined as well (every 2 clock cycles; can't do one as it would conflict with RAM.)

ISA

Refer to testbench/src/lldriver/decl.rs

Hardware Todo/Done

note: switch from cdylib <-switch-> rlib at Cargo.toml at testbench

Todo

---------- this entire thing should take you like 2 days if you lock in

  1. Implement idea 5 <-- to be tested

  2. Go for 6 by 6 systolic array.

  3. remove W and WT instructions and accumulator (you don't really need it).

    • remove OD
  4. CC, RC, and full constant across the entire weight + run @ 1 clock cycle.

----------- back to software implementation

  1. Finish rest of the implementation from Astra IR --> Chip IR. (1 day)

    • dot prod, elw add, extra memory stuff, specialize functions, etc.
  2. IR, IR, IR optimizations (a week easily...)

    • First do value tracking (1s and 0s)
    • Then do equivalent operation tracking (1/ac followed by ac *= ac is just ac itself...). Find other mathematical identities as well
    • Prove that you need minimal caching + smart register movements.
    • only then... do you prove the validness of this entire project in of itself; after 1y...
  3. Implement more popular architectures -- Transformers, GRU, etc. (2 days?)

    • Then go back to IR optimization and see if there's anything you can find

---------------- TWO Weeeeeeks and im alr here tf

  1. Implement more features for autograd, basically (2-3 days)

    • Data Feeder class (can you allocate memory at the same time as kernels or do they have to be at the same time...)
    • Data Loader
    • More optimizations
    • More neural net arch
  2. CUDA OPTIMIZATION! (this needs to be split up and more research honestly before I can find the timing for this)

    • can we have such advance kernel fusion and optimization that we come up with something like flash attention?
    • focus on memory aware optimization

------------- extra chip shi that you keep putting to the side

  1. Clean up Chip.sv

    • have RCSelect.enable as part of ID
    • PC.set_pc, ram_data_in, etc.
    • ID out should have little out ports as possible.
    • Already done for RAM.
  2. Implement floating point!

    • figure and fine-tune out cos/sin/inv and other specialize functions as well.
    • High-end FPGAs should be able to support basic FP using their DSPs.
    • not synthesizable
    • LUTs shared across multiple ALUs
    • experiment with int8 opt + mixed precision
  3. Make sure synthesizable.

    • Look at common tools and shi
    • find the size/needs of LUT etc.

IDEAS

  1. Do integer math... but the weights and inputs and results will have completely different 10^x value

    • x is... to be determined idk if we want to do dynamic or sum.
    • systolic array will be majority INTEGER, but I can also add a few FP32 compute in there asw.
    • Need to do some analysis on common Transformer dot products or sum.
    • find out the dot product * 10^X signs. write code for that
    • NOT NEW: This is called fixed point. Read more literature on this.
  2. If you have a dual channel RAM, then enable instructions for those

    • This really depends on the FPGA specifications, which I am asking rn
    • Memory specs is pretty high for optimal chip
      • seperate memory for instructions (could act as "cache", but kinda going away from that notion entirely)
        • could be BRAM in an FPGA
      • dual port DDR4/DDR5 RAM.
        • I could read and write at the same time;
        • For W or WT, read twice (double bandwidth)
        • For X or GX (future instruction for reading and setting to output)
          • one port read; one port write.
        • sigh... please please I want a ram like that plz
    • for now, I won't support anything like that.
  3. Purely clock-less systolic array

    • Multiplication must be purely combination
    • Addition must be purely combinational
    • The x/weight registers can be "clock-less" by setting the (column && row_enable[0] as the clock signal itself)
      • Still, you will face propogation delays so be warned :/
    • Sadly, you can't really test this on an FPGA; you need actual dedicated hardware to do this.
      • how to do dedicated hardware? good question
    • This should be possible
      • data available before the next clock pulse
      • select happens after some delay;
        • might need to do some negative clock pulse weird thing
        • or shifting some idk
    • More efficient than research
  4. Have the FPUs have it's own "registers" so to speak.

    • Basically, it's controlled all by a single control unit, but the result of multiplication OR ADDITION can be stored in register B
    • we can of course, still set register W
    • The entire purpose of this is to ensure that elementwise multiplication and addition won't be the limiting factor of our chip
      • Amdalh's law
      • NVIDIA gpus might fare better because they focused more on this (and then abstracted from dot product)
      • I am doing the reverse: abstrating from dot product to elementwise multiplication
    • This is listed as one of my worries for this project.

Done

  • Update special functions similar to tinygrad.
    • Functions:
      • exp2, log2, sqrt, recip, sin (done), neg (need opt)
  • Add incr setting
    • pretty simple
    • But add it to SW and SR (incr address reg) and repeat instructions (incr row/col)
  • Create accumulator as part of the conditioner
    • only set accumulator on next_pc.
    • Helpful at dot product accumulation
      1. CS 7 <-- set accumulator mode
      2. C <-- Fill intermediate dp 1
      3. C <-- Fill intermediate dp 2 (next_pc; sets accumulator to previous output)
      4. C <-- Fill intermediate dp 3
      5. G <-- Get instruction
      • Note that you can have multiple C by doing IR.
        • Might be helpful to have two different instructions --> One for incr column last and one for not.
    • have reset instruction also reset accumulator OR CS is set to any other instruction.
  • More streamlined
    • W and G will always start AND end with col_out 0001 and row_out 00010 (avoiding x).
    • Have the R instruction incr the column at the end;
    • Shift G_INSTR into idx 3 somehow.
    • Have it to interact with the grid with data as much as possible
  • Add transpose feature
    • WT transpose feature
    • C fill column (have to rename)
    • Since we are adding more instructions:
      • do only two instructions to contorl row/col instructions
        • Select, select data
        • Change the driver to support this well
          • do comparator select asw
        • More streamlined
    • Then...
      • SW instructions if the WT, C, etc. transpose operations can't be done.
        • From addr to addr + data from MOV
      • SR set instruction to the RAM just for debugging
        • Set's addr and data to RAM.
  • Add a BR instruction that also puts PC as returning address.
    • Also, create a seperate address just for returning address
    • Possibly remove "Data" registers; you have to point the CS into something different.
    • Figure something out in terms of control; don't add so many instructions for this either.
    • "Address" and "Data" register are still too broad for anything IMO.
    • Furthermore, deal with address offset
      • a 10 bit address won't be enough for BR instructions or stuff like X (other repeat instructions), IR, etc.
  • Add all the combinations of systolic array you need.
    • ADD:

      • Comparator
      • OR output to row instruction
        • get the current output and then sends to the current row (starts from 0)
      • OD output to diag instruction
        • get the current output and then sends across the diag
      • ICD with data enabled
        • similar to IC, but inputs data
      • IRD with data enabled
        • Similar to IR, but inputs data
      • Have R also reset the col and row select.
      • D (a diag instruction with size 4 across the systolic array).
        • Probably need to perform reset before any diag
      • IR, IC, XR, LR and 0C just for good measure (without data).
      • R a fill row instruction. Start with instruction and then fill the row (size 4)
      • C a fill row instruction of a constant. Similar to R but you don't increment address
    • REMOVE

      • All systolic array manipulation except
        • EX, NA, RG
      • Remove GD
    • Create testbenches for

      • Elw (X, D, RG) & constant (IC, R) multiply
      • Elw (IC, R) & constant (IC, C, R) add
      • comparator select (test conditioner) CS
    • ``Make sure all the other existing testbenches work~~

      • may need to change some instructions
  • Implement more general systolic array manipulaton
    • instead of IR for just one single instruction, create a general W instruction that fills entire matrix W/ stride too.
    • Removes the need for an intermediate IR
      • also may be faster since it doesn't have to instruction fetch everytime.
      • also find out a way whether you can change row and feed data at the same time
        • you probably already do this?
      • This also saves program size
      • Our program size will be anyways limited (ex: Block RAM on an FPGA) since most of the memory will be actual data anyways.
    • Not only this, but even if we do not have a seperate memory for instructions (2.), we save clock cycles.
    • Will be known as REPEAT instructions
  • Add operation to go from resultant vector back to the systolic array according to the specific row idx. :)
    • we store "intermediatory operations" within the systolic array this way
    • do not go to RAM
    • The GG instruction
      • This allows interest ways to represent matmuls, where you can store the intermediate state always as the first column of the systolic array (x-value is 1), and load the weights from the 2nd to last columns. Not sure how feasible this will be
      • This will build the bounding blocks of the actual ML compiler.
  • Add a GD instruction. Get from result and then feed towards the data register
  • Use BR and BRE instructions
    • Conditionals. Pretty self-explanatory
  • Add a special ALU on the bottom of every output in the grid (conditioner)
    • Support less than 0, more than 0, equal to 0, shifts, etc.
      • This will be our layer wise compute basically.
    • Want to look more into special functions before I do this...
    • And/Or conditions will be sort of interesting to represent
      • Or will be represented as +. If each condition is a 1 or 0, then an or will sum up all the conditions. Then we can use the more than 0 instruction to get the final or
      • And can be represented as a multiplication. If any of conditions are 0, then the final multiplication output is 0. Then, you can use more than 0
    • However, added And/ors if you want to use it
  • General register manipulation support
    • RAM <--> Register
  • For the GI, increment row 4x (or however many is the size of the systolic array) instead of just increment row 1x
    • Pretty complicated:have to be integrated with pipelining + stop mod counter in the Control unit.
  • Add branching
    • Make the out bus the output of the branching pipeline.
    • Like a real CPU!
  • Add reset instruction:
    • set all the x-reg to 1, set all the w's to 0s :)
    • useful for clearing from matmul to not matmul (linear algebra addition, etc.)
  • Add an instruction for increment both column and row.
  • Add pipelining towards the CPU
    • Currently, CPU Stages are:
      1. fetch instruction
      2. decode instruction (and setting ram addr for data)
      3. change column/row selector (the data will be received from ram by this moment)
      4. change grid values (the data will be set to the registers)
    • We could add such that in 3. and 4. we could start decoding the fetch for the next instruction.
      1. at stage 3, we will be setting the address and the write enable low AT THE SAME TIME the data will be received from stage 2 of the original pipeline.
      2. note that at stage 4, the data must be cached such that it will still be going to the grid, but the new data coming from the ram will be the next instruction itself.
    • Remove pipelining whenever we have GI (increment row)
      • GI uses the ram at the 3rd/4th stage. Could be better if we have two channel RAM? Idk how that's organized.
  1. Store PC instruction from PC --> RAM (easy instruction; just alter gi enabler a lil to add pc instead of data_out)
    • also add a RAM --> PC (custom instruction you may need to create from scratch)
    • Maybe consider having a store value function onto RAM? could act as a parameter into something!!
      • set upper addr/data
      • set lower addr/data

Worries

  1. Elementwise operations might be the limiting factors rather than dot product

    • Almdal's law:
      • even if dot product accounts for 90% of the computation, and I speed it up by 20x, the resultant increase in speed will only be by 2x
      • Why? It is waiting for rest of the computation to speed up.
    • In other words, you need dot product and elementwise computation to be equally fast.
    • If I need to add a 128 x 512 by a 1 x 512, can I do that more efficiently than what NVIDIA can?
      • 128 x 512 --> 65536; assuming a 128 x 128 systolic array, that's still 65536 / 128 (as I can only use one row) --> 512 clock cycles
      • can I increase this to be even faster?
    • This is what leads to idea 5
      • Doing idea 5 requires more space, but in turn helps with elementwise computation
      • more space --> less processing units --> more cycles for dot product --> etc.
      • Again, it's a tradeoff. A balance that you must find
        • Ideally, create an algorithm that could find a right balance between both
        • Linear optimization :O
  2. Can't necessary compare mine with any top of the line yet unless I get it tapped out; also memory problems

    • Still not a bad investment. Tapeout could be ~1k - 3k. I just have to earn money to do that
    • Oh well.
    • That, and get the memory chips that could make it comparable?
      • HBMs, etc.
    • FPGA simulations usualy have a lower clock pulse
  3. Google TPUs is a systolic array, but wired a little bit differently...

    • weight stationary
    • It's still a good question on whether this idea would be better than Google's TPU
      • they still have large cache memory blocks
      • meanwhile, the compiler tries to not do that
    • My systolic array is still... slightly different?
      • mine is more... easier to do?
      • I don't necessarily know. I did some scratchpad computations on the back of my head, and it seems that mine's fare better
    • Also Google's TPU software sucks meanwhile mine's doesn't hehehehehehehehe.
  4. Life

    • yeah idk man

VCD Setup

clock pc

address (ram) data_out (ram) write_enable (ram)

instruction (instruction_dec) data (instruction_dec)

enable (rc_select_comp) col_setting (chip, but from rc_select_comp) row_setting (chip, but from rc_select_comp) col_out (chip, but from rc_select_comp) row_out (chip, but from rc_select_comp)

idx (rics) is_ri (rics) ri_addr (chip, but from rics) c_instr (rics) pc_next (rics) is_grid_en (rics)

reset_grid (chip) ram_data_in (chip) out_grid (chip) ri_get_instr (chip, but from rics)

Software

Explained in the autodiff folder.

Technologies Used

I used System Verilog to design the chip, and Rust and rstb as an interface between our testbench software and rust. I use icarus verilog for actual simulation (rstb is just an "interface").

Future

My hope is to integrate this with my own cloud platform software, which aims to take a graphical approach rather than a code line-by-line approach to machine learning. I have seen many machine learning projects and development can be just pipelines (although, there's a few examples where this is not the case; see Stanford's HazyResearch for example), and I aim to simplify this process futher. Also, it would be cool to have a frontend designer interface as well.

I could also write my own Python wrapper, and interface it there. Furthermore, a hope can be to fully integrate with an existing learning library like PyTorch, but I want to experiment integrating first with tinygrad, as it only allows me to write ~26 or so primitive operations and then everything else is optimized for me. However, I need to check whether everything is compiled; not just model prediction (training loop, getting data, etc.). Honestly, I might just write my own autodiff/linear algebra library. Shouldn't be too bad.

The thing is, I could make this a coprocessor for a CPU (or even a GPU), then I could just create a library to call the chip whenever needed (like writing a GPU driver). But not sure how to approach that just of yet. PCIe's specification is hell, to say the least.

There are many interfaces I could build off of this; hell I could make my own programming language if I wanted to, but that seems really unnecessary when I just want to interface mainly with Rust and Python.

Why Rust? Memory safety is a huge plus for me, but personally I really like coding in Rust. It feels almost... pythonic? (for i in 0...3, mp.iter().map(|v| v * 2)) but then the similar (if not more, there's no garbage collection) speed as C++. Perfect for high-performance applications, which I will need if I were to scale this up towards multiple servers and stuff. Very readable syntax (at least for me). Plus, the cargo system and rust-analyzer is to die for.

Furthermore, the "graphical cloud platform" application that I mentioned earlier has the backend written in Rust as well (at least, transitioning. I wrote it in C++ initially but I am rewriting the entire thing yet again). Since I am building the prelimary stages of the low-level driver for Project Astra, I could easily connect the cloud platform backend with this driver in the far future. I need the full hardware for the chip to be complete, which you can see at "Hardware Todos".

About

Integer Systolic Array Implementation in Verilog - Dot Product, Elementwise, Memory Manipulation, Branching with custom ISA

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published