Table of Contents
SoC Design Course - This article is part of a series.
Part 6: This Article

Introduction
#

In [SoC-04] and [SoC-05], we studied ISA concepts, addressing modes, and the RISC-V philosophy. Now it’s time to see RISC-V in action — we will take real C code and trace exactly how it becomes assembly instructions and ultimately machine code.

This is where theory meets practice. By the end of this post, you will be able to read RISC-V assembly, understand compiler output, and reason about how your C code executes on hardware.


1. RISC-V Instruction Reference
#

Let’s first consolidate the key RV32I instructions we will use:

1.1 R-Type Instructions (Register-Register)
#

31       25 24    20 19    15 14  12 11     7 6      0
┌──────────┬────────┬────────┬──────┬────────┬────────┐
│  funct7  │   rs2  │   rs1  │funct3│   rd   │ opcode │
└──────────┴────────┴────────┴──────┴────────┴────────┘
Instructionfunct7funct3Operation
add0000000000rd = rs1 + rs2
sub0100000000rd = rs1 - rs2
and0000000111rd = rs1 & rs2
or0000000110rd = rs1 | rs2
xor0000000100rd = rs1 ^ rs2
sll0000000001rd = rs1 « rs2
srl0000000101rd = rs1 » rs2 (logical)
sra0100000101rd = rs1 » rs2 (arithmetic)
slt0000000010rd = (rs1 < rs2) ? 1 : 0

1.2 I-Type Instructions (Immediate)
#

31            20 19    15 14  12 11     7 6      0
┌────────────────┬────────┬──────┬────────┬────────┐
│   imm[11:0]    │   rs1  │funct3│   rd   │ opcode │
└────────────────┴────────┴──────┴────────┴────────┘
Instructionfunct3Operation
addi000rd = rs1 + imm
andi111rd = rs1 & imm
ori110rd = rs1 | imm
xori100rd = rs1 ^ imm
slti010rd = (rs1 < imm) ? 1 : 0
lw010rd = Memory[rs1 + imm]
lh001rd = sign_ext(Memory[rs1 + imm]) (16-bit)
lb000rd = sign_ext(Memory[rs1 + imm]) (8-bit)
lbu100rd = zero_ext(Memory[rs1 + imm]) (8-bit)

1.3 S-Type Instructions (Store)
#

Instructionfunct3Operation
sw010Memory[rs1 + imm] = rs2 (32-bit)
sh001Memory[rs1 + imm] = rs2 (16-bit)
sb000Memory[rs1 + imm] = rs2 (8-bit)

1.4 B-Type Instructions (Branch)
#

Instructionfunct3Condition
beq000Branch if rs1 == rs2
bne001Branch if rs1 != rs2
blt100Branch if rs1 < rs2 (signed)
bge101Branch if rs1 >= rs2 (signed)
bltu110Branch if rs1 < rs2 (unsigned)
bgeu111Branch if rs1 >= rs2 (unsigned)

2. C to Assembly: Simple Expressions
#

2.1 Variable Assignment
#

int a = 5;
int b = 3;
int c = a + b;

Assembly (assuming a→x10, b→x11, c→x12):

addi x10, x0, 5      # a = 5
addi x11, x0, 3      # b = 3
add  x12, x10, x11   # c = a + b = 8

2.2 Complex Expressions
#

int f = (a + b) - (c + d);

Assembly (a→x10, b→x11, c→x12, d→x13, f→x14):

add  x5, x10, x11    # temp1 = a + b
add  x6, x12, x13    # temp2 = c + d
sub  x14, x5, x6     # f = temp1 - temp2

Notice how the compiler uses temporary registers (x5, x6) for intermediate results.

2.3 Bitwise Operations
#

int mask = value & 0xFF;        // Extract lowest byte
int shifted = value << 4;       // Multiply by 16
int toggled = flags ^ 0x01;     // Toggle bit 0
andi  x11, x10, 0xFF     # mask = value & 0xFF
slli  x12, x10, 4        # shifted = value << 4  (= value × 16)
xori  x13, x14, 0x01     # toggled = flags ^ 0x01

Key insight: Shift-left by $n$ is equivalent to multiplying by $2^n$. Compilers use this to replace multiplication by powers of 2, which is much faster than a hardware multiply.


3. C to Assembly: Conditional Statements
#

3.1 Simple If-Else
#

if (a == b) {
    c = a + b;
} else {
    c = a - b;
}
      bne  x10, x11, else   # if (a != b) goto else
      add  x12, x10, x11    # c = a + b  (if branch)
      jal  x0, end           # goto end (skip else)
else: sub  x12, x10, x11    # c = a - b  (else branch)
end:  ...                    # continue

Pattern: The compiler typically inverts the condition and branches to the else block. The jal x0, end at the end of the if block is an unconditional jump (using x0 discards the return address since we don’t need it).

3.2 Comparison Operators
#

Different C comparisons map to different branch instructions:

C ConditionRISC-V BranchNotes
a == bbeq x10, x11, L
a != bbne x10, x11, L
a < bblt x10, x11, LSigned
a >= bbge x10, x11, LSigned
a > bblt x11, x10, LSwap operands!
a <= bbge x11, x10, LSwap operands!

Notice that RISC-V doesn’t have bgt or ble instructions — the compiler swaps the operands to use blt and bge. This is an example of “make the common case fast” — fewer instruction types, simpler decoder.

3.3 Multi-Way Conditional (Switch)
#

switch (x) {
    case 0: result = a; break;
    case 1: result = b; break;
    case 2: result = c; break;
    default: result = d;
}

Method 1: Chain of branches (for small switch):

      beq  x10, x0, case0     # if x == 0
      addi x5, x0, 1
      beq  x10, x5, case1     # if x == 1
      addi x5, x0, 2
      beq  x10, x5, case2     # if x == 2
      jal  x0, default         # else: default
case0: add  x14, x11, x0      # result = a
      jal  x0, end
case1: add  x14, x12, x0      # result = b
      jal  x0, end
case2: add  x14, x13, x0      # result = c
      jal  x0, end
default: add x14, x15, x0     # result = d
end:  ...

Method 2: Jump table (for large, dense switch — more efficient):

      # x10 = switch variable, x20 = base of jump table
      slli  x5, x10, 2        # x5 = x * 4 (each table entry is 4 bytes)
      add   x5, x20, x5       # x5 = &jump_table[x]
      lw    x5, 0(x5)         # x5 = jump_table[x] (target address)
      jalr  x0, 0(x5)         # jump to target

4. C to Assembly: Loops
#

4.1 While Loop
#

int sum = 0;
int i = 0;
while (i < 10) {
    sum += i;
    i++;
}
      addi x10, x0, 0         # sum = 0
      addi x11, x0, 0         # i = 0
      addi x12, x0, 10        # limit = 10
loop: bge  x11, x12, done     # if (i >= 10) exit loop
      add  x10, x10, x11      # sum += i
      addi x11, x11, 1        # i++
      jal  x0, loop            # goto loop
done: ...                      # sum is in x10 (= 45)

4.2 For Loop
#

for (int i = 0; i < n; i++) {
    a[i] = a[i] * 2;
}
      addi x11, x0, 0         # i = 0
      # x12 = n, x13 = base address of a[]
loop: bge  x11, x12, done     # if (i >= n) exit
      slli x5, x11, 2         # x5 = i * 4 (word offset)
      add  x5, x13, x5        # x5 = &a[i]
      lw   x6, 0(x5)          # x6 = a[i]
      slli x6, x6, 1          # x6 = a[i] * 2 (shift left = ×2)
      sw   x6, 0(x5)          # a[i] = a[i] * 2
      addi x11, x11, 1        # i++
      jal  x0, loop            # goto loop
done: ...

4.3 Do-While Loop
#

do {
    x = x >> 1;    // divide by 2
    count++;
} while (x != 0);
      # x10 = x, x11 = count
loop: srli x10, x10, 1        # x = x >> 1
      addi x11, x11, 1        # count++
      bne  x10, x0, loop       # if (x != 0) continue
      # loop done; count is in x11

The do-while loop places the condition check at the bottom — the body always executes at least once.


5. C to Assembly: Arrays and Memory
#

5.1 Array Access
#

int a[100];
int x = a[5];       // Load
a[10] = x + 1;      // Store
# x13 = base address of a[]
lw   x10, 20(x13)     # x = a[5]  (5 × 4 = 20 byte offset)
addi x10, x10, 1      # x + 1
sw   x10, 40(x13)     # a[10] = x + 1  (10 × 4 = 40 byte offset)

5.2 Array Traversal (Sum)
#

int sum = 0;
for (int i = 0; i < n; i++) {
    sum += a[i];
}

Approach 1: Index-based (compute address each iteration)

      addi x10, x0, 0        # sum = 0
      addi x11, x0, 0        # i = 0
loop: bge  x11, x12, done    # if (i >= n) exit
      slli x5, x11, 2        # offset = i * 4
      add  x5, x13, x5       # addr = base + offset
      lw   x6, 0(x5)         # load a[i]
      add  x10, x10, x6      # sum += a[i]
      addi x11, x11, 1       # i++
      jal  x0, loop
done: ...

Approach 2: Pointer-based (more efficient — increment pointer)

      addi x10, x0, 0        # sum = 0
      slli x5, x12, 2        # x5 = n * 4
      add  x5, x13, x5       # x5 = &a[n] (end pointer)
      add  x6, x13, x0       # x6 = &a[0] (current pointer)
loop: bge  x6, x5, done      # if (ptr >= end) exit
      lw   x7, 0(x6)         # load *ptr
      add  x10, x10, x7      # sum += *ptr
      addi x6, x6, 4         # ptr++ (advance by 4 bytes)
      jal  x0, loop
done: ...

The pointer-based approach avoids the slli + add for address calculation inside the loop — one fewer instruction per iteration. Optimizing compilers often perform this transformation automatically.

5.3 Strings (Character Arrays)
#

int strlen(char *s) {
    int len = 0;
    while (s[len] != '\0') {
        len++;
    }
    return len;
}
strlen:
      addi x11, x0, 0        # len = 0
loop: add  x5, x10, x11      # addr = s + len
      lb   x6, 0(x5)         # load s[len] (byte)
      beq  x6, x0, done      # if (s[len] == '\0') exit
      addi x11, x11, 1       # len++
      jal  x0, loop
done: add  x10, x11, x0      # return value in a0 (x10)
      jalr x0, 0(x1)         # return to caller

6. C to Assembly: Functions
#

6.1 Function Call Convention
#

RISC-V defines a calling convention that specifies how functions communicate:

RegisterABI NameRoleSaved By
x1raReturn addressCaller
x2spStack pointerCallee
x5–x7t0–t2TemporariesCaller
x8–x9s0–s1SavedCallee
x10–x11a0–a1Arguments / Return valueCaller
x12–x17a2–a7ArgumentsCaller
x18–x27s2–s11SavedCallee
x28–x31t3–t6TemporariesCaller

Caller-saved registers may be overwritten by the called function — if the caller needs them after the call, it must save them to the stack first.

Callee-saved registers must be preserved by the called function — if it uses them, it must save the old values to the stack and restore them before returning.

6.2 Simple Function Call
#

int add(int a, int b) {
    return a + b;
}

int main() {
    int result = add(3, 4);
}
# --- main ---
main:
      addi x10, x0, 3      # a0 = 3 (first argument)
      addi x11, x0, 4      # a1 = 4 (second argument)
      jal  x1, add          # call add; ra = return address
      # x10 now contains 7 (return value)
      ...

# --- add ---
add:
      add  x10, x10, x11   # a0 = a0 + a1 (result in a0)
      jalr x0, 0(x1)       # return to caller (jump to ra)

This is a leaf function (doesn’t call other functions) — no need to save anything on the stack.

6.3 Nested Function Calls (Stack Usage)
#

int multiply(int a, int b) {
    return a * b;   // assume M extension
}

int compute(int x, int y) {
    int temp = multiply(x, y);
    return temp + 1;
}
compute:
      # Prologue: save registers to stack
      addi sp, sp, -12      # allocate 12 bytes on stack
      sw   x1, 8(sp)        # save return address (ra)
      sw   x8, 4(sp)        # save s0
      sw   x9, 0(sp)        # save s1

      add  x8, x10, x0      # s0 = x (save argument)
      add  x9, x11, x0      # s1 = y (save argument)

      # Arguments already in a0, a1 for multiply
      jal  x1, multiply      # call multiply(x, y)
      # x10 = result of multiply

      addi x10, x10, 1      # return temp + 1

      # Epilogue: restore registers from stack
      lw   x1, 8(sp)        # restore ra
      lw   x8, 4(sp)        # restore s0
      lw   x9, 0(sp)        # restore s1
      addi sp, sp, 12       # deallocate stack space

      jalr x0, 0(x1)        # return

The stack frame for this function:

High Address
┌──────────────┐ ← sp (before call)
│   ra (x1)    │  sp + 8
├──────────────┤
│   s0 (x8)   │  sp + 4
├──────────────┤
│   s1 (x9)   │  sp + 0
└──────────────┘ ← sp (after prologue)
Low Address

6.4 Recursive Function
#

int factorial(int n) {
    if (n <= 1) return 1;
    return n * factorial(n - 1);
}
factorial:
      # Base case check
      addi x5, x0, 1
      bge  x5, x10, base     # if (1 >= n) goto base

      # Recursive case: save state
      addi sp, sp, -8        # allocate stack space
      sw   x1, 4(sp)         # save return address
      sw   x10, 0(sp)        # save n

      addi x10, x10, -1      # a0 = n - 1
      jal  x1, factorial      # call factorial(n-1)
      # x10 = factorial(n-1)

      lw   x5, 0(sp)         # restore n
      lw   x1, 4(sp)         # restore return address
      addi sp, sp, 8         # deallocate stack

      mul  x10, x5, x10      # return n * factorial(n-1)
      jalr x0, 0(x1)         # return

base:
      addi x10, x0, 1        # return 1
      jalr x0, 0(x1)         # return

Stack evolution for factorial(4):

Call factorial(4): save ra, n=4          Stack: [ra4, 4]
  Call factorial(3): save ra, n=3        Stack: [ra4, 4] [ra3, 3]
    Call factorial(2): save ra, n=2      Stack: [ra4, 4] [ra3, 3] [ra2, 2]
      Call factorial(1): base case → return 1
    Return: 2 × 1 = 2
  Return: 3 × 2 = 6
Return: 4 × 6 = 24

7. Encoding a Complete Instruction
#

Let’s encode a real instruction from start to finish.

Instruction: add x9, x20, x21

Step 1: Identify the format → R-type

Step 2: Look up the fields:

FieldValueBinary
funct700000000000000
rs2x2110101
rs1x2010100
funct3000000
rdx901001
opcode01100110110011

Step 3: Assemble:

0000000 | 10101 | 10100 | 000 | 01001 | 0110011
funct7    rs2     rs1    f3    rd      opcode

Binary: 00000001010110100000010010110011

Hex: 0x015A04B3

Step 4: Verify — this 32-bit value is what gets stored in instruction memory and what the CPU fetches and decodes.


8. Pseudo-Instructions
#

RISC-V assembly provides pseudo-instructions — convenient shorthand that the assembler expands into real instructions:

Pseudo-instructionActual Instruction(s)Meaning
mv x5, x6addi x5, x6, 0Copy register
li x5, 42addi x5, x0, 42Load immediate
li x5, 0x12345678lui x5, 0x12345; addi x5, x5, 0x678Load large constant
nopaddi x0, x0, 0No operation
j labeljal x0, labelUnconditional jump
retjalr x0, 0(x1)Return from function
call funcauipc x1, ...; jalr x1, ...Far function call
not x5, x6xori x5, x6, -1Bitwise NOT
neg x5, x6sub x5, x0, x6Negate
beqz x5, Lbeq x5, x0, LBranch if zero
bnez x5, Lbne x5, x0, LBranch if not zero

These make assembly code more readable without adding hardware complexity.


9. Complete Example: Bubble Sort
#

Let’s bring everything together with a real algorithm:

void bubble_sort(int *arr, int n) {
    for (int i = 0; i < n - 1; i++) {
        for (int j = 0; j < n - 1 - i; j++) {
            if (arr[j] > arr[j + 1]) {
                // swap
                int temp = arr[j];
                arr[j] = arr[j + 1];
                arr[j + 1] = temp;
            }
        }
    }
}
# x10 = arr (base address), x11 = n
bubble_sort:
      addi x18, x11, -1      # s2 = n - 1 (outer limit)
      addi x19, x0, 0        # s3 = i = 0 (outer counter)

outer:
      bge  x19, x18, done    # if (i >= n-1) exit

      sub  x20, x18, x19     # s4 = (n-1) - i (inner limit)
      addi x21, x0, 0        # s5 = j = 0 (inner counter)

inner:
      bge  x21, x20, next_i  # if (j >= n-1-i) next outer iteration

      slli x5, x21, 2        # x5 = j * 4
      add  x5, x10, x5       # x5 = &arr[j]
      lw   x6, 0(x5)         # x6 = arr[j]
      lw   x7, 4(x5)         # x7 = arr[j+1]

      bge  x7, x6, no_swap   # if (arr[j+1] >= arr[j]) skip swap

      # Swap: arr[j] and arr[j+1]
      sw   x7, 0(x5)         # arr[j] = arr[j+1]
      sw   x6, 4(x5)         # arr[j+1] = arr[j]

no_swap:
      addi x21, x21, 1       # j++
      jal  x0, inner          # continue inner loop

next_i:
      addi x19, x19, 1       # i++
      jal  x0, outer          # continue outer loop

done:
      jalr x0, 0(x1)         # return

This example shows every concept we’ve learned:

  • Loops (nested for loops with branch instructions)
  • Array access (slli + add for index calculation, lw/sw for load/store)
  • Conditionals (bge for comparison, branch to skip swap)
  • Register usage (saved registers for loop counters, temporaries for addresses/values)

10. Summary
#

TopicKey Takeaway
R/I/S/B formatsEach instruction type has a specific encoding; register positions are consistent
ExpressionsMap directly to add, sub, and/or/xor, shift instructions
ConditionalsCompiler inverts condition and branches to else block
LoopsCondition check at top (while/for) or bottom (do-while) with backward branch
ArraysIndex × element_size for byte offset; pointer-based traversal is more efficient
FunctionsCaller/callee-saved registers; stack for saving state; jal/jalr for call/return
RecursionEach call pushes state onto stack; stack unwinds on return
Pseudo-instructionsConvenient shorthand (mv, li, ret, nop) expanded by assembler

In the next post ([SoC-07]), we will start building the actual hardware that executes these instructions — beginning with the building blocks of a single-cycle RISC-V processor.


This post is part of the SoC Design Course series. Navigate to the next post to continue your learning journey.

SoC Design Course - This article is part of a series.
Part 6: This Article