Learning x86-64 assembly basics

Translations: Português (pt-BR)
Publication date: Feb 25, 2022
Tags: assembly, x86-64

Recently I decided to learn assembly. I already had a reasonable understanding of how it worked due to some classes that touched the subject in university, however I never had the opportunity to really write assembly code.

Since my everyday computer is an x86-64 machine, it made most sense to learn assembly for this architecture, so I could avoid the need for a VM. I started with only the desire to get my hands dirty with assembly code, and not any particular objective or project.

At first I was alternating between trying things out and researching on the web just to understand enough to get a bare minimum assembly file and commands that would assemble it and run. Eventually I stumbled upon the book that would guide me: x86-64 Assembly Language Programming with Ubuntu.

This book is free, recent and had the perfect scope for me: it's aimed at people that already have a good grasp of programming, but are new to x86-64 assembly, and it shows some theory and concepts, but there are plenty of exercises to learn from practice.

It was pretty fun to work through that book, and it worked well for me to create some familiarity with x86-64 assembly. I'm sure there are still a bunch of things to learn on the subject, since the book only gives a basis, but it was enough to teach me some interesting things.

Signedness and two's complement

The biggest lesson to me was a better understanding of signedness. I'm used to seeing int and unsigned int in C, and to watch out for using the wrong signedness, but it wasn't as clear to me how that worked at the assembly level.

The first thing to have in mind, is that the type concept present in higher level languages like C (like if a number is signed or not) is completely absent in assembly. The computer memory stores only 0s and 1s, and it's up to you, the programmer, to interpret what they mean: is 01011000 the number 88, the character X, the POP AX instruction? With only that single byte, you can't even be sure of the size: maybe those are really 8 boolean flags in a single byte, or part of a 4-byte signed number. Without context it's impossible to tell.

If the same representation can mean both a signed or unsigned number, depending on the context, that means that when operating on those numbers, you as the programmer have to use the right variant of the instruction to give that context to the computer.

While going through the book, the following arithmetic instructions were presented for unsigned numbers:

  • add adds two numbers
  • sub subtracts two numbers
  • mul multiplies two numbers
  • div divides two numbers

And the following instructions were shown for comparison between unsigned numbers:

  • ja compares two numbers and jumps if the first one is above the second
  • jb compares two numbers and jumps if the first one is below the second

And sure enough, shortly after, the signed variants of those instructions were also shown:

  • imul is mul's signed variant
  • idiv is div's signed variant
  • jg is ja's signed variant
  • jl is jb's signed variant

But wait, what about iadd and isub? That's the thing, the way x86-64 represents negative numbers is through the use of the two's complement system, which has the useful property of allowing addition and subtraction to be done in the exactly same manner for both signed and unsigned values.

This means that there's only one way to add, independently of the signedness, and it's using add. There's no iadd. Likewise for subtraction.

So the interesting conclusion is that for addition and subtraction it doesn't matter if you use unsigned int or int for the variables in C. The unsigned keyword is there for you to tell the compiler to use the right variant of the instruction in the generated assembly, which is required when you're comparing numbers (ja vs jg, jb vs jl), multiplying (mul vs imul) or dividing (div vs idiv). But thanks to two's complement, in addition and subtraction there's no way to get it wrong 🙂.

Side note: interestingly, while writing this post, I read on the Wikipedia page that two's complement also works the same for multiplication, but only if you do a sign extend of the two operands beforehand. Which makes me think that if the mul instruction always did the sign extend step, no imul instruction would be required as well, but that would probably increase complexity (and cost) in the logic circuitry.

Other interesting lessons

The other thing that interested me the most was to realize that local variables are nothing more than adding more space to the stack. And that this is done simply by subtracting the stack register rsp by the total number of bytes needed for the variables at the start of a subroutine.

Also interesting was to learn how there are calling conventions to standardize on:

  • which registers are used to pass arguments to subroutines and in which order;
  • which registers can be overwritten by a subroutine and which should be left unchanged. When using the latter, its current value should first be pushed on the stack so that it can be restored before returning.

And what about the magic main() function that the C compiler expects in every C program? Assembly doesn't need compiling, so no need for that, but turns out a different magic label is expected by the linker: _start.

Some other things that were interesting to do in assembly:

  • Making syscalls
  • Exploiting a stack buffer overflow
  • Interacting assembly code with C code, and vice versa.

Lack of a good GUI

One thing I missed was a good GUI application when debugging the assembly programs. It would have been really helpful to have one that showed the values of expressions in tooltips when hovering, that was able to follow labels when clicking, and so on.

The book recommends using DDD, which is a GUI, but it felt clunky and really outdated. I went for using GDB together with the peda plugin, and that worked reasonably well, but being a CLI, every inspection required divining the correct command, so it took more time to get oriented.

Conclusion

This was a great experience and I hope to get back to it and further my knowledge past the "basic" level for x86-64 sometime in the future. Seeing what's happening at the assembly level really helps better understand the higher level languages, and value the way they hide complexities below!

I've uploaded the code I wrote for all the book's exercises to this repository. I don't expect it to be useful to anyone since it's simple stuff, but it's there either way.

The only exercise that I couldn't actually finish was the last one. There's very little information on the book about how to do it, and during research of the topic online I eventually got demotivated and started learning about other subjects instead. Maybe one day I'll give it another try. If you do know how to do it, get in touch! 🙂

And even though I couldn't finish that last exercise, while researching about it I ended up learning about how to use the asm syntax for GCC through this guide, to embed assembly in a C file, and also about the Compiler Explorer which seems a great way to learn about assembly and C by just seeing what assembly is generated from a given C code, so I'm calling this a win!