07-28-2018, 01:24 AM
So, it's been quite some time since I updated my
I've told you in the past how various things don't exist in programming, such as if-blocks, for loops, pointers, and data types. In each of these, I've explained how they actually do work. Well now is that anticipated moment when I combine them all together and show you something more abstract. Yes, I'm talking about the switch statement.
Take a look at this C code. How do you think it's implemented?
Did you say "a bunch of comparisons and branches"?
Well, if you did, you'd be wrong. That's terrible for efficiency. You should only need 1 comparison to implement this (or any sized) switch statement (with one exception). Only one comparison!? How on Earth can you manage that!?
It's actually pretty simple. Switch statements are one of those special types of abstract control structure. We know the following information from looking at the C code
Based on this knowledge, we can construct a table of sorts. To accomplish this feat, let's first start with the assumption that any number is valid. What we'll do is make a one-column table, something like this:
This looks pretty useless, but bear with me. As an array, we can use these as pointers to the correct label of something. We'd just need to do the following:
Make sense? Well, our table actually is exactly like that, but without all the complicated mess. We can just use some basic algebra to figure it out.
We know that ARM addresses memory on a 32-bit bus, meaning that every location (and thus label address) will take up 4 bytes. Because of that, we can figure out the address stored in table[foo] with the following formula:
But wait you say. ARM doesn't have an efficient multiply instruction. Don't you worry, there's a shortcut. Notice how 4 is one of those numbers that sticks in your head? Seems familiar huh? Like maybe from the RAM cards themselves and that they're always in multiples of 2?
Well, they actually aren't multiples of 2, they're powers of 2. This is due to addressing and your computers logic being in base 2. It wouldn't be able to access bits outside of that. Because this is base 2, we can multiply by 2 just by shifting every bit to the left 1 location, and inserting a 0 in the empty space. Here's an example.
So, we can use that pretty reliably. In fact, so reliably we get to cut an entire cycle off our code. ARM has this neat little way of doing logic shifts and additions in the same operation. So we can now get our address with this:
So, that's neat and all, but what does it mean? That means we only have 4 bytes to hold our code? Seems pretty useless.
Well, normally, yes. Jump tables make this a lot easier. You can use that 4 bytes to hold a branch instruction to the label you need it to be.
Ok, now that I've confused you enough, here's the actual assembly.
Now, let's go through this pretty briefly, but enough that someone with ARM knowledge should know what's going on.
Those first 3 lines are all loads, they should be pretty obvious. First we get the value of foo and store it in R0, then we store the address of our jump table in R2. We don't want to get any values from that yet, we need to do some math first.
Next up, we bitwise and with 0xfffffffc. What that is doing is removing the last 3 bits (the ones we care about). We store that value in R3. We wanted to remove the bits we care about to see if there are any bits we don't care about, since any number higher than 3 will be in the default case, we want to see first if that is the case. This is also our only comparison, if you notice the S following the AND instruction. This instructs the processor to compare that result with 0 after it's finished and set our condition bits. ARM has this amazing feature that does it as part of a math operation for us if we choose, eliminating the need for us to waste another clock cycle and do the comparison ourselves. Since this comparison is all we really care about, we don't actually need the value in R3, and we're free to reuse it again (even though I did not to make it easy to read).
Next up is where all of the magic happens. You should see that the right side of this instruction looks pretty similar to the one we made before. It actually does the same thing in this case. I mentioned that ARM was powerful enough to do a shift and an addition in one cycle, well it's also powerful enough to use that result in a memory fetch operation. All of this happens in 1 single clock cycle. What this actually comes out to is
Remember, PC is our program counter, which tells the processor where to execute memory from. Setting the PC to something is equivalent to a branch instruction, which is exactly what's going on here. We're branching to the base of the table + 4 times the value of foo.
Now, about the EQ at the end of the LDR instrucion. This instructs the processor to only perform this operation if our previous comparison was equal to 0, meaning that the number was between 0 and 3 inclusively. Because of this, we don't need to and it with 3 to mask off the extra bits, because this will only get executed if there are no extra bits.
Now, if you look at the case table, you can see that I just define 4 words in succession. These 4 words just hold the address of a label, and so when we jump to the address held in them, we get to our 4 different cases. One label for each of the 4 valid cases (0-3). But what about the default case?
Well, the default case just happens to be the same as the 0 case, so the C programmer left off the 0 case and just used the default one. If he had created both a 0 case and a default case that were the same, the compiler would have optimized it to the same C code I wrote. Back to that EQ suffix on the LDR instruction. If there were extra bits (meaning a number not in the range of 0-3), then that EQ suffix prevents the jump from happening, and it continues on to label a, which happens to be the same as our default (and zero) case.
The same is true for the fallthrough case (2). It is the only one without a branch to the case end label, meaning that after it executes, it will fall through to label d, or case for 3.
Because this is assembly, and it's been optimized, we use R4 for temporary storage and write the value into var at the end of the case. This allows us to have fewer load and store operations in the code, and provides a central place to debug an issue from.
Well, that's about it, let me know what you think, and if you have any questions.
[To see links please register here]
, and I figured I'd give it another go.I've told you in the past how various things don't exist in programming, such as if-blocks, for loops, pointers, and data types. In each of these, I've explained how they actually do work. Well now is that anticipated moment when I combine them all together and show you something more abstract. Yes, I'm talking about the switch statement.
Take a look at this C code. How do you think it's implemented?
Hidden Content
Did you say "a bunch of comparisons and branches"?
Well, if you did, you'd be wrong. That's terrible for efficiency. You should only need 1 comparison to implement this (or any sized) switch statement (with one exception). Only one comparison!? How on Earth can you manage that!?
It's actually pretty simple. Switch statements are one of those special types of abstract control structure. We know the following information from looking at the C code
- There are exactly 4 cases we can fall into
- All 3 specified cases are linear (no missing numbers)
- We know the start and ending numbers (1 and 3)
Based on this knowledge, we can construct a table of sorts. To accomplish this feat, let's first start with the assumption that any number is valid. What we'll do is make a one-column table, something like this:
- case0
- case1
- case2
- case3
This looks pretty useless, but bear with me. As an array, we can use these as pointers to the correct label of something. We'd just need to do the following:
Hidden Content
Make sense? Well, our table actually is exactly like that, but without all the complicated mess. We can just use some basic algebra to figure it out.
We know that ARM addresses memory on a 32-bit bus, meaning that every location (and thus label address) will take up 4 bytes. Because of that, we can figure out the address stored in table[foo] with the following formula:
Hidden Content
But wait you say. ARM doesn't have an efficient multiply instruction. Don't you worry, there's a shortcut. Notice how 4 is one of those numbers that sticks in your head? Seems familiar huh? Like maybe from the RAM cards themselves and that they're always in multiples of 2?
Well, they actually aren't multiples of 2, they're powers of 2. This is due to addressing and your computers logic being in base 2. It wouldn't be able to access bits outside of that. Because this is base 2, we can multiply by 2 just by shifting every bit to the left 1 location, and inserting a 0 in the empty space. Here's an example.
Hidden Content
So, we can use that pretty reliably. In fact, so reliably we get to cut an entire cycle off our code. ARM has this neat little way of doing logic shifts and additions in the same operation. So we can now get our address with this:
Hidden Content
So, that's neat and all, but what does it mean? That means we only have 4 bytes to hold our code? Seems pretty useless.
Well, normally, yes. Jump tables make this a lot easier. You can use that 4 bytes to hold a branch instruction to the label you need it to be.
Ok, now that I've confused you enough, here's the actual assembly.
Hidden Content
Now, let's go through this pretty briefly, but enough that someone with ARM knowledge should know what's going on.
Those first 3 lines are all loads, they should be pretty obvious. First we get the value of foo and store it in R0, then we store the address of our jump table in R2. We don't want to get any values from that yet, we need to do some math first.
Next up, we bitwise and with 0xfffffffc. What that is doing is removing the last 3 bits (the ones we care about). We store that value in R3. We wanted to remove the bits we care about to see if there are any bits we don't care about, since any number higher than 3 will be in the default case, we want to see first if that is the case. This is also our only comparison, if you notice the S following the AND instruction. This instructs the processor to compare that result with 0 after it's finished and set our condition bits. ARM has this amazing feature that does it as part of a math operation for us if we choose, eliminating the need for us to waste another clock cycle and do the comparison ourselves. Since this comparison is all we really care about, we don't actually need the value in R3, and we're free to reuse it again (even though I did not to make it easy to read).
Next up is where all of the magic happens. You should see that the right side of this instruction looks pretty similar to the one we made before. It actually does the same thing in this case. I mentioned that ARM was powerful enough to do a shift and an addition in one cycle, well it's also powerful enough to use that result in a memory fetch operation. All of this happens in 1 single clock cycle. What this actually comes out to is
Hidden Content
Remember, PC is our program counter, which tells the processor where to execute memory from. Setting the PC to something is equivalent to a branch instruction, which is exactly what's going on here. We're branching to the base of the table + 4 times the value of foo.
Now, about the EQ at the end of the LDR instrucion. This instructs the processor to only perform this operation if our previous comparison was equal to 0, meaning that the number was between 0 and 3 inclusively. Because of this, we don't need to and it with 3 to mask off the extra bits, because this will only get executed if there are no extra bits.
Now, if you look at the case table, you can see that I just define 4 words in succession. These 4 words just hold the address of a label, and so when we jump to the address held in them, we get to our 4 different cases. One label for each of the 4 valid cases (0-3). But what about the default case?
Well, the default case just happens to be the same as the 0 case, so the C programmer left off the 0 case and just used the default one. If he had created both a 0 case and a default case that were the same, the compiler would have optimized it to the same C code I wrote. Back to that EQ suffix on the LDR instruction. If there were extra bits (meaning a number not in the range of 0-3), then that EQ suffix prevents the jump from happening, and it continues on to label a, which happens to be the same as our default (and zero) case.
The same is true for the fallthrough case (2). It is the only one without a branch to the case end label, meaning that after it executes, it will fall through to label d, or case for 3.
Because this is assembly, and it's been optimized, we use R4 for temporary storage and write the value into var at the end of the case. This allows us to have fewer load and store operations in the code, and provides a central place to debug an issue from.
Well, that's about it, let me know what you think, and if you have any questions.