07-24-2023, 03:28 AM
I am currently experimenting with the creation of highly-optimized, reusable functions for a library of mine. For instance, I write the function "is power of 2" the following way:
template<class IntType>
inline bool is_power_of_two( const IntType x )
{
return (x != 0) && ((x & (x - 1)) == 0);
}
This is a portable, low-maintenance implementation as an inline C++ template. This code is compiled by VC++ 2008 to the following code with branches:
is_power_of_two PROC
test rcx, rcx
je SHORT $LN3@is_power_o
lea rax, QWORD PTR [rcx-1]
test rax, rcx
jne SHORT $LN3@is_power_o
mov al, 1
ret 0
$LN3@is_power_o:
xor al, al
ret 0
is_power_of_two ENDP
I found also the implementation from here: ["The bit twiddler"][1], which would be coded in assembly for x64 as follows:
is_power_of_two_fast PROC
test rcx, rcx
je SHORT NotAPowerOfTwo
lea rax, [rcx-1]
and rax, rcx
neg rax
sbb rax, rax
inc rax
ret
NotAPowerOfTwo:
xor rax, rax
ret
is_power_of_two_fast ENDP
I tested both subroutines written separately from C++ in an assembly module (.asm file), and the second one works about 20% faster!
Yet the overhead of the function call is considerable: if I compare the second assembly implementation "is_power_of_two_fast" to the inline'd-version of the template function, the latter is faster despite branches!
Unfortunately, the new conventions for x64 specify that no inline assembly is allowed. One should instead use "intrinsic functions".
Now the question: can I implement the faster version "is_power_of_two_fast" as a custom intrinsic function or something similar, so that it can be used inline? Or alternatively, is it possible to somehow force the compiler to produce the low-branch version of the function?
[1]:
template<class IntType>
inline bool is_power_of_two( const IntType x )
{
return (x != 0) && ((x & (x - 1)) == 0);
}
This is a portable, low-maintenance implementation as an inline C++ template. This code is compiled by VC++ 2008 to the following code with branches:
is_power_of_two PROC
test rcx, rcx
je SHORT $LN3@is_power_o
lea rax, QWORD PTR [rcx-1]
test rax, rcx
jne SHORT $LN3@is_power_o
mov al, 1
ret 0
$LN3@is_power_o:
xor al, al
ret 0
is_power_of_two ENDP
I found also the implementation from here: ["The bit twiddler"][1], which would be coded in assembly for x64 as follows:
is_power_of_two_fast PROC
test rcx, rcx
je SHORT NotAPowerOfTwo
lea rax, [rcx-1]
and rax, rcx
neg rax
sbb rax, rax
inc rax
ret
NotAPowerOfTwo:
xor rax, rax
ret
is_power_of_two_fast ENDP
I tested both subroutines written separately from C++ in an assembly module (.asm file), and the second one works about 20% faster!
Yet the overhead of the function call is considerable: if I compare the second assembly implementation "is_power_of_two_fast" to the inline'd-version of the template function, the latter is faster despite branches!
Unfortunately, the new conventions for x64 specify that no inline assembly is allowed. One should instead use "intrinsic functions".
Now the question: can I implement the faster version "is_power_of_two_fast" as a custom intrinsic function or something similar, so that it can be used inline? Or alternatively, is it possible to somehow force the compiler to produce the low-branch version of the function?
[1]:
[To see links please register here]