Login

C++ programs are translated to assembly programs during the generation of machine code from the source code. It would be virtually wrong to say assembly is slower than C++. Moreover, the binary code generated differs from compiler to compiler. So a smart C++ compiler *may* produce binary code more optimal and efficient than a dumb assembler's code.

However I believe your profiling methodology has certain flaws. The following are general guidelines for profiling:

1. Make sure your system is in its normal/idle state. Stop all running processes (applications) that you started or that use CPU intensively (or poll over the network).
2. Your datasize must be greater in size.
3. Your test must run for something more than 5-10 seconds.
4. Do not rely on just one sample. Perform your test N times. Collect results and calculate the mean or median of the result.

From comments:

> But, this code never stops (because of integer overflow) !?! Yves Daoust

For many numbers it will *not* overflow.

If it *will* overflow - for one of those unlucky initial seeds, the overflown number will very likely converge toward 1 without another overflow.

Still this poses interesting question, is there some overflow-cyclic seed number?

Any simple final converging series starts with power of two value (obvious enough?).

2^64 will overflow to zero, which is undefined infinite loop according to algorithm (ends only with 1), but the most optimal solution in answer will finish due to `shr rax` producing ZF=1.

Can we produce 2^64? If the starting number is `0x5555555555555555`, it's odd number, next number is then 3n+1, which is `0xFFFFFFFFFFFFFFFF + 1` = `0`. Theoretically in undefined state of algorithm, but the optimized answer of johnfound will recover by exiting on ZF=1. The `cmp rax,1` of Peter Cordes **will end in infinite loop** (QED variant 1, "cheapo" through undefined `0` number).

How about some more complex number, which will create cycle without `0`?
Frankly, I'm not sure, my Math theory is too hazy to get any serious idea, how to deal with it in serious way. But intuitively I would say the series will converge to 1 for every number : 0 < number, as the 3n+1 formula will slowly turn every non-2 prime factor of original number (or intermediate) into some power of 2, sooner or later. So we don't need to worry about infinite loop for original series, only overflow can hamper us.

So I just put few numbers into sheet and took a look on 8 bit truncated numbers.

There are three values overflowing to `0`: `227`, `170` and `85` (`85` going directly to `0`, other two progressing toward `85`).

But there's no value creating cyclic overflow seed.

Funnily enough I did a check, which is the first number to suffer from 8 bit truncation, and already `27` is affected! It does reach value `9232` in proper non-truncated series (first truncated value is `322` in 12th step), and the maximum value reached for any of the 2-255 input numbers in non-truncated way is `13120` (for the `255` itself), maximum number of steps to converge to `1` is about `128` (+-2, not sure if "1" is to count, etc...).

Interestingly enough (for me) the number `9232` is maximum for many other source numbers, what's so special about it? :-O `9232` = `0x2410` ... hmmm.. no idea.

Unfortunately I can't get any deep grasp of this series, why does it converge and what are the implications of truncating them to *k* bits, but with `cmp number,1` terminating condition it's certainly possible to put the algorithm into infinite loop with particular input value ending as `0` after truncation.

But the value `27` overflowing for 8 bit case is sort of alerting, this looks like if you count the number of steps to reach value `1`, you will get wrong result for majority of numbers from the total k-bit set of integers. For the 8 bit integers the 146 numbers out of 256 have affected series by truncation (some of them may still hit the correct number of steps by accident maybe, I'm too lazy to check).

You did not post the code generated by the compiler, so there' some guesswork here, but even without having seen it, one can say that this:

test rax, 1
jpe even

... has a 50% chance of mispredicting the branch, and that will come expensive.

The compiler almost certainly does both computations (which costs neglegibly more since the div/mod is quite long latency, so the multiply-add is "free") and follows up with a CMOV. Which, of course, has a _zero_ percent chance of being mispredicted.

As a generic answer, not specifically directed at this task: In many cases, you can significantly speed up any program by making improvements at a high level. Like calculating data once instead of multiple times, avoiding unnecessary work completely, using caches in the best way, and so on. These things are much easier to do in a high level language.

Writing assembler code, it is _possible_ to improve on what an optimising compiler does, but it is hard work. And once it's done, your code is much harder to modify, so it is much more difficult to add algorithmic improvements. Sometimes the processor has functionality that you cannot use from a high level language, inline assembly is often useful in these cases and still lets you use a high level language.

In the Euler problems, most of the time you succeed by building something, finding why it is slow, building something better, finding why it is slow, and so on and so on. That is very, very hard using assembler. A better algorithm at half the possible speed will usually beat a worse algorithm at full speed, and getting the full speed in assembler isn't trivial.

durns711537

infuriate520

scientist682379

Drfaraday9

aigneis319

ferd576