Login

I'm writing some performance-critical code in Swift. After implementing all the optimizations I could think of, and profiling the application in Instruments, I came to realize that the vast majority of CPU cycles are spent performing `map()` and `reduce()` operations on arrays of Floats. So, just to see what would happen, I replaced all instances of `map` and `reduce` with good old-fashioned `for` loops. And to my amazement... the `for` loops were much, much faster!

A bit puzzled by this, I decided to perform some rough benchmarks. In one test, I had `map` return an array of Floats after performing some simple arithmetic like so:

// Populate array with 1,000,000,000 random numbers
var array = [Float](count: 1_000_000_000, repeatedValue: 0)
for i in 0..<array.count {
array[i] = Float(random())
}
let start = NSDate()
// Construct a new array, with each element from the original multiplied by 5
let output = array.map({ (element) -> Float in
return element * 5
})
// Log the elapsed time
let elapsed = NSDate().timeIntervalSinceDate(start)
print(elapsed)

And the equivalent `for` loop implementation:

var output = [Float]()
for element in array {
output.append(element * 5)
}
Average execution time for `map`: 20.1 seconds. Average execution time for the `for` loop: 11.2 seconds. Results were similar using Integers instead of Floats.

I created a similar benchmark to test the performance of Swift's `reduce`. This time, `reduce` and `for` loops achieved nearly the same performance when summing the elements of one large array. But when I loop the test 100,000 times like this:

// Populate array with 1,000,000 random numbers
var array = [Float](count: 1_000_000, repeatedValue: 0)
for i in 0..<array.count {
array[i] = Float(random())
}
let start = NSDate()
// Perform operation 100,000 times
for _ in 0..<100_000 {
let sum = array.reduce(0, combine: {$0 + $1})
}
// Log the elapsed time
let elapsed = NSDate().timeIntervalSinceDate(start)
print(elapsed)

vs:

for _ in 0..<100_000 {
var sum: Float = 0
for element in array {
sum += element
}
}

The `reduce` method takes 29 seconds while the `for` loop takes (apparently) 0.000003 seconds.

Naturally I'm ready to disregard that last test as the result of a compiler optimization, but I think it may give some insight into how the compiler optimizes differently for loops vs Swift's built-in array methods. Note that all tests were performed with -Os optimization on a 2.5 GHz i7 MacBook Pro. Results varied depending on array size and number of iterations, but `for` loops always outperformed the other methods by at least 1.5x, sometimes up to 10x.

I'm a bit perplexed about Swift's performance here. Shouldn't the built-in Array methods be faster than the naive approach for performing such operations? Maybe somebody with more low-level knowledge than I can shed some light on the situation.

> Shouldn't the built-in Array methods be faster than the naive approach
> for performing such operations? Maybe somebody with more low-level knowledge than I can shed some light on the situation.

I just want to attempt to address this part of the question and more from the conceptual level (with little understanding of the nature of Swift's optimizer on my part) with a "not necessarily". It's coming more from a background in compiler design and computer architecture than deep-rooted knowledge of the nature of Swift's optimizer.

**Calling Overhead**

With functions like `map` and `reduce` accepting functions as inputs, it places a greater strain on the optimizer to put it one way. The natural temptation in such a case short of some very aggressive optimization is to constantly branch back and forth between the implementation of, say, `map`, and the closure you provided, and likewise transmit data across these disparate branches of code (through registers and stack, typically).

That kind of branching/calling overhead is very difficult for the optimizer to eliminate, especially given the flexibility of Swift's closures (not impossible but conceptually quite difficult). C++ optimizers can inline function object calls but with far more restrictions and code generation techniques required to do it where the compiler would effectively have to generate a whole new set of instructions for `map` for each type of function object you pass in (and with explicit aid of the programmer indicating a function template used for the code generation).

So it shouldn't be of great surprise to find that your hand-rolled loops can perform faster -- they put a great deal of less strain on the optimizer. I have seen some people cite that these higher-order functions should be able to go faster as a result of the vendor being able to do things like parallelize the loop, but to effectively parallelize the loop would first require the kind of information that would typically allow the optimizer to inline the nested function calls within to a point where they become as cheap as the hand-rolled loops. Otherwise the function/closure implementation you pass in is going to be effectively opaque to functions like `map/reduce`: they can only call it and pay the overhead of doing so, and cannot parallelize it since they cannot assume anything about the nature of the side effects and thread-safety in doing so.

Of course this is all conceptual -- Swift may be able to optimize these cases in the future, or it may already be able to do so now (see `-Ofast` as a commonly-cited way to make Swift go faster at the cost of some safety). But it does place a heavier strain on the optimizer, at the very least, to use these kinds of functions over the hand-rolled loops, and the time differences you're seeing in the first benchmark seem to reflect the kind of differences one might expect with this additional calling overhead. Best way to find out is to look at the assembly and try various optimization flags.

**Standard Functions**

That's not to discourage the use of such functions. They do more concisely express intent, they can boost productivity. And relying on them could allow your codebase to get faster in future versions of Swift without any involvement on your part. But they aren't necessarily always going to be faster -- it is a good general rule to think that a higher-level library function that more directly expresses what you want to do is going to be faster, but there are always exceptions to the rule (but best discovered in hindsight with a profiler in hand since it's far better to err on the side of trust than distrust here).

**Artificial Benchmarks**

As for your second benchmark, it is almost certainly a result of the compiler optimizing away code that has no side effects that affect user output. Artificial benchmarks have a tendency to be notoriously misleading as a result of what optimizers do to eliminate irrelevant side effects (side effects that don't affect user output, essentially). So you have to be careful there when constructing benchmarks with times that seem too good to be true that they aren't the result of the optimizer merely skipping all the work you actually wanted to benchmark. At the very least, you want your tests to output some final result gathered from the computation.

I did a quick set of performance tests measuring the performance of repeated transformations on an Array of Strings, and it showed that `.map` was much more performant than a for loop, by a factor of about 10x.

The results in the screenshot below show that chained transformations in a single `map` block outperform multiple `map`s with a single transformation in each, and any use of `map` out-performs for loops.

[![Demonstration of map vs for loop performance][1]][1]

[1]:

Code I used in a Playground:





import Foundation
import XCTest

class MapPerfTests: XCTestCase {
var array =
[
"MyString",
"MyString",
"MyString",
"MyString",
"MyString",
"MyString",
"MyString",
"MyString",
"MyString",
"MyString",
"MyString",
"MyString",
"MyString",
"MyString",
"MyString",
"MyString",
"MyString",
"MyString",
"MyString",
"MyString",
"MyString",
"MyString",
"MyString",
"MyString",
"MyString",
"MyString",
"MyString",
"MyString",
"MyString",
"MyString",
"MyString",
"MyString",
"MyString",
"MyString",
"MyString",
"MyString",
"MyString",
"MyString",
"MyString",
"MyString",
"MyString",
"MyString",
"MyString",
"MyString",
"MyString",
"MyString",
"MyString",
"MyString",
"MyString",
"MyString",
"MyString",
"MyString",
"MyString",
"MyString",
"MyString",
"MyString",
"MyString",
"MyString",
"MyString",
"MyString",
"MyString",
"MyString",
"MyString",
"MyString",
"MyString",
"MyString",
"MyString",
"MyString",
"MyString",
"MyString",
"MyString",
"MyString",
"MyString",
"MyString",
"MyString",
"MyString",
"MyString",
"MyString",
"MyString",
"MyString",
"MyString",
"MyString",
"MyString",
"MyString",
"MyString"
]

func testForLoopAllInOnePerf() {
measure {
var newArray: [String] = []
for item in array {
newArray.append(item.uppercased().lowercased().uppercased().lowercased())
}
}
}

func testForLoopMultipleStagesPerf() {
measure {
var newArray: [String] = []
for item in array {
let t1 = item.uppercased()
let t2 = item.lowercased()
let t3 = item.uppercased()
let t4 = item.lowercased()
newArray.append(t4)
}
}
}

func testMultipleMapPerf() {
measure {
let newArray = array
.map( { $0.uppercased() } )
.map( { $0.lowercased() } )
.map( { $0.uppercased() } )
.map( { $0.lowercased() } )
}
}

func testSingleMapPerf() {
measure {
let newArray = array
.map( { $0.uppercased().lowercased().uppercased().lowercased() } )
}
}
}

MapPerfTests.defaultTestSuite.run()

I cannot say much about your first test (`map()` vs `append()` in a loop)
but I can confirm your results. The append loop becomes even faster if
you add

output.reserveCapacity(array.count)

after the array creation. It seems that Apple can improve things here
and you might file a bug report.

In

for _ in 0..<100_000 {
var sum: Float = 0
for element in array {
sum += element
}
}

the compiler (probably) removes the entire loop
because the computed results are not used at all.
I can only speculate why a similar optimization does not happen in

for _ in 0..<100_000 {
let sum = array.reduce(0, combine: {$0 + $1})
}

but it would more difficult to decide if calling `reduce()` with the closure has any side-effects or not.

If the test code is changed slightly to calculate *and print* a total sum

do {
var total = Float(0.0)
let start = NSDate()
for _ in 0..<100_000 {
total += array.reduce(0, combine: {$0 + $1})
}
let elapsed = NSDate().timeIntervalSinceDate(start)
print("sum with reduce:", elapsed)
print(total)
}

do {
var total = Float(0.0)
let start = NSDate()
for _ in 0..<100_000 {
var sum = Float(0.0)
for element in array {
sum += element
}
total += sum
}
let elapsed = NSDate().timeIntervalSinceDate(start)
print("sum with loop:", elapsed)
print(total)
}

then both variants take about 10 seconds in my test.

throstle506543

wannest620897

preheaters857671

thorshavndttdeufelm