Create an account

Very important

  • To access the important data of the forums, you must be active in each forum and especially in the leaks and database leaks section, send data and after sending the data and activity, data and important content will be opened and visible for you.
  • You will only see chat messages from people who are at or below your level.
  • More than 500,000 database leaks and millions of account leaks are waiting for you, so access and view with more activity.
  • Many important data are inactive and inaccessible for you, so open them with activity. (This will be done automatically)


Thread Rating:
  • 619 Vote(s) - 3.51 Average
  • 1
  • 2
  • 3
  • 4
  • 5
memory bandwidth for many channels x86 systems

#1
I'm testing the memory bandwidth on a desktop and a server.

Sklyake desktop 4 cores/8 hardware threads
Skylake server Xeon 8168 dual socket 48 cores (24 per socket) / 96 hardware threads

The peak bandwidth of the system is

Peak bandwidth desktop = 2-channels*8*2400 = 38.4 GB/s
Peak bandwidth server = 6-channels*2-sockets*8*2666 = 255.94 GB/s

I'm using my own [triad function from STREAM](

[To see links please register here]

) to measure the bandwidth (full code later)

void triad(double *a, double *b, double *c, double scalar, size_t n) {
#pragma omp parallel for
for(int i=0; i<n; i++) a[i] = b[i] + scalar*c[i];
}

Here are results I get

Bandwidth (GB/s)
threads Desktop Server
1 28 16
2(24) 29 146
4(48) 25 177
8(96) 24 189

For 1 thread I don't understand why the desktop is so much faster than the server. According to this answer

[To see links please register here]

SSE is sufficient to get the full bandwidth of a dual channel system. That's what I observe on the desktop. Two threads only helps slightly and 4 and 8 threads give a worse result But on the server the single threaded bandwidth is much less. **Why is this?**

On the server I get the best results using 96 threads. I would have thought it would be saturated with far fewer threads. **Why are so many threads necessary to saturate the bandwidth on the server?** There is a large margin of error in my results and I don't include an error estimate. I took the best result of several runs.

The code

//gcc -O3 -march=native triad.c -fopenmp
//gcc -O3 -march=skylake-avx512 -mprefer-vector-width=512 triad.c -fopenmp
#include <stdio.h>
#include <omp.h>
#include <x86intrin.h>

void triad_init(double *a, double *b, double *c, double k, size_t n) {
#pragma omp parallel for
for(size_t i=0; i<n; i++) a[i] = k, b[i] = k, c[i] = k;
}

void triad(double *a, double *b, double *c, double scalar, size_t n) {
#pragma omp parallel for
for(size_t i=0; i<n; i++) a[i] = b[i] + scalar*c[i];
}

void triad_stream(double *a, double *b, double *c, double scalar, size_t n) {
#if defined ( __AVX512F__ ) || defined ( __AVX512__ )
__m512d scalarv = _mm512_set1_pd(scalar);
#pragma omp parallel for
for(size_t i=0; i<n/8; i++) {
__m512d bv = _mm512_load_pd(&b[8*i]), cv = _mm512_load_pd(&c[8*i]);
_mm512_stream_pd(&a[8*i], _mm512_add_pd(bv, _mm512_mul_pd(scalarv, cv)));
}
#else
__m256d scalarv = _mm256_set1_pd(scalar);
#pragma omp parallel for
for(size_t i=0; i<n/4; i++) {
__m256d bv = _mm256_load_pd(&b[4*i]), cv = _mm256_load_pd(&c[4*i]);
_mm256_stream_pd(&a[4*i], _mm256_add_pd(bv, _mm256_mul_pd(scalarv, cv)));
}
#endif
}

int main(void) {
size_t n = 1LL << 31LL;
double *a = _mm_malloc(sizeof *a * n, 64), *b = _mm_malloc(sizeof *b * n, 64), *c = _mm_malloc(sizeof *c * n, 64);
//double peak_bw = 2*8*2400*1E-3; // 2-channels*8-bits/byte*2400MHz
double peak_bw = 2*6*8*2666*1E-3; // 2-sockets*6-channels*8-bits/byte*2666MHz
double dtime, mem, bw;
printf("peak bandwidth %.2f GB/s\n", peak_bw);

triad_init(a, b, c, 3.14159, n);
dtime = -omp_get_wtime();
triad(a, b, c, 3.14159, n);
dtime += omp_get_wtime();
mem = 4*sizeof(double)*n*1E-9, bw = mem/dtime;
printf("triad: %3.2f GB, %3.2f s, %8.2f GB/s, bw/peak_bw %8.2f %%\n", mem, dtime, bw, 100*bw/peak_bw);

triad_init(a, b, c, 3.14159, n);
dtime = -omp_get_wtime();
triad_stream(a, b, c, 3.14159, n);
dtime += omp_get_wtime();
mem = 3*sizeof(double)*n*1E-9, bw = mem/dtime;
printf("triads: %3.2f GB, %3.2f s, %8.2f GB/s, bw/peak_bw %8.2f %%\n", mem, dtime, bw, 100*bw/peak_bw);
}

Reply

#2
The hardware prefetcher is tuned differently on server vs workstation CPUs. Servers are expected to handle many threads, so the prefetcher will request smaller chunks from RAM. Here is a paper that goes into detail about the issue you're experiencing, but from the other side of the coin:

[Hardware Prefetcher Aggressiveness Controllers: Do We Need Them All the Time?][1]


[1]:
Reply



Forum Jump:


Users browsing this thread:
1 Guest(s)

©0Day  2016 - 2023 | All Rights Reserved.  Made with    for the community. Connected through