07-27-2023, 11:08 AM
I had to replace all the null bytes in a file with another character (I arbitrarily chose `@`), and was pretty surprised that `tr '\00' '@'` was about 1/4 the speed of `gzip`:
$ pv < lawl | gzip > /dev/null
^C13MiB 0:00:04 [28.5MiB/s] [====> ] 17% ETA 0:00:18
$ pv < lawl | tr '\00' '@' > /dev/null
^C58MiB 0:00:08 [7.28MiB/s] [==> ] 9% ETA 0:01:20
My real data file is 3GB gzipped and took 50 minutes to `tr`, and I'll actually need to do this on many such files, so it's not a completely academic problem. Note that reading from disk (a reasonably fast SSD here), or `pv`, isn't the bottleneck in either case; both `gzip` and `tr` are using 100% CPU, and `cat` is much faster:
$ pv < lawl | cat > /dev/null
642MiB 0:00:00 [1.01GiB/s] [================================>] 100%
This code:
#include <stdio.h>
int main() {
int ch;
while ((ch = getchar()) != EOF) {
if (ch == '\00') {
putchar('@');
} else {
putchar(ch);
}
}
}
compiled with `clang -O3` is somewhat faster:
$ pv < lawl | ./stupidtr > /dev/null
^C52MiB 0:00:06 [ 8.5MiB/s] [=> ] 8% ETA 0:01:0
Compiling with `gcc -O4 -mtune=native -march=native` (4.8.4) is comparable, maybe very slightly faster. Adding `-march=native` to clang (`Apple LLVM version 6.1.0 (clang-602.0.53) (based on LLVM 3.6.0svn)`) produces an identical binary.
This is presumably just because the generic processing code for replacements in `tr` is replaced with constants and the checks can be compiled down. The LLVM IR (`clang -S -O3 stupidtr.c`) looks pretty good.
I guess `gzip` must be faster because it's doing something SIMD instructions or something. Is it possible to get this up to `gzip` speeds?
Some specifications, if they're relevant:
- The file is a CSV; the null byte can only occur in a certain field, but some of the other fields are variable-length, so you can't just seek around arbitrarily. Most lines have a null byte in that field. I suppose this means you could do a [Boyer-Moore](
- A typical file is about 20 GiB uncompressed, but are bz2 compressed on disk, if that's relevant.
- You can parallelize if you want, though `gzip` does this with one so it shouldn't be necessary. I'll be running this either on a quad-core i7 running OSX or a two-vCPU cloud server running Linux.
- Both machines I might run on have 16GB of RAM.
$ pv < lawl | gzip > /dev/null
^C13MiB 0:00:04 [28.5MiB/s] [====> ] 17% ETA 0:00:18
$ pv < lawl | tr '\00' '@' > /dev/null
^C58MiB 0:00:08 [7.28MiB/s] [==> ] 9% ETA 0:01:20
My real data file is 3GB gzipped and took 50 minutes to `tr`, and I'll actually need to do this on many such files, so it's not a completely academic problem. Note that reading from disk (a reasonably fast SSD here), or `pv`, isn't the bottleneck in either case; both `gzip` and `tr` are using 100% CPU, and `cat` is much faster:
$ pv < lawl | cat > /dev/null
642MiB 0:00:00 [1.01GiB/s] [================================>] 100%
This code:
#include <stdio.h>
int main() {
int ch;
while ((ch = getchar()) != EOF) {
if (ch == '\00') {
putchar('@');
} else {
putchar(ch);
}
}
}
compiled with `clang -O3` is somewhat faster:
$ pv < lawl | ./stupidtr > /dev/null
^C52MiB 0:00:06 [ 8.5MiB/s] [=> ] 8% ETA 0:01:0
Compiling with `gcc -O4 -mtune=native -march=native` (4.8.4) is comparable, maybe very slightly faster. Adding `-march=native` to clang (`Apple LLVM version 6.1.0 (clang-602.0.53) (based on LLVM 3.6.0svn)`) produces an identical binary.
This is presumably just because the generic processing code for replacements in `tr` is replaced with constants and the checks can be compiled down. The LLVM IR (`clang -S -O3 stupidtr.c`) looks pretty good.
I guess `gzip` must be faster because it's doing something SIMD instructions or something. Is it possible to get this up to `gzip` speeds?
Some specifications, if they're relevant:
- The file is a CSV; the null byte can only occur in a certain field, but some of the other fields are variable-length, so you can't just seek around arbitrarily. Most lines have a null byte in that field. I suppose this means you could do a [Boyer-Moore](
[To see links please register here]
) search for `,\00,`, if that'd help. Once you've found a null byte, it's also guaranteed that there can't be another one for a hundred bytes or so.- A typical file is about 20 GiB uncompressed, but are bz2 compressed on disk, if that's relevant.
- You can parallelize if you want, though `gzip` does this with one so it shouldn't be necessary. I'll be running this either on a quad-core i7 running OSX or a two-vCPU cloud server running Linux.
- Both machines I might run on have 16GB of RAM.