Login

I had to replace all the null bytes in a file with another character (I arbitrarily chose `@`), and was pretty surprised that `tr '\00' '@'` was about 1/4 the speed of `gzip`:

$ pv < lawl | gzip > /dev/null
^C13MiB 0:00:04 [28.5MiB/s] [====> ] 17% ETA 0:00:18
$ pv < lawl | tr '\00' '@' > /dev/null
^C58MiB 0:00:08 [7.28MiB/s] [==> ] 9% ETA 0:01:20

My real data file is 3GB gzipped and took 50 minutes to `tr`, and I'll actually need to do this on many such files, so it's not a completely academic problem. Note that reading from disk (a reasonably fast SSD here), or `pv`, isn't the bottleneck in either case; both `gzip` and `tr` are using 100% CPU, and `cat` is much faster:

$ pv < lawl | cat > /dev/null
642MiB 0:00:00 [1.01GiB/s] [================================>] 100%

This code:

#include <stdio.h>

int main() {
int ch;
while ((ch = getchar()) != EOF) {
if (ch == '\00') {
putchar('@');
} else {
putchar(ch);
}
}
}

compiled with `clang -O3` is somewhat faster:

$ pv < lawl | ./stupidtr > /dev/null
^C52MiB 0:00:06 [ 8.5MiB/s] [=> ] 8% ETA 0:01:0

Compiling with `gcc -O4 -mtune=native -march=native` (4.8.4) is comparable, maybe very slightly faster. Adding `-march=native` to clang (`Apple LLVM version 6.1.0 (clang-602.0.53) (based on LLVM 3.6.0svn)`) produces an identical binary.

This is presumably just because the generic processing code for replacements in `tr` is replaced with constants and the checks can be compiled down. The LLVM IR (`clang -S -O3 stupidtr.c`) looks pretty good.

I guess `gzip` must be faster because it's doing something SIMD instructions or something. Is it possible to get this up to `gzip` speeds?

Some specifications, if they're relevant:

- The file is a CSV; the null byte can only occur in a certain field, but some of the other fields are variable-length, so you can't just seek around arbitrarily. Most lines have a null byte in that field. I suppose this means you could do a [Boyer-Moore](

[To see links please register here]

) search for `,\00,`, if that'd help. Once you've found a null byte, it's also guaranteed that there can't be another one for a hundred bytes or so.

- A typical file is about 20 GiB uncompressed, but are bz2 compressed on disk, if that's relevant.

- You can parallelize if you want, though `gzip` does this with one so it shouldn't be necessary. I'll be running this either on a quad-core i7 running OSX or a two-vCPU cloud server running Linux.

- Both machines I might run on have 16GB of RAM.

Your code is incorrect because you do not test for end of file at the right spot. It is a **very** common mistake in `do {} while` loops. I recommend to avoid this construct completely (except in macros to convert sequences of statements into a single statement).

Also try and tell the glibc to perform fewer checks on the stream:

#include <stdio.h>

int main() {
int c;
while ((c = getchar_unlocked()) != EOF) {
if (c == '\0')
c = '@':
putchar_unlocked©;
}
}

You can also play with different buffer sizes, for example try these before the `while()` loop:

setvbuf(stdin, NULL, _IOFBF, 1024 * 1024);
setvbuf(stdout, NULL, _IOFBF, 1024 * 1024);

It should not have much impact if you use the utility as a filter with pipes, but it may be more efficient if you use files.

If you use files, you can also `mmap` the file and use `memchr` to find the `'\0'` bytes, or even `strchr` that might be faster and you can ensure there is a `'\0`` at the end of the file (putting it there is a good way).

You need to use block reads and writes for speed. (Even with a buffered I/O library like stdio.h, the cost of managing the buffer can be significant.) Something like:

#include <unistd.h>
int main( void )
{
char buffer[16384];
int size, i;
while ((size = read(0, buffer, sizeof buffer)) > 0) {
for( i = 0; i < size; ++i ) {
if (buffer[i] == '\0') {
buffer[i] = '@';
// optionally, i += 64; since
// "Once you've found a null byte, it's also guaranteed that there can't
// be another one for a hundred bytes or so"
}
}
write(1, buffer, size);
}
}

Naturally, compile with optimizations so that the compiler can transform indexing into pointer arithmetic if helpful.

This version also lends itself well to SIMD optimizations if you still aren't meeting your speed targets (or a smart enough compiler may vectorize the `for` loop automatically).

Furthermore, this code lacks robust error handling. As @chqrlie mentions in a comment, you should retry when you get `-EINTR`, and you should handle partial writes.

First, as others have noted, don't use `getchar()/putchar()`, or even any of the FILE-based methods such as `fopen()/fread()/fwrite()`. Use `open()/read()/write()` instead.

If the file is already uncompressed on disk, do not use pipes. If it is compressed, then you DO want to use a pipe in order remove an entire read/write cycle. If you uncompress from disk back to disk, then replace the NUL characters, the data path is disk->memory/cpu->disk->memory/cpu->disk. If you use a pipe, the path is disk->memory/cpu->disk. If you're disk-limited, that extra read/write cycle will just about double the time it takes to process your gigabytes (or more) of data.

One more thing - given your IO pattern and the amount of data your moving - read an entire multi-GB file, write the entire file - the page cache is only getting in your way. Use direct IO, thusly in C on Linux (headers and robust error checking left off for clarity):

#define CHUNK_SIZE ( 1024UL * 1024UL * 4UL )
#define NEW_CHAR '@'
int main( int argc, char **argv )
{
/* page-aligned buffer */
char *buf = valloc( CHUNK_SIZE );
/* just set "in = 0" to read a stdin pipe */
int in = open( argv[ 1 ], O_RDONLY | O_DIRECT );
int out = open( argv[ 2 ], O_WRONLY | O_CREAT | O_TRUNC | O_DIRECT, 0644 );

for ( ;; )
{
ssize_t bytes = read( in, buf, CHUNK_SIZE );
if ( bytes < 0 )
{
if ( errno == EINTR )
{
continue;
}
break;
}
else if ( bytes == 0 )
{
break;
}

for ( int ii = 0; ii < bytes; ii++ )
{
if ( !buf[ ii ] )
{
buf[ ii ] = NEW_CHAR;
}
}
write( out, buf, bytes );
}

close( in );
close( out );

return( 0 );
}

Crank up the compiler optimization as high as it goes. To use this code on real data, you do need to check the results of the `write()` call - direct IO on Linux is a real finicky beast. I've had to close a file opened with O_DIRECT and reopen a it without direct IO set in order to write the last bytes of a file on Linux when the last bits weren't a multiple of a full page.

If you want to go even faster, you can multithread the process - one thread reads, one thread translates chars, and another thread writes. Use as many buffers, passing them from thread to thread, as necessary to keep the slowest part of the process busy at all times.

If you're really interested in seeing how fast you can move data, multithread the reads/writes, too. And if your filesystem support it, use asynchronous read/write.

Combining ideas from the various answers with some extra bithacks, here is an optimized version:

#include <errno.h>
#include <stdint.h>
#include <stdio.h>
#include <string.h>
#include <unistd.h>

#define BUFFER_SIZE 16384
#define REPLACE_CHAR '@'

int main(void) {
/* define buffer as uint64_t to force alignment */
/* make it one slot longer to allow for loop guard */
uint64_t buffer[BUFFER_SIZE/8 + 1];
ssize_t size, chunk;
uint64_t *p, *p_end;
uint64_t rep8 = (uint8_t)REPLACE_CHAR * 0x0101010101010101ULL;

while ((size = read(0, buffer, BUFFER_SIZE)) != 0) {
if (size < 0) {
if (errno == EINTR) continue;
fprintf(stderr, "read error: %s\n", strerror(errno));
return 1;
}
p = buffer;
p_end = p + ((size + 7) >> 3);
*p_end = 0ULL; /* force a 0 at the end */
for (;; p++) {
#define LOWBITS 0x0101010101010101ULL
#define HIGHBITS 0x8080808080808080ULL
uint64_t m = ((*p - LOWBITS) & ~*p & HIGHBITS);
if (m != 0) {
if (p >= p_end) break;
m |= m >> 1;
m |= m >> 2;
m |= m >> 4;
*p |= m & rep8;
}
}
for (unsigned char *pc = (unsigned char *)buffer;
(chunk = write(1, pc, (size_t)size)) != size;
pc += chunk, size -= chunk) {
if (chunk < 0) {
if (errno == EINTR) continue;
fprintf(stderr, "write error: %s\n", strerror(errno));
return 2;
}
}
}
return 0;
}

eagre178908

stylidiumsdpprnnreq

fierce217

audrieaudris954

spryest174204