100 million rows in PHP challenge
with a bit of D
As I finally got my hands on the 1 billion rows challenge in D, a new challenge of similar nature appeared from Brendt, the PHP guy: the 100 million rows challenge in PHP.
The premise is deceptively simple: as an input we get a log of requests with two parameters, route and date, one line for each record, 100 millions in total. The output is a map of routes with its numbers of requests per day. The main condition is to do it all in PHP, without any external tools and with pretty much basic modules enabled, which one could find on any other server with PHP installed.
The original challenge started on the 25th of February and ended on the 15th of March, with leaderboards and prizes from JetBrains for the best performers.
This time I even was able to jump onboard, securing a place close to 10-15th in the parallelism leaderboard at the time of submission, but quickly falling down, finishing on the 35th place in parallelism (1.5s difference from the top-1 solution) and 14th in single thread (3.6s difference). (even saw Danny van Kooten in the leaderboard, hehe)
I found this challenge rather interesting and wanted to write a breakdown of
what I did and which tricks were more effective to speed things up
what the top-1 solutions did differently
and how solution in D would perform against PHP
So let’s start from the beginning.
Baseline
My progress (along with the timings on my machine, same one as in the D blog post) can be found in my fork for the challenge.
The baseline naive solution is pretty simple:
final class Parser
{
public function parse(string $inputPath, string $outputPath): void
{
$out = [];
$handle = fopen($inputPath, 'r');
if ($handle === false) {
throw new Exception("Failed to open input file: {$inputPath}");
}
try {
while (($line = fgets($handle)) !== false) {
$line = rtrim($line, "\r\n");
if ($line !== '') {
[$path, $date] = $this->parseLine($line);
if (!isset($out[$path])) $out[$path] = [];
$out[$path][$date] = ($out[$path][$date] ?? 0) + 1;
}
}
} finally {
fclose($handle);
}
foreach ($out as $path => &$dates) {
ksort($dates);
}
file_put_contents($outputPath, json_encode($out, JSON_PRETTY_PRINT));
}
private function parseLine(string $line): array
{
// skipping https://stitcher.io/, , it's all the same domain anyway
// also skipping 15ch of datetime
$line = substr($line, 19, -15);
return explode(',', $line);
}
}We read the file line by line, parse the lines and write a json_encode output to the file.
This version ended up running for 93s.
Optimizations
Round 1: Chunk reading and custom parsers
The first few optimizations came from the lessons learned during 1brc, along with the basic understanding of PHP’s shortcomings:
Chunk reads. Instead of one syscall per line, read huge blocks at a time with
fread. Track the last newline in the chunk and seek back for any partial line at the end.Custom date parser. Pre-compute a
$dateIdsmap and look up the 8-character date substring directly.Custom jsonizer. Build the output JSON string manually with
fwriteand a write buffer instead ofjson_encode+file_put_contents.Basic custom line parser. Skip known fixed-width regions (URL prefix, timestamp suffix) before looking for the separator.
Turning off garbage collector.
Including the code snippets would take too much space, but you can see the code diff here.
These changes reduced the result to 73s.
Round 2: global namespace, garbage collector and code structure
These changes were relatively small, but gave surprisingly impactful results.
Global namespace function calls. Prefixing built-in calls with
\(e.g.\strpos,\substr) tells PHP to skip the local namespace lookup and call the global function directly. Saves a small amount per call, but at 100M lines it adds up.Embedding chunk logic into the main method. Removing the
processChunkmethod call eliminates per-chunk function call overhead and allows PHP to inline the variables.
public function parse(string $inputPath, string $outputPath): void
{
\gc_disable();
$handle = \fopen($inputPath, 'rb');
\stream_set_read_buffer($handle, 0);
$remaining = \filesize($inputPath);
while ($remaining > 0) {
$toRead = $remaining > self::READ_CHUNK ? self::READ_CHUNK : $remaining;
$chunk = \fread($handle, $toRead);
$chunkLen = \strlen($chunk);
$remaining -= $chunkLen;
$lastNl = \strrpos($chunk, "\n");
$tail = $chunkLen - $lastNl - 1;
if ($tail > 0) {
\fseek($handle, -$tail, SEEK_CUR);
$remaining += $tail;
}
$p = self::PREFIX_LEN;
while ($p < $lastNl) {
$sep = \strpos($chunk, ',', $p);
// ...
$p = $sep + 52;
}
}
}
Result: 31.45s
Round 3: Lookup tables and 1D storage
The two biggest single-threaded wins came from eliminating excessive runtime computations:
Pre-built path and date maps. Scan the first 2 MB of the file upfront to discover all unique slugs and assign each a numeric ID. And also pre-compute all valid dates. This saves us some time on unnecessary string manipulations.
Pre-multiplied base offsets. Basically store
$pathId * $dateCountin the map directly.1D flat array instead of nested associative array. This is probably the most important thing. One lookup instead of two, and PHP integer-keyed arrays are dramatically faster than string-keyed ones.
// as a rough example of what's happening in the actual code
$raw = \fread($handle, self::PATH_SCAN_SIZE);
$pos = 0;
while ($pos < $lastNl) {
$nlPos = \strpos($raw, "\n", $pos + 52);
$slug = \substr($raw, $pos + self::PREFIX_LEN, $nlPos - $pos - 51);
if (!isset($pathIds[$slug])) {
$pathIds[$slug] = $pathCount;
$pathBaseMap[$slug] = $pathCount * $dateCount;
$pathCount++;
}
$pos = $nlPos + 1;
}
//...
$sep = \strpos($chunk, ',', $p);
$idx = $pathBaseMap[\substr($chunk, $p, $sep - $p)] + $dateIds[\substr($chunk, $sep + 3, 8)];
$counts[$idx]++;
$p = $sep + 52;
Result: 15.9s
Bonus: Unwrapping the loop
One of the neat tricks is manually unrolling the loop into 8-12 identical code chunks. This reduces the number of condition checks and might give a performance boost.
Result: 14.3s
Parallelism
This was my first attempt at working with parallelism in PHP, but apparently it’s pretty similar to C in this regard. Basically we do pcntl_fork with one worker per core.
The core idea is to assign a byte range of the file to a worker and let it do its thing, then write the result to a temporary file. Once all workers finish, the main thread takes all the results and sums them.
// rough outline of the idea
$numProcs = \max(1, (int) \shell_exec('nproc') ?: 4);
$ranges = $this->splitRanges($inputPath, $fileSize, $numProcs);
for ($i = 0; $i < $numProcs; $i++) {
$tmp = \tempnam(\sys_get_temp_dir(), 'mrc_');
$pid = \pcntl_fork();
if ($pid === 0) {
$output = $this->processRange(...);
\file_put_contents($tmp, $output);
exit(0);
}
$pidToTmp[$pid] = $tmp;
}
// ...
while ($pending > 0) {
$finished = \pcntl_waitpid(-1, $status);
$proc = \array_values(\unpack('C*', \file_get_contents($pidToTmp[$finished])));
for ($j = 0; $j < $outputSize; $j++) {
$counts[$j] += $proc[$j];
}
$pending--;
}
Result: 3.6s
And the last attempt
As my last change, I added logic with nextchar map (shamelessly looked it up from xHeaven’s solution). Counter stored as raw bytes in a string ($output[$idx] = $next[$output[$idx]]).
$next = [];
for ($i = 0; $i < 255; $i++) {
$next[\chr($i)] = \chr($i + 1);
}
$output = \str_repeat("\0", $outputSize);
$output[$idx] = $next[$output[$idx]];
This gave me 2.79 without file cache and 1.6 with. On the reference hardware of the challenge it took 3.7s.
Learning from the best
While there was plenty of time to try and do some more tweaking, I thought it would be unfair, as at this point all I was doing was just looking at the solutions of top-3 guys. So instead I made an outline of the most important things they did differently, which led to them having even more performance.
I was looking at the top-3 submissions: xHeaven, AcidBurn86 and l0gicnz, with all three being at a very similar level of 2.028s.
The main differences, that gave the biggest performance boosts:
more workers: increasing to 10-12 gives a noticeable boost
using sodium_add for SIMD merge: kind of cheating, offloading all combining to a C function
UNIX sockets instead of temporary files for sync: avoids any unnecessary disk I/O, which gives a very significant boost
getting rid of strpos: it’s famous for bringing performance down, especially when it’s called a huge number of times, so writing a specialized implementation does wonders here
backward traversal with suffix key: going backwards in strings, with a known date and length of the beginning of the URLs, save a lot of time on unnecessary lookups
packed stride+index in a single map value: using byte offsets to get everything in a single value, which saves time on lookups
In addition, some solutions also used dynamic load balancing between the threads, but it didn’t seem to bring any significant improvements.
Bonus: D language implementation
As a point of comparison, I wrote the same solution in D, which runs in 0.4-0.7s: roughly 2–4 times faster than the best PHP solution on the same machine.
The D version is functionally the same, but has a couple of important differences:
it uses std.parallelism, making it easier to work with
instead of built-in associative arrays I used hashmap, like in d1brc
MmFile for loading the file
auto mm = new MmFile(argv[1]);
auto data = cast(const(char)[]) mm[];
foreach (tid; parallel(iota(0, numThreads))) {
processChunk(data, bounds[tid], bounds[tid + 1], &pathMap, localCounts[tid], next);
}
foreach (lc; localCounts)
foreach (i, v; lc)
counts[i] += v;And, of course, MmFile gave the most noticeable speed boost. Hashmap came in second. Not to mention being actually compiled and optimized before the runtime.
Closing thoughts
Well, this was a fun challenge, an interesting experience, and pretty educational.
Seeing the runtime reduced from 93s to 2-3s is already exciting, but exporing the actual tricks behind it is even more fun. And by the time I finish this blog post, I already used a lot of these tricks to rewrite and speed up my beloved LRG2.
The D version was also a fun reality check: while, yes, it’s faster than PHP, and version in something like Go would likely be roughtly at the same level, if not slightly slower, it’s important to keep in mind, that at this point the difference in speed was not that big. Which is surprising in a way: I did not expect PHP to be able to pull this off.
So here’s that for ya’ll. Keep doing fun programming challenges.


