A (somewhat) recent bug fix required a pretty deep dive through our stack and I think it would be interesting to write up how I discovered the root cause. Nothing here is earth shattering, but I learned some new stuff along the way that’s worth recording.

Our background processing is mostly handled by cron jobs running every minute on various hosts in our data center. These jobs are designed to run for a short amount of time, and there are controls in place to detect long running jobs. Each job creates a lockfile and we won’t start a new job if the number of running jobs (measured by counting lockfiles) is over a certain threshold. If we find a lockfile that is over an hour old, an alert is sent out to the team.

The lockfile creation/deletion process has been in use long enough that we can confidently say long lived lockfiles are tied to hung processes, not processes that failed to remove their lockfile when they died. When an alert was received, someone would log into the box, find the lockfile, kill the associated process and remove the file. (Lockfiles contain the PID of the process that created them, to make our lives easier.)

These alerts were received sporadically, and never often enough to dedicate any amount of time to resolving the root cause. We’d just clean up the box, everything would go back to normal and we’d chalk it up to gremlins. Code audits never raised any suspicions and the difficulty in reproducing the hang killed any motivation for digging into the problem.

During a recent team walkthrough of the background processing ecosystem, I noticed there was a hung process without a lockfile and decided it was finally time to run this bug down.

Our cron jobs are just PHP scripts in the controllers/crons directory. The output of ps showed that one job (pid 26489) had been “running” for nearly a month.

$ date
Tue Nov 24 22:36:11 UTC 2015
$ ps aux | grep [p]hp
httpd    26489  0.0  0.0 183008 25088 ?        Ss   Oct27   0:00 /usr/local/bin/php /usr/local/htdocs/application/controllers/crons/cron.php --run=/crons/mturk_cron/main_cron_control
httpd    38398  0.5  0.0 186196 27900 ?        Ss   22:33   0:00 /usr/local/bin/php /usr/local/htdocs/application/controllers/crons/cron.php --run=/crons/mturk_cron/main_cron_control

The first question to answer was: Since this process had been running for nearly a month, why hadn’t we been receiving alerts? After a little while of talking things through, we realized that the lack of alerts was caused by an oversight in our deploy process. The deploy script git checkouts a specific SHA into a directory named after that SHA, then symlinks that directory to /usr/local/htdocs. Because cron jobs check for lockfiles in the /usr/local/htdocs hierarchy, lockfiles present at the start of a deploy get lost when the symlink changes.

We made two changes to address that oversight:

  • Lockfiles are now written to a location outside of the symlinked directory, making them persistent across deploys

  • Warnings are now issued when a running process tries to remove its lockfile and fails to find it

After figuring out why we hadn’t received any alerts, I dug into what that process was doing. My gut feeling was that the process was waiting on a resource, not stuck in an infinite loop, but I had no idea how to prove that. Luckily, I found a thing on the internet that explained how to use strace to inspect a process:

$ sudo strace -e trace=open,read -p 26489 -s 80
Process 26489 attached - interrupt to quit
read(13, ^C <unfinished ...>
Process 26489 detached

Hmmm. The process was blocked trying to read from file descriptor 13. What could it have been trying to read? I knew that cron job talked to Amazon’s Mechanical Turk and SQS, but it also read and wrote lockfiles. All of a process’ file descriptors are available in /proc/<PID>:

$ sudo ls -l /proc/26489/fd
total 0
<...snip...>
lrwx------ 1 httpd httpd 64 Nov 24 17:31 13 -> socket:[202494966]
<...snip...>

Great, the process was trying to read from a socket. (Not too surprising, given that there’s barely any file IO involved in lockfile creation/deletion.) At this point, I felt totally boned. I had no idea how to figure out what socket:[202494966] was connected to. The Internet told me that I could find more information in /proc/net/tcp:

$ head -n1 < /proc/net/tcp ; grep -a 202494966 /proc/net/tcp
  sl  local_address rem_address   st tx_queue rx_queue tr tm->when retrnsmt   uid  timeout inode
 107: D3E1160A:E948 32C21548:01BB 01 00000000:00000000 00:00000000 00000000   500        0 202494966 1 ffff881f1d8476c0 27 3 11 10 -1

That told me we were connected to something on port 0x1BB, which meant it was a TLS connection, since 0x1BB is 443 in decimal. But what exactly were we connected to? An IPv4 address is just 4 octets and translating the remote address from hex to decimal (and inserting separators) gives an address of 50.194.21.72. WHOIS told me that’s a random IP address owned by Comcast, which felt wrong. It was far more likely that we weren’t actually connected to some random Comcast IP, and I had just screwed up the translation. I then tried translating the hex for our local address, since I had a better idea of what that should have been. The same hex to decimal gives an address of 211.225.22.10 but it should have given 10.22.225.211. The hand-wavy argument to explain what was going on is “something endianness something something hton something,” but the takeaway is that I should have read the octets back to front, and the remote address actually was 72.21.194.50. That’s an IP address owned by Amazon and makes a lot more sense.

At this point, I knew we were connected to Amazon, waiting to read some information from the socket. The particular job I was looking at makes calls to both Mechanical Turk and SQS and those calls are made by two different libraries. So which call was hanging? I got a little lucky here, because it just so happened that a different cron job on a different machine was also hung in exactly the same way:

$ ps aux | grep [p]hp
httpd    31967  0.0  0.0   9196  1184 ?        Ss   Oct21   0:00 /bin/sh -c /usr/local/bin/php /usr/local/htdocs/application/controllers/crons/cron.php --run=/crons/submission_queue/pull_from_queue > /dev/null 2>&1
httpd    31969  0.0  0.0 177904 19020 ?        S    Oct21   0:00 /usr/local/bin/php /usr/local/htdocs/application/controllers/crons/cron.php --run=/crons/submission_queue/pull_from_queue
$ sudo strace -e trace=open,read -p 31969 -s 80
Process 31969 attached - interrupt to quit
read(13, ^C <unfinished ...>
Process 31969 detached
$ sudo ls -l /proc/31969/fd
total 0
<...snip...>
lrwx------ 1 httpd httpd 64 Nov 24 19:22 13 -> socket:[86082021]
<...snip...>
$ head -n1 < /proc/net/tcp ; grep -a 86082021 /proc/net/tcp
  sl  local_address rem_address   st tx_queue rx_queue tr tm->when retrnsmt   uid  timeout inode
  51: A0E1160A:E204 54D61548:01BB 01 00000000:00000000 00:00000000 00000000   500        0 86082021 1 ffff882066ad2300 48 3 13 3 7

That particular job doesn’t talk to SQS, it only talks to MTurk, so I was able to focus on library that communicates with MTurk.

This is where I went off the rails a little bit. Our hand rolled MTurk library builds the required URL for a given API endpoint and then loads it using PHP’s simplexml_load_file. I couldn’t find much information about how that function works, so I cloned the PHP git repo and started reading the source. Turns out, simplexml_load_file actually calls out to libxml2. After downloading the libxml2 source tarball and digging through that code for about thirty minutes, I realized that libxml2 doesn’t handle HTTPS endpoints. Instead, PHP registers callbacks with libxml2 to read from TLS streams. Back to the PHP repo, I was able to find where PHP streams are registered as handlers for libxml2 in ext/libxml/libxml.c:

static xmlParserInputBufferPtr
php_libxml_input_buffer_create_filename(const char *URI, xmlCharEncoding enc)
{
	xmlParserInputBufferPtr ret;
	void *context = NULL;
	TSRMLS_FETCH();

	if (LIBXML(entity_loader_disabled)) {
		return NULL;
	}

	if (URI == NULL)
		return(NULL);

	context = php_libxml_streams_IO_open_read_wrapper(URI);

	if (context == NULL) {
		return(NULL);
	}
	/* Allocate the Input buffer front-end. */
	ret = xmlAllocParserInputBuffer(enc);
	if (ret != NULL) {
		ret->context = context;
		ret->readcallback = php_libxml_streams_IO_read;
		ret->closecallback = php_libxml_streams_IO_close;
	} else
		php_libxml_streams_IO_close(context);
	return(ret);
}

and php_libxml_streams_IO_read just calls out to php_stream_read

static int php_libxml_streams_IO_read(void *context, char *buffer, int len)
{
	TSRMLS_FETCH();
	return php_stream_read((php_stream*)context, buffer, len);
}

After all this, I finally knew what to Google for. I immediately found this bug related to php_stream_read and SSL/TLS: https://bugs.php.net/bug.php?id=41631

The underlying problem was (quoting from the ticket) “the SSL_read() function in the relevant code section would block indefinitely waiting for data on a blocking socket connection.” The bug was fixed in PHP 5.4.33 and our boxen were running 5.4.13. (Incidentally, this bug hadn’t cropped up earlier because we ran a different point release of PHP in AWS and had to downgrade when we moved inside of Groupon’s data center.)

While upgrading our version of PHP would have fixed the issue, various institutional reasons made that difficult. The fix we settled on was to replace our usage of simplexml_load_file with curl. (This had the added benefit of giving us better error reporting when an API call failed.)