Modwest Mages versus Lazy Linking Sorcery
This post from Modwest Mage (aka Systems Engineer) Tom:
Once upon a time, we wrote a yarn of PHP Versions and I/O woe. Mired in a battle of attrition, we enlisted a tough, gritty, mysterious stranger to bolster the ranks of our demoralized filesystem army. However, even a legendary hero could not rescue us from inevitable performance peril, and our wizards have been furtively developing new majicks to provide salvation for us and our citizens. As one of these wizards, I have my own chronicle to relate, although it lacks the glory of front-line battle.
My most recent task was to freshen our shared system webservers with the newest release of their operating system. By doing so, I would be opening the portal that would, in the near future, allow our webserver software to communicate with our new filesystem architecture. Unexpected foes riddled the veins of the project, and the task became a trial that spanned many days. The challenges tore at my sanity; my therapy is to share the hurdles with you.
The most daunting problem was rooted in Linux's "Lazy Dynamic Linking." While I am not sure I even completely understand -- it is magic, after all -- I will do my best to paraphrase the situation and solution.
First, for perspective, let me describe how our webservers handle CGI scripts. A request for a CGI triggers our modified sbox. Sbox chroots the script inside the shared system environment, and then runs the script as the owner of the script, not as the webserver user. We call this process "jailing".
My upgrade plan included no changes to jailing. The only thing that was to change was the underlying operating system. So, after rebuilding a webserver from scratch, I was surprised to see this error when attempting to run any CGI script:
/usr/lib/apache/suexec: relocation error: /lib/libnss_compat.so.2: symbol _nss_files_parse_grent, version GLIBC_2.0 not defined in file libc.so.6 with link time reference.
While the error was clue enough to hint at C, I am not much of a C programmer. I was beginning to get a bad feeling about this. Like any good apprentice, hoping it was common lore, I asked the arcane compendium seer for knowledge. Alas, information was sparse, and relevant tomes did not exist. I was on my own.
After days entombed in my computing cave, I had gathered shockingly little helpful data in isolating the problem.
But then, a flicker of hope sparkled from a configuration in nscd. With passwd caching off, CGI scripts began to run just fine, without explanation!
However, as any Linux hedge wizard knows, nscd yields massive performance gains on name lookups, and webservers like ours do these lookups constantly. Without solving this problem, the webservers would be too lethargic to work as a foundation for our project. Growing weak from monitor radiation and frustration, the situation looked grim to me.
I was certain I had failed king and country. Desperate, we summoned a magician from another fiefdom. One afternoon later, filled with gdb, strace, and ltrace, his expertise finally revealed the true nature of our error:
warning: .dynamic section for "/lib/libnss_compat.so.2" is not at the expected address Error while mapping shared library sections: /lib/i686/mmx/libnsl.so.1: No such file or directory. Error while reading shared library symbols: /lib/i686/mmx/libnsl.so.1: No such file or directory. Error while reading shared library symbols: /lib/i686/mmx/libnsl.so.1: No such file or directory.
By using a clever technique of adding a sleep statement in the C code to provide time to bind to the process with the debugging programs, he finally received the output above and instantly knew the problem.
Lazy Linking normally maps library functions in the Procedure Linkage Table, but doesn't actually load the functions. The functions only load when called. This is fine for normal environments, but for this instance, ld.so was trying to load the functions while in the jail, but the mapping had been done outside of the jail. In other words, after deciding that we would use the hardware libraries, we then later tried to use the jail libraries instead. Or, as the computer would say, 'dynamic section for "/lib/libnss_compat.so.2" is not at the expected address.'
The key to modifying this behavior lies in a simple environment variable.
Running with "LD_BIND_NOW = 1" solved all of the library problems I had seen. The variable tells ld.so to map and load the functions at runtime. We lose a little bit of efficiency, because the linker may be loading functions the program doesn't actually use, but it guarantees we use the libraries we think we are using.
Another ritual complete, but does it mean we're any closer to smashing the stalemate?
Websites on our shared hosting cluster aren't affected by this change, aside from any miscast spells on my part. The change does mean that we can, within a few weeks, deploy our new Apache stack that talks to the new filesystem. Assuming that deployment goes well, the beta testing that we've hinted about will start to become a reality a few weeks later.
Soon after, I hope we may close the chapter on this epic battle and enter an age of renaissance.
-TRC
Dude. Can you write that again in _English_? I assume it affects my hosting but I don't really know what you're talking about.
Posted by: Daffyd | November 18, 2008 at 05:52 PM
Hmmmm, in English, I'd summarize:
* We are working hard to improve our shared hosting system,
* We had a roadblock for a while,
* We overcame the roadblock!
We hope the techie explanation might help someone else out there with a similar issue.
Keep watching the blog for further updates about our progress improving services for everyone. And if you have ideas and feedback to share, please join the conversation at http://feedback.modwest.com
-JM
Posted by: John Masterson | November 18, 2008 at 06:47 PM
It may not be English...but some things can be lost in translation. I actually appreciate the pure, honest, explanation with details that help the greater good.
Maybe next time there can be a link for the non-technical explanation...a version of both. Everyone then happy.
Posted by: eferrini | November 21, 2008 at 08:46 AM
I appreciate the work that goes on behind the scenes to improve services for everyone and welcome technical explanations in any writing style that's aimed at helping others.
What I find strange is the repeated references to the post not being in English. Granted the English may not be plain and simple, but if the level required is that of the subsequent summary then that would make the blog less appealing to even more people and less useful too.
So, I don't think it's necessary to keep everyone happy with links to multiple explanations in the same language, not when the comments section is clearly sufficient for the odd occasion that it's necessary.
Posted by: Squiggle | January 07, 2009 at 05:57 PM