Nodes Not Setting DNS When Using /etc/hosts in Perceus

While moving a cluster from assigning node IP’s via DHCP to static assignment I had problems with the Nodes not properly setting the DNS server in /etc/resolve.conf when using /etc/hosts in Perceus to assign IP addresses. The nodes were defaulting to the values that were already in the vnfs capsule instead of updating them. This caused very slow ssh login times because the DNS server that was in the vnfs capsule /etc/resolve.conf file did not exist.

I wasn’t able to figure out how to fix this issue, so for now I just updated the resolve.conf file in the vnfs capsule.

For some background information on what led to this problem please see this old post: [intlink id=”87″ type=”post”]old post[/intlink].

Nodes not Setting Hostname When Using /etc/hosts for Static IP’s in Perceus

Recently I’ve been working on moving a cluster from assigning node IP’s with DHCP to statically defined IP’s, in order to work around Torque/Moab not starting when it is unable to resolve the name of every node.

To do this, I entered all of the relevant information into the /etc/hosts file. But, after doing this and rebooting the nodes they were no longer automatically setting their hostname which had previously been retrieved from the DHCP server. Instead it would look like the following after logging in:

[root@localhost ~]#

This can be solved by enabling the Perceus hostname module.

perceus module activate hostname

After enabling this it should like the following when logging into the nodes:

[root@n0000 ~]#

Problem with Perceus dhcpd import script

After a couple days of banging my head against the wall trying to figure out why the import script kept giving this cryptic error, we finally submitted a question to the Perceus mailing list.

Undefined subroutine &main::add_node called at ./ line 46,  line 8.

I’m sure if I knew Perl that error wouldn’t have been so confusing.

Here’s a diff for anyone who’s interested.

<    if ( $_ =~ /^s*hosts+([^s]+)s*{s*$/ ) {
>    if ( $_ =~ /^s*hosts+([^s]+)s*{?s*$/ ) {
<       &add_node($1, $hostname);
> print "Adding: $1, $hostnamen";
>       &node_add($1, $hostname);

New Nodes failing to get DHCP IP after booting in Perceus

As of Perceus 1.4 nodes will no longer automatically get an ip via DHCP. You must first enable and configure the ipaddr module in Perceus.

perceus module activate ipaddr

Then, edit the ipaddr config file in /etc/perceus/modules/ipaddr. Uncommenting the last line seems to be more than sufficient for most configurations. If your machines do not have their second ethernet card plugged in it is worth removing the eth1 portion as this will significantly reduce boot times.

* eth0:[default]/[default] eth1:[default]/[default]/[default]

Reload the Perceus service, and then restart the nodes and they should automatically get a new ipaddress.

Perceus “ERROR No such host: binsh”

I’ve been working a lot with Perceus at work, and I figured I would put up some posts about problems I have encountered, and possible solutions to the problems.

Today I was attempting to boot a node with Perceus 1.3.8 installed. The node would download and run the first kernel, but when it attempted to begin provisioning with provisiond it would exit with this error:

ERROR No such host: binsh

The node would then infinitely loop through the following while loop which printed the error every second right after running provisiond:

# Excerpt from:

while [ ! -f "/next" ]; do
	# If this works we wont even get a chance to say goodbye!
	# If it errors out, we need to touch /next to
	# iterate to next count and/or interface.
	if [ $INIT_DEBUG -eq 0 ]; then
	   provisiond -s /bin/sh $MASTERIP init || touch /next
	elif [ $INIT_DEBUG -eq 1 ]; then
	   provisiond -v -s /bin/sh $MASTERIP init || touch /next
	   provisiond -d -s /bin/sh $MASTERIP init || touch /next
	sleep 1

I was able to find a reference to error message in the source code for provisiond. Initially I thought that the node was passing “/bin/sh” to the server instead of the master’s IP address, but after trying various things with the command line parameters I decided to look elsewhere.

Eventually I noted that provisiond was running as a service on the head node, but provisiond should only run on provisioned nodes. I tried uninstalling provisiond from the head node which seemed to fix the problem. Unfortunately I tried a couple other ideas at the same time so I cannot be absolutely sure that provisiond was causing the problem.

If I get a chance I will do a more thorough test to make sure that I am correct.

edit: Never got a chance to test if this worked correctly. If anyone was able to test this situation I would be interested in hearing about it.