ComputersLinuxWeb Design

Converting Linux man pages into html

One of the things that I am hoping to provide on my Penguin Tutor website is to allow the current linux man pages to be viewed online. For those not familiar with the concept of the manual page, you basically enter man followed by a command and it gives the information about that command, with a bit of formatting.

e.g. man ls
will give you information on the bash shell.

There’s a bit more to it, but that’s enough for now. If you want to know more then from a linux command line enter
man man

The problem is that this is not stored in html, but in a macro format which is parsed by the groff program. Searching for commands to convert brings up a few different options, but I had limited success with them. groff itself should be able to convert the files, but didn’t work on my system.

groff -man -Thtml <filename>
should conver the page to html, but instead gave the error:
groff: can’t find `DESC’ file
groff:fatal error: invalid device `html’

which appears to suggest that there is a DESC file missing that describes the html format. I couldn’t find a copy of the file, and looked at other methods.

Normally at this point I would probably have written my own, as the format does not appear to be too difficult to parse, however I’m wanting to launch the site soon, so didn’t really want to spend a lot of time writing yet another program.

One program I did look at was called man2html, but the version I downloaded didn’t appear to do much more than add html tags at the top and bottom, still leaving the rest of the file incorrectly formatted. I later found a different version that was available via apt-get on Ubuntu. This was installed using
apt-get install man2html

This version also handles formatting within the page. It’s designed to run as a cgi script, so I wrote a little wrapper to run the command against all the man pages in certain directories and create their output. This is a quick hack and relies on the output directories being created before it is run.

The code is:


#!/usr/bin/perl -w

#### Quick hack - convert all man pages to html
#### Do not rely on this

# Source and dest folders (top level)
# Must end with /
my $manfolder = "/tmp/man/man-pages-2.31/";
my $htmlfolder = "/tmp/man/html/";

my $convertcmd = "/usr/bin/man2html";

my @subfolders = ("man0p", "man1", "man1p", "man2",  "man3", "man3p", "man4", "man5", "man6", "man7", "man8", "man9");
my $filename;
foreach $thisfolder (@subfolders)
{
	opendir (INDIR, "$manfolder$thisfolder") or die "unable to ls $manfolder$thisfolder";
	while ($filename = readdir INDIR)
	{
		if ($filename =~ /^\./) {next;}
		print ("$convertcmd $manfolder$thisfolder/$filename to $htmlfolder$thisfolder/$filename.html\n");
		system ("$convertcmd $manfolder$thisfolder/$filename \> $htmlfolder$thisfolder/$filename.html");
	}
	
	close INDIR;
}

The outputted files have a HTTP content entry at the top. This is not a big deal for me as I need to strip the headers anyway. A slightly harder problem is that the file is HTML 4, and not xhtml. It looks like it may need some significant changes to convert the files to xhtml, so for now I’ll probably just have to have the man pages as HTML 4.0.