LaTeX to HTML/MathML conversion

With accessibility legislation requiring videos to be captioned by September 23, 2020, and with our teaching being online, our document provision needs to be accessible.

While LaTeX itself is quite good for accessibility, the process of converting it to PDF removes almost all of the mark up structure. There are some LaTeX packages that help, but really the dvips, pdflatex and ps2pdf programs all need to be rewritten to put accessibility issues at the heart of the conversion, and to retain all the mark up information.

In the absence of any such initiative, it seems that a web-based output is probably optimal, since accessibility is more strongly built in, and the mark up structure is even more deeply embedded.

For maths, we have the usual problem of equations and symbols. There is a special mark up format, MathML, for maths, which is meant to be recommended for screen readers. It is, however, very verbose, and not convenient for typing by hand.

Luckily, MathJax is pretty good at accessibility, and has access to MathML built in, in some sense (I don't quite understand this).

Simplest method: using htlatex

There are various ways to convert LaTeX to HTML (see Matthew Towers' blog posts for a comparison).

From the Chris Hughes' TALMO talk, I learned of one way to do this conversion. I haven't been able to get his script working, but I did get something similar (which I discovered by following up links from his talk) to work quite well.

We download a "configuration file" process.cfg (879 bytes), and a batch file process.bat (218 bytes).

Put these in the same directory as the file course.tex to be converted, and type

process course

from a command prompt in that directory to get an output. It takes a little time, but course.tex is converted to course.pdf and also course.html. The former file is identical to what you would get if you ran LaTeX as normal; the latter file is an HTML output, with MathJax/MathML enabled (at the time of writing, Firefox is the browser that best renders MathML).

I ran this for the notes of an entire 130-page lecture course; the PDF output was the same as I already had, and the HTML version (1.3MB) was also pretty good, as far as it went. However, not all of the images were found (and these would need alt-text to be added manually anyway); and the MathML didn't work completely, and only the first 95-pages-or-so displayed in the browser. Presumably a little further editing would sort that out, to get things working fully.

In the short term, this seems to me to be the easiest way to go. The output is less nice than that of the next solution, however, and I will recommend the RMarkdown/bookdown approach for the longer term, for various reasons.

Better longer-term method: RMarkdown/Bookdown

To me, this looks more promising as a longer term solution.

You will need to have R and RStudio installed. Open RStudio, and type


in the left-hand window.

If you are converting old LaTeX files, you should also install pandoc, which converts documents from one format to another.

For conversion of an existing LaTeX course, course.tex, the idea is fairly simple:

  1. Convert the LaTeX file to a Markdown file course.txt using pandoc. This is done simply by typing

pandoc -s course.tex -o course.txt

from the command line. Rename course.txt as course.Rmd, the RMarkdown format. (But see the issues below.)

  1. Make this into an RMarkdown project. This involves putting it into a directory with a couple of other files: an index.Rmd (1MB) file, and a course.Rproj (295 bytes) file. The latter file is an "R project file", and can be ignored; the former file should be edited with information about the course (course title, lecturer etc.).

  2. In fact, we are going to be using bookdown format, since it works well with lecture notes, or any other long document. I broke up course.Rmd into a file for each chapter, and then edited those. RStudio will recognise this, since that's part of the R project file. So I have a master file course.Rproj, and a file index.Rmd with just the meta information, 01-introduction.Rmd for Chapter 1 Introduction, 02-notation.Rmd for Chapter 2 Notation, and so on.

  3. Double click on course.Rproj to open it in RStudio. A directory listing should appear in the bottom-right pane (under the Files tab); the top-right pane contains a tab "Build", and when you click on that, you will see a heading "Build Book". It is this that runs everything.

  4. So click on "Build Book", and the "knitting" process begins. From the RMarkdown files, we get outputs _main.tex and _main.pdf, and a collection of HTML files such as index.html and notation.html, all in the _book directory. (If you get that far. A lot of editing may be needed to get everything processing as you and RStudio would like.) Given that I already have a nice PDF I am happy with, I may simply rename it _main.pdf and replace the output of the process.

  5. But the resulting HTML file, index.html is a very nicely presented HTML output, with maths rendered with MathJax (so MathML is accessible, at least with Firefox). There is an option to download the PDF in the menu at the top of the page.

  6. The _book directory contains all the output needed; just copy it to the web, possibly renaming things more suitably. In particular, we get an HTML version of the lecture notes, using MathJax (which is accessible), and there is an option for the reader to download a PDF or an EPUB.

Overall, I recommend this bookdown solution as a longer term way forward. I don't know the RMarkdown syntax well enough yet, and it took me a couple of days of solid editing to get the outputs looking as I would wish in the first attempt (I expect this to go down to a few hours for future courses). But it's pretty decent. Best of all is the notion of a "code chunk", where you add in code, and the output appears in the HTML and PDF documents. R code is written in the document using

```{r} ... ```

and the output documents contain the outputs (calculations or plots). There are many other engines, including for Python, and just about any other language we might use.


The main issue is that RMarkdown is a superset of Markdown. The conversion of pandoc from LaTeX to Markdown is excellent as far as it goes, but some details are omitted. In particular, RMarkdown has support for theorem environments, but Markdown does not. The conversion process completely omits LaTeX commands it doesn't understand, and therefore strips the \begin{theorem} and \end{theorem} commands, leaving what looks like a plain text paragraph. You should go back and edit the file to add ```{theorem} where the \begin{theorem} was, and ``` where the \end{theorem} was.

Alternatively, it is not hard to write some AWK (exe, 967MB) scripts: one script before.awk (953 bytes) which should be run before the pandoc conversion to replace \begin{theorem} with a text fragment like BEGthm which will be retained by the conversion, and another script after.awk (822 bytes) to replace BEGthm with ```{theorem} when run after the conversion (and similarly for the \end{theorem} commands, and all other similar environments). A batch script, conv.bat (110 bytes) allows you to type

conv course

to do the conversion in a single command.

For me, perhaps the main problem has been tikz environments for drawing pictures. Bookdown can run tikz, with a tikz engine, ```{tikz}, but pandoc will delete all the contents of the environment. In order to avoid this, one can enclose it in a \begin{verbatim} environment for the purposes of the conversion: so we add to the before.awk script something to convert \begin{tikzpicture} to, say, BEGtikz\begin{verbatim}, and \end{tikzpicture} to \end{verbatim}ENDtikz. The conversion will delete the \begin{verbatim} commands, but will now preserve the contents (but indented in the file, so that they will appear as a verbatim environment in the output). Then we need to replace BEGtikz with ```{tikz}\n\begin{tikzpicture} in after.awk, and do something about the indentations and blank lines. Only partly implemented yet.

\label and \ref need dealing with too; RMarkdown is quite restrictive about what is permitted for this - I haven't implemented anything here yet in the scripts, but may do so soon - manual editing will be needed afterwards.

Images need treatment to get them sized appropriately; this isn't automatic; what looks good in the HTML file may not look good in the PDF and vice versa, so some experimentation might be needed (this is largely why I may replace the default PDF file with what I already had).

The images in the PDF are largely done with the figure environment. This can float the figures far from where they were meant to be; this might not be consistent with how the original LaTeX source was set out. I will probably redefine these as caption environments or something. Not yet implemented.

Again, one needs to add alt-text for images to the HTML file manually; I haven't worked out yet whether there is a way to do this in the RMarkdown document, otherwise everytime that the files are built, one would have to make all the alt-text additions again. Are the captions really enough?

I've done this for only one course so far, but I think the output certainly justifies looking at this approach in more detail, and I expect to use it for all my courses from now on, given the accessibility legislation. I need to try it on shorter documents as well, such as example sheets, and I hope to write more on this later.

Other markdown solutions

PreTeXt looks like a good alternative for authoring courses.

I haven't had experience of using this; it clearly has the advantage over the bookdown approach of (apparently) allowing arbitrary LaTeX theorem numbering - bookdown is a bit more restrictive in what it allows: there are about 10 present theorem names allowed, and numberings can't be merged. But development of RMarkdown seems more ongoing than development of PreTeXt, as far as I can tell, and this might change in the future. (I have seen a GitHub page that does this already, apparently.)

I get the impression that it may be harder to do the conversion process in PreTeXt than in RMarkdown, but I'm not sure.

Further remarks

An HTML file can be more dynamic, of course; one could insert links to videos, or embed them; or use forms to have some quizzing (there are R packages to help with this), so in principle, a single web page might suffice for the whole course.