DEV Community

Josh Stillman
Josh Stillman

Posted on • Edited on

Embrace Your Laziness: Automatically Convert Word Documents into Terms & Conditions Pages

textutil in action

Modern Single Page Applications (SPAs) often embed terms and conditions pages into the app itself for a slick and modern feel. While this makes for a great user experience, it can be tedious and time consuming for developers to convert long Microsoft Word documents of legal copy into HTML/JSX that can be embedded into a terms and conditions component or modal. But fret not, fellow developer! With some macOS and shell utilities, you can let the computer handle the drudgery, so you can focus on something more important.

Simply using some built-in command line programs in macOS will do the trick, converting the Word document into clean HTML you can paste into your terms and conditions component! Let's see how—and why.

Laziness Is a Virtue

Larry Wall, the creator of the Perl programming language, argued that laziness is one of the primary virtues of a good programmer. It's what makes coders "write labor-saving programs that other people will find useful." And indeed, a good engineer will "go to great effort to reduce overall energy expenditure" by finding opportunities to make processes more efficient.

Along these lines, I'd say that a primary coding virtue is the ability to identify which tasks are rote, repetitive, and best delegated to a computer, and which tasks instead require human creativity, problem solving, and ingenuity. We'll only ever have time and energy for the latter category if we find a way to let the computers handle the boring, repetitive stuff.

The Inevitable Terms & Conditions Ticket

It's inevitable. When developing a new SPA, there will come a day that you or your teammate will be assigned a ticket to create an embedded terms and conditions page or modal. (That's our litigious modern society. Sigh...) Typically, a developer is handed a Word document from the legal department and some fancy designs, and left to figure out the rest.

The most painstaking approach would be to manually copy each paragraph, add any bold and italic formatting, and wrap it in appropriate HTML tags. This can take a while if it's a long Word document! And it won't be pleasant. Our laziness instincts should be kicking in about now.

We can make this process a little less manual through this nifty VS Code extension. It will let us wrap each paragraph or sentence of text in the appropriate <p>, <b>/<strong>, or <i>/<em> tags. But it's still a pretty manual process of copying, pasting, and formatting. How can we fully automate this?

There's a CLI for That

Good news! macOS ships with a command line tool called textutil that excels at converting documents into different formats. It can convert a Word document into HTML in a single terminal command: textutil -convert html -strip terms.docx. This will take your Word document, strip out all the metadata, and convert it into basic HTML markup. Paragraphs will be wrapped in <p> tags, and bold and italic formatting tags will be added as well. No more need to go through the document paragraph by paragraph yourself to create the markup. And it even works on other document formats, such as .txt and .rtf files. Joy!

Much Too Classy

Initial Output

One problem! textutil creates some basic CSS styles for you based on the source Word document and attaches very generic class names such as p2 and Apple-converted-space to seemingly every tag it creates. But you probably don't want these generated class names polluting your markup. Not only does it just look ugly and hard to read, but these highly generic class names could clash with other classes in your app, leading to unintended consequences.

Sadly, textutil lacks any built-in option to suppress these class names. Sure, we could manually remove all the classes from the generated markup, but we don't want to do that either.

Right Sed Fred

Fear not—we can clean up the HTML that textutil gives us using sed, a shell tool for text manipulation that comes built into Bash and Zsh. We'll pipe the HTML that textutil generates into sed, strip out all the class names, and save the result to a file.

The sed command we'll use to delete the class names is sed 's/class="[^"]*"//g'. Let's break that down. The leading s in the argument means we'll substitute text matching the pattern between the first and second / characters with the text between the second and third /'s. The regex pattern we'll match is class="[^"]*" (explained below). Then, we'll replace the text matching that pattern with the text between the last two slashes—here, an empty string. And we'll do it for every occurrence with the global modifier, /g. That is, we'll simply delete the text matching the pattern throughout the document.

About that funky-looking regex... sed doesn't have the same regex capabilities you're familiar with in modern languages such as JavaScript. It doesn't have lazy matching, meaning that if you try to match class=".*", sed will greedily match far more text than you intended, well beyond the end of the HTML tag.

Instead, we can mock lazy matching in sed with this technique: we can match the opening ", followed by any character except a ", then the closing ". So /class="[^"]*"/ will get us the lazy matching we need—effectively /class=".*?"/ in JavaScript's regex dialect. Lazy matching for lazy programmers!

After running textutil's output through this sed command, we'll have nice, clean markup without all the random class names.

Transformed output

Building on this technique, we could even take it a step further and strip out unnecessary <span> tags, and anything else we wanted to get rid of from textutil's output.

Putting It All Together

Last, we'll save the cleaned HTML to a file. The final command line script is textutil -convert html -strip -stdout terms.docx | sed 's/ class="[^"]*"//g' > output.html, which (1) converts the Word document to HTML with textutil, (2) strips out the class names that textutil adds to each tag with sed, and (3) saves the cleaned HTML to a file. From there, we can simply paste the HTML into our terms and conditions component in our SPA, style it, and call it a day.

textutil in action

Conclusion

If a development task is manual, repetitive, time-consuming, and boring, that's a sign. As developers, we should hone a keen awareness of this feeling, which is usually a clear sign that it's time to automate the task and move on to more creative, higher-value problem solving. It's a unique privilege of being software engineers that we can (and should!) automate these annoying parts of our jobs. So, embrace your laziness, fellow devs! It's the virtuous thing to do.

TL;DR

Convert your Word document to clean HTML on macOS by running this command in your shell: textutil -convert html -strip -stdout terms.docx | sed 's/ class="[^"]*"//g' > output.html

Top comments (1)

Collapse
 
imsuryadev profile image
imSurya-dev

Thank you so much, Saved a-lot of time