Hello Coders,
In this article, I will present a simple HTML Parser used by me to integrate much faster HTML themes into legacy apps, coded in different technologies. When a customer requests a new UI for his app, the manual processing can take some time, and I decided to automate a little bit of the whole flow. Using the tool, I'm able to update the design in less than 2h for a simple website with 2/3 pages.
Thanks for reading! - Content provided by App Generator.
Main Feature
The tool converts flat HTML to production-ready components for different engines: PUG, Jinja2, Blade, Mustache, Core Php.
Note: the tool is not open-source, but I will consider releasing a light version as an open-source project in the future. In the HTML Parser public repository, I will publish processed HTML themes converted to PUG, Jinja, and Blade to be used by anyone.
Technologies
The HTML parser Tool is developed in Python3 / BeautifulSoup library as an interactive console. I was able to use the tool for real projects after 3mo of R&D work.
HTML Parser features
- Normalize the HTML file to load the assets from a standard directories ( /assets/ [ img, js, css ] ) making the integration in webpack related tools much easier
- Edit / traverse the HTML tree
- Edit attributes like anchor HREF, span texts, remove elements, edit class names
- Extract components for production use for various engines like PUG, Jinja2, Blade
- Migrate legacy Bootstrap layouts to Bulma and Tailwind CSS frameworks
HTML Parser Implementation
In order to process the HTML and process the HTML tree, we need to load first the whole file. BeautifulSoup has a simple constructor that accepts the string to parse and load into memory and the desired parser.
Load HTML in memory
# read_file retun the file content as string
html_content = read_file('index.html')
soup = bs(html_content,'html.parser')
# At this point, we can interact with the HTML
# elements stored in memory using all helpers offered by BS library
BeautifulSoup library supports more than one parser (e.g. lxml, xml, html5lib), the differences between them become clear on non well-formed HTML documents. For instance, lxml will add missing closing tags for all elements. For more information please access the dedicated section in the documentation regarding this topic.
Parse Head section
To select the whole HEAD node, and interact with all elements we need to write just a few lines of code:
header = soup.find('head')
# If we want to change the title
header.title.string.replace_with('My new title')
Parse HTML for JS Scripts
Javascript files are present in the HTML using script
nodes:
...
<script type='text/javascript' src='js/bootstrap.js'></script>
<script type='text/javascript' src='js/custom.js'></script>
...
To scan the HTML soup for script tags, we can use the find_all
helper:
for script in soup.body.find_all('script', recursive=False):
# Print the path
print(' JS source = ' + script[src])
# Update (normalize) the path
js_path = script['src']
js_file = js_path.split('/')[-1] # select the last segment
script[src] = '/assets/js/' + js_file
Parse HTML for Images
Using the same technique as for JS files, we can normalize the Images to be loaded from a standard directory.
for img in soup.body.find_all('img'):
# Print the path
print(' IMG src = ' + img[src])
img_path = img['src']
img_file = img_path.split('/')[-1]
img[src] = '/assets/img/' + img_file
Save the HTML
All our changes are made in memory. To make these changes permanent we need to extract the string representation of our processed HTML from BS, and dump it into a file for later usage:
processed_html = soup.prettify(formatter="html")
f = open( 'index2.html', 'w+')
f.write(processed_html)
f.close
Real life sample
The sample, extracted from Stellar HTML5Up theme is a simple navigation bar, extracted from this file
- Index file: original version and normalized version
- JSON descriptor is generated by the HTML parser tool and encapsulate the assets and resources used by the HTML files
- Navigation component
Pug version
nav#nav
ul
li
a.active.newclass(href='https://appseed.us/html-parser').
Introduction
li
a(href='#first').
First Section
li
a(href='#second').
Second Section
li
a(href='#cta').
Get Started
PHP version
<nav id="nav">
<ul>
<li>
<a class="active newclass" href="https://appseed.us/html-parser">
<?php echo $var_1?>
</a>
</li>
<li>
<a href="#first">
<?php echo $var_2?>
</a>
</li>
<li>
<a href="#second">
<?php echo $var_3?>
</a>
</li>
<li>
<a href="#cta">
<?php echo $var_4?>
</a>
</li>
</ul>
</nav>
Projects built with this tool
All are open-source, with live DEMO.
- JAMstack Fractal - HTML5Up design coded in JAMstack pattern
- JAMstack BigPicture - HTML5Up design coded in JAMstack pattern
- JAMstack Landed - HTML5Up Landed design coded in JAMstack pattern
- Flask Dashboard Material Design - Admin Dashboard with Material Design
- Flask Dashboard NowUI - Admin Dashboard with NowUI Design
- Flask Dashboard Black - Open-Source Admin Panel
- Flask Dashboard Argon - Open-Source Admin Panel
- Flask Dashboard Light - Open-Source Admin Panel
Resources
- HTML Parser - How to use Python BS4 to work less
- Developer Tools - Open-Source HTML Parser - related article
- HTML Parser - used by the AppSeed App Generator to parse flat HTML
- BeautifulSoup Html Parser documentation
- HTML Parser sources - the official public repository
- HTML Parser provided by AppSeed
- HTML Parser - Convert HTML to Jinja2 and Php components - related blog article
- Video presentation HTML parsing and components extraction
Thank you!
Top comments (5)
Hey Sm0ke,
Thank you for sharing this article with the community, but is there any chance you can share a little more? I know that the proprietary licensing prevents you from releasing the code, but could you discuss some of the algorithms that were used or perhaps use pseudocode to demonstrate how certain sections work? Otherwise, I am afraid your article might violate section 11 of the community's Terms of Use:
Hello @ssimontis ,
I will add more information regarding tool architecture & use.
As I mentioned the tool will provide some free assets to developers:
Until then, I will add more information regarding the tool algorithms,
to make more useful to the audience.
Thank you!
Thank you, that's awesome!
Hello @ssimontis ,
As promised, I've added more information regarding the HTML parser internals. Tell me if you find useful the updates or suggest more topics to be added.
Happy parsing!
Thank you so much for doing that! I found it very useful and I appreciate the time you put into it! Actually hoping I can play with parsing later today, one coding assessment left to complete for interviews...