DEV Community

Waylon Walker
Waylon Walker

Posted on • Originally published at waylonwalker.com

Find all Headings with BeautifulSoup

BeautifulSoup is a DOM like library for python. It's quite useful to manipulate html. Here is an example to find_all html headings. I stole the regex from stack overflow, but who doesn't.

Make an example

sample.html

Lets make a sample.html file with the following contents. It mainly has some headings, <h1> and <h2> tags that I want to be able to find.

<!DOCTYPE html>
<html lang="en">
  <body>
    <h1>hello</h1>
    <p>this is a paragraph</p>
    <h2>second heading</h2>
    <p>this is also a paragraph</p>
    <h2>third heading</h2>
    <p>this is the last paragraph</p>

  </body>
</html>
Enter fullscreen mode Exit fullscreen mode

Get the headings with BeautifulSoup

Lets import our packages, read in our sample.html using pathlib and find all headings using BeautifulSoup.

from bs4 import BeautifulSoup from pathlib import Path

soup = BeautifulSoup(Path('sample.html').read_text(), features="lxml") headings = soup.find_all(re.compile("^h[1-6]$"))
Enter fullscreen mode Exit fullscreen mode

And what we get is a list of bs4.element.Tag's.

>> print(headings)
[<h1>hello</h1>, <h2>second heading</h2>, <h2>third heading</h2>]
Enter fullscreen mode Exit fullscreen mode

I recently added a heading_link plugin to markata, you might notice the
πŸ”—'s next to each heading on this page, that is powered by this exact
technique.

Top comments (0)