VBA – Remove HTML Tags from a String

I needed to clean some webpage texts for a private project I was working on and created the following function. Today, in a forum question, the subject resurfaced so I dug up my function and am posting it here should it might serve someone else.

So here is a simple function utilizing Regular Expressions to remove/sanitize/extract HTML Tags from the passed string. So you get returned the text and nothing else.

'---------------------------------------------------------------------------------------
' Procedure : RemoveHTML
' Author    : Daniel Pineault, CARDA Consultants Inc.
' Website   : http://www.cardaconsultants.com
' Purpose   : Remove any HTML tags and/or comments from a string
' Copyright : The following may be altered and reused as you wish so long as the
'             copyright notice is left unchanged (including Author, Website and
'             Copyright).  It may not be sold/resold or reposted on other sites (links
'             back to this site are allowed).
'
' Input Variables:
' ~~~~~~~~~~~~~~~~
' sString   : String to sanitize (remove HTML tags from)
'
' Usage:
' ~~~~~~
' RemoveHTML("And 

then
some

") ' Returns: And then some ' ' Revision History: ' Rev Date(yyyy/mm/dd) Description ' ************************************************************************************** ' 1 2009-Aug-26 Initial Release ' 2 2009-Oct-30 Changed pattern to include comments '--------------------------------------------------------------------------------------- Function RemoveHTML(sString As String) As String On Error GoTo Error_Handler Dim oRegEx As Object Set oRegEx = CreateObject("vbscript.regexp") With oRegEx 'Patterns see: http://regexlib.com/Search.aspx?k=html%20tags '.Pattern = "<[^>]+>" 'basic html pattern .Pattern = "]*>" 'html tags and comments .Global = True .IgnoreCase = True .MultiLine = True End With RemoveHTML = oRegEx.Replace(sString, "") Error_Handler_Exit: On Error Resume Next Set oRegEx = Nothing Exit Function Error_Handler: MsgBox "The following error has occurred." & vbCrLf & vbCrLf & _ "Error Number: " & Err.Number & vbCrLf & _ "Error Source: RemoveHTML" & vbCrLf & _ "Error Description: " & Err.Description, _ vbCritical, "An Error has Occurred!" Resume Error_Handler_Exit End Function

Update 2016-08-26

Just learnt about a function, thank to fellow MVP theDBguy, that has existed since Access 2007 that can remove Rich Text formatting from a string. So you may wish to check out the PlainText Method. Its use is very straightforward:

? PlainText("And 

then
some

)"

which returns:

And
then
some)

In my brief testing, for better or worse depending on what you are doing, this method seems to retain line feeds.

6 responses on “VBA – Remove HTML Tags from a String

  1. Ben Sacherich

    I have been using the PlainText() function but have found a bug with it when processing tags.

    PlainText("tblCustomer JOIN tblOrder")

    Will return:

    tblCustomerJOIN tblOrder

    Instead of:

    tblCustomer JOIN tblOrder

    My current workaround is to replace “” with ” and in turn the function does not return a space at the beginning of the second line.

    1. Daniel Pineault Post author

      I am on Access 2013 (haven’t tested on other builds) and it returns correctly -> tblCustomer JOIN tblOrder
      What build are you seeing the behavior you describe?

  2. John Litchfield

    Typo at the end. It should be:

    ? PlainText(“And then some”)
    And
    then
    some

    Hope that helps. Thanks for the info! This is an amazing site!

    John.

  3. John Litchfield

    In my last comment, the site changed my text and my code, so it does not make any sense (I guess the site textbox tried to interpret the code or something). Sorry that that happened. I’ll try to describe it.

    Test the code at the end. The is a ) and a ” that need to be switched.
    There is an extra ) in the output.