Tag Archives: Regular Expressions

VBA – Remove HTML Tags from a String

I needed to clean some webpage texts for a private project I was working on and created the following function. Today, in a forum question, the subject resurfaced so I dug up my function and am posting it here should it might serve someone else.

So here is a simple function utilizing Regular Expressions to remove/sanitize/extract HTML Tags from the passed string. So you get returned the text and nothing else.

'---------------------------------------------------------------------------------------
' Procedure : RemoveHTML
' Author    : Daniel Pineault, CARDA Consultants Inc.
' Website   : http://www.cardaconsultants.com
' Purpose   : Remove any HTML tags and/or comments from a string
' Copyright : The following may be altered and reused as you wish so long as the
'             copyright notice is left unchanged (including Author, Website and
'             Copyright).  It may not be sold/resold or reposted on other sites (links
'             back to this site are allowed).
'
' Input Variables:
' ~~~~~~~~~~~~~~~~
' sString   : String to sanitize (remove HTML tags from)
'
' Usage:
' ~~~~~~
' RemoveHTML("And 

then
some

") ' Returns: And then some ' ' Revision History: ' Rev Date(yyyy/mm/dd) Description ' ************************************************************************************** ' 1 2009-Aug-26 Initial Release ' 2 2009-Oct-30 Changed pattern to include comments '--------------------------------------------------------------------------------------- Function RemoveHTML(sString As String) As String On Error GoTo Error_Handler Dim oRegEx As Object Set oRegEx = CreateObject("vbscript.regexp") With oRegEx 'Patterns see: http://regexlib.com/Search.aspx?k=html%20tags '.Pattern = "<[^>]+>" 'basic html pattern .Pattern = "]*>" 'html tags and comments .Global = True .IgnoreCase = True .MultiLine = True End With RemoveHTML = oRegEx.Replace(sString, "") Error_Handler_Exit: On Error Resume Next Set oRegEx = Nothing Exit Function Error_Handler: MsgBox "The following error has occurred." & vbCrLf & vbCrLf & _ "Error Number: " & Err.Number & vbCrLf & _ "Error Source: RemoveHTML" & vbCrLf & _ "Error Description: " & Err.Description, _ vbCritical, "An Error has Occurred!" Resume Error_Handler_Exit End Function

Update 2016-08-26

Just learnt about a function, thank to fellow MVP theDBguy, that has existed since Access 2007 that can remove Rich Text formatting from a string. So you may wish to check out the PlainText Method. Its use is very straightforward:

? PlainText("And 

then
some

)"

which returns:

And
then
some)

In my brief testing, for better or worse depending on what you are doing, this method seems to retain line feeds.