I needed to clean some webpage texts for a private project I was working on and created the following function. Today, in a forum question, the subject resurfaced so I dug up my function and am posting it here should it might serve someone else.
So here is a simple function utilizing Regular Expressions to remove/sanitize/extract HTML Tags from the passed string. So you get returned the text and nothing else.
'---------------------------------------------------------------------------------------
' Procedure : RemoveHTML
' Author : Daniel Pineault, CARDA Consultants Inc.
' Website : http://www.cardaconsultants.com
' Purpose : Remove any HTML tags and/or comments from a string
' Copyright : The following may be altered and reused as you wish so long as the
' copyright notice is left unchanged (including Author, Website and
' Copyright). It may not be sold/resold or reposted on other sites (links
' back to this site are allowed).
'
' Input Variables:
' ~~~~~~~~~~~~~~~~
' sString : String to sanitize (remove HTML tags from)
'
' Usage:
' ~~~~~~
' RemoveHTML("And then
some
")
' Returns: And then some
'
' Revision History:
' Rev Date(yyyy/mm/dd) Description
' **************************************************************************************
' 1 2009-Aug-26 Initial Release
' 2 2009-Oct-30 Changed pattern to include comments
'---------------------------------------------------------------------------------------
Function RemoveHTML(sString As String) As String
On Error GoTo Error_Handler
Dim oRegEx As Object
Set oRegEx = CreateObject("vbscript.regexp")
With oRegEx
'Patterns see: http://regexlib.com/Search.aspx?k=html%20tags
'.Pattern = "<[^>]+>" 'basic html pattern
.Pattern = "]*>" 'html tags and comments
.Global = True
.IgnoreCase = True
.MultiLine = True
End With
RemoveHTML = oRegEx.Replace(sString, "")
Error_Handler_Exit:
On Error Resume Next
Set oRegEx = Nothing
Exit Function
Error_Handler:
MsgBox "The following error has occurred." & vbCrLf & vbCrLf & _
"Error Number: " & Err.Number & vbCrLf & _
"Error Source: RemoveHTML" & vbCrLf & _
"Error Description: " & Err.Description, _
vbCritical, "An Error has Occurred!"
Resume Error_Handler_Exit
End Function
Update 2016-08-26
Just learnt about a function, thank to fellow MVP theDBguy, that has existed since Access 2007 that can remove Rich Text formatting from a string. So you may wish to check out the PlainText Method. Its use is very straightforward:
? PlainText("And then
some
)"
which returns:
And then some)
In my brief testing, for better or worse depending on what you are doing, this method seems to retain line feeds.