I needed to clean some webpage texts for a private project I was working on and created the following function. Today, in a forum question, the subject resurfaced so I dug up my function and am posting it here should it might serve someone else.
So here is a simple function utilizing Regular Expressions to remove/sanitize/extract HTML Tags from the passed string. So you get returned the text and nothing else.
'---------------------------------------------------------------------------------------
' Procedure : RemoveHTML
' Author : Daniel Pineault, CARDA Consultants Inc.
' Website : http://www.cardaconsultants.com
' Purpose : Remove any HTML tags and/or comments from a string
' Copyright : The following may be altered and reused as you wish so long as the
' copyright notice is left unchanged (including Author, Website and
' Copyright). It may not be sold/resold or reposted on other sites (links
' back to this site are allowed).
'
' Input Variables:
' ~~~~~~~~~~~~~~~~
' sString : String to sanitize (remove HTML tags from)
'
' Usage:
' ~~~~~~
' RemoveHTML("And then
some
")
' Returns: And then some
'
' Revision History:
' Rev Date(yyyy/mm/dd) Description
' **************************************************************************************
' 1 2009-Aug-26 Initial Release
' 2 2009-Oct-30 Changed pattern to include comments
'---------------------------------------------------------------------------------------
Function RemoveHTML(sString As String) As String
On Error GoTo Error_Handler
Dim oRegEx As Object
Set oRegEx = CreateObject("vbscript.regexp")
With oRegEx
'Patterns see: http://regexlib.com/Search.aspx?k=html%20tags
'.Pattern = "<[^>]+>" 'basic html pattern
.Pattern = "]*>" 'html tags and comments
.Global = True
.IgnoreCase = True
.MultiLine = True
End With
RemoveHTML = oRegEx.Replace(sString, "")
Error_Handler_Exit:
On Error Resume Next
Set oRegEx = Nothing
Exit Function
Error_Handler:
MsgBox "The following error has occurred." & vbCrLf & vbCrLf & _
"Error Number: " & Err.Number & vbCrLf & _
"Error Source: RemoveHTML" & vbCrLf & _
"Error Description: " & Err.Description, _
vbCritical, "An Error has Occurred!"
Resume Error_Handler_Exit
End Function
Update 2016-08-26
Just learnt about a function, thank to fellow MVP theDBguy, that has existed since Access 2007 that can remove Rich Text formatting from a string. So you may wish to check out the PlainText Method. Its use is very straightforward:
? PlainText("And then
some
)"
which returns:
And then some)
In my brief testing, for better or worse depending on what you are doing, this method seems to retain line feeds.
Thanks. Very useful.
Awesome, works great, thanks!!
I have been using the PlainText() function but have found a bug with it when processing tags.
PlainText("tblCustomer JOIN tblOrder")Will return:
tblCustomerJOIN tblOrder
Instead of:
tblCustomer JOIN tblOrder
My current workaround is to replace “” with ” and in turn the function does not return a space at the beginning of the second line.
I am on Access 2013 (haven’t tested on other builds) and it returns correctly -> tblCustomer JOIN tblOrder
What build are you seeing the behavior you describe?
Typo at the end. It should be:
? PlainText(“And then some”)
And
then
some
Hope that helps. Thanks for the info! This is an amazing site!
John.
In my last comment, the site changed my text and my code, so it does not make any sense (I guess the site textbox tried to interpret the code or something). Sorry that that happened. I’ll try to describe it.
Test the code at the end. The is a ) and a ” that need to be switched.
There is an extra ) in the output.