How to remove all HTML tags, but not all

JamesBowen

19+ years progress programming and still learning.
I am developing a BLOG for my website and I need a way to remove all the HTML tags, but not all of them. I want to keep the following tags: <a> <em> <strong> <cite> <code> <ul> <ol> <li> <dl> <dt> <dd> <img> <img src> <img >

However I have googled for the answer but all the solutions I have found are to use regular expressions (which OE does not handle).

Does anybody know of away that Webspeed/Openedge can handle HTML content.
 
An html tag must begin with <somename .

Once you know the location in the line that <somename starts on

assign
lpos = index ( sourceLine, '<somename' )

You enter stripMode

stripmode = true

and search for the closing '>' which must be the first following '>' you encounter.

rpos = index ( sourceLine, '>', lpos )
.

If rpos gt 0 then

assign
substring ( sourceLine, lpos, rpos - lpos + 1 ) = ""
.

If rpos is 0, you have to search the following line for the closing '>' and keep searching until you find it.

Doing that is not fun; to avoid the hard work I recommend putting the complete text into one variable and processing it completely as a single source line.

Furthermore, I recommend you make a list of all the tags you want to replace and organize them according to their beginning letter. Then processing reduces itself to

define variable starting as integer no-undo init 1.
define variable strlen as integer no-undo init 0.
assign
strlen = length ( textString, "CHARACTER" )
starting = index ( textString, '<', starting )
.
do while strlen gt starting and starting gt 0 :
run processString ( input-output testString, starting ).
assign
strlen = length ( textString, "CHARACTER" )
starting = index ( textString, '<', starting )
.
end.

procedure processString:
define input-output parameter theString as longchar no-undo.
define input parameter ipStart as integer no-undo.

define variable myClue as character no-undo.
assign
myClue = substring ( theString, ipStart + 1, 1 )
.
case myClue:
when "a" then run caseA ( input-output theString, ipStart ).
...
otherwise do:
/* nothing */
end.
end case /* myClue */.
end procedure /* processString */.
 
Thanks ABLsaurusRex for your help. Based on your code example I have created the following function. It's not glamorous but it's works. I possibly have not accounted for all situations but I think it will do the job for now.

Please feel free to copy/edit/improve the following ABL function:

Code:
FUNCTION StripUnwantedHTML RETURNS CHARACTER
  (INPUT pcContext AS CHARACTER):

  /* Removes/strips HTML tags from the passed string context, except for the HTML 
     element whiched are defined in the xcAllowedHTMLElements preprocessor.....*/

  DEFINE VARIABLE cElement        AS CHARACTER   NO-UNDO INITIAL ''.
  DEFINE VARIABLE iEndPos         AS INTEGER     NO-UNDO INITIAL 0.
  DEFINE VARIABLE iStartPos       AS INTEGER     NO-UNDO INITIAL 0.
  DEFINE VARIABLE iElementLength  AS INTEGER     NO-UNDO INITIAL 0.
  
  &SCOPED-DEFINE xcAllowedHTMLElements 'a,em,strong,cite,code,ul,ol,li,dl,dt,dd,img'              
  &SCOPED-DEFINE xcHTMLPrefix '<'
  &SCOPED-DEFINE xcHTMLSuffix '>'
  &SCOPED-DEFINE xcHTMLClosing '~/'
  &SCOPED-DEFINE xcNullChar ''
  &SCOPED-DEFINE xcSpaceChar ' '
  
  ASSIGN
    iStartPos = INDEX(pcContext, {&xcHTMLPrefix}). /* Try and find the first opening HTML Tag*/
  
  DO WHILE iStartPos   GT 0 AND 
     LENGTH(pcContext) GT 0:
  
    ASSIGN
      iEndPos = INDEX( pcContext, {&xcHTMLSuffix}, iStartPos ). /* Try and find the subseqent suffix tag elemnt..*/
  
    /* Continue the serach and replace process 
       if the HTML tag suffix is found... */

    IF iEndPos GT 0 THEN
    DO:
      
      ASSIGN
        iElementLength = iEndPos - iStartPos                                /* Calculate the HTML tag's Length..*/
        cElement       = SUBSTRING(pcContext,iStartPos,iElementLength + 1)  /* Extract the HTML element from the content..*/
        cElement       = ENTRY(1,cElement, {&xcSpaceChar})                  /* Just isolate the element tag, ignoring any element attributes.*/
        cElement       = REPLACE(cElement, {&xcHTMLPrefix},{&xcNULLChar})   /* Remove the HTML tag's prefix..*/
        cElement       = REPLACE(cElement, {&xcHTMLSuffix},{&xcNULLChar})   /* Remove the HTML tag's suffix..*/
        cElement       = REPLACE(cElement, {&xcHTMLClosing},{&xcNULLChar}). /* Remove the HTML tag's closing marker..*/
          
      /* If the isolated HTML element can not be found in the 
         "Allowed HTML Elements List", then strip the HTML tag 
         out.. Otherwise continue on with the search.. */

      IF NOT CAN-DO({&xcAllowedHTMLElements},cElement) THEN
      DO:
        SUBSTRING(pcContext,iStartPos,iElementLength + 1) = ''.
        ASSIGN
          iStartPos = INDEX(pcContext, {&xcHTMLPrefix}, iStartPos). /* Try and find the start of the next HTML element 
                                                                       starting from the beging of the last known element..*/ 
      END.  /* END OF IF NOT CAN-DO({&xcAllowedHTMLElements},cElement) BLOCK..*/
      ELSE
        ASSIGN
          iStartPos = INDEX(pcContext, {&xcHTMLPrefix}, iEndPos). /* Try and find the start of the next HTML element 
                                                                     starting from the end of the last known element..*/
      
      ASSIGN
        iEndPos = 0.  /*  Reset last known element position, ending.*/
  
    END.  /* END OF IF iEndPos GT 0 BLOCK*/
    ELSE
      ASSIGN
        iStartPos = 0.  /*  Reset last known element position, beging */
  
  END.  /* END OF DO WHILE iStartPos GT 0 AND LENGTH(pcContext) GT 0 BLOCK.*/
  
  RETURN pcContext.
END FUNCTION.

/* Test the StripUnwanntedHTML function */

MESSAGE 
  StripUnwantedHTML(INPUT '<!DOCTYPE html><html><head><title>Hello HTML </title></head><body><p>Hello World! Click <a href="http://www.ekkoguardian.com">here</a> for a great service.</p> </body></html>')
  VIEW-AS ALERT-BOX INFO.
 
Back
Top