HTML to Plain Text convert

Status
Not open for further replies.

JamesBowen

19+ years progress programming and still learning.
Hi all.

When we send automated emails from our system, the email contains two parts HTML formated and plain text.

On our database we store the content of the email in two versions, HTML and text. The logistical problem that we have is that we have to maintain two version of identical content but the HTML version contains formatting.

Has anyone developed a process the can take HTML can convert it on the fly into plain text?

I have found a command line tool call HTML2text which does exactly what is says, convert HTML into Text. However it not cross platform compatible. I need a pure ABL solution.

I was thinking about using a SAX/DOM parser to strip out the HTML tags but I could not guaranty the HTML document conforms to the XHTML standard.

Additional Information:

OE 10.1B
Linux RH4 SE.

Many Thanks.
 
Ok here goes:

This is my first attempt at converting HTML into plain text. I don't think that I will use it because, I don't think I could code for every possible scenario of HTML. Please feel free to try out and see where improvements could be made.

Code:
DEFINE TEMP-TABLE gttTagPosition NO-UNDO
  FIELD tagID       AS INTEGER
  FIELD posStart    AS INTEGER                                          
  FIELD posEnd      AS INTEGER
  FIELD tagLength   AS INTEGER
  FIELD tagType     AS CHARACTER
  INDEX idxTagID AS PRIMARY
    tagID DESCENDING .


FUNCTION html2text RETURNS CHARACTER 
  (INPUT pcHTMLString AS CHARACTER):

  DEFINE VARIABLE cTagReplace   AS CHARACTER   NO-UNDO INITIAL ''.
  DEFINE VARIABLE iLoop         AS INTEGER     NO-UNDO INITIAL 0.
  DEFINE VARIABLE iNumTabs      AS INTEGER     NO-UNDO INITIAL 0.
  DEFINE VARIABLE iPosEnd       AS INTEGER     NO-UNDO INITIAL 0.
  DEFINE VARIABLE iPosStart     AS INTEGER     NO-UNDO INITIAL 0.
  DEFINE VARIABLE iStringLength AS INTEGER     NO-UNDO INITIAL 0.
  DEFINE VARIABLE cEOL          AS CHARACTER   NO-UNDO INITIAL '~n' .
  
  CASE OPSYS:
    WHEN 'UNIX' THEN
      cEOL = '~n':U.
    WHEN 'WIN32' THEN
      cEOL = '~r~n':U.
  END CASE.

  EMPTY TEMP-TABLE gttTagPosition.

  iStringLength = LENGTH(pcHTMLString).
  
  DO iLoop = 1 TO iStringLength:
  
    iPosStart = 0.
    
    /*  Get the postion at the start of the tag*/
    iPosStart = INDEX(pcHTMLString, '<',  iLoop).
  
    IF iPosStart > 0 THEN
    DO:
      CREATE gttTagPosition.
  
      ASSIGN 
        gttTagPosition.TagID    = iLoop
        gttTagPosition.posStart = iPosStart.
  
        /*  Get the postion at the end of the tag*/
        iPosEnd = INDEX(pcHTMLString,'>',iPosStart).
  
        IF iPosEnd > 0 THEN
        DO:
  
          ASSIGN
            gttTagPosition.posEnd     = iPosEnd
            gttTagPosition.tagLength  = (iPosEnd - iPosStart) + 1.
            gttTagPosition.tagType    = SUBSTRING(pcHTMLString, iPosStart + 1, iPosEnd - iPosStart - 1).  
            gttTagPosition.tagType    = ENTRY(1,gttTagPosition.tagType,' ').    /*  strip out attributes..  */
        END.
      
      ASSIGN
        iLoop = iPosStart.
    END.
  END.
  
    
  /*  Retmove the tags in reverse order....*/
  
  FOR EACH gttTagPosition
    BY gttTagPosition.tagID DESCENDING:
      
    CASE gttTagPosition.tagType:
      WHEN 'br' OR WHEN 'br/' THEN
        cTagReplace = cEOL.
      WHEN 'tr' THEN
        cTagReplace = '':U.
      WHEN '/tr' THEN
        cTagReplace = '|' + cEOL.
      WHEN 'td':U THEN
        cTagReplace = '|':U.
      WHEN '/td' THEN
        cTagReplace = '~t':U.
      WHEN 'p' THEN
        cTagReplace = cEOL.
      WHEN '/p' THEN
        cTagReplace = cEOL.
      WHEN 'ul' THEN
        ASSIGN
          iNumTabs    = iNumTabs - 1
          cTagReplace = cEOL.
      WHEN '/ul' THEN
        ASSIGN
          iNumTabs    = iNumTabs + 1
          cTagReplace = cEOL.
      WHEN 'li' THEN
        cTagReplace =  FILL('~t',iNumTabs) + '* ':U.
      WHEN '/li' THEN
        cTagReplace = cEOL.
      WHEN 'b' OR WHEN '/b' THEN
        cTagReplace = '*':U.
      
      OTHERWISE 
      cTagReplace = '':U.
    END CASE.
    
    SUBSTRING (pcHTMLString,gttTagPosition.posStart,gttTagPosition.tagLength) = cTagReplace.
      
  END.

  RETURN pcHTMLString.

END FUNCTION.
                                                                                          

DEFINE STREAM sOutput.

OUTPUT STREAM sOutput TO 'html2text.txt'.

PUT STREAM sOutput UNFORMATTED 
  html2text(INPUT '<TABLE width="100%"><tbody><tr><td>Hello</td><td><b>World</b></td></tr><tr><td>Hello</td><td><b>World</b></td><td>Again</td></tr></tbody></table><br/>Line 0<br/>Line 1<br> <p>Paragrah <b>BOLD</b><br/>second line pargrah</p><ul><li>Item 1</li><li><ul><li>Item 3</li><li>Item 4</li></ul></li></ul>').

OUTPUT STREAM sOutput CLOSE.
Here is the final output:

Code:
|Hello  |*World*        |
|Hello  |*World*        |Again  |

Line 0
Line 1

Paragrah *BOLD*
second line pargrah

        * Item 1
        *
                * Item 3
                * Item 4
 
Status
Not open for further replies.
Back
Top