Bruce Bruen
bbruen at bigpond.com
Sun Jun 8 01:35:27 CDT 2003
Hi List, Does anyone know of a library that will help me parse an email, which is in BADLY formed html. I need to find a table in the message, pull the text from each cell and add the information to a table. I have tried using linked outlook (and exchange) tables, the message field only contains the plain text. One of the info pieces we are looking for is a hyperlink which M$ conveniently removes. I have tried using MAPI and CDO libraries. And if I ever meet in a dark alley the M$ decision maker who put that security misconception together...... I have tried using Redemption, which lets me get at either the plaintext or the HTML body of the message fine, but... Now I've got that far, I am having extreme dificulty with the parsing. The "ideal" solution would be to have a template per sender identity that would store the layout of the table in the message (and its position) and a set of routines that would parse the message, find the table beast, dig it out and populate the recordset row based on the template. Sounds simple eh? Here's the complexity: 1. The tables are in different positions in the message, depending on how much useless advertising the vendor is sprouting today. 2. At least the tables are in constant formats! 3. Depending on the vendor, the HTML of the mail is either fair, poor or atrocious. The most common occurrence is unmatched closing tags, for example "<TR><TD>blah blah<TD>blah1 blah1<TD>blah3</Table>" - fine for web browser companies with 2.3Gigadevelopers to hack it around but I'm only one underpaid ..... 4. The cells contain more than one attribute. This bit, I'm OK with, I can dig out, validate the part# v. description etc with a bit of work. 5. In some cases we need to dig out the tag attributes e.g. a hyperlink. So, I'm looking for something that I could call that could either "correct" the html, so I can parse it, or something I could call that would parse the html bad as it is and return the info for processing somewhat like the XML parser. Any ideas? Bruce