[AccessD] *&^*%%$ Outlook, &^%HTML, (*&@Computers!!

Shamil Salakhetdinov shamil at smsconsulting.spb.ru
Sun Jun 8 07:16:13 CDT 2003


Bruce,

If you can extract HTML doc from e-mail by your code then I think you can
use Microsoft Internet Controls to parse it even if this HTML doc is
partially broken - here is the sample code for starters:

http://smsconsulting.spb.ru/shamil_s/topics/tableparser.htm

HTH,
Shamil

----- Original Message -----
From: "Bruce Bruen" <bbruen at bigpond.com>
To: <accessd at databaseadvisors.com>
Sent: Sunday, June 08, 2003 10:35 AM
Subject: [AccessD] *&^*%%$ Outlook, &^%HTML, (*&@Computers!!


> Hi List,
>
> Does anyone know of a library that will help me parse an email, which is
> in BADLY formed html.  I need to find a table in the message, pull the
> text from each cell and add the information to a table.
>
> I have tried using linked outlook (and exchange) tables, the message
> field only contains the plain text.  One of the info pieces we are
> looking for is a hyperlink which M$ conveniently removes.
>
> I have tried using MAPI and CDO libraries. And if I ever meet in a dark
> alley the M$ decision maker who put that security misconception
> together......
>
> I have tried using Redemption, which lets me get at either the plaintext
> or the HTML body of the message fine, but...
>
> Now I've got that far, I am having extreme dificulty with the parsing.
>
> The "ideal" solution would be to have a template per sender identity
> that would store the layout of the table in the message (and its
> position) and a set of routines that would parse the message, find the
> table beast, dig it out and populate the recordset row based on the
> template.
>
> Sounds simple eh?  Here's the complexity:
> 1.  The tables are in different positions in the message, depending on
> how much useless advertising the vendor is sprouting today.
> 2.  At least the tables are in constant formats!
> 3.  Depending on the vendor, the HTML of the mail is either fair, poor
> or atrocious.  The most common occurrence is unmatched closing tags, for
> example "<TR><TD>blah blah<TD>blah1 blah1<TD>blah3</Table>" - fine for
> web browser companies with 2.3Gigadevelopers to hack it around but I'm
> only one underpaid .....
> 4. The cells contain more than one attribute.  This bit, I'm OK with, I
> can dig out, validate the part# v. description etc with a bit of work.
> 5. In some cases we need to dig out the tag attributes e.g. a hyperlink.
>
> So, I'm looking for something that I could call that could either
> "correct" the html, so I can parse it, or something I could call that
> would parse the html bad as it is and return the info for processing
> somewhat like the XML parser.
>
> Any ideas?
> Bruce
>
> _______________________________________________
> AccessD mailing list
> AccessD at databaseadvisors.com
> http://databaseadvisors.com/mailman/listinfo/accessd
> Website: http://www.databaseadvisors.com



More information about the AccessD mailing list