[AccessD] &^%%$ Outlook, &^%HTML, (*&@Computers!!

Sun Jun 8 08:11:07 CDT 2003

Shamil,

Many, many thanks. This looks like what I have been biting my nails
over, and yes it doesn't seem to mind the rotten bits of html I have
given it so far.

Now if I can come up with a good way to encode the templates for parsing
the cell contents I'll be there!

Thanks again
Bruce

-----Original Message-----
From: accessd-bounces at databaseadvisors.com
[mailto:accessd-bounces at databaseadvisors.com] On Behalf Of Shamil
Salakhetdinov
Sent: Sunday, June 08, 2003 10:16 PM
To: accessd at databaseadvisors.com
Subject: Re: [AccessD] *&^*%%$ Outlook, &^%HTML, (*&@Computers!!

Bruce,

If you can extract HTML doc from e-mail by your code then I think you
can use Microsoft Internet Controls to parse it even if this HTML doc is
partially broken - here is the sample code for starters:

http://smsconsulting.spb.ru/shamil_s/topics/tableparser.htm

HTH,
Shamil

----- Original Message -----
From: "Bruce Bruen" <bbruen at bigpond.com>
To: <accessd at databaseadvisors.com>
Sent: Sunday, June 08, 2003 10:35 AM
Subject: [AccessD] *&^*%%$ Outlook, &^%HTML, (*&@Computers!!

> Hi List,
>
> Does anyone know of a library that will help me parse an email, which 
> is in BADLY formed html.  I need to find a table in the message, pull 
> the text from each cell and add the information to a table.
>
> I have tried using linked outlook (and exchange) tables, the message 
> field only contains the plain text.  One of the info pieces we are 
> looking for is a hyperlink which M$ conveniently removes.
>
> I have tried using MAPI and CDO libraries. And if I ever meet in a 
> dark alley the M$ decision maker who put that security misconception 
> together......
>
> I have tried using Redemption, which lets me get at either the 
> plaintext or the HTML body of the message fine, but...
>
> Now I've got that far, I am having extreme dificulty with the parsing.
>
> The "ideal" solution would be to have a template per sender identity 
> that would store the layout of the table in the message (and its
> position) and a set of routines that would parse the message, find the

> table beast, dig it out and populate the recordset row based on the 
> template.
>
> Sounds simple eh?  Here's the complexity:
> 1.  The tables are in different positions in the message, depending on

> how much useless advertising the vendor is sprouting today. 2.  At 
> least the tables are in constant formats! 3.  Depending on the vendor,

> the HTML of the mail is either fair, poor or atrocious.  The most 
> common occurrence is unmatched closing tags, for example "<TR><TD>blah

> blah<TD>blah1 blah1<TD>blah3</Table>" - fine for web browser companies

> with 2.3Gigadevelopers to hack it around but I'm only one underpaid 
> ..... 4. The cells contain more than one attribute.  This bit, I'm OK 
> with, I can dig out, validate the part# v. description etc with a bit 
> of work. 5. In some cases we need to dig out the tag attributes e.g. a

> hyperlink.
>
> So, I'm looking for something that I could call that could either 
> "correct" the html, so I can parse it, or something I could call that 
> would parse the html bad as it is and return the info for processing 
> somewhat like the XML parser.
>
> Any ideas?
> Bruce
>
> _______________________________________________
> AccessD mailing list
> AccessD at databaseadvisors.com 
> http://databaseadvisors.com/mailman/listinfo/accessd
> Website: http://www.databaseadvisors.com

_______________________________________________
AccessD mailing list
AccessD at databaseadvisors.com
http://databaseadvisors.com/mailman/listinfo/accessd
Website: http://www.databaseadvisors.com