Bruce Bruen
bbruen at bigpond.com
Sun Jun 8 08:11:07 CDT 2003
Shamil, Many, many thanks. This looks like what I have been biting my nails over, and yes it doesn't seem to mind the rotten bits of html I have given it so far. Now if I can come up with a good way to encode the templates for parsing the cell contents I'll be there! Thanks again Bruce -----Original Message----- From: accessd-bounces at databaseadvisors.com [mailto:accessd-bounces at databaseadvisors.com] On Behalf Of Shamil Salakhetdinov Sent: Sunday, June 08, 2003 10:16 PM To: accessd at databaseadvisors.com Subject: Re: [AccessD] *&^*%%$ Outlook, &^%HTML, (*&@Computers!! Bruce, If you can extract HTML doc from e-mail by your code then I think you can use Microsoft Internet Controls to parse it even if this HTML doc is partially broken - here is the sample code for starters: http://smsconsulting.spb.ru/shamil_s/topics/tableparser.htm HTH, Shamil ----- Original Message ----- From: "Bruce Bruen" <bbruen at bigpond.com> To: <accessd at databaseadvisors.com> Sent: Sunday, June 08, 2003 10:35 AM Subject: [AccessD] *&^*%%$ Outlook, &^%HTML, (*&@Computers!! > Hi List, > > Does anyone know of a library that will help me parse an email, which > is in BADLY formed html. I need to find a table in the message, pull > the text from each cell and add the information to a table. > > I have tried using linked outlook (and exchange) tables, the message > field only contains the plain text. One of the info pieces we are > looking for is a hyperlink which M$ conveniently removes. > > I have tried using MAPI and CDO libraries. And if I ever meet in a > dark alley the M$ decision maker who put that security misconception > together...... > > I have tried using Redemption, which lets me get at either the > plaintext or the HTML body of the message fine, but... > > Now I've got that far, I am having extreme dificulty with the parsing. > > The "ideal" solution would be to have a template per sender identity > that would store the layout of the table in the message (and its > position) and a set of routines that would parse the message, find the > table beast, dig it out and populate the recordset row based on the > template. > > Sounds simple eh? Here's the complexity: > 1. The tables are in different positions in the message, depending on > how much useless advertising the vendor is sprouting today. 2. At > least the tables are in constant formats! 3. Depending on the vendor, > the HTML of the mail is either fair, poor or atrocious. The most > common occurrence is unmatched closing tags, for example "<TR><TD>blah > blah<TD>blah1 blah1<TD>blah3</Table>" - fine for web browser companies > with 2.3Gigadevelopers to hack it around but I'm only one underpaid > ..... 4. The cells contain more than one attribute. This bit, I'm OK > with, I can dig out, validate the part# v. description etc with a bit > of work. 5. In some cases we need to dig out the tag attributes e.g. a > hyperlink. > > So, I'm looking for something that I could call that could either > "correct" the html, so I can parse it, or something I could call that > would parse the html bad as it is and return the info for processing > somewhat like the XML parser. > > Any ideas? > Bruce > > _______________________________________________ > AccessD mailing list > AccessD at databaseadvisors.com > http://databaseadvisors.com/mailman/listinfo/accessd > Website: http://www.databaseadvisors.com _______________________________________________ AccessD mailing list AccessD at databaseadvisors.com http://databaseadvisors.com/mailman/listinfo/accessd Website: http://www.databaseadvisors.com