Shamil Salakhetdinov
shamil at smsconsulting.spb.ru
Sun Jun 8 07:16:13 CDT 2003
Bruce, If you can extract HTML doc from e-mail by your code then I think you can use Microsoft Internet Controls to parse it even if this HTML doc is partially broken - here is the sample code for starters: http://smsconsulting.spb.ru/shamil_s/topics/tableparser.htm HTH, Shamil ----- Original Message ----- From: "Bruce Bruen" <bbruen at bigpond.com> To: <accessd at databaseadvisors.com> Sent: Sunday, June 08, 2003 10:35 AM Subject: [AccessD] *&^*%%$ Outlook, &^%HTML, (*&@Computers!! > Hi List, > > Does anyone know of a library that will help me parse an email, which is > in BADLY formed html. I need to find a table in the message, pull the > text from each cell and add the information to a table. > > I have tried using linked outlook (and exchange) tables, the message > field only contains the plain text. One of the info pieces we are > looking for is a hyperlink which M$ conveniently removes. > > I have tried using MAPI and CDO libraries. And if I ever meet in a dark > alley the M$ decision maker who put that security misconception > together...... > > I have tried using Redemption, which lets me get at either the plaintext > or the HTML body of the message fine, but... > > Now I've got that far, I am having extreme dificulty with the parsing. > > The "ideal" solution would be to have a template per sender identity > that would store the layout of the table in the message (and its > position) and a set of routines that would parse the message, find the > table beast, dig it out and populate the recordset row based on the > template. > > Sounds simple eh? Here's the complexity: > 1. The tables are in different positions in the message, depending on > how much useless advertising the vendor is sprouting today. > 2. At least the tables are in constant formats! > 3. Depending on the vendor, the HTML of the mail is either fair, poor > or atrocious. The most common occurrence is unmatched closing tags, for > example "<TR><TD>blah blah<TD>blah1 blah1<TD>blah3</Table>" - fine for > web browser companies with 2.3Gigadevelopers to hack it around but I'm > only one underpaid ..... > 4. The cells contain more than one attribute. This bit, I'm OK with, I > can dig out, validate the part# v. description etc with a bit of work. > 5. In some cases we need to dig out the tag attributes e.g. a hyperlink. > > So, I'm looking for something that I could call that could either > "correct" the html, so I can parse it, or something I could call that > would parse the html bad as it is and return the info for processing > somewhat like the XML parser. > > Any ideas? > Bruce > > _______________________________________________ > AccessD mailing list > AccessD at databaseadvisors.com > http://databaseadvisors.com/mailman/listinfo/accessd > Website: http://www.databaseadvisors.com