[AccessD] Hyperlink screen scraping

Tue May 27 10:55:28 CDT 2003

Screenscraping has its problems. If someone changes the web page, your 
parser of the html or xml string returned has to adapt, There are also 
problems if the site uses some form of password, authentication and/or 
cookies. There is no standard. This has messed me up in the past if you 
have
no contact with the end site, to find out their particular standard. 
Some sites put their encoded site password in HTML headers. Note Amazon 
sends their authentication code in clear with the URL and no direct https.

 Here is some VBA sample code to search Amazon with their web services 
via xml
Might give you some ideas. Amazon used basic XML URL posts rather than SOAP.
One routine searches computer equipment; other searches books, third 
routine uses dom to parse the xml returned , but you would be working 
with html probably.

Get your own developer ID from Amazon and change in code.

http://www.amazon.com/gp/aws/landing.html/103-9253290-7901428

 I ran this from Access 97 WinXP, so I would have MS XML 4.0 installed
Use the test Amazon XML scratchpad to set and change your URL xml POST 
string code
without going through all the documentation.

There should be some sample XMLHttp code in the archives.

'
' Amazon Query Examples
'
'http://www.amazon.com/gp/aws/landing.html/103-9253290-7901428

Sub AmazonQueryBooks()
  Dim SelectedText As String
  Dim MSXML As Object
  Dim XMLURL As String
  Dim Loaded As Boolean
  Dim AssociateID As String
  Dim DeveloperToken As String

  ' Get text selection
  SelectedText = "Greg Bear" 'SciFi Author

  ' Make sure that some text is selected
  If ((Len(SelectedText) = 0) Or (SelectedText < " ")) Then
    MsgBox "Please select some text and try again."
    Exit Sub
  End If

  ' Set Associate ID and Developer Token
  AssociateID = "webservices-20"
  DeveloperToken = "D2ED5GRZA6RZ7Y"

  ' Create MSXML Object
  Set MSXML = CreateObject("MSXML.DOMDocument")

  ' Set MSXML Options
  MSXML.Async = False
  MSXML.preserveWhiteSpace = False
  MSXML.validateOnParse = True
  MSXML.resolveExternals = False

  ' Form an XML URL
  XMLURL = "http://xml.amazon.com/onca/xml2" + _
           "?t=" + AssociateID + _
           "&dev-t=" + DeveloperToken + _
           "&page=1" + _
           "&f=xml" + _
           "&mode=books" + _
           "&type=lite" + _
           "&KeywordSearch=" + SelectedText

  ' Issue the request and wait for it to be honored
  Loaded = MSXML.Load(XMLURL)

  If (Loaded) Then
   ' ProcessResults MSXML
   'show as long string

   Debug.Print MSXML.xml
  Else
    MsgBox "The service is not available."
  End If

End Sub

Sub AmazonQueryComputers()
  Dim SelectedText As String
  Dim MSXML As Object
  Dim XMLURL As String
  Dim Loaded As Boolean
  Dim AssociateID As String
  Dim DeveloperToken As String

  ' Get text selection
  SelectedText = "Toshiba"

  ' Make sure that some text is selected
  If ((Len(SelectedText) = 0) Or (SelectedText < " ")) Then
    MsgBox "Please select some text and try again."
    Exit Sub
  End If

  ' Set Associate ID and Developer Token
  AssociateID = "webservices-20"
  DeveloperToken = "D2EF5GZR7A6RZ7Y"

  ' Create MSXML Object
  Set MSXML = CreateObject("MSXML.DOMDocument")

  ' Set MSXML Options
  MSXML.Async = False
  MSXML.preserveWhiteSpace = False
  MSXML.validateOnParse = True
  MSXML.resolveExternals = False

  ' Form an XML URL

  XMLURL = "http://xml.amazon.com/onca/xml3" & _
            "?t=" & AssociateToken  & _
            "&dev-t="  &  DeveloperToken  & _
            "&KeywordSearch=" & SelectedText & _
            "&mode=pc-hardware" & _
            "&sort=+pricrerank" & _
            "&offer=All" & _
            "&type=lite" & _
            "&page=2" & _
            "&f=xml"
  ' Issue the request and wait for it to be honored
  Loaded = MSXML.Load(XMLURL)

  If (Loaded) Then
   ' ProcessResults MSXML
   Debug.Print MSXML.xml
  Else
    MsgBox "The service is not available."
  End If

End Sub
'
' ProcessResults -
'
'   Iterate through the given XML object and display each item.
'   Allow the user to choose some items and insert them into
'   the document.
'
Sub ProcessResults(MSXML As Object)
  Dim XML_Root As Object
  Dim XML_Child As Object
  Dim XML_Title As Object
  Dim XML_ListPrice As Object
  Dim XML_OurPrice As Object
  Dim XML_Mfr As Object
  Dim XML_Authors As Object
  Dim XML_Author As Object
  Dim XML_ReleaseDate As Object
  Dim TotalResults As Long
  Dim Title As String
  Dim ListPrice As String
  Dim OurPrice As String
  Dim Mfr As String
  Dim URL As String
  Dim Author As String
  Dim ReleaseDate As String
  Dim Updates As String
  Dim Footnote As String
  Dim i As Long

  ' Get root of document
  Set XML_Root = MSXML.documentElement

  ' Reset list of items in document
 ' AmazonDetailsForm.ClearItems

  ' Process each top-level element
  For i = 0 To XML_Root.childNodes.Length - 1
    Set XML_Child = XML_Root.childNodes.Item(i)

    Select Case XML_Child.nodeName
      Case "TotalResults"
      TotalResults = CLng(XML_Child.Text)

      Case "Details"
        Set XML_Title = Nothing
        Set XML_ListPrice = Nothing
        Set XML_OurPrice = Nothing
        Set XML_Mfr = Nothing
        Set XML_Author = Nothing
        Set XML_ReleaseDate = Nothing
        Title = ""
        ListPrice = ""
        OurPrice = ""
        Mfr = ""
        Author = ""
        ReleaseDate = ""

        ' Locate the child nodes
        Set XML_Title = XML_Child.selectSingleNode("ProductName")
        Set XML_ListPrice = XML_Child.selectSingleNode("ListPrice")
        Set XML_OurPrice = XML_Child.selectSingleNode("OurPrice")
        Set XML_Mfr = XML_Child.selectSingleNode("Manufacturer")
        Set XML_Authors = XML_Child.selectSingleNode("Authors")
        Set XML_ReleaseDate = XML_Child.selectSingleNode("ReleaseDate")

        ' Extract the text from the child nodes
        If (Not (XML_Title Is Nothing)) Then Title = XML_Title.Text
        If (Not (XML_ListPrice Is Nothing)) Then ListPrice = 
XML_ListPrice.Text
        If (Not (XML_OurPrice Is Nothing)) Then OurPrice = XML_OurPrice.Text
        If (Not (XML_Mfr Is Nothing)) Then Mfr = XML_Mfr.Text
        If (Not (XML_ReleaseDate Is Nothing)) Then ReleaseDate = 
XML_ReleaseDate.Text
        URL = XML_Child.getAttribute("url")

        If (Not (XML_Authors Is Nothing)) Then
          Set XML_Author = XML_Authors.selectSingleNode("Author")
          If (Not (XML_Author Is Nothing)) Then Author = XML_Author.Text
        End If

        ' Add the text to the form
    '    AmazonDetailsForm.AddItem Title, ListPrice, OurPrice, Mfr, URL, 
Author, ReleaseDate
    End Select
  Next i

  ' Show the form
 ' AmazonDetailsForm.Show vbModal

  '
  ' Update the document:
  ' 1 - Insert document title inline
  ' 2 - Insert footnote (if requested) as footnote
  '
  'Updates = AmazonDetailsForm.GetSelectedDetails

  'If (Updates <> "") Then
  '  Selection.MoveRight 1
  '  Selection.Text = " [" + Updates + "] "
  'End If

  ' Footnote = AmazonDetailsForm.GetSelectedFootnote
  ' If (Footnote <> "") Then
  '  Selection.Footnotes.Add Range:=Selection.Range, Text:=Footnote
  ' End If

End Sub

Bruce Bruen wrote:

>On further reflection XMLHttp looks like the way to go....just stepping
>into the MSXML documentation now... If I'm not back in 3 weeks could
>someone please feed the kids.... :-)
>
>Will report back!
>Bruce
>
>-----Original Message-----
>From: accessd-bounces at databaseadvisors.com
>[mailto:accessd-bounces at databaseadvisors.com] On Behalf Of Bruce Bruen
>Sent: Tuesday, May 27, 2003 8:40 PM
>To: accessd at databaseadvisors.com
>Subject: RE: [AccessD] Hyperlink screen scraping
>
>
>Thanks Marty.....Urrgh to much information
>
>The pages we are tryinng to retrieve are "pure" html ( with some
>unwanted pictures ). Primarily, they are vendor price lists we are
>trying to monitor against benchmarks for PC componentry.  One user, say
>"Fred" is after motherboard prices. He builds a list of vendor sites and
>runs a daily check on the supply price for the model(s) he is looking
>for.  Most of the time is spent clicking on a "favourites", waiting for
>the page to load and copying the days prices into his pricing models.
>What we would like to do is kick off the "download" in the morning, grab
>all the pages while he's getting a coffee and then show them one at a
>time and Fred can highlight any lines he wants and then we'll scrape
>that bit off the page and paste it in his models.
>
>What choice do you suggest.  I note that the XML route would be the
>technopolitic right way - but the vendors aren't necessarily agreed
>suppliers - in fact in general its just the opposite!, we are trying to
>keep the agreed suppliers on their toes. (Also note that my Polish isn't
>up to scratch re Inet API - or did I get the wrong site baby?)  Finally,
>if WinHttp is the go, do you know of a site that's a bit more to the
>point of our problem rather than the MSDN pages(and pages and pages and
>......
>
>Yet again
>tia
>Bruce
>
>-----Original Message-----
>From: accessd-bounces at databaseadvisors.com
>[mailto:accessd-bounces at databaseadvisors.com] On Behalf Of MartyConnelly
>Sent: Tuesday, May 27, 2003 1:29 
>To: accessd at databaseadvisors.com
>Subject: Re: [AccessD] Hyperlink screen scraping
>
>
>There are at least 3 ways to do this WinHttp, Inet Api's and XMLHttp. 
>These all may require an install or a specific version of IE. It depends
>
>on what type of document you want to download a jpeg, a word Doc, text 
>file or html. It also depends on what OS you want to deploy on. I would 
>probably put the download into an OLE field and only use text into a 
>memo field.
>
>Bruce Bruen wrote:
>
>  
>
>>Dear List,
>>
>>I am sure we have covered this before (Seth?) but I cannot find
>>anything in the archives (probably cant set a good keyword).
>>
>>The database has a text field containing a hyperlink.  When the user
>>clicks a command button on a form showing the address I want to 
>>download the hyperlink document and save the contents of the page in a 
>>memo field for later scraping and processing.
>>
>>How is this done? Follow and FollowHyperlink just open IE.  Is there
>>something I am missing (apart from that!)
>>
>>
>>Tia
>>Bruce
>>
>>
>> 
>>
>>    
>>
>
>  
>