MartyConnelly
martyconnelly at shaw.ca
Tue May 27 10:55:28 CDT 2003
Screenscraping has its problems. If someone changes the web page, your parser of the html or xml string returned has to adapt, There are also problems if the site uses some form of password, authentication and/or cookies. There is no standard. This has messed me up in the past if you have no contact with the end site, to find out their particular standard. Some sites put their encoded site password in HTML headers. Note Amazon sends their authentication code in clear with the URL and no direct https. Here is some VBA sample code to search Amazon with their web services via xml Might give you some ideas. Amazon used basic XML URL posts rather than SOAP. One routine searches computer equipment; other searches books, third routine uses dom to parse the xml returned , but you would be working with html probably. Get your own developer ID from Amazon and change in code. http://www.amazon.com/gp/aws/landing.html/103-9253290-7901428 I ran this from Access 97 WinXP, so I would have MS XML 4.0 installed Use the test Amazon XML scratchpad to set and change your URL xml POST string code without going through all the documentation. There should be some sample XMLHttp code in the archives. ' ' Amazon Query Examples ' 'http://www.amazon.com/gp/aws/landing.html/103-9253290-7901428 Sub AmazonQueryBooks() Dim SelectedText As String Dim MSXML As Object Dim XMLURL As String Dim Loaded As Boolean Dim AssociateID As String Dim DeveloperToken As String ' Get text selection SelectedText = "Greg Bear" 'SciFi Author ' Make sure that some text is selected If ((Len(SelectedText) = 0) Or (SelectedText < " ")) Then MsgBox "Please select some text and try again." Exit Sub End If ' Set Associate ID and Developer Token AssociateID = "webservices-20" DeveloperToken = "D2ED5GRZA6RZ7Y" ' Create MSXML Object Set MSXML = CreateObject("MSXML.DOMDocument") ' Set MSXML Options MSXML.Async = False MSXML.preserveWhiteSpace = False MSXML.validateOnParse = True MSXML.resolveExternals = False ' Form an XML URL XMLURL = "http://xml.amazon.com/onca/xml2" + _ "?t=" + AssociateID + _ "&dev-t=" + DeveloperToken + _ "&page=1" + _ "&f=xml" + _ "&mode=books" + _ "&type=lite" + _ "&KeywordSearch=" + SelectedText ' Issue the request and wait for it to be honored Loaded = MSXML.Load(XMLURL) If (Loaded) Then ' ProcessResults MSXML 'show as long string Debug.Print MSXML.xml Else MsgBox "The service is not available." End If End Sub Sub AmazonQueryComputers() Dim SelectedText As String Dim MSXML As Object Dim XMLURL As String Dim Loaded As Boolean Dim AssociateID As String Dim DeveloperToken As String ' Get text selection SelectedText = "Toshiba" ' Make sure that some text is selected If ((Len(SelectedText) = 0) Or (SelectedText < " ")) Then MsgBox "Please select some text and try again." Exit Sub End If ' Set Associate ID and Developer Token AssociateID = "webservices-20" DeveloperToken = "D2EF5GZR7A6RZ7Y" ' Create MSXML Object Set MSXML = CreateObject("MSXML.DOMDocument") ' Set MSXML Options MSXML.Async = False MSXML.preserveWhiteSpace = False MSXML.validateOnParse = True MSXML.resolveExternals = False ' Form an XML URL XMLURL = "http://xml.amazon.com/onca/xml3" & _ "?t=" & AssociateToken & _ "&dev-t=" & DeveloperToken & _ "&KeywordSearch=" & SelectedText & _ "&mode=pc-hardware" & _ "&sort=+pricrerank" & _ "&offer=All" & _ "&type=lite" & _ "&page=2" & _ "&f=xml" ' Issue the request and wait for it to be honored Loaded = MSXML.Load(XMLURL) If (Loaded) Then ' ProcessResults MSXML Debug.Print MSXML.xml Else MsgBox "The service is not available." End If End Sub ' ' ProcessResults - ' ' Iterate through the given XML object and display each item. ' Allow the user to choose some items and insert them into ' the document. ' Sub ProcessResults(MSXML As Object) Dim XML_Root As Object Dim XML_Child As Object Dim XML_Title As Object Dim XML_ListPrice As Object Dim XML_OurPrice As Object Dim XML_Mfr As Object Dim XML_Authors As Object Dim XML_Author As Object Dim XML_ReleaseDate As Object Dim TotalResults As Long Dim Title As String Dim ListPrice As String Dim OurPrice As String Dim Mfr As String Dim URL As String Dim Author As String Dim ReleaseDate As String Dim Updates As String Dim Footnote As String Dim i As Long ' Get root of document Set XML_Root = MSXML.documentElement ' Reset list of items in document ' AmazonDetailsForm.ClearItems ' Process each top-level element For i = 0 To XML_Root.childNodes.Length - 1 Set XML_Child = XML_Root.childNodes.Item(i) Select Case XML_Child.nodeName Case "TotalResults" TotalResults = CLng(XML_Child.Text) Case "Details" Set XML_Title = Nothing Set XML_ListPrice = Nothing Set XML_OurPrice = Nothing Set XML_Mfr = Nothing Set XML_Author = Nothing Set XML_ReleaseDate = Nothing Title = "" ListPrice = "" OurPrice = "" Mfr = "" Author = "" ReleaseDate = "" ' Locate the child nodes Set XML_Title = XML_Child.selectSingleNode("ProductName") Set XML_ListPrice = XML_Child.selectSingleNode("ListPrice") Set XML_OurPrice = XML_Child.selectSingleNode("OurPrice") Set XML_Mfr = XML_Child.selectSingleNode("Manufacturer") Set XML_Authors = XML_Child.selectSingleNode("Authors") Set XML_ReleaseDate = XML_Child.selectSingleNode("ReleaseDate") ' Extract the text from the child nodes If (Not (XML_Title Is Nothing)) Then Title = XML_Title.Text If (Not (XML_ListPrice Is Nothing)) Then ListPrice = XML_ListPrice.Text If (Not (XML_OurPrice Is Nothing)) Then OurPrice = XML_OurPrice.Text If (Not (XML_Mfr Is Nothing)) Then Mfr = XML_Mfr.Text If (Not (XML_ReleaseDate Is Nothing)) Then ReleaseDate = XML_ReleaseDate.Text URL = XML_Child.getAttribute("url") If (Not (XML_Authors Is Nothing)) Then Set XML_Author = XML_Authors.selectSingleNode("Author") If (Not (XML_Author Is Nothing)) Then Author = XML_Author.Text End If ' Add the text to the form ' AmazonDetailsForm.AddItem Title, ListPrice, OurPrice, Mfr, URL, Author, ReleaseDate End Select Next i ' Show the form ' AmazonDetailsForm.Show vbModal ' ' Update the document: ' 1 - Insert document title inline ' 2 - Insert footnote (if requested) as footnote ' 'Updates = AmazonDetailsForm.GetSelectedDetails 'If (Updates <> "") Then ' Selection.MoveRight 1 ' Selection.Text = " [" + Updates + "] " 'End If ' Footnote = AmazonDetailsForm.GetSelectedFootnote ' If (Footnote <> "") Then ' Selection.Footnotes.Add Range:=Selection.Range, Text:=Footnote ' End If End Sub Bruce Bruen wrote: >On further reflection XMLHttp looks like the way to go....just stepping >into the MSXML documentation now... If I'm not back in 3 weeks could >someone please feed the kids.... :-) > >Will report back! >Bruce > >-----Original Message----- >From: accessd-bounces at databaseadvisors.com >[mailto:accessd-bounces at databaseadvisors.com] On Behalf Of Bruce Bruen >Sent: Tuesday, May 27, 2003 8:40 PM >To: accessd at databaseadvisors.com >Subject: RE: [AccessD] Hyperlink screen scraping > > >Thanks Marty.....Urrgh to much information > >The pages we are tryinng to retrieve are "pure" html ( with some >unwanted pictures ). Primarily, they are vendor price lists we are >trying to monitor against benchmarks for PC componentry. One user, say >"Fred" is after motherboard prices. He builds a list of vendor sites and >runs a daily check on the supply price for the model(s) he is looking >for. Most of the time is spent clicking on a "favourites", waiting for >the page to load and copying the days prices into his pricing models. >What we would like to do is kick off the "download" in the morning, grab >all the pages while he's getting a coffee and then show them one at a >time and Fred can highlight any lines he wants and then we'll scrape >that bit off the page and paste it in his models. > >What choice do you suggest. I note that the XML route would be the >technopolitic right way - but the vendors aren't necessarily agreed >suppliers - in fact in general its just the opposite!, we are trying to >keep the agreed suppliers on their toes. (Also note that my Polish isn't >up to scratch re Inet API - or did I get the wrong site baby?) Finally, >if WinHttp is the go, do you know of a site that's a bit more to the >point of our problem rather than the MSDN pages(and pages and pages and >...... > >Yet again >tia >Bruce > >-----Original Message----- >From: accessd-bounces at databaseadvisors.com >[mailto:accessd-bounces at databaseadvisors.com] On Behalf Of MartyConnelly >Sent: Tuesday, May 27, 2003 1:29 >To: accessd at databaseadvisors.com >Subject: Re: [AccessD] Hyperlink screen scraping > > >There are at least 3 ways to do this WinHttp, Inet Api's and XMLHttp. >These all may require an install or a specific version of IE. It depends > >on what type of document you want to download a jpeg, a word Doc, text >file or html. It also depends on what OS you want to deploy on. I would >probably put the download into an OLE field and only use text into a >memo field. > >Bruce Bruen wrote: > > > >>Dear List, >> >>I am sure we have covered this before (Seth?) but I cannot find >>anything in the archives (probably cant set a good keyword). >> >>The database has a text field containing a hyperlink. When the user >>clicks a command button on a form showing the address I want to >>download the hyperlink document and save the contents of the page in a >>memo field for later scraping and processing. >> >>How is this done? Follow and FollowHyperlink just open IE. Is there >>something I am missing (apart from that!) >> >> >>Tia >>Bruce >> >> >> >> >> >> > > >