One of the long standing requests/complaints is for WebCopy to support JavaScript enabled websites, e.g. modern SPA's where JavaScript is used to build the page. Traditionally this is something I have always put onto the furthest of back burners as in order to support this natively I'd have to essentially write half a browser, something that would be a full time job and a half and not something I'm interested in doing. Other solutions did exist but I never really looked into them.
It recently occurred to me however, that I'd put into place all the building blocks I needed to have WebCopy support JavaScript execution (in a limited fashion, more on this later) using Internet Explorer. And it was easy, in fact, the hardest part was sorting out threading issues - despite the fact that WebCopy currently only crawls on a single thread, it does run on a different thread to the UI in order not to freeze it, which COM can have a problem with.
The end result? A new Use Web Browser option can be found in the Project Properties dialog. When set, WebCopy will do its own downloading and remapping of content, but it will use an embedded Internet Explorer session to do the crawling.
The screenshot above shows a scan of the WebCopy demonstration site. The page dom.php
has a few lines of JavaScript to build a list of links. As seem above, previous versions of WebCopy are completely oblivious to these extra links.
The image above is the same website scanned using WebCopy 1.8 and the new option enabled - you can see how it has detected additional links, due to allowing JavaScript to execute. If you peer hard enough you will also see that it was significantly slower due to scanning the website using this technique.
Listing the cons
Although I'm pleased to be able to finally offer this functionality, there are afew caveats.
This functionality is very new, and very experimental. It is by no means certain that I have ironed out all the potential issues. Caveat Emptor!
- Crawling may be substantially slower. HTML documents will be downloaded twice, and the headless web browsing will also add significant overhead
- JavaScript is being executed. This can lead to your sessions being finger printed, tracked, malicious content being downloaded, any number of things
- This functionality currently uses the latest version Internet Explorer that is installed on your computer. Not all websites play nicely with IE
- Keeping with the Internet Explorer theme, it will share and use global cookies
- Some options won't apply - for example the user agent. If a website is particularly unfriendly, it may serve different content to WebCopy than it does to the hosted Internet Explorer session
- WebCopy will remap only the original document it downloads, not the JavaScript executed version. I don't plan on changing this behaviour
- This system only supports non-interactive scripts, e.g. JavaScript that executes when the page loads. I have no intention of supporting scripts that normally require user interaction to run, e.g. clicking a button or scrolling a window
- It occurs to me as I write this post that I have no idea what will happen if the scripts try to open a popup window. Probably nothing good!
- Potentially more issues. Experimental code!
I don't want to use Internet Explorer, can't I use Chrome or Firefox?
Neither do I. Microsoft have dropped the ball so many times with web browsers I'm amazed they are still in the game. Although I wish they'd just decoupled Edge from the OS and updated it more frequently than giving into Google and adopting Chromium. But I've probably stated this before, plus, as usual, I digress.
To get back to the point, I expect future versions of WebCopy will support both Firefox and Chromium. However, as these browsers are several times larger than WebCopy, they won't be included by default. So I also need to have a nice system so that you can easily add extra browser engines to WebCopy from within the application and without needing to install anything.
I'm also considering supporting Edge as Microsoft appear to be adding support for this to .NET, as long as you're on the latest Windows 10. However, given that it's probably "old" Edge then this may not happen as adding support for two obsolete browsers and with one only available to a fraction of users is going to be a waste of the time I simply don't have to waste.
I'll have more to write about this in future I'm sure!
All content Copyright (c) by Cyotek Ltd or its respective writers. Permission to reproduce news and web log entries and other RSS feed content in unmodified form without notice is granted provided they are not used to endorse or promote any products or opinions (other than what was expressed by the author) and without taking them out of context. Written permission from the copyright owner must be obtained for everything else.
Original URL of this content is https://www.cyotek.com/blog/products/webcopy-1-8-javascript-support?source=rss.