To passively scrape a webpage one uses automation tools, ideally headless browsers like #Selenium or #Puppeteer. Of course one can use any tool that is typically used for #e2e testing in the #browser.
The biggest obstacle for passively scraping is dealing with either #captcha or #cloudflare.
There are options to use captcha farms for a small monetary fee. And Cloudflare can be over come by IP hopping.
In general, passively scraping only works on websites that were poorly configured.
I hate Google, but #GoogleColab can be handy.
We had some teaching examples of #webscraping with #selenium on colab because installing the webdriver locally can be challenging for some students.
Google broke selenium scraping on colab. :((
Any #MyBinder suggestions? Something else..?
There's a website which has a download button.
The button dynamically generates a "blob" which is then downloaded.
I want to use #Selenium's Chrome Webdriver to "click" that button and save the file to a specific location.
Is that possible without lots of funky JS injection?
(I am using a headless Linux box, so can't launch Chrome normally tools.
I've spent the evening learning #Selenium - quite good fun being able to manipulate the web and take screenshots using #Python.
But what's really cheesing me off if non-semantic class names. Seriously - what genius came up with a framework which dumps random strings into nice orderly HTML properties?
But the limitation of using #Selenium is a big one. Being forced to work in java, forced to use the resource hog of a modern gui browser, forced to reveal more browserprint info, being browser-dependent, etc. Selenium is my last choice when desperation is sufficiently high.
Automated testing is crucial for the continued quality of #KDE software. The #Selenium webdriver automates such tests. Selenium-AT-SPI does the same but for #Qt programs.
Google engineers want to make ad-blocking (near) impossible (stackdiary.com)