Scraping dynamic content from website in near-realtime

Scraping dynamic content from website in near-realtime - javascript

I’m trying to implement a web scraper scraping dynamically updated content from a website in near-realtime.
Let’s take https://www.timeanddate.com/worldclock/ as an example and assume I want to continuously get the current time at my home location.
My solution right now is as follows: Get the rendered page content every second and extract the time using bs4. Working Code:
import asyncio
import bs4
import pyppeteer
def get_current_time(content):
soup = bs4.BeautifulSoup(content, features="lxml")
clock = soup.find(class_="my-city__digitalClock")
hour_minutes = clock.contents[3].next_element
seconds = clock.contents[5].next_element
return hour_minutes + ":" + seconds
async def main():
browser = await pyppeteer.launch()
page = await browser.newPage()
await page.goto("https://www.timeanddate.com/worldclock/")
for _ in range(30):
content = await page.content()
print(get_current_time(content))
await asyncio.sleep(1)
await browser.close()
asyncio.run(main())
What I would like to do instead is: React only when the time is updated on the page. Reasons: Faster reaction and less computationally intensive (especially when monitoring multiple pages that may update in irregular intervals smaller or much larger than a second).
I got / tried the following three ideas how to solve this, but I don’t know how to do continue. There might also a much simpler / more elegant approach:
1) Intercepting network responses using pyppeteer
This does not seem to work, since there is no more network activity after initially loading the page (except from advertising), as I can see in the Network tab in Chrome Dev Tools.
2) Reacting to custom events on the page
Using the “Event Listener Breakpoints” in the “Sources” tab in Chrome Dev Tools, I can stop the JavaScript code execution on various events (e.g. the “Set innerHTML” event).
Is it possible to do something like this using pyppeteer, provide some context information about the event (e.g. which element is updated with which new text)?
It seems to be possible using JavaScript and puppeteer (see https://github.com/puppeteer/puppeteer/blob/main/examples/custom-event.js), but I think pyppeteer does not provide this functionality (I could not find it in the API Reference).
3) Overriding a function in the JavaScript code of the page
Override a relevant function and intercept the relevant data (which are provided to that function as a parameter).
This idea is inspired by this blogpost: https://antoinevastel.com/javascript/2019/06/10/monitor-js-execution.html
Entire code for the blogpost: https://github.com/antoinevastel/blog-post-monitor-js/blob/master/monitorExecution.js
I tried around a bit, but my JavaScript seems too limited to even just override a function in one of the javascripts used by the page.

You could achieve this with Selenium. I am using the Chrome webdriver via webdriver-manager but you can modify this to use whatever you prefer.
First, all of our imports
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.options import Options
from webdriver_manager.chrome import ChromeDriverManager
Create our driver object with the headless parameter so that the browser window doesn't open.
options = Options()
options.add_argument("--headless")
driver = webdriver.Chrome(ChromeDriverManager().install(), options=options)
Define a function that accepts a WebElement to extract the clock time.
def getTimeString(myClock: WebElement) -> str:
hourMinute = myClock.find_element(By.XPATH, "span[position()=2]").text
seconds = myClock.find_element(By.CLASS_NAME, "my-city__seconds").text
return f"{hourMinute}:{seconds}"
Get the page and extract the clock WebElement
driver.get("https://www.timeanddate.com/worldclock/")
myClock = driver.find_element(By.CLASS_NAME, "my-city__digitalClock")
Finally, implement our loop
last = None
while True:
now = getTimeString(myClock)
if now == last:
continue
print(now)
last = now
Before your logic concludes, be sure to run driver.quit() to clean up.
Output
05:27:56
05:27:57
05:27:58

Related

Get variable inside DOM before DOM changes when clicking redirecting button

I have been trying for so long to find a way to persist variables between page refreshes and different pages in one browser session opened from selenium python.
Unfortunately, neither storing variable in localStorage, sessionStorage or window.name doesn't work after testing so many times and research.
So I have resorted to a python script which continuously repeats driver.execute_script('return variable') and continue to gather data while surfing.
Data that needs to be collected, is a value of element that gets clicked, which is catched by eventListener for click and inserted to local variable I have added to the page.
This all works fine, except for the time where the element that gets clicked, is the actual button that contains a link that redirects page and changes the DOM.
My best guess is that at the same moment, the click, my JavaScript script that stores the variable, my JavaScript script that retrieves the variable, and the page redirect, all almost happen at the same time, suspecting that the change of the DOM happens before the retrieving of the variable, thus canceling any of my efforts to get that data.
This is the code:
from selenium.common import TimeoutException, WebDriverException
from selenium.webdriver.support.ui import WebDriverWait
from selenium import webdriver
class Main:
def __init__(self, page_url):
self.__driver = webdriver.Chrome()
self.__element_list = []
self.__page_url = page_url
def start(self):
program_return = []
self.__driver.get(self.__page_url)
event_js = '''
var array_events = []
var registerOuterHtml = (e) => {
array_events.push(e.target.outerHTML)
window.array_events = array_events
}
var registerUrl = (e) => {
array_events.push(document.documentElement.outerHTML)
}
getElementHtml = document.addEventListener("click", registerOuterHtml, true)
getDOMHtml = document.addEventListener("click", registerUrl, true)
'''
return_js = '''return window.array_events'''
self.__driver.set_script_timeout(10000)
self.__driver.execute_script(event_js)
try:
for _ in range(800):
if array_events := self.__driver.execute_script(return_js):
if array_events[-2:] not in program_return:
program_return.append(array_events[-2:])
else:
try:
WebDriverWait(self.__driver, 0.1).until(
lambda driver: self.__driver.current_url != self.__page_url)
except TimeoutException:
pass
else:
self.__page_url = self.__driver.current_url
self.__driver.execute_script(event_js)
except WebDriverException:
pass
finally:
print(len(program_return)) # should print total number of clicks made.
To test it out, call it like this:
Main('any url you wish').start()
And after clicking, and should at least click a button which changes the page, you can close the window manually and check the results.
Please indent the functions of the class a tab to the right, I can't format it here for the sake of my life!
Any idea or ideally a solution to this problem would be greatly appreciated.
Overall question---Taking for granted that variable persistence between different pages is not possible, How can I get the value of that variable that gets set on the time of click, before the page changes, from the same click action? (Maybe delay whole page...??)

Theoretically you can get some global data before a navigation like:
data = driver.execute_async_script("""
let [resolve] = arguments
window.unload = () => resolve(window.some_global_data)
""")
but it's likely to timeout ... Puppeteer / Playwright are better suited to things like this. There are python ports of them you might try.

How to load infinite scroll pages faster?

https://immutascan.io/address/0xac98d8d1bb27a94e79fbf49198210240688bb1ed
This URL has 100k+ rows that I'm trying to scrape. They go back about a month (1/10/2022 I believe) but load in badges of 7-8.
Right now I have a macro slowly scrolling down the page, which is working, but takes about 8-10 hours per day's worth of rows.
As of now, when new rows load there are 2-3 items that load immediately and then a few that load over time. I don't need the parts that load slowly and would like them to load faster or not at all.
Is there a way that I can prevent elements from loading to speed up the loading of additional rows?
I'm using an autohotkey script that scrolls down with the mouse-wheel and that's been working best.
I've also tried a Chrome extension but that was slower.
I found a python script at one point but it wasn't any faster than autohotkey.
Answer: Immutable X has an API so I'm using that instead of this site that does the same thing. Here's the working code:
import requests
import time
import pandas as pd
import time
URL = "https://api.x.immutable.com/v1/orders"
bg_output = []
params = {'status': 'filled',
'sell_token_address': '0xac98d8d1bb27a94e79fbf49198210240688bb1ed'}
with requests.Session() as session:
while True:
(r := session.get(URL, params=params)).raise_for_status()
data = r.json()
for value in data['result']:
orderID = value['order_id']
info = value["sell"]["data"]["properties"]["name"]
wei = value["buy"]["data"]["quantity"]
decimals = value["buy"]["data"]["decimals"]
spacer = "."
eth = float(wei[decimals:] + spacer + wei[:decimals])
print(f'Count={len(bg_output)},Order ID={orderID}, Info={info}, Eth={eth}')
bg_output.append(f'Count={len(bg_output)},Order ID={orderID}, Info={info}, Eth={eth}')
timestr = time.strftime("%Y%m%d")
pd.DataFrame(bg_output).to_csv('bg_output' + timestr + '.csv')
#print(len(bg_output))
time.sleep(1)
if (cursor := data.get('cursor')):
params['cursor'] = cursor
else:
print(bg_output)
break
print(bg_output)
print("END")

Have you considered using their API directly? When you scroll the page, have a look at your browser’s dev tools “network” tab. There you can see the actual call to their API. Look at all POST requests to the URL
https://3vkyshzozjep5ciwsh2fvgdxwy.appsync-api.us-west-2.amazonaws.com/graphql
Try adapting these API calls so that you can get the data right via their GraphQL-API and without having to scroll the actual page.

Selenium is returning an old state of PageSource and is not updating after Javascript Execution

I have a console program in C# with Selenium controlling a Chrome Browser Instance and I want to get all Links from a page.
But after the Page has loaded in Selenium the PageSource from Selenium ist different to the HTML of the Website I have navigated to. The Content of the Page is asynchronously loaded by JavaScript and the HTML is changed.
Even if I load the HTML of the Website like the following the HTML is still different to the one inside the Selenium controlled Browserwindow:
var html = ((IJavaScriptExecutor)driver).ExecuteScript("return document.getElementsByTagName('html')[0].outerHTML").ToString();
But why is the PageSource or the HTML returned by my JS still the same as it was when Selenium loaded the page?
EDIT:
As #BinaryBob has pointed out I have now implemented a wait-function to wait for a desired element to change a specific attribute value. The Code looks like this:
private static void AttributeIsNotEmpty(IWebDriver driver, By locator, string attribute, int secondsToWait = 60)
{
new WebDriverWait(driver, new TimeSpan(0, 0, secondsToWait)).Until(d => IsAttributeEmpty(d, locator, attribute));
}
private static bool IsAttributeEmpty(IWebDriver driver, By locator, string attribute)
{
Console.WriteLine("Output: " + driver.FindElement(locator).GetAttribute(attribute));
return !string.IsNullOrEmpty(driver.FindElement(locator).GetAttribute(attribute));
}
And the function call looks like this:
AttributeIsNotEmpty(driver, By.XPath("/html/body/div[2]/c-wiz/div[4]/div[1]/div/div/div/div/div[1]/div[1]/div[1]/a[1]"), "href");
But the condition is never met and the timeout is thrown. But inside the Chrome Browser (which is controlled by Selenium) the condition is met and the element has a filled href-Attribute.

I'm taking a stab at this. Are you calling wait.Until(ExpectedConditions...) somewhere in your code? If not, that might be the issue. Just because a FindElement method has returned does not mean the page has finished rendering.
For a quick example, this code comes from the Selenium docs site. Take note of the creation of a WebDriverWait object (line 1), and the use of it in the firstResult assignment (line 4)
WebDriverWait wait = new WebDriverWait(driver, TimeSpan.FromSeconds(10));
driver.Navigate().GoToUrl("https://www.google.com/ncr");
driver.FindElement(By.Name("q")).SendKeys("cheese" + Keys.Enter);
IWebElement firstResult = wait.Until(ExpectedConditions.ElementExists(By.CssSelector("h3>div")));
Console.WriteLine(firstResult.GetAttribute("textContent"));
If this is indeed the problem, you may need to read up on the various ways to use ExpectedConditions. I'd start here: Selenium Documentation: WebDriver Waits

Apify web scraper task not stable. Getting different results between runs minutes apart

I'm building a very simple scraper to get the 'now playing' info from an online radio station I like to listen too.
It's stored in a simple p element on their site:
data html location
Now using the standard apify/web-scraper I run into a strange issue. The scraping sometimes works, but sometimes doesn't using this code:
async function pageFunction(context) {
const { request, log, jQuery } = context;
const $ = jQuery;
const nowPlaying = $('p.js-playing-now').text();
return {
nowPlaying
};
}
If the scraper works I get this result:
[{"nowPlaying": "Hangover Hotline - hosted by Lamebrane"}]
But if it doesn't I get this:
[{"nowPlaying": ""}]
And there is only a 5 minute difference between the two scrapes. The website doesn't change, the data is always presented in the same way. I tried checking all the boxes to circumvent security and different mixes of options (Use Chrome, Use Stealth, Ignore SSL errors, Ignore CORS and CSP) but that doesn't seem to fix it unfortunately.
Scraping instable
Any suggestions on how I can get this scraping task to constantly return the data I need?

It would be great if you can attach the URL, it will help me to find out the problem.
With the information you provided, I guess that the data you want to are loaded asynchronously. You can use context.waitFor() function.
async function pageFunction(context) {
const { request, log, jQuery } = context;
const $ = jQuery;
await context.waitFor(() => !!$('p.js-playing-now').text());
const nowPlaying = $('p.js-playing-now').text();
return {
nowPlaying
};
}
You can pass the function to wait, and I will wait until the result of the function will be true. You can check the doc.

IPython Notebook Javascript: retrieve content from JavaScript variables

Is there a way for a function (called by an IPython Notebook cell) to retrieve the content of a JavaScript variable (for example IPython.notebook.notebook_path which contains the path of the current notebook)?
The following works well when written directly within a cell (for example, based on this question and its comments):
from IPython.display import display,Javascript
Javascript('IPython.notebook.kernel.execute("mypath = " + "\'"+IPython.notebook.notebook_path+"\'");')
But that falls apart if I try to put it in a function:
# this doesn't work
from IPython.display import display,Javascript
def getname():
my_js = """
IPython.notebook.kernel.execute("mypath = " + "\'"+IPython.notebook.notebook_path+"\'");
"""
Javascript(my_js)
return mypath
(And yes, I've tried to make global the mypath variable, both from within the my_js script and from within the function. Also note: don't be fooled by possible leftover values in variables from previous commands; to make sure, use mypath = None; del mypath to reset the variable before calling the function, or restart the kernel.)
Another way to formulate the question is: "what's the scope (time and place) of a variable set by IPython.notebook.kernel.execute()"?
I think it isn't an innocuous question, and is probably related to the mechanism that IPython uses to control its kernels and their variables and that I don't know much about. The following experiment illustrate some aspect of that mechanism. The following works when done in two separate cells, but doesn't work if the two cells are merged:
Cell [1]:
my_out = None
del my_out
my_js = """
IPython.notebook.kernel.execute("my_out = 'hello world'");
"""
Javascript(my_js)
Cell [2]:
print(my_out)
This works and produces the expected hello world. But if you merge the two cells, it doesn't work (NameError: name 'my_out' is not defined).

I think the problem is related with Javascript being asynchronus while python is not. Normally you would think that the Javascript(""" python cmd """) command is executed, and then your print statment should work properly as expected. However, the Javascript command is fired but not executed. Most pobably it is executed after the cell 1 execution is fully completed.
I tried your example with sleep function. Did not help.
The asnyc problem can esaily be seen by adding an alert statement within my_js, but before kernel.execute line. The alert should be fired even before trying a python command execution.
But at the presence of print (my_out) statement within cell 1, you will again get the same error without any alerts. If you take the print line out, you will see the alert poping out within cell 1. But the varibale my_out is set afterwards.
my_out = None
del my_out
my_js = """
**alert ("about to execute python comand");**
IPython.notebook.kernel.execute("my_out = 'hello world'");
"""
Javascript(my_js)
There are other javascript utilities within notebook like IPython.display.display_xxx which varies from displaying video to text object, but even the text object option does not work.
Funny enough, I tested this with my webgl canvas application which displays objects on the HTML5 canvas; display.display_javascript(javascript object) works fine ( which is a looong html5 document) while the two pieces of words of output does not show up?! Maybe I should embed the output into canvas application somewhere, so it s displayed on the canvas :)

I wrote a related question (Cannot get Jupyter notebook to access javascript variables) and came up with a hack that does the job. It uses the fact that the input(prompt) command in Python does block the execution loop and waits for user input. So I looked how this is processed on the Javascript side and inserted interception code there.
The interception code is:
import json
from IPython.display import display, Javascript
display(Javascript("""
const CodeCell = window.IPython.CodeCell;
CodeCell.prototype.native_handle_input_request = CodeCell.prototype.native_handle_input_request || CodeCell.prototype._handle_input_request
CodeCell.prototype._handle_input_request = function(msg) {
try {
// only apply the hack if the command is valid JSON
console.log(msg.content.prompt)
const command = JSON.parse(msg.content.prompt);
const kernel = IPython.notebook.kernel;
// return some value in the Javascript domain, depending on the 'command'.
// for now: specify a 5 second delay and return 'RESPONSE'
kernel.send_input_reply(eval(command["eval"]))
} catch(err) {
console.log('Not a command',msg,err);
this.native_handle_input_request(msg);
}
}
"""))
The interception code looks whether the input prompt is valid JSON, and in that case it executes an action depending on the command argument. In this case, it runs the commend["eval"] javascript expression and returns the result.
After running this cell, you can use:
notebook_path = input(json.dumps({"eval":"IPython.notebook.notebook_path"}))
Quite a hack, I must admit.

Okay, I found a way around the problem: call a Python function from Javascript and have it do all of what I need, rather than returning the name to "above" and work with that name there.
For context: my colleagues and I have many experimental notebooks; we experiment for a while and try various things (in a machine learning context). At the end of each variation/run, I want to save the notebook, copy it under a name that reflects the time, upload it to S3, strip it from its output and push it to git, log the filename, comments, and result scores into a DB, etc. In short, I want to automatically keep track of all of our experiments.
This is what I have so far. At the bottom of my notebooks, I put:
In [127]: import mymodule.utils.lognote as lognote
lognote.snap()
In [128]: # not to be run in the same shot as above
lognote.last
Out[128]: {'file': '/data/notebook-snapshots/2015/06/18/20150618-004408-save-note-exp.ipynb',
'time': datetime.datetime(2015, 6, 18, 0, 44, 8, 419907)}
And in a separate file, e.g. mymodule/utils/lognote.py:
# (...)
from datetime import datetime
from subprocess import call
from os.path import basename, join
from IPython.display import display, Javascript
# TODO: find out where the topdir really is instead of hardcoding it
_notebook_dir = '/data/notebook'
_snapshot_dir = '/data/notebook-snapshots'
def jss():
return """
IPython.notebook.save_notebook();
IPython.notebook.kernel.execute("import mymodule.utils.lognote as lognote");
IPython.notebook.kernel.execute("lognote._snap('" + IPython.notebook.notebook_path + "')");
"""
def js():
return Javascript(jss())
def _snap(x):
global last
snaptime = datetime.now()
src = join(_notebook_dir, x)
dstdir = join(_snapshot_dir, '{}'.format(snaptime.strftime("%Y/%m/%d")))
dstfile = join(dstdir, '{}-{}'.format(snaptime.strftime("%Y%m%d-%H%M%S"), basename(x)))
call(["mkdir", "-p", dstdir])
call(["cp", src, dstfile])
last = {
'time': snaptime,
'file': dstfile
}
def snap():
display(js())

To add to the other great answers, there is a nuance of the browsers attempting to run the jupyter nb javascript magic on nb load.
To demonstrate: create and run the following cell:
%%javascript
IPython.notebook.kernel.execute('1')
Now save the notebook, close it and then re-open it. When you do that, under that cell suddenly you will see an error in red:
Javascript error adding output!
TypeError: Cannot read property 'execute' of null
See your browser Javascript console for more details.
That means the browser has parsed some js code and it tried to run it. This is the error in chrome, it will probably different in a different browser.
I have no idea why this jupyter javascript magic cell is being run on load and why jupyter notebook is not properly escaping things, but the browser sees some js code and so it runs it and it fails, because the notebook kernel doesn't yet exist!
So you must add a check that the object exists:
%%javascript
if (IPython.notebook.kernel) {
IPython.notebook.kernel.execute('1')
}
and now there is no problem on load.
In my case, I needed to save the notebook and run an external script on it, so I ended up using this code:
from IPython.display import display, Javascript
def nb_auto_export():
display(Javascript("if (IPython.notebook) { IPython.notebook.save_notebook() }; if (IPython.notebook.kernel) { IPython.notebook.kernel.execute('!./notebook2script.py ' + IPython.notebook.notebook_name )}"))
and in the last cell of the notebook:
nb_auto_export()

Develop Reference

JavaScript is the programming language of the Web.