How to load infinite scroll pages faster? - javascript

https://immutascan.io/address/0xac98d8d1bb27a94e79fbf49198210240688bb1ed
This URL has 100k+ rows that I'm trying to scrape. They go back about a month (1/10/2022 I believe) but load in badges of 7-8.
Right now I have a macro slowly scrolling down the page, which is working, but takes about 8-10 hours per day's worth of rows.
As of now, when new rows load there are 2-3 items that load immediately and then a few that load over time. I don't need the parts that load slowly and would like them to load faster or not at all.
Is there a way that I can prevent elements from loading to speed up the loading of additional rows?
I'm using an autohotkey script that scrolls down with the mouse-wheel and that's been working best.
I've also tried a Chrome extension but that was slower.
I found a python script at one point but it wasn't any faster than autohotkey.
Answer: Immutable X has an API so I'm using that instead of this site that does the same thing. Here's the working code:
import requests
import time
import pandas as pd
import time
URL = "https://api.x.immutable.com/v1/orders"
bg_output = []
params = {'status': 'filled',
'sell_token_address': '0xac98d8d1bb27a94e79fbf49198210240688bb1ed'}
with requests.Session() as session:
while True:
(r := session.get(URL, params=params)).raise_for_status()
data = r.json()
for value in data['result']:
orderID = value['order_id']
info = value["sell"]["data"]["properties"]["name"]
wei = value["buy"]["data"]["quantity"]
decimals = value["buy"]["data"]["decimals"]
spacer = "."
eth = float(wei[decimals:] + spacer + wei[:decimals])
print(f'Count={len(bg_output)},Order ID={orderID}, Info={info}, Eth={eth}')
bg_output.append(f'Count={len(bg_output)},Order ID={orderID}, Info={info}, Eth={eth}')
timestr = time.strftime("%Y%m%d")
pd.DataFrame(bg_output).to_csv('bg_output' + timestr + '.csv')
#print(len(bg_output))
time.sleep(1)
if (cursor := data.get('cursor')):
params['cursor'] = cursor
else:
print(bg_output)
break
print(bg_output)
print("END")

Have you considered using their API directly? When you scroll the page, have a look at your browser’s dev tools “network” tab. There you can see the actual call to their API. Look at all POST requests to the URL
https://3vkyshzozjep5ciwsh2fvgdxwy.appsync-api.us-west-2.amazonaws.com/graphql
Try adapting these API calls so that you can get the data right via their GraphQL-API and without having to scroll the actual page.

Related

Google App Script randomly stop executing

I have a script that basically take info from a website for multiple users, and put these info in a google spreadsheet, with one sheet per users.
I have a function that remove values of the firstline, resize every columns, and put back the setValues:
function adjustColumnsAndIgnoreFirstLine(sheet) {
Logger.log('--- Adjust columns ---')
const range = sheet.getRange("1:1")
// save the title line
const datas = range.getValues();
// clear it
range.clearContent();
// format without the title lines
var lastColumn = sheet.getLastColumn()
sheet.autoResizeColumns(1, lastColumn);
// set width to a minimum
for (var i = 1; i < 37; i++) { // fixed number of columns
if (sheet.getColumnWidth(i) < 30) {
sheet.setColumnWidth(i, 30);
}
}
// put back titles
range.setValues(datas);
}
my problem is that the script stop executing in the middle of the function. I still have the "execution please wait" popup, but in the logs, the script stopped like there was no error (execution finished) with this as the last log:
And, on the google spreadsheet:
One thing to note is that the problem doesn't comes from the script itself, as I do not encounter this problem on any of my machines, but my client does. My client ran the script on different navigator (chrome and edge), and had the same problem, but on different users (sometimes it blocks at the before-last user, sometimes at the before-before-last user...)
So I'm kinda lost on this problem...
The problem is actually a timeout. Google app script limit the execution time of a script at ~6 minutes.
There is existing issues for this

Scraping dynamic content from website in near-realtime

I’m trying to implement a web scraper scraping dynamically updated content from a website in near-realtime.
Let’s take https://www.timeanddate.com/worldclock/ as an example and assume I want to continuously get the current time at my home location.
My solution right now is as follows: Get the rendered page content every second and extract the time using bs4. Working Code:
import asyncio
import bs4
import pyppeteer
def get_current_time(content):
soup = bs4.BeautifulSoup(content, features="lxml")
clock = soup.find(class_="my-city__digitalClock")
hour_minutes = clock.contents[3].next_element
seconds = clock.contents[5].next_element
return hour_minutes + ":" + seconds
async def main():
browser = await pyppeteer.launch()
page = await browser.newPage()
await page.goto("https://www.timeanddate.com/worldclock/")
for _ in range(30):
content = await page.content()
print(get_current_time(content))
await asyncio.sleep(1)
await browser.close()
asyncio.run(main())
What I would like to do instead is: React only when the time is updated on the page. Reasons: Faster reaction and less computationally intensive (especially when monitoring multiple pages that may update in irregular intervals smaller or much larger than a second).
I got / tried the following three ideas how to solve this, but I don’t know how to do continue. There might also a much simpler / more elegant approach:
1) Intercepting network responses using pyppeteer
This does not seem to work, since there is no more network activity after initially loading the page (except from advertising), as I can see in the Network tab in Chrome Dev Tools.
2) Reacting to custom events on the page
Using the “Event Listener Breakpoints” in the “Sources” tab in Chrome Dev Tools, I can stop the JavaScript code execution on various events (e.g. the “Set innerHTML” event).
Is it possible to do something like this using pyppeteer, provide some context information about the event (e.g. which element is updated with which new text)?
It seems to be possible using JavaScript and puppeteer (see https://github.com/puppeteer/puppeteer/blob/main/examples/custom-event.js), but I think pyppeteer does not provide this functionality (I could not find it in the API Reference).
3) Overriding a function in the JavaScript code of the page
Override a relevant function and intercept the relevant data (which are provided to that function as a parameter).
This idea is inspired by this blogpost: https://antoinevastel.com/javascript/2019/06/10/monitor-js-execution.html
Entire code for the blogpost: https://github.com/antoinevastel/blog-post-monitor-js/blob/master/monitorExecution.js
I tried around a bit, but my JavaScript seems too limited to even just override a function in one of the javascripts used by the page.
You could achieve this with Selenium. I am using the Chrome webdriver via webdriver-manager but you can modify this to use whatever you prefer.
First, all of our imports
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.options import Options
from webdriver_manager.chrome import ChromeDriverManager
Create our driver object with the headless parameter so that the browser window doesn't open.
options = Options()
options.add_argument("--headless")
driver = webdriver.Chrome(ChromeDriverManager().install(), options=options)
Define a function that accepts a WebElement to extract the clock time.
def getTimeString(myClock: WebElement) -> str:
hourMinute = myClock.find_element(By.XPATH, "span[position()=2]").text
seconds = myClock.find_element(By.CLASS_NAME, "my-city__seconds").text
return f"{hourMinute}:{seconds}"
Get the page and extract the clock WebElement
driver.get("https://www.timeanddate.com/worldclock/")
myClock = driver.find_element(By.CLASS_NAME, "my-city__digitalClock")
Finally, implement our loop
last = None
while True:
now = getTimeString(myClock)
if now == last:
continue
print(now)
last = now
Before your logic concludes, be sure to run driver.quit() to clean up.
Output
05:27:56
05:27:57
05:27:58

How to scrape Instagram using Python using Selenium after Instagram changed their API process? I'm unable to find all the entries, can only find 12

I'm trying to scrape Instagram using Python and Selenium. Goal is to get the url of all the posts, number of comments, number of likes, etc.
I was able to scrape some data but for some reason the page doesn't show more than 12 latest entries. I'm unable to figure out a way to show all the other entries. I've even tried scrolling down and then reading the page but it's only giving 12. I checked the source and am unable to find how to get the rest of the entries. It looks like the 12 entries are embedded into the script tag and I don't see it anywhere else.
driver = webdriver.Chrome('chromedriver.exe')
driver.get('https://www.instagram.com/fazeapparel/?hl=en')
source = driver.page_source
data=bs(source, 'html.parser')
body = data.find('body')
script = body.find('script', text=lambda t: t.startswith('window._sharedData'))
page_json = script.text.split(' = ', 1)[1].rstrip(';')
data = json.loads(page_json)
Using the data retrieved, I was able to find the information and collect them.
for each in data['entry_data']['ProfilePage'][0]['graphql']['user']['edge_owner_to_timeline_media']['edges']:
link = 'https://www.instagram.com'+'/p/'+each['node']['shortcode']+'/'
posttext = each['node']['edge_media_to_caption']['edges'][0]['node']['text'].replace('\n','')
comments = each['node']['edge_media_to_comment']['count']
likes = each['node']['edge_liked_by']['count']
postimage = each['node']['thumbnail_src']
isvideo = each['node']['is_video']
postdate = time.strftime('%Y %b %d %H:%M:%S', time.localtime(each['node']['taken_at_timestamp']))
links.append([link, posttext, comments, likes, postimage, isvideo, postdate])
I've even created a scroll function to scroll the window and then scraping the data but it's only returning 12.
Is there any way I can get more than 12 entries? This account has 46 entries and I'm unable to find it anywhere in the code. Please Help!
Edit: I think the data is embedded within React so it's not showing all the posts
Have you added using OpenQA.Selenium.Support.UI ? It has a WebDriverWait and you can wait for the element to be visible. Sorry for doing this in C#.
Boxes should returns all the posts.
Again, I know it isn't in Python, but I hope it helps.
IWebDriver driver = new ChromeDriver("C:\\Users\\admin\\downloads", options);
WebDriverWait wait = new WebDriverWait(driver, time);
driver.Navigate().GoToUrl("www.instagram.com\cnn");
IWebElement mainDocument = wait.Until(SeleniumExtras.WaitHelpers.ExpectedConditions.ElementExists(By.TagName("body")));
IWebElement element = mainDocument.FindElements(By.CssSelector("#react-root > section > main > div > div._2z6nI > article > div > div");
IList <IWebElement> boxes = element.FindElements(By.TagName("div"));
foreach (var posts in boxes)
{
//do stuff here
}
EDIT:
It is making a ajax call on the back end to load the next posts when you scroll. One way might be to run a script that scrolls down. You would want to call this script in selenium. I would add logic to with a timer to wait while the script runs and check if it returns "STOP". Any type of thread sleeps blocks the thread. I would use some start of timer to call the method that runs my script.
function scrollDown() {
//once this bottom element disappears we found all the posts
var bottom = document.querySelector('._4emnV')
if (bottom != null) {
window.scroll(0,999999)
}
else
{
return "STOP"
}
}

casperjs: trouble scraping an item viewed number from a page

The page I want to scrape is http://v.qq.com/page/k/9/2/k0188qdxy92.html, it is hosted in China so it takes some time to load. The data I want from this page is just the played count of the video, located southeast to the player, its selector is as shown in the picture.
When you open the page, you will notice that this number will display much later then other parts of the page.
var time1 = Date.now();
var time2;
var casper = require('casper').create();
var url = 'http://v.qq.com/page/k/9/2/k0188qdxy92.html';
casper.start(url,function(){
time2 = Date.now();
console.log((time2-time1)/1000);
this.echo(this.fetchText('.played_count em'));
})
casper.run();
This is what I tried at first. Yesterday it worked, but today every time it just print a blank line and return to the shell. I think it is probably because the number was requested asynchronously and the network is slow. So I added a wait time into the script:
var time1 = Date.now();
var time2;
var casper = require('casper').create();
var url = 'http://v.qq.com/page/k/9/2/k0188qdxy92.html';
casper.start(url);
casper.wait('6000',function(){
time2 = Date.now();
console.log((time2-time1)/1000);
this.echo(this.fetchText('.played_count em'));
})
casper.run();
Although it is slow to open the page, 60s is absolutely enough. However, this is what I get:
You can see only 1 out of 4 attempts I get the right number, which means my code is working, although something else is preventing me from always getting the right data. What is it? Could it be the network, or some script on the page?
I also tried using waitForSelector, and waitFor, but every time I get error message like waittimeout expired, exiting, even though I set the waitTimeout option to 30000 or 60000. I am really stuck here. Although I am new to casperjs, but I've successful scraped similar data from other video sites' pages, what is so special about this one?

External Ajax: All at Once, As Needed, or Something Else?

I created a Magic The Gathering site for my friends and I to use. On this site, we upload our decks of cards, and on the page where you can view all the cards in the deck each card is a link to the card on http://gatherer.wizards.com/. For ease of use, though, I made it so that when you hover over any of the card names, the card image gets Ajax'd in from gatherer, thus letting you see the card without having to click the link.
The question is: should I load all of the ~40 or so card images all at once when the page loads, or should I continuously load the images as they are hovered over, or is there some other way I should be doing it?
As it stands, I load each card as it is hovered over. My concern is that, as people mouse up and down the list, that is a LOT of requests to Gatherer. It would probably save requests to load them all up at the start, but I'm not sure if Gatherer would be upset with me for a sudden flurry of requests every time someone loads one of the decks on my site.
A solution I thought of was to load cards as they are hovered over, but save the image in a hidden container and just reload it when they mouse over it AGAIN. Thus if they load the page and don't look at anything, no needless requests were sent, but if they stay on the page for 30 minutes looking at every card over and over again, we don't inundate Gatherer with requests.
I just don't know if the method I'm using is wasteful - from a bandwidth standpoint for me or for gatherer, or from any other standpoint that I'm not familiar with. Are there any golden rules of external Ajax that I should know, for instance?
The method I'm currently using, which I assume is probably the worst implementation possible, but it was a proof of concept:
$(document).ready(function(){
var container = $('#cardImageHolder');
$('.bumpin a').mouseenter(function(){
doAjax($(this).attr('href'));
return false;
});
function doAjax(url){
// if it is an external URI
if(url.match('^http')){
// call YQL
$.getJSON("http://query.yahooapis.com/v1/public/yql?"+
"q=select%20*%20from%20html%20where%20url%3D%22"+
encodeURIComponent(url)+
"%22&format=xml'&callback=?",
// this function gets the data from the successful
// JSON-P call
function(data){
// if there is data, filter it and render it out
if(data.results[0]){
var data = filterData(data.results[0]);
var src = $(data).find('.leftCol img').first().attr('src');
var fixedImageSrc = src.replace("../../", "http://gatherer.wizards.com/");
var image = $(data).find('.leftCol img').first().attr('src', fixedImageSrc);
container.html(image);
// otherwise tell the world that something went wrong
} else {
var errormsg = "<p>Error: can't load the page.</p>";
container.html(errormsg);
}
}
);
// if it is not an external URI, use Ajax load()
} else {
$('#target').load(url);
}
}
// filter out some nasties
function filterData(data){
data = data.replace(/<?\/body[^>]*>/g,'');
data = data.replace(/[\r|\n]+/g,'');
data = data.replace(/<--[\S\s]*?-->/g,'');
data = data.replace(/<noscript[^>]*>[\S\s]*?<\/noscript>/g,'');
data = data.replace(/<script[^>]*>[\S\s]*?<\/script>/g,'');
data = data.replace(/<script.*\/>/,'');
return data;
}
});
No, there are no Golden Rules of Ajax. Loading 40 images up front would minimize load time upon hover, but would greatly increase how much bandwidth is used when the page is first loaded.
You will always have these types of balance questions. It's up to you to decide what is best, and tweak it based on empirical data.
"A solution I thought of was to load cards as they are hovered over,
but save the image in a hidden container and just reload it when they
mouse over it AGAIN. Thus if they load the page and don't look at
anything, no needless requests were sent, but if they stay on the page
for 30 minutes looking at every card over and over again, we don't
inundate Gatherer with requests."
This sounds reasonable.
If I were you, though, I would load every picture when the user first loads the page. Let the browser cache the images and you don't have to worry about it. Plus, this is likely the easiest method. Don't over complicate things when you don't have to :)

Categories

Resources