How do I scrape dynamic URL of a page?

How do I scrape dynamic URL of a page? - javascript

I am trying to do some website testing through selenium and python. I did fill the page http://www.flightcentre.co.nz/ and submitted the form. But now the search result is taking me to a new page with URL - https://secure.flightcentre.co.nz/eyWD/results . How does my web driver now will handle this? I am doing this for the first time. Could any one help me by providing an example or point me to a right tutorial of this sort.
Thanks.

Ok since I tried to answer your other question I'll give it a go on this one although you are not exactly explaining what you want.
One thing to remember is Selenium is running your browser and not a traditional web scraper. Which means if the url changes it's not a big deal, the only time you have to change how you approach scraping in selenium is when you get a popup.
One thing you can do from your other code is when looking for a flight do a
driver.implicitly_wait(40)//40 is the amount of seconds
this will wait for at least 40 seconds before crashing, and then start whenever the page finishes loading, or whatever you want to do next is active in the dom.
Now if you are trying to scrape all of the flight data that comes up, that'll be fairly tricky. You could do a for loop and grab every element on a page and write it to a csv file.
class_for_departure_flight = driver.find_elements_by_xpath('//div[#class="iata"]')
for flights in class_for_departure_flight:
try:
with open('my_flights.csv', 'a', newline='') as flights_book:
csv_writer = csv.writer(flights_book, delimiter = ',')
csv_writer.writerow(flights.text)
except:
print("Missed a flight")
Things to take notice in this second part is I am using the CSV library in Python to write rows of data. A note you can append a bunch of data together and write it as one row like:
data = (flights, dates, times)
csv_writer.writerow(data)
and it will right all of those different things on the same row in a spreadsheet.
The other two big things that are easily missed are:
class_for_departure_flight = driver.find_elements_by_xpath('//div[#class="iata"]')
that is driver.find_elements_by_xpath, you'll notice elements is plural, which means it is looking for multiple objects with the same class_name and it will store them in an array so you can iterate over them in a for loop.
The next part is csv_writer.writerow(flights.text) when you iterate over your flights, you need to grab the text to do that you do flights.text. If you were to do this with just a search function you could do something like this as well.
class_for_departure_flight = driver.find_elements_by_xpath('//div[#class="iata"]').text
hopefully this helps!

This is a good place to start: http://selenium-python.readthedocs.org/getting-started.html
Here are some things about Selenium I've learned the hard way:
1) When the DOM refreshes, you lose your references to page objects (i.e. return from something like element = driver.find_element_by_id("passwd-id"), element is now stale)
2) Test shallow; each test case should do only one assert/validation of page state, maybe two. This enables you to take screen shots when there's a failure and save you from dealing with "is it a failure in the test, or the app?"
3) It's a big race condition between any JavaScript on the page, and Selenium. Use Explicit Waits to block Selenium while the JavaScript works to refresh the DOM.
To be clear, this is my experience with using Selenium; thus not universal to everyone's experience.
Good luck! Hope this is helpful.

Related

How to load HTML document including js.download elements with Python

I am trying to collect data from a web page displaying search results about cars on sale.
The structure of the online document is not too complex and I was able to single-out the interesting bits of information based on a certain attribute data-testid which every returned car record possesses.
I can find different interesting bits of information like price, immatriculation year, mileage and so on, based on substring characteristics of this attribute.
I use beautifulsoup to parse the HTML and requests to initially load the HTML document from the web.
Now, here's the issue. In a way that I cannot predict, nor find a logic to, the HTML returned by requests.get() is somehow incomplete. In a page of 100 results, which I can see when I inspect the page online (and I can track there are 100 data-testid fields with that specific substring for price, 100 for mileage and so on...), the HTML returned by requests.get(), in the same way as the one I can obtain with a 'save-as' operation from the page itself, only contains a portion of these fields.
Also their number is kind of unpredictable.
I started asking just why this discrepancy between online and saved HTML.
So far no full response, but in the comments the hint was that the page was kinda loading dynamically through JavaScript.
I was happy to find that saving the page to disk, with all the files, somehow produced the full HTML I could then parse without further issues.
However, my joy only lasted for that specific search. When I needed to try a new one, I was suddenly back to square one.
With further investigation, I came to my current understanding, which is at the origin of the question: I noticed that, when I save the online page as 'Webpage, Complete' (which creates an .html file plus a folder), this combo surely contains ALL records. I can say that because if I go offline and double-click on the newly saved html, I can see all records which were online (100 in this case).
However, the HTML file itself only contains a few of them!!!
My deduction is, therefore, that the rest of the records must be 'hidden' in the folder created at saving time, and I would tend to say it could be embedded in those (many) *.js.download files:
My questions are:
is my assumption correct? The other records are stored in those files?
if yes, how can I make them 'explicit' when parsing the HTML with beautifulsoup?
UPDATE 07/05
I've tried to install and use requests_html as suggested in the comments and in this answer.
Its render() method looked promising, however I'm probably not really understanding the mechanisms explained in requests_html documentation here (the render JS portion) because even after the following operations (pseudo-code)
from requests_html import HTMLSession
session = HTMLSession()
r = session.get(URL)
r.html.render()
At this point, I was hoping to have 'forced' the site to 'spit out' ALL HTML, including those bits which remain somehow hidden and only show up in the live page.
However, a successive dump of the r.html.html into a file, still gives back the same old 5 records (why 5 now, when for other searches it returned 12, or even 60, is a complete mystery to me).

Interact through an array generated from a Google Sheet | Google Apps Script

I'm new to Google Script and I'm learning at work. What I am trying to do is a script that, based on a google sheet, check if the subject of the email contains the String present in one of the rows and save the attachment on the google drive. The ID for this folder is also on the same google sheet.
So the structure would be:
Column A
Column B
Column C
x
EXE2928
folderID29382938
x
EXE823
folderID29383994
x
EX3948
folderID55154988
The script to send save the attachments I am already using and it works. And if I read the information on the google sheet one by one, I can send it to the folder correctly. But this is not optimal, since there are a lot of rows.
What I tried so far was following function to get the rows as array and than interact over it
var dataspreadsheet = SpreadsheetApp.openById("SpreadsheetID");
var sheet = dataspreadsheet.getSheetByName("Sheet1");
// Just looking at the Column B
var data = sheet.getRange(1,4,5).getValues();
var header = data[0];
data.shift();
// ...
for (var l = 0; l < data.length; l++) {
if (filename.indexOf(data.length[l]) !== -1) {
// Here I still need to get the folder ID from the same google sheet
folderName = folder1;
} else {
folderName = folder2;
}
Could you give me some support or ideas on who to go on with this script?
Thank you in advance!

This answer assumes that you are using GmailApp to retrieve the emails. Also, it can only be considered a bit of a guideline, since I have not spend the time to test and measure code.
To keep in mind
Tooling
To optimize code you the first you need is tooling.
The most important tool is to measure that time that some part takes to run. You can do so with console.time and console.timeEnd (see console timers reference). This methods will allow you to measure the time it takes between them.
For parts of code where you don't require Apps Script classes or methods, you can test them locally to measure the performance with any other tool. Just be aware that it may not perfectly translate to Google Apps Script.
Know what you want to achieve
What is fast enough? This is something that you need to know before starting. This is usually specified for the reason you are optimizing in the first place.
Best practices
Read the official Apps Script best practices guide. It has a lot of advises that hold almost always true.
Know your system
A lot of times there are constraints on the system that you haven't even considered.
For example: if the string to search is always at the start of the subject, you may be able to make a more specific code that performs better.
Another example: Does this kind of threads only have a single email, or multiple of them? This can change the code a lot.
Measure everything
Don't assume. Measure. Sometime things that seem like should be slower are faster. This is more true the more you optimize. It's highly recommended to get a baseline of the current time and work from there.
Test the simple first
Don't get carried away trying complex code. Some time simple code is faster. Sometimes it's not faster but it's fast enough.
Weird can be good
Try to think outside the box. Sometime weird solutions are the fastest. This may reduce readability.
An example would be to generate a regular expression with all the values and use it to detect if it contains one and which. This could be slower or faster, but it's worth trying.
const r = /(EXE2928|EXE823|EX3948)/ // generate this dynamically
const m = r.match(string)
if (m != null) {
const key = m[1] // key is the value included on the subject
}
Some ideas
Get the minimum data and only once
You only need the mapping of columns B (text to find) and C (folder to send) to do what you need. Simply get that data. You will need to iterate so there's no need to transform the data. Also skip the headers.
var mapping = sheet.getRange(`B2:C`).getValues()
Also try to limit the number of Gmail threads that you read.
Organize the emails
I'd try moving the emails into a more easily digestible data to iterate.
Change when it's executed
I don't know when this code is executed but changing it could change execution time.
Changing platform
Google Apps Script may not be the best platform to be using. Calling the APIs directly from a client library (there is a python one) may be better.
References
console (MDN)
Best practices (Google Apps Script guides)

Detect the referral/s of Url/s using JavaScript or PHP from inside a Bookmarklet

Let's think out of the box!
Without any programming skills, how can you say/detect if you are on a web page that lists products, and not on the page that prints specific details of a product?
The Bookmarklet is inserted using JavaScript in right after the body tag of a website ( eBay, Bloomingdales, Macy's, toys'r'us ... )
Now, my story is: (programming skills needed now)
I have a bookmarklet and my main problem is how to detect if I am on a page that lists products or if i am on the page that prints the product detail.
The best way that I could think, to detect if I am on the detail page of a product is to detect the referral(s) of the current URL. (maybe all the referrals, the entire click history)
Possible problem: a user adds the URL as favorite and does not use my bookmarklet, and closes the browser; then the user uses the browser again, clicks the favorite link and uses my bookmaklet and I think that I can't detect the referral in this case; it's OK, not all the cases are covered or possible;
Can I detect the referral of this link using the cache in this case? (many browsers cache systems involved here, I know)

how can you say/detect if you are on a web page that lists products, and not on the page that prints specific details of a product
I'd setup Brain.js (a neural net implemented in javascript) and train it up on a (necessarily broad and varied) sample set of DOMs and then pick a threshold product:details ratio to 'detect' (as near as possible) what type of page I'm on.
This will require some trial and error, but is the best approach I can think of (neural nets can get to "good enough" results pretty quickly - try it, you'll be surprised at the results).

No. You can't check history with a bookmarklet, or with any normal client side JavaScript. You are correct, the referrer will be empty if loaded from a bookmark.
The bookmarklet can however store the referrer the first time it is used in a cookie or in localStorage and then the next time it is used, if referrer is empty, check the cookie or localStorage.
That said, your entire approach to this problem seems really odd to me, but I don't have enough details to know if it is genius our insanity.
If I was trying to determine if the current page was a list or a details page, I'd either inspect the url for common patterns or inspect the content of the page for common patterns.
Example of common url patterns: Many 'list pages' are search results, so query string will have words like "search=", "q=", "keywords=", etc.
Example of page content patterns: A product page will have only 1 "buy" button or "add to cart", whatever. A list page will have either no such button or have many.

Why don't u use the URL? then you can do something like this http://www.le.url.com?pageid=10&type=DS and then the code will be something like this:
<?php
if(isset($_GET['type']) && $_GET['type'] == 'DS'){
// Do stuff related to Details Show
} else{
// Show all the products
}
?>
And you can make the url something like this with an .htacces file:
http://www.le.url.com/10/DS

I would say your goal should first be for it to work for some websites. Then many websites and then eventually all websites.
A) Try hand coding the main sites like Amazon, eBay etc... Have a target in mind.
B) Something more creative might be to keep a list of all currency symbols then detect if a page has maybe 10 scattered around. For instance the $ symbol is found all over amazon. But only when there is say 20 per page can you really say that it is a product listing (this is a bad example, amazon's pages are fairly crazy). Perhaps the currency symbols won't work; however, I think you can you can generalize something similar. Perhaps tons of currency symbols plus detection of a "grid" type system with things lined up in a row. You'll get lots of garbage so you'll need good filtering. Data analysis is needed after you have something working algorithmically like this.
C) I think after B) you'll realize that your system might be better with parts of A). In other words you are going to want to customize the hell out of certain popular websites (or more niche ones for that matter). This should help fill the gap for sites that don't follow any known models.
Now as far as tracking where the user came from why not use a tracking cookie type concept. You could of course use indexedDB or localstorage or whatever. In other words always keep a reference to the last page by saving it on the current page. You could also do things like have a stack and push urls onto it on every page. If you want to save it for some reason just send that data back to your server.
Detecting favorite clicks could involve detecting all AJAX traffic and analyzing it (although this might be hard...). You should first do a survey to see what those calls typically look like. I'd imaging something like amazon.com/favorite/product_id would be fairly common. Also... you could try to detect the selector for the "favorite" button on the page then add an onclick handler to detect when it is clicked.
I tried to solve each problem you mentioned. I don't think I understand exactly what you are trying to do.

How to test performance of website parts?

I've actually obtained a job to test a website that is somehow struggling with its performance. In Detail I should pick out different parts of the document and check out their waiting->load->finished states. Since I'm familiar with firebug i've tested many sites as a whole. But now i need to know when starts a special DIV rendering, when is it finished and how long did it wait before. The goal is to find out wich part of the website took how long until painted.

I doubt you'll be able to measure individual parts of a page they way you want. I would approach this by removing parts of the page, measuring the subsetted page, and inferring from those measurements which parts are slowest.
Keep in mind that this sort of logic may not be be correct. For example, you may have a page with two parts. You may measure the two parts independently by creating subsetted pages. The times of the two parts added together will not equal the time for the total. And one part seeming slower than the other doesn't mean that when combined, the "slow" part is responsible for the bulk of the time. Browsers are very complicated machines, and they don't always operate the way you imagine.

AFAIK, speed of printing a div is not something you should worry about. If there is some sererside language, then i would suggest assiging a variable to current time before a portion starts and compare it to the time right after the portion ends. You can subtract them to get the time it took do work that portion out.
If there is javascript involved, then i would suggest chrome dev tool's timeline panel. It shows everything, from css recalculation and printing of the style/div to ajax/(if using) db queries..

As you are familiar with Firebug you can use HttpWatch tool for recording the exact request and response time of all specific http requests made by your browser.
so when a special DIV rendering starts this tool will capture the request and response time for the same.
http://www.httpwatch.com/
All the best!

How to change A's html site when B does something

Say we have users A and B which visit the same URL containing a button. When A clicks on the button, I want something on B's website to change immediately while B is on it, e.g. a text to be added. I want this to happen with a delay of less than 150ms.
Is this realistic? Could you give me hints as to what I should search for, or toy examples which illustrate this? Thanks.

I think you should take a look at a Push/Comet server. A very popular one right now is NGINX's push module: http://pushmodule.slact.net/
This is how you can create chat room for example. At least that is what it sounds like you explained.
****update****
As for your latency question, I don't think 150ms is realistic, you realize that it is a full round trip at least plus a DB read and write. Polling will not give you a very snappy experience for the user, this is because your JS might decide to send it's response right before the user completes the action and you'd have to wait until your JS sends the request again for the user "B" to see the update. This could be a long time, maybe like 10 seconds? You wouldn't to use polling in my opinion because it's very wasteful, and makes cacheing pretty tough as well.
I'd go with push. Unfortunately Apache doesn't have a reliable push service like Nginx.

There are 2 main approaches to this:
You can make ajax queries asking if the state has changed every, say, 5 seconds.
HTTP Streaming
This article lists 2 more approaches: http://www.infoq.com/news/2007/07/pushvspull

You can do this easily with php and mysql or some kind of database. Is there something preventing you from using a database? If so you can write to files using php which would let you store these values for user A and B.

Develop Reference

JavaScript is the programming language of the Web.