How to load HTML document including js.download elements with Python - javascript

I am trying to collect data from a web page displaying search results about cars on sale.
The structure of the online document is not too complex and I was able to single-out the interesting bits of information based on a certain attribute data-testid which every returned car record possesses.
I can find different interesting bits of information like price, immatriculation year, mileage and so on, based on substring characteristics of this attribute.
I use beautifulsoup to parse the HTML and requests to initially load the HTML document from the web.
Now, here's the issue. In a way that I cannot predict, nor find a logic to, the HTML returned by requests.get() is somehow incomplete. In a page of 100 results, which I can see when I inspect the page online (and I can track there are 100 data-testid fields with that specific substring for price, 100 for mileage and so on...), the HTML returned by requests.get(), in the same way as the one I can obtain with a 'save-as' operation from the page itself, only contains a portion of these fields.
Also their number is kind of unpredictable.
I started asking just why this discrepancy between online and saved HTML.
So far no full response, but in the comments the hint was that the page was kinda loading dynamically through JavaScript.
I was happy to find that saving the page to disk, with all the files, somehow produced the full HTML I could then parse without further issues.
However, my joy only lasted for that specific search. When I needed to try a new one, I was suddenly back to square one.
With further investigation, I came to my current understanding, which is at the origin of the question: I noticed that, when I save the online page as 'Webpage, Complete' (which creates an .html file plus a folder), this combo surely contains ALL records. I can say that because if I go offline and double-click on the newly saved html, I can see all records which were online (100 in this case).
However, the HTML file itself only contains a few of them!!!
My deduction is, therefore, that the rest of the records must be 'hidden' in the folder created at saving time, and I would tend to say it could be embedded in those (many) *.js.download files:
My questions are:
is my assumption correct? The other records are stored in those files?
if yes, how can I make them 'explicit' when parsing the HTML with beautifulsoup?
UPDATE 07/05
I've tried to install and use requests_html as suggested in the comments and in this answer.
Its render() method looked promising, however I'm probably not really understanding the mechanisms explained in requests_html documentation here (the render JS portion) because even after the following operations (pseudo-code)
from requests_html import HTMLSession
session = HTMLSession()
r = session.get(URL)
r.html.render()
At this point, I was hoping to have 'forced' the site to 'spit out' ALL HTML, including those bits which remain somehow hidden and only show up in the live page.
However, a successive dump of the r.html.html into a file, still gives back the same old 5 records (why 5 now, when for other searches it returned 12, or even 60, is a complete mystery to me).

Related

How do I scrape dynamic URL of a page?

I am trying to do some website testing through selenium and python. I did fill the page http://www.flightcentre.co.nz/ and submitted the form. But now the search result is taking me to a new page with URL - https://secure.flightcentre.co.nz/eyWD/results . How does my web driver now will handle this? I am doing this for the first time. Could any one help me by providing an example or point me to a right tutorial of this sort.
Thanks.
Ok since I tried to answer your other question I'll give it a go on this one although you are not exactly explaining what you want.
One thing to remember is Selenium is running your browser and not a traditional web scraper. Which means if the url changes it's not a big deal, the only time you have to change how you approach scraping in selenium is when you get a popup.
One thing you can do from your other code is when looking for a flight do a
driver.implicitly_wait(40)//40 is the amount of seconds
this will wait for at least 40 seconds before crashing, and then start whenever the page finishes loading, or whatever you want to do next is active in the dom.
Now if you are trying to scrape all of the flight data that comes up, that'll be fairly tricky. You could do a for loop and grab every element on a page and write it to a csv file.
class_for_departure_flight = driver.find_elements_by_xpath('//div[#class="iata"]')
for flights in class_for_departure_flight:
try:
with open('my_flights.csv', 'a', newline='') as flights_book:
csv_writer = csv.writer(flights_book, delimiter = ',')
csv_writer.writerow(flights.text)
except:
print("Missed a flight")
Things to take notice in this second part is I am using the CSV library in Python to write rows of data. A note you can append a bunch of data together and write it as one row like:
data = (flights, dates, times)
csv_writer.writerow(data)
and it will right all of those different things on the same row in a spreadsheet.
The other two big things that are easily missed are:
class_for_departure_flight = driver.find_elements_by_xpath('//div[#class="iata"]')
that is driver.find_elements_by_xpath, you'll notice elements is plural, which means it is looking for multiple objects with the same class_name and it will store them in an array so you can iterate over them in a for loop.
The next part is csv_writer.writerow(flights.text) when you iterate over your flights, you need to grab the text to do that you do flights.text. If you were to do this with just a search function you could do something like this as well.
class_for_departure_flight = driver.find_elements_by_xpath('//div[#class="iata"]').text
hopefully this helps!
This is a good place to start: http://selenium-python.readthedocs.org/getting-started.html
Here are some things about Selenium I've learned the hard way:
1) When the DOM refreshes, you lose your references to page objects (i.e. return from something like element = driver.find_element_by_id("passwd-id"), element is now stale)
2) Test shallow; each test case should do only one assert/validation of page state, maybe two. This enables you to take screen shots when there's a failure and save you from dealing with "is it a failure in the test, or the app?"
3) It's a big race condition between any JavaScript on the page, and Selenium. Use Explicit Waits to block Selenium while the JavaScript works to refresh the DOM.
To be clear, this is my experience with using Selenium; thus not universal to everyone's experience.
Good luck! Hope this is helpful.

Save Value State for Public Sharing (Add to URL)

http://liveweave.com/xfOKga
I'm trying to figure out how to save code similar to Liveweave.
Basically whatever you code you click the save button and it generates a hash after the url. When you go to this url you can see the saved code. (I been trying to learn this, I just keep having trouble finding the right sources. My search results end up with references completely unrelated to what I'm looking for, example )
I spent the past two days researching into this and I've gotten no where.
Can anyone can help direct me to a tutorial or article that explains this type of save event thoroughly?
To understand the functionality, it is best to try and identify everything that is happening. Dissect this feature according to the technology that would typically be used for each distinguishable component. That dissected overview will then make it easier to see how the underlying technologies work together. I suspect you may lack the experience or nomenclature to see at a glance how a site like liveweave works or how to search for the individual pieces, so I will break it down for you. It will be up to you to research the individual components that I will name. Knowing this, here are the keys you need to research:
Note that without being the actual developer of liveweave, knowing all the backend technology is not possible, but intelligent guesses will suffice. The practice is all the same. This is a cursory breakdown.
1) A marked up page, with HTML, CSS, and JavaScript. This is the user-facing part of the application, where content can be typed, and how the user interacts with the application.
2) JavaScript to asynchronously (AJAX) submit the page's form to the backend for processing.
3) A backend programming/scripting language to process the incoming form. In the case of liveweave, the form is POSTed. It is also using PHP to process the form.
4) A database table with a column for each language (liveweave has HTML, CSS, and JavaScript). This database will insert the current data from each textarea submitted in the form and processed by PHP as a new row. Each row will generate a new hash and store it alongside the data just inserted. A popular database is MySQL.
5) When the database insert is complete, the scripting language takes over again, and send its response back to the marked up page (1). That page is waiting for a response from the backend. JavaScript handles the response. In the case of liveweave, the response is the latest hash to be used in the URL.
6) The URL magic happens with JavaScript. You want to look up JavaScript's latest History API, where methods like pushState will be used to update the URL in the browser without actually refreshing the page.
When a URL with a given hash is navigated to, the scripting language processes the request, grabs the hash, searches for the hash in the database table, finds a matching row, and populates the page's textareas with the data just found.
Throughout all this, there should be checks to avoid duplication and a multitude of exploits. This is also up to you to research.
It should be noted that currently there are two comments for your question. Darren's link will indeed allow the URL to change, but it is a redirect, and not what you want. ksealey's answer is not wrong; that is one way of doing it, but it is not the most robust or scalable, and would not be the recommended approach for solving this.

Run Database Stored RegEx against DOM

I have a question about how to approach a certain scenario before I get halfway through it and figure out it was not the best option.
I work for a large company that has a team that creates tools for the team mates to use that aren’t official enterprise tools. We have no access to the database directly, just access to an internal server to store our files to run and be able to access the main site with javascript etc (same domain).
What I am working on is a tool that has a ton of options in it that allow you to select that I will call “data points” on a page.
There are things like “Account status, Balance, Name, Phone number, email etc” and have it save those to an excel sheet.
So you input account numbers, choose what you need and then using IE Objects it navigates to the page and scrapes data you request.
My question is as follows..
I want to make the scraping part pretty Dynamic in the way it works. I want to be able to add new datapoints on the fly.
My goal or idea is so store the regular expression needed to get the specific piece of data in the table with the “data point option”.
If I choose “Name” it knows the expression for name in the database to run again the DOM.
What would be the best way about creating that type of function in Javascript / Jquery?
I need to pass a Regex to a function, have it run against the DOM and then return the result.
I have a feeling that there will be things that require more than 1 step to get the information etc.
I am just trying to think of the best way to approach it without having to hardcode 200+ expressions into the file as the page may get updated and need to be changed.
Any ideas?
IRobotSoft scraper may be the tool you are looking for. Check this forum and see if questions are similar to what you are doing: http://irobotsoft.org/bb/YaBB.pl?board=newcomer. It is free.
What it uses is not regular expression but a language called HTQL, which may be more suitable for extracting web pages. It also supports regular expression, but not as the main language.
It organizes all your actions well with a visual interface, so you can dynamically compose actions or tasks for changing needs.

Detect the referral/s of Url/s using JavaScript or PHP from inside a Bookmarklet

Let's think out of the box!
Without any programming skills, how can you say/detect if you are on a web page that lists products, and not on the page that prints specific details of a product?
The Bookmarklet is inserted using JavaScript in right after the body tag of a website ( eBay, Bloomingdales, Macy's, toys'r'us ... )
Now, my story is: (programming skills needed now)
I have a bookmarklet and my main problem is how to detect if I am on a page that lists products or if i am on the page that prints the product detail.
The best way that I could think, to detect if I am on the detail page of a product is to detect the referral(s) of the current URL. (maybe all the referrals, the entire click history)
Possible problem: a user adds the URL as favorite and does not use my bookmarklet, and closes the browser; then the user uses the browser again, clicks the favorite link and uses my bookmaklet and I think that I can't detect the referral in this case; it's OK, not all the cases are covered or possible;
Can I detect the referral of this link using the cache in this case? (many browsers cache systems involved here, I know)
how can you say/detect if you are on a web page that lists products, and not on the page that prints specific details of a product
I'd setup Brain.js (a neural net implemented in javascript) and train it up on a (necessarily broad and varied) sample set of DOMs and then pick a threshold product:details ratio to 'detect' (as near as possible) what type of page I'm on.
This will require some trial and error, but is the best approach I can think of (neural nets can get to "good enough" results pretty quickly - try it, you'll be surprised at the results).
No. You can't check history with a bookmarklet, or with any normal client side JavaScript. You are correct, the referrer will be empty if loaded from a bookmark.
The bookmarklet can however store the referrer the first time it is used in a cookie or in localStorage and then the next time it is used, if referrer is empty, check the cookie or localStorage.
That said, your entire approach to this problem seems really odd to me, but I don't have enough details to know if it is genius our insanity.
If I was trying to determine if the current page was a list or a details page, I'd either inspect the url for common patterns or inspect the content of the page for common patterns.
Example of common url patterns: Many 'list pages' are search results, so query string will have words like "search=", "q=", "keywords=", etc.
Example of page content patterns: A product page will have only 1 "buy" button or "add to cart", whatever. A list page will have either no such button or have many.
Why don't u use the URL? then you can do something like this http://www.le.url.com?pageid=10&type=DS and then the code will be something like this:
<?php
if(isset($_GET['type']) && $_GET['type'] == 'DS'){
// Do stuff related to Details Show
} else{
// Show all the products
}
?>
And you can make the url something like this with an .htacces file:
http://www.le.url.com/10/DS
I would say your goal should first be for it to work for some websites. Then many websites and then eventually all websites.
A) Try hand coding the main sites like Amazon, eBay etc... Have a target in mind.
B) Something more creative might be to keep a list of all currency symbols then detect if a page has maybe 10 scattered around. For instance the $ symbol is found all over amazon. But only when there is say 20 per page can you really say that it is a product listing (this is a bad example, amazon's pages are fairly crazy). Perhaps the currency symbols won't work; however, I think you can you can generalize something similar. Perhaps tons of currency symbols plus detection of a "grid" type system with things lined up in a row. You'll get lots of garbage so you'll need good filtering. Data analysis is needed after you have something working algorithmically like this.
C) I think after B) you'll realize that your system might be better with parts of A). In other words you are going to want to customize the hell out of certain popular websites (or more niche ones for that matter). This should help fill the gap for sites that don't follow any known models.
Now as far as tracking where the user came from why not use a tracking cookie type concept. You could of course use indexedDB or localstorage or whatever. In other words always keep a reference to the last page by saving it on the current page. You could also do things like have a stack and push urls onto it on every page. If you want to save it for some reason just send that data back to your server.
Detecting favorite clicks could involve detecting all AJAX traffic and analyzing it (although this might be hard...). You should first do a survey to see what those calls typically look like. I'd imaging something like amazon.com/favorite/product_id would be fairly common. Also... you could try to detect the selector for the "favorite" button on the page then add an onclick handler to detect when it is clicked.
I tried to solve each problem you mentioned. I don't think I understand exactly what you are trying to do.

How much JSON is too much JSON?

I am developing a bookmark site like delicious. In order to provide a better and a faster user experience to the user, i am grabbing all the bookmarks from the db table and form a json object with all the bookmark information in it. Eg, for each bookmark, i have an id, title, url, description, tags etc. The json object is already formed on the first page load. I then get the output json, use jquery.each to style and inject relevant html on the fly.
Right now, i have no option to test it so here comes my question: imagining there is no limits on the number of bookmarks a user can save, what would be the effect on this structure on the browser (or any other problems that might arise for this situation) if a user has, say, 2000 bookmarks also considering that paging is not an option for this particular project.
Probably controversial but anyway. How can paging not be an option? When is it ever relevant to show 2k bookmarks at a time? I'd say never.
When you're returning that much data (of course it depends on how much text) you're wide open to DDOS attacks. Imagine an attacker that gets hold of a url containing several megabytes of json, it would not be that hard to sink your servers.
It would be nice with some more information on your UI so we can analyze what data you really need.

Categories

Resources