Injecting HTML into a page using Mechanize - javascript

I am writing a webscraping program to get my grades from a website. I used Mechanize to log into the page and navigate to the area I'm scraping. Unfortunately, the page uses Javascript to encrypt the page (possibly to stop me from scraping). I found the decryption script and ported to Python. It works and I used it to extract the encrypted string from the page and when I convert it, it becomes a table in HTML.
So, to get to my point, is there any way to inject the HTML back into the page and use mechanize to use the links on the table to get my grades?
Thanks for the help!
EDIT: I have beautiful soup also, if that is any help.

I ended up just using this:
response = br.open("www.linknotonpagethatiwanttogoto.com")
page = response.read()
I found out that you store the .open() of a link as a response, instead of using the .follow_link(). Also the browser uses the same cookies so the session cookies are preserved. So after parsing the html, I popped the links into the .open() and got the new page.

Related

How to do http request to get the whole source page when part of html loaded by javascript?

I wait to get the html web page from https://www.collinsdictionary.com/dictionary/english/supremacy, but part of the html file is loaded by javascript. When I use HTTP.jl to get the web page with HTTP.request(), I only get part of the html file that loaded before the javascript been run, so the web page I get is different to the web page I got from Chrome. How can I get the web page as same as Chrome get? Do I have to use WebDriver.jl with is a a wrapper around Selenium WebDriver's python bindings?
part of my source:
function get_page(w::word)::Bool
response = nothing
try
response = HTTP.request("GET", "https://www.collinsdictionary.com/dictionary/$(dictionary)/$(w.org_word)",
connect_timeout=connect_timeout, readtimeout=readtimeout, retries=retries, redirect=true,proxy=proxy)
catch e
push!(w.err_log, [get_page_http_err, string(e)])
return falses
end
open("./assets/org_page.html", "w") do f
write(f, String(response.body))
end
return true
end
dictionary and w.org_word are both String, the function is in a module.
What you want is impossible to achieve with just HTTP.jl. Running the Javascript part of the page is fundamentally different -- you need a Javascript engine to do so, which is nothing simple.
And this is not a unique weakness of Julia's HTTP:
Python requests.get(url) returning javascript code instead of the page html
(recently the standard library request in python seems to added Javascript rendering ability)

Get element from website with python without opening a browser

I'm trying to write a python script which parses one element from a website and simply prints it.
I couldn't figure out how to achieve this, without selenium's webdiver, in order to open a browser which handles the scripts to properly display the website.
from selenium import webdriver
browser = webdriver.Firefox()
browser.get('http://groceries.asda.com/asda-webstore/pages/landing/home.shtml#!product/910000800509')
content = browser.page_source
print(content[42000:43000])
browser.close()
This is just a rough draft which will print the contents, including the element of interest <span class="prod-price-inner">£13.00</span>.
How could I get the element of interest without the browser opening, or even without a browser at all?
edit: I've previously tried to use urllib or in bash wget, which both lack the required javascript interpretation.
As other answers mentioned, this webpage requires javascript to render content, so you can't simply get and process the page with lxml, Beautiful Soup, or similar library. But there's a much simpler way to get the information you want.
I noticed that the link you provided fetches data from an internal API in a structured fashion. It appears that the product number is 910000800509 based on the url. If you look at the networking tab in Chrome dev tools (or your brower's equivalent dev tools), you'll see that a GET request is being made to following URL: http://groceries.asda.com/api/items/view?itemid=910000800509.
You can make the request like this with just the json and requests modules:
import json
import requests
url = 'http://groceries.asda.com/api/items/view?itemid=910000800509'
r = requests.get(url)
price = r.json()['items'][0]['price']
print price
£13.00
This also gives you access to lots of other information about the product, since the request returns some JSON with product details.
How could I get the element of interest without the browser opening,
or even without a browser at all?
After inspecting the page you're trying to parse :
http://groceries.asda.com/asda-webstore/pages/landing/home.shtml#!product/910000800509
I realized that it only displays the content if javascript is enabled, based on that, you need to use a real browser.
Conclusion:
The way to go, if you need to automatize, is:
selenium

Crawl a webpage and grab all dynamic Javascript links

I am using C# to crawl a website. All works fine except it can't detect dynamic JS links. As an example, a page with over 100 products may have few pages and the "Next Page"m "Prev Page" link may JS dynamic urls which is generated on click. Typical JS code is below:
<a href="javascript:PageURL('
cf-233--televisions.aspx','?',2);">></a>
Is there anyway of getting the actual link of the above href while collecting urls on the page ?
I am using Html Agility Pack but open to any other technology. I tried google this many times but seems no solution yet.
Thanks.
Have you tried to evaluate javascript to get actual hrefs? It might be helpful Parsing HTML to get script variable value
Or maybe you should check what PageURL function does (Just open the website with a browser and write at it's console PageURL without parentheses. It will show you code of the function) and rewrite it with C#
AbotX allows you to render the javascript on the page. Its a powerful web crawler with advanced features.

load external webpage and add custom header and use the data from webpage

I want to load a external webpage on my own server and add my own header. Also i need to use the data from the external website like url and content (i need to search and find specific data, check if i got that data in my system and show my data in the header). The external webpage needs to be working (like the buttons for opening other pages, no new windows).
I know i can play with .NET to create software but i want to create a website that will do the trick. Can this be done? Php + iframe is to simple i think, that won't give me the data from external website and my server won't see changes in the external url (what i need).
If it's supposed to be client-side, then you can acquire the data necessary by using an Ajax request, parsing it in JavaScript and then just inserting it into an element. However you have to take into account that if the host doesn't support cross-origin resource sharing, then you won't be able to do it like this.
Ajax page source request: get full html source code of page through ajax request through javascript
Parsing elements from the source: http://ajaxian.com/archives/html-parser-in-javascript (not sure if useful)
Changing the element body:
// data --> the content you want to display in your element
document.getElementById('yourElement').innerHtml = data;
Other approach (server-side though) is to "act" like a browser by faking your user-agent to some browser's and then using cUrl for example to get the source. But you don't want to fake it, because that's not nice and you would feel bad..
Hope it gets you started!

Python urllib2 returns noscript-content

I am trying to get the html-content of several pages with python 2.7.3 and urllib2.
For the most pages, it works fine, but some pages like http://www.bbc.co.uk/news/entertainment-arts-22441507#sa-ns_mchannel=rss&ns_source=PublicRSS20-sa return me this content:
This page is best viewed in an up-to-date web browser with style sheets (CSS) enabled. While you will be able to view the content of this page in your current browser, you will not be able to get the full visual experience. Please consider upgrading your browser software or enabling style sheets (CSS) if you are able to do so.
This problem also occurs with pages where javascript is required. I only get the content within the noscript-tag returned.
Here is how I get the content:
cj = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
response = urllib2.urlopen(url).read().decode("utf-8")
Are there additional headers needed?
Sounds like you're fetching the original HTML page, before javascript/ajax has had a go at it. Try using webkit to get the page with JavaScript applied. See here for an answer with links.

Categories

Resources