Python scraping web page causes javascript issue

Python scraping web page causes javascript issue - javascript

I am using urllib2 to scrape a webpage. It works well on a lot of sites, but in some, I get:
<html><title>You are being redirected...</title>
<noscript>Javascript is required. Please enable javascript before you are allowed to see this page.</noscript>
Here's my code:
hdr = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
'Accept-Encoding': 'none',
'Accept-Language': 'en-US,en;q=0.8',
'Connection': 'keep-alive'}
req = urllib2.Request(url=link, headers=hdr)
try:
f = urllib2.urlopen(req)
except urllib2.HTTPError, e:
return None
except urllib2.URLError, e:
return None
mycontent = f.read()
Any ideas?

This is not a javascript issue, pay attention to the first line:
<html><title>You are being redirected...</title>
This means you have to follow the redirection until you reach the destination server.
urllib2.urlopen does not do that automatically, you need to call .geturl() on the result, or use other functions or libraries, that resolve redirects.

Related

How to find all the JavaScript requests made from my browser when I'm accessing a site

I want to scrape the contents of LinkedIn using requests and bs4 but I'm facing a problem with the JavaScript that is loading the page after I sign in(I don't get the home page directly), I don't wanna use Selenium
here is my code
import requests
from bs4 import BeautifulSoup
class Linkedin():
def __init__(self, url ):
self.url = url
self.header = {"User-Agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64) "
"AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.97 Safari/537.36"}
def saveRsulteToHtmlFile(self, nameOfFile=None):
if nameOfFile == None:
nameOfFile ="Linkedin_page"
with open(nameOfFile+".html", "wb") as file:
file.write(self.response.content)
def getSingInPage(self):
self.sess = requests.Session()
self.response = self.sess.get(self.url, headers=self.header)
soup = BeautifulSoup(self.response.content, "html.parser")
self.csrf = soup.find(attrs={"name" : "loginCsrfParam"})["value"]
def connecteToMyLinkdin(self):
self.form_data = {"session_key": "myemail#mail.com",
"loginCsrfParam": self.csrf,
"session_password": "mypassword"}
self.url = "https://www.linkedin.com/uas/login-submit"
self.response = self.sess.post(self.url, headers=self.header, data=self.form_data)
def getAnyPage(self,url):
self.response = self.sess.get(url, headers=self.header)
url = "https://www.linkedin.com/"
likedin_page = Linkedin(url)
likedin_page.getSingInPage()
likedin_page.connecteToMyLinkdin() #I'm connected but java script still loading
likedin_page.getAnyPage("https://www.linkedin.com/jobs/")
likedin_page.saveRsulteToHtmlFile()
I want help to pass the javascript loads without using Selenium...

Although it's technically possible to simulate all the calls from Python, at a dynamic page like LinkedIn, I think it will be quite tedious and brittle.
Anyway, you'd open "developer tools" in your browser before you open LinkedIn and see how the traffic looks like. You can filter for the requests from Javascript (in Firefox, the filter is called XHR).
You would then simulate the necessary/interesting requests in your code. The benefit is the servers usually return structured data to Javascript, such as JSON. Therefore you won't need to do as much HTML parsing.
If you find not progressing very much this way (it really depends on the particular site), then you will probably have to use Selenium or some alternative such as:
https://robotframework.org/
https://miyakogi.github.io/pyppeteer/ (port of Puppeteer to Python)

You should send all the XHR and JS requests manually [in the same session which you created during login]. Also, pass all the fields in request headers (copy from the network tools).
self.header_static = {
'authority': 'static-exp2.licdn.com',
'method': 'GET',
'path': '/sc/h/c356usw7zystbud7v7l42pz0s',
'scheme': 'https',
'accept': '*/*',
'accept-encoding': 'gzip, deflate, br',
'accept-language': 'en-GB,en;q=0.9,en-US;q=0.8,hi;q=0.7,la;q=0.6',
'cache-control': 'no-cache',
'dnt': '1',
'pragma': 'no-cache',
'referer': 'https://www.linkedin.com/jobs/',
'sec-fetch-mode': 'no-cors',
'sec-fetch-site': 'cross-site',
'user-agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.97 Mobile Safari/537.36'
}
def postConnectionRequests(self):
urls = [
"https://static-exp2.licdn.com/sc/h/62mb7ab7wm02esbh500ajmfuz",
"https://static-exp2.licdn.com/sc/h/mpxhij2j03tw91bpplja3u9b",
"https://static-exp2.licdn.com/sc/h/3nq91cp2wacq39jch2hz5p64y",
"https://static-exp2.licdn.com/sc/h/emyc3b18e3q2ntnbncaha2qtp",
"https://static-exp2.licdn.com/sc/h/9b0v30pbbvyf3rt7sbtiasuto",
"https://static-exp2.licdn.com/sc/h/4ntg5zu4sqpdyaz1he02c441c",
"https://static-exp2.licdn.com/sc/h/94cc69wyd1gxdiytujk4d5zm6",
"https://static-exp2.licdn.com/sc/h/ck48xrmh3ctwna0w2y1hos0ln",
"https://static-exp2.licdn.com/sc/h/c356usw7zystbud7v7l42pz0s",
]
for url in urls:
self.sess.get(url,headers=self.header_static)
print("REQUEST SENT TO "+url)
I called the postConnectionRequests() function after before saving the HTML content, and received the complete page.
Hope this helps.

XHR is send by JavaScript and Python will not run JavaScript code when it will get page using requests and beautifulsoup. Tools like Selenium loads page and runs JavaScript. You can also use Headless Browsers.

Outlook login script in python3: Javascript not enabled - cannot login

I am currently trying to log into an outlook account using requests only in Python. I did the same thing with selenium before, but because of better performance and the ability to use proxies with it more easily I would like to use requests now. The problem is that whenever I send a post request to the outlook post URL it returns a page saying that it will not function without javascript.
I read on here that I would have to do the requests that javascript would do with requests, so I used the network analysis tool in Firefox and made requests to all the URLs that the browser made requests to. It still returns the same page, saying that js is not enabled and it will not function without js.
s = requests.Session()
s.headers.update ({
"Accept": "application/json",
"Accept-Language": "en-US",
"Accept-Encoding": "gzip, deflate, br",
"Connection": "keep-alive",
"Host": "www.login.live.com",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36",
})
url = "https://login.live.com/"
r = s.get(url, allow_redirects=True)
#Simulate JS
sim0 = s.get("https://logincdn.msauth.net/16.000.28299.9
#...requests 1-8 here
sim9 = s.get("https://logincdn.msauth.net/16.000.28299.9/images/Backgrounds/0.jpg?x=a5dbd4393ff6a725c7e62b61df7e72f0", allow_redirects=True)
post_url = "https://login.live.com/ppsecure/post.srf?contextid=EFE1326315AF30F4&bk=1567009199&uaid=880f460b700f4da9b692953f54786e1c&pid=0"
payload = {"username": username}
sleep(4)
s.cookies.update({"logonLatency": "LGN01=637025992217600244"})
s.cookies.update({"MSCC": "1567001943"})
s.cookies.update({"CkTst": "G1567004031095"})
print(s.cookies.get_dict())
print ("---------------------------")
print(json.dumps(payload))
print ("---------------------------")
rp = s.post(post_url, data=json.dumps(payload), allow_redirects=True)
rg = s.get("https://logincdn.msauth.net/16.000.28299.9/images/arrow_left.svg?x=a9cc2824ef3517b6c4160dcf8ff7d410")
print (rp.text)
print (rp.status_code)
If anyone could hint me in the right direction on how to fix this I would highly appreciate it! Thanks in advance

How to programmatically replicate a request found in Chrome Developer Tools?

I'm looking at my balance on Venmo.com but they only show you 3 months at a time and I'd like to get my entire transaction history.
Looking at the Chrome Developer Tools, under the network tab, I can see the request to https://api.venmo.com/v1/transaction-history?start_date=2017-01-01&end_date=2017-01-31 which returns JSON.
I'd like to programmatically iterate through time and make several request and aggregate all of the transactions. However, I keep getting 401 Unauthorized.
My initial approach was just using Node.js. I looked at the cookie in the request and copied it into a secret.txt file and then sent the request:
import fetch from 'node-fetch'
import fs from 'fs-promise'
async function main() {
try {
const cookie = await fs.readFile('secret.txt')
const options = {
headers: {
'Cookie': cookie,
},
}
try {
const response = await fetch('https://api.venmo.com/v1/transaction-history?start_date=2016-11-08&end_date=2017-02-08', options)
console.log(response)
} catch(e) {
console.error(e)
}
} catch(e) {
console.error('please put your cookie in a file called `secret.txt`')
return
}
}
That didn't work do I tried copying all of the headers over:
const cookie = await fs.readFile('secret.txt')
const options = {
headers: {
'Accept-Encoding': 'gzip, deflate, sdch, br',
'Accept-Language': 'en-US,en;q=0.8',
'Cache-Control': 'no-cache',
'Connection': 'keep-alive',
'Cookie': cookie,
'Host': 'api.venmo.com',
'Origin': 'https://venmo.com',
'Pragma': 'no-cache',
'Referer': 'https://venmo.com/account/settings/balance/statement?end=02-08-2017&start=11-08-2016',
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.95 Safari/537.36',
},
}
try {
const response = await fetch('https://api.venmo.com/v1/transaction-history?start_date=2016-11-08&end_date=2017-02-08', options)
console.log(response)
} catch(e) {
console.error(e)
}
This also did not work.
I even tried making the request from the console of the website and got a 401:
fetch('https://api.venmo.com/v1/transaction-history?start_date=2016-11-08&end_date=2017-02-08', {credentials: 'same-origin'}).then(console.log)
So my question here is this: I see a network request in Chrome Developer Tools. How can I make that same request programmatically? Preferably in Node.js or Python so I can write an automated script.

In the Network tab of the Chrome Developer Tools, right click the request and click "Copy" > "Copy as cURL (bash)". You can then either write a script using the curl command directly, or use https://curlconverter.com/ to convert the cURL command to Python, JavaScript, PHP, R, Go, Rust, Elixir, Java, MATLAB, Dart or JSON.

OPTIONS (failed) only on Chrome and Firefox

I make a POST request and the request just sits, pending until it eventually fails. I've monitored the nginx logs and the node server logs and the request doesn't even register. This works for anyone else that I've had test it except one other colleague. If I use the edge browser or a different computer it works fine.
I have attempted to make POST requests to other (custom) servers and it hangs on options there as well. I have also made the POST request with jQuery and it fails the same way.
It's maybe worth noting that I am using the withCredentials flag.
Headers:
Provisional headers are shown
Access-Control-Request-Headers:content-type
Access-Control-Request-Method:GET
Origin:http://localhost:8080
Referer:http://localhost:8080/<path>
User-Agent:Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.84 Safari/537.36
The request:
public login(user) {
const endpoint = `http://<url>`;
let headers = new Headers();
headers.append('Content-type', 'application/json');
return this.http
.post(endpoint, JSON.stringify(user), {
headers: headers,
});
}
I subscribe to the call in my component:
this._accountService.login(this.user)
.subscribe(res => {
console.log("logged in!");
if (res.json().status === "success") {
window.location.href = `/home/${this.org}/${this.product}`;
}
else {
// What other options are there?
console.log("Do something else maybe?");
}
},
err => {
this.invalidLogin = true;
console.log("Ye shall not pass!");
});
Successful user's headers
Accept:*/*
Accept-Encoding:gzip, deflate, sdch
Accept-Language:en-US,en;q=0.8
Access-Control-Request-Headers:content-type
Access-Control-Request-Method:POST
Connection:keep-alive
Host:<url>
Origin:<url>
Referer:<url>
User-Agent:Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.33 Safari/537.36
From chrome://net-internals/#events
t=61869793 [st= 0] +REQUEST_ALIVE [dt=60162]
--> has_upload = false
--> is_pending = true
--> load_flags = 34624 (DO_NOT_SAVE_COOKIES | DO_NOT_SEND_AUTH_DATA | DO_NOT_SEND_COOKIES | MAYBE_USER_GESTURE | VERIFY_EV_CERT)
--> load_state = 14 (WAITING_FOR_RESPONSE)
--> method = "OPTIONS"
--> net_error = -1 (ERR_IO_PENDING)
--> status = "IO_PENDING"
--> url = "<url>"
t=61929955 [st=60162] -HTTP_STREAM_PARSER_READ_HEADERS
--> net_error = -324 (ERR_EMPTY_RESPONSE)
t=61929955 [st=60162] -HTTP_TRANSACTION_READ_HEADERS
--> net_error = -324 (ERR_EMPTY_RESPONSE)
t=61929955 [st=60162] -URL_REQUEST_START_JOB
--> net_error = -324 (ERR_EMPTY_RESPONSE)
t=61929955 [st=60162] URL_REQUEST_DELEGATE [dt=0]
t=61929955 [st=60162] -REQUEST_ALIVE
--> net_error = -324 (ERR_EMPTY_RESPONSE)
I'm really guessing this is related to something that is cached in my browser(s) but I really cannot find what. I've cleared all cookies and anything that could be stored. Where else can I check to clear things? This is clearly something local to my computer/browser (and one other unfortunate person).

Please try to subscribe() to the observable.
return this.http
.post(endpoint, JSON.stringify(user), {
headers: headers,
}).subscribe(() => console.log("POST done!"));

Have you tried setting the 'Cache-Control' in your headers? I think in jQuery you can simply set
$.ajax({
cache: false
});
or adding a header with a regular ajax request
request.setRequestHeader("Cache-Control", "no-cache");

Why don't you just prevent getting into OPTIONS request loop . It really drives you crazy at times . Other browsers do not trigger OPTIONS request but chrome and firefox does to ensure CORS . I have successfully used this library named as xdomain from github , and it really works !! Their github introduction page introduce xdomain as a CORS alternative . And most importantly i used it in JQuery , but it also does support Angular's http service . Have a look at it . It may help you for good :) . Here's the link to library Xdomain CORS Alternative

There are issues with CORS and using localhost as the domain (which you have listed in the ORIGIN headers). Typically CORS / OPTIONS requests don't work properly when localhost is involved for certain security reasons, but hanging isn't normally what happens so this might not be the correct answer but its worth a shot!
Try adding a new host to your local machine and removing localhost from the equation. Just throwing this idea out there and hope that it might help you out!
As per comment below
Your server appears to allow the connection, but it does not appear to send a response. Are you able to post the headers from a successful OPTIONS request to prove that the server is actually able to handle these requests.

Phantomjs + Google analytics

I'm using PhantomJS on my webpage. Right now, I'm just testing things (like screenshots, clicks, custom resolutions, custom ip... etc) and I'm trying to add to my website to Google Analytics.
In the PhantomJS script, I'm using a function to change proxy automatically with a text file with some private proxies I have got. Changkng the IP works perfectly - confirmed through screenshots - but in Analytics I'm not receiving any visits.
Can someone help me and give some tips to send visits with PhantomJS to google analytics?
Thanks in advance!
Edited:
I added the default google analytics script they give to me,in the head of my webpage.
And this is the PhantomJS code:
var system = require("system");
var page = require("webpage").create();
page.customHeaders = {
"Referer": "http://www.google.com",
"User-Agent": "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.111 Safari/537.36",
"Connection": 'keep-alive',
"Accept": 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,* /*;q=0.8',
"Accept-Enconding": 'gzip, deflate, sdch',
"Accept-Language": 'en;q=0.8',
"Request-protocol": 'HTTP/1.1',
"Request-URI": '/'
};
page.settings.loadImages = false;
page.open("http://exampleurl.com/", function (status) {
page.render('test.png');
console.log('Done!');
phantom.exit();
});

I think you should set "page.settings.loadImages" to true, as GA is using gif pixel to collect data.

Develop Reference

JavaScript is the programming language of the Web.

Python scraping web page causes javascript issue - javascript

Related

How to find all the JavaScript requests made from my browser when I'm accessing a site

Outlook login script in python3: Javascript not enabled - cannot login

How to programmatically replicate a request found in Chrome Developer Tools?

OPTIONS (failed) only on Chrome and Firefox

Phantomjs + Google analytics

Categories

Resources