Phantomjs + Google analytics

Phantomjs + Google analytics - javascript

I'm using PhantomJS on my webpage. Right now, I'm just testing things (like screenshots, clicks, custom resolutions, custom ip... etc) and I'm trying to add to my website to Google Analytics.
In the PhantomJS script, I'm using a function to change proxy automatically with a text file with some private proxies I have got. Changkng the IP works perfectly - confirmed through screenshots - but in Analytics I'm not receiving any visits.
Can someone help me and give some tips to send visits with PhantomJS to google analytics?
Thanks in advance!
Edited:
I added the default google analytics script they give to me,in the head of my webpage.
And this is the PhantomJS code:
var system = require("system");
var page = require("webpage").create();
page.customHeaders = {
"Referer": "http://www.google.com",
"User-Agent": "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.111 Safari/537.36",
"Connection": 'keep-alive',
"Accept": 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,* /*;q=0.8',
"Accept-Enconding": 'gzip, deflate, sdch',
"Accept-Language": 'en;q=0.8',
"Request-protocol": 'HTTP/1.1',
"Request-URI": '/'
};
page.settings.loadImages = false;
page.open("http://exampleurl.com/", function (status) {
page.render('test.png');
console.log('Done!');
phantom.exit();
});

I think you should set "page.settings.loadImages" to true, as GA is using gif pixel to collect data.

Related

Python3 - grabbing data from dynamically refreshed table in webpage

I am pulling some data from this web page using the below back end call in Python. I have managed to get all the channels on the page returned by playing around with the &items parameter, however I cannot seem to grab the request in the console, or see any relevant params that will change the times of the day EPG data is populated for:
import requests
session = requests.Session()
url = 'http://tvmds.tvpassport.com/snippet/white_label/php/grid.php?subid=postmedia&lang=en&lu=36625D&tz=America%2FToronto&items=3100&st=&wd=940&nc=0&div=tvmds_frames&si=0'
headers ={
'Upgrade-Insecure-Requests': '1',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36'
}
r = session.get(url, headers=headers)
status_code2 = r.status_code
data2 = r.content.decode('utf-8', errors='ignore')
filename = 'C:\\Users\\myuser\\mypath\\test_file.txt'
with open(filename, "a", encoding='utf-8', errors="ignore") as text_file:
text_file.write(data2)
text_file.close
...all I get is an error in Google Chrome Dev Tools saying [Violation] Forced reflow while executing JavaScript.
Can anyone assist? I need to be able to grab program data from a full 24 hour period and across different days...
Thanks

How to find all the JavaScript requests made from my browser when I'm accessing a site

I want to scrape the contents of LinkedIn using requests and bs4 but I'm facing a problem with the JavaScript that is loading the page after I sign in(I don't get the home page directly), I don't wanna use Selenium
here is my code
import requests
from bs4 import BeautifulSoup
class Linkedin():
def __init__(self, url ):
self.url = url
self.header = {"User-Agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64) "
"AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.97 Safari/537.36"}
def saveRsulteToHtmlFile(self, nameOfFile=None):
if nameOfFile == None:
nameOfFile ="Linkedin_page"
with open(nameOfFile+".html", "wb") as file:
file.write(self.response.content)
def getSingInPage(self):
self.sess = requests.Session()
self.response = self.sess.get(self.url, headers=self.header)
soup = BeautifulSoup(self.response.content, "html.parser")
self.csrf = soup.find(attrs={"name" : "loginCsrfParam"})["value"]
def connecteToMyLinkdin(self):
self.form_data = {"session_key": "myemail#mail.com",
"loginCsrfParam": self.csrf,
"session_password": "mypassword"}
self.url = "https://www.linkedin.com/uas/login-submit"
self.response = self.sess.post(self.url, headers=self.header, data=self.form_data)
def getAnyPage(self,url):
self.response = self.sess.get(url, headers=self.header)
url = "https://www.linkedin.com/"
likedin_page = Linkedin(url)
likedin_page.getSingInPage()
likedin_page.connecteToMyLinkdin() #I'm connected but java script still loading
likedin_page.getAnyPage("https://www.linkedin.com/jobs/")
likedin_page.saveRsulteToHtmlFile()
I want help to pass the javascript loads without using Selenium...

Although it's technically possible to simulate all the calls from Python, at a dynamic page like LinkedIn, I think it will be quite tedious and brittle.
Anyway, you'd open "developer tools" in your browser before you open LinkedIn and see how the traffic looks like. You can filter for the requests from Javascript (in Firefox, the filter is called XHR).
You would then simulate the necessary/interesting requests in your code. The benefit is the servers usually return structured data to Javascript, such as JSON. Therefore you won't need to do as much HTML parsing.
If you find not progressing very much this way (it really depends on the particular site), then you will probably have to use Selenium or some alternative such as:
https://robotframework.org/
https://miyakogi.github.io/pyppeteer/ (port of Puppeteer to Python)

You should send all the XHR and JS requests manually [in the same session which you created during login]. Also, pass all the fields in request headers (copy from the network tools).
self.header_static = {
'authority': 'static-exp2.licdn.com',
'method': 'GET',
'path': '/sc/h/c356usw7zystbud7v7l42pz0s',
'scheme': 'https',
'accept': '*/*',
'accept-encoding': 'gzip, deflate, br',
'accept-language': 'en-GB,en;q=0.9,en-US;q=0.8,hi;q=0.7,la;q=0.6',
'cache-control': 'no-cache',
'dnt': '1',
'pragma': 'no-cache',
'referer': 'https://www.linkedin.com/jobs/',
'sec-fetch-mode': 'no-cors',
'sec-fetch-site': 'cross-site',
'user-agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.97 Mobile Safari/537.36'
}
def postConnectionRequests(self):
urls = [
"https://static-exp2.licdn.com/sc/h/62mb7ab7wm02esbh500ajmfuz",
"https://static-exp2.licdn.com/sc/h/mpxhij2j03tw91bpplja3u9b",
"https://static-exp2.licdn.com/sc/h/3nq91cp2wacq39jch2hz5p64y",
"https://static-exp2.licdn.com/sc/h/emyc3b18e3q2ntnbncaha2qtp",
"https://static-exp2.licdn.com/sc/h/9b0v30pbbvyf3rt7sbtiasuto",
"https://static-exp2.licdn.com/sc/h/4ntg5zu4sqpdyaz1he02c441c",
"https://static-exp2.licdn.com/sc/h/94cc69wyd1gxdiytujk4d5zm6",
"https://static-exp2.licdn.com/sc/h/ck48xrmh3ctwna0w2y1hos0ln",
"https://static-exp2.licdn.com/sc/h/c356usw7zystbud7v7l42pz0s",
]
for url in urls:
self.sess.get(url,headers=self.header_static)
print("REQUEST SENT TO "+url)
I called the postConnectionRequests() function after before saving the HTML content, and received the complete page.
Hope this helps.

XHR is send by JavaScript and Python will not run JavaScript code when it will get page using requests and beautifulsoup. Tools like Selenium loads page and runs JavaScript. You can also use Headless Browsers.

Outlook login script in python3: Javascript not enabled - cannot login

I am currently trying to log into an outlook account using requests only in Python. I did the same thing with selenium before, but because of better performance and the ability to use proxies with it more easily I would like to use requests now. The problem is that whenever I send a post request to the outlook post URL it returns a page saying that it will not function without javascript.
I read on here that I would have to do the requests that javascript would do with requests, so I used the network analysis tool in Firefox and made requests to all the URLs that the browser made requests to. It still returns the same page, saying that js is not enabled and it will not function without js.
s = requests.Session()
s.headers.update ({
"Accept": "application/json",
"Accept-Language": "en-US",
"Accept-Encoding": "gzip, deflate, br",
"Connection": "keep-alive",
"Host": "www.login.live.com",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36",
})
url = "https://login.live.com/"
r = s.get(url, allow_redirects=True)
#Simulate JS
sim0 = s.get("https://logincdn.msauth.net/16.000.28299.9
#...requests 1-8 here
sim9 = s.get("https://logincdn.msauth.net/16.000.28299.9/images/Backgrounds/0.jpg?x=a5dbd4393ff6a725c7e62b61df7e72f0", allow_redirects=True)
post_url = "https://login.live.com/ppsecure/post.srf?contextid=EFE1326315AF30F4&bk=1567009199&uaid=880f460b700f4da9b692953f54786e1c&pid=0"
payload = {"username": username}
sleep(4)
s.cookies.update({"logonLatency": "LGN01=637025992217600244"})
s.cookies.update({"MSCC": "1567001943"})
s.cookies.update({"CkTst": "G1567004031095"})
print(s.cookies.get_dict())
print ("---------------------------")
print(json.dumps(payload))
print ("---------------------------")
rp = s.post(post_url, data=json.dumps(payload), allow_redirects=True)
rg = s.get("https://logincdn.msauth.net/16.000.28299.9/images/arrow_left.svg?x=a9cc2824ef3517b6c4160dcf8ff7d410")
print (rp.text)
print (rp.status_code)
If anyone could hint me in the right direction on how to fix this I would highly appreciate it! Thanks in advance

Python scraping web page causes javascript issue

I am using urllib2 to scrape a webpage. It works well on a lot of sites, but in some, I get:
<html><title>You are being redirected...</title>
<noscript>Javascript is required. Please enable javascript before you are allowed to see this page.</noscript>
Here's my code:
hdr = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
'Accept-Encoding': 'none',
'Accept-Language': 'en-US,en;q=0.8',
'Connection': 'keep-alive'}
req = urllib2.Request(url=link, headers=hdr)
try:
f = urllib2.urlopen(req)
except urllib2.HTTPError, e:
return None
except urllib2.URLError, e:
return None
mycontent = f.read()
Any ideas?

This is not a javascript issue, pay attention to the first line:
<html><title>You are being redirected...</title>
This means you have to follow the redirection until you reach the destination server.
urllib2.urlopen does not do that automatically, you need to call .geturl() on the result, or use other functions or libraries, that resolve redirects.

How to make two Winjs.xhr call one for retrieving the Captca from a website and another to submit data back

I 'm making a windows app using HTML5/JAVASCRIPT . What actually i want is to retrieve a captcha from a website and then submitting the captcha along with other form field to the website(.aspx) back and getting the response back.I think that i have to handle cookies for this purpose ,but i do not know how to do this .
100 times salute to the person who will show interest in this .
Here is what i did .
// Retrieving the Captcha.
WinJS.xhr({
url: "http://example.in/main.aspx",
type: "get",
responseType: "document",
headers: {
"CONTENT-TYPE": "application/x-www-form-urlencoded",
"User-Agent": "Mozilla/5.0 (Windows NT 6.3; Win64; x64; Trident/7.0; MSAppHost/2.0; rv:11.0) like Gecko",
"CONNECTION": "keep-alive",
},
}).then(
function complete(xhr) {
var image = document.createElement("img");
image.src = xhr.response.querySelector("img[alt='Captcha']").src;
document.getElementById("captcha").appendChild(image);
});
//submit the data ..
document.getElementById("submit").onclick = function () {
WinJS.xhr({
url: "http://example.in/main.aspx/",
type: "post",
responsetype: "document",
headers: {
"CONTENT-TYPE": "application/x-www-form-urlencoded",
"User-Agent": "Mozilla/5.0 (Windows NT 6.3; Win64; x64; Trident/7.0; MSAppHost/2.0; rv:11.0) like Gecko",
"CONNECTION" : "keep-alive",
},
data: alldata // all data contain username,password and captcha(entered by user)
}).then(
function complete(xhr) {
document.getElementById("response").innerHTML = toStaticHTML(xhr.response);
});
I know that here i have to handle cookie , but really no idea how to do that.

The WinJS.xhr API is just a wrapper for the HTML XMLHttpRequest API, which unfortunately doesn't let you get to cookies and such.
You'll thus need to use the Windows.Web.Http.HttpClient API instead, which does allow you access to cookies and everything else.
Start with the Windows.Web.Http landing page for an overview, followed by How to connect to an HTTP Server for the next level of details. Refer also to the HttpClient sample, specifically scnearios 8, 9, and 10 that deal with cookies.
I also cover the details of this API (and making HTTP requests in general) in Chapter 4 of my free ebook, Programming Windows Store Apps with HTML, CSS, and JavaScript, 2nd Edition.
That should get you going.

Develop Reference

JavaScript is the programming language of the Web.

Phantomjs + Google analytics - javascript

I think you should set "page.settings.loadImages" to true, as GA is using gif pixel to collect data.

Related

Python3 - grabbing data from dynamically refreshed table in webpage

How to find all the JavaScript requests made from my browser when I'm accessing a site

Outlook login script in python3: Javascript not enabled - cannot login

Python scraping web page causes javascript issue

How to make two Winjs.xhr call one for retrieving the Captca from a website and another to submit data back

Categories

Resources