Pycurl javascript - javascript

I created a python 3 script that allows me to search on a search engine (DuckDuckGo), get the HTML source code and write it in a textfile.
import pycurl
from io import BytesIO
buffer = BytesIO()
c = pycurl.Curl()
c.setopt(c.URL, 'https://duckduckgo.com/?q=test')
c.setopt(c.WRITEDATA, buffer)
c.setopt(c.FOLLOWLOCATION, True)
c.perform()
c.close()
body = buffer.getvalue()
with open("output.htm", "w") as text_file:
text_file.write(str(body))
print(body.decode('iso-8859-1'))
That part of the code is working properly. However, when I try to open the output.htm file containing the HTML source code of the search engine, I don't get anything (I get an input with my search topic written inside). I would like to have the same HTML source code that I would get by running curl https://duckduckgo.com/?q=test on my terminal.

Duckduckgo's html pages uses javascript to load their search result into their html markups, so curl or PyCurl will not be able to get the same html content you'd see in a browser since curl/pycurl merely fetches internet resources but does not provide any javascript processing.
Use https://duckduckgo.com/api instead of scraping to find search results in their servers/databases.

Related

How to get all of a website’s js files and their urls [duplicate]

I want to scan some websites and would like to get all the java script files names and content.I tried python requests with BeautifulSoup but wasn't able to get the scripts details and contents.am I missing something ?
I have been trying lot of methods to find but I felt like stumbling in the dark.
This is the code I am trying
import requests
from bs4 import BeautifulSoup
r = requests.get("http://www.marunadanmalayali.com/")
soup = BeautifulSoup(r.content)
You can get all the linked JavaScript code use the below code:
l = [i.get('src') for i in soup.find_all('script') if i.get('src')]
soup.find_all('script') returns a list of all the <script> tags in the page.
A list comprehension is used here to loop over all the elements in the list which returned by soup.find_all('script').
i is a dict like object, use .get('src') to check if it has src attribute. If not, ignore it. Otherwise, put it into a list (which's called l in the example).
The output, in this case looks like below:
['http://adserver.adtech.de/addyn/3.0/1602/5506153/0/6490/ADTECH;loc=700;target=_blank;grp=[group]',
'http://tags.expo9.exponential.com/tags/MarunadanMalayalicom/ROS/tags.js',
'http://tags.expo9.exponential.com/tags/MarunadanMalayalicom/ROS/tags.js',
'http://js.genieessp.com/t/057/794/a1057794.js',
'http://ib.adnxs.com/ttj?id=5620689&cb=[CACHEBUSTER]&pubclick=[INSERT_CLICK_TAG]',
'http://ib.adnxs.com/ttj?id=5531763',
'http://advs.adgorithms.com/ttj?id=3279193&cb=[CACHEBUSTER]&pubclick=[INSERT_CLICK_TAG]',
'http://xp2.zedo.com/jsc/xp2/fo.js',
'http://www.marunadanmalayali.com/js/mnmads.js',
'http://www.marunadanmalayali.com/js/jquery-2.1.0.min.js',
'http://www.marunadanmalayali.com/js/jquery.hoverIntent.minified.js',
'http://www.marunadanmalayali.com/js/jquery.dcmegamenu.1.3.3.js',
'http://www.marunadanmalayali.com/js/jquery.cookie.js',
'http://www.marunadanmalayali.com/js/swanalekha-ml.js',
'http://www.marunadanmalayali.com/js/marunadan.js?r=1875',
'http://www.marunadanmalayali.com/js/taboola_home.js',
'http://d8.zedo.com/jsc/d8/fo.js']
My code missed some links because they're not in the HTML source actually.
You can see them in the console:
But they're not in the source:
Usually, that's because these links were generated by JavaScript. And the requests module doesn't run any JavaScript in the page like a real browser - it only send a request to get the HTML source.
If you also need them, you have to use another module to run the JavaScript in that page, and you can see these links then. For that, I'd suggest use selenium - which runs a real browser so it can runs JavaScript in the page.
For example (make sure that you have already installed selenium and a web driver for it):
from bs4 import BeautifulSoup
from selenium import webdriver
driver = webdriver.Chrome() # use Chrome driver for example
driver.get('http://www.marunadanmalayali.com/')
soup = BeautifulSoup(driver.page_source, "html.parser")
l = [i.get('src') for i in soup.find_all('script') if i.get('src')]
__import__('pprint').pprint(l)
You can use a select with script[src] which will only find script tags with a src, you don't need to call .get multiple times:
import requests
from bs4 import BeautifulSoup
r = requests.get("http://www.marunadanmalayali.com/")
soup = BeautifulSoup(r.content)
src = [sc["src"] for sc in soup.select("script[src]")]
You can also specify src=True with find_all to do the same:
src = [sc["src"] for sc in soup.find_all("script",src=True)]
Which will both give you the same output:
['http://tags.expo9.exponential.com/tags/MarunadanMalayalicom/ROS/tags.js', 'http://tags.expo9.exponential.com/tags/MarunadanMalayalicom/ROS/tags.js', 'http://js.genieessp.com/t/052/954/a1052954.js', '//s3-ap-northeast-1.amazonaws.com/tms-t/marunadanmalayali-7219.js', 'http://advs.adgorithms.com/ttj?id=3279193&cb=[CACHEBUSTER]&pubclick=[INSERT_CLICK_TAG]', 'http://www.marunadanmalayali.com/js/mnmcombined1.min.js', 'http://www.marunadanmalayali.com/js/mnmcombined2.min.js']
Also if you use selenium, you can use it with PhantomJs for headless browsing, you don't need beautufulSoup at all if you use selenium, you can use the same css selector directly in selenium:
from selenium import webdriver
driver = webdriver.PhantomJS()
driver.get('http://www.marunadanmalayali.com/')
src = [sc.get_attribute("src") for sc in driver.find_elements_by_css_selector("script[src]")]
print(src)
Which gives you all the links:
u'https://pixel.yabidos.com/fltiu.js?qid=836373f5137373f5131353&cid=511&p=165&s=http%3a%2f%2fwww.marunadanmalayali.com%2f&x=admeta&nci=&adtg=96331&nai=', u'http://gum.criteo.com/sync?c=72&r=2&j=TRC.getRTUS', u'http://b.scorecardresearch.com/beacon.js', u'http://cdn.taboola.com/libtrc/impl.201-1-RELEASE.js', u'http://p165.atemda.com/JSAdservingMP.ashx?pc=1&pbId=165&clk=&exm=&jsv=1.84&tsv=2.26&cts=1459160775430&arp=0&fl=0&vitp=0&vit=&jscb=&url=&fp=0;400;300;20&oid=&exr=&mraid=&apid=&apbndl=&mpp=0&uid=&cb=54613943&pId0=64056124&rank0=1&gid0=64056124:1c59ac&pp0=&clk0=[External%20click-tracking%20goes%20here%20(NOT%20URL-encoded)]&rpos0=0&ecpm0=&ntv0=&ntl0=&adsid0=', u'http://cdn.taboola.com/libtrc/marunadanaalayali-network/loader.js', u'http://s.atemda.com/Admeta.js', u'http://www.google-analytics.com/analytics.js', u'http://tags.expo9.exponential.com/tags/MarunadanMalayalicom/ROS/tags.js', u'http://tags.expo9.exponential.com/tags/MarunadanMalayalicom/ROS/tags.js', u'http://js.genieessp.com/t/052/954/a1052954.js', u'http://s3-ap-northeast-1.amazonaws.com/tms-t/marunadanmalayali-7219.js', u'http://d8.zedo.com/jsc/d8/fo.js', u'http://z1.zedo.com/asw/fm/1185/7219/9/fm.js?c=7219&a=0&f=&n=1185&r=1&d=9&adm=&q=&$=&s=1936&l=%5BINSERT_CLICK_TRACKER_MACRO%5D&ct=&z=0.025054786819964647&tt=0&tz=0&pu=http%3A%2F%2Fwww.marunadanmalayali.com%2F&ru=&pi=1459160768626&ce=UTF-8&zpu=www.marunadanmalayali.com____1_&tpu=', u'http://cas.criteo.com/delivery/ajs.php?zoneid=308686&nodis=1&cb=38688817829&exclude=undefined&charset=UTF-8&loc=http%3A//www.marunadanmalayali.com/', u'http://ads.pubmatic.com/AdServer/js/showad.js', u'http://showads.pubmatic.com/AdServer/AdServerServlet?pubId=135167&siteId=135548&adId=600924&kadwidth=300&kadheight=250&SAVersion=2&js=1&kdntuid=1&pageURL=http%3A%2F%2Fwww.marunadanmalayali.com%2F&inIframe=0&kadpageurl=marunadanmalayali.com&operId=3&kltstamp=2016-3-28%2011%3A26%3A13&timezone=1&screenResolution=1024x768&ranreq=0.8869257988408208&pmUniAdId=0&adVisibility=2&adPosition=999x664', u'http://d8.zedo.com/jsc/d8/fo.js', u'http://z1.zedo.com/asw/fm/1185/7213/9/fm.js?c=7213&a=0&f=&n=1185&r=1&d=9&adm=&q=&$=&s=1948&l=%5BINSERT_CLICK_TRACKER_MACRO%5D&ct=&z=0.08655649935826659&tt=0&tz=0&pu=http%3A%2F%2Fwww.marunadanmalayali.com%2F&ru=&pi=1459160768626&ce=UTF-8&zpu=www.marunadanmalayali.com____1_&tpu=', u'http://advs.adgorithms.com/ttj?id=3279193&cb=[CACHEBUSTER]&pubclick=[INSERT_CLICK_TAG]', u'http://ib.adnxs.com/ttj?ttjb=1&bdc=1459160761&bdh=ZllBLkzcj2dGDVPeS0Sw_OTWjgQ.&tpuids=eyJ0cHVpZHMiOlt7InByb3ZpZGVyIjoiY3JpdGVvIiwidXNlcl9pZCI6Il9KRC1PUmhLX3hLczd1cUJhbjlwLU1KQ2VZbDQ2VVUxIn1dfQ==&view_iv=0&view_pos=664,2096&view_ws=400,300&view_vs=3&bdref=http%3A%2F%2Fwww.marunadanmalayali.com%2F&bdtop=true&bdifs=0&bstk=http%3A%2F%2Fwww.marunadanmalayali.com%2F&&id=3279193&cb=[CACHEBUSTER]&pubclick=[INSERT_CLICK_TAG]', u'http://www.marunadanmalayali.com/js/mnmcombined1.min.js', u'http://www.marunadanmalayali.com/js/mnmcombined2.min.js', u'http://pixel.yabidos.com/iftfl.js?ver=1.4.2&qid=836373f5137373f5131353&cid=511&p=165&s=http%3a%2f%2fwww.marunadanmalayali.com%2f&x=admeta&adtg=96331&nci=&nai=&nsi=&cstm1=&cstm2=&cstm3=&kqt=&xc=&test=&od1=&od2=&co=0&tps=34&rnd=3m17uji8ftbf']

Accessing a local file using HTML/javascript

I am trying to access local files. The method works with Firefox (and was surprised Edge) but not Chrome.
The files in question are 2 html files each containing a huge tables that are used as databases. The tables are basic tables (table, tbody for each group, tr's, and td's with data).
The method I am using is to load the html files into 2 hidden iframes then accessing the tables inside - html file 1 is a master spell list and html file 2 is similar file for a pencil and paper RPG. Works beautifully in Firefox - tables are read into memory, selects/options are all loaded up, popups and page modifications (showing results of what you selected, memory versions of tables modified as needed, generated customized function working - if this file exists at loadup it automatically updates the memory versions of the tables, if the tables are modified - user is shown the function and can copy/save by using a text editor to local file system). Again beautifully.
But Chrome is a different matter. I can load the files in the iframes, but can't access the tables within. It throws an error about cross server access even though all files are in the same directory (the master html file, functions.js file, 2 table files, and if generated and saved by user the customization.js).
So my question is: is there a way to load/import/access a second or third html file in the main html that will work in FF, Chrome, Edge, and most other modern browsers without changing any security settings?
I would love something as simple as how js and the iframe files can be loaded () and accessable. Can xmlrequest work on local files (I could load and render the tables)?
I would like to share the files with the other players, but can't assume browser choices, security settings, and some may not be technically minded enough to make or want said changes.
PS: I am not looking to write any files back to file system, user is the only one with those options.
OK, the other methods (using new tag attributes) failed so looking into a way to hijack the tag and use JSON.
Another user here posted this code (I have cleaned it up - easier to read - and added the suggested but not included part of the code - adding/initializing rowIx and its incrementer)
function getTable() {
var jsonArr = [];
var obj = {};
var jsonObj = {};
var rowIx = 0
//this assumes only one table in the whole page, and table has column headers
var thNum = document.getElementsByTagName('th').length;
var arrLength = document.getElementsByTagName('td').length;
for(i = 0; i < arrLength; i++){
if(i%thNum === 0){ obj = {}; }
var head = document.getElementsByTagName('th')[i%thNum].innerHTML;
var content = document.getElementsByTagName('td')[i].innerHTML;
obj[head] = content;
if(i%thNum === 0){
jsonObj[rowIx++] = obj;
}
}
return JSON.stringify({"Values": jsonObj})
}
the caller then displays (in a P tag using .innerText since .innerHTML tries to render the data; there are p and br tags in some of the table cells) the returned value so it can be copy/pasted/saved in a separate .js file.
Testing the JSON.parse function in the original HTML (that contains the table I want to later import elsewhere) works just fine, although not like the original: array.Values[x].property vs array.rows[x].cells[y].innerHTML but I can work with that.
format:
{"Values":{"numeric index":{7 key/value pairings},{pattern repeated 122 more times}}}
But when the data is placed in a separate js file, it won't parse back to the original data (error is found when developer options/web console is activated, see below).
source HTML file (has the table database, generates the JSON data for copy/paste/save)
large Table (style="display:none;" which hides it, 123 rows by 7 cells each)
the above function getTable
var test1 = getTable()
update p tag using .innerText for copying with test1 data
var schematics = JSON.parse(test1)
alert(schematics.Values[0].Name)
(all of that works)
js File contents (schematics.json.js)
var schematics = JSON.parse( copy/pasted data goes here );
html file
<script language="javascript" src="schematics.json.js"></script>
<script language="javascript">
alert(schematics.Values[0].name); //data restored test
function rebuildTable(){
//use schematics data to rebuild hidden table
)
</script>
<script language="javascript" src="_functions.js"></script>
all other code is in the last script tag
Web Console, reported error
unexpected character at line 1 column 2 of the JSON data
So what am I doing wrong with the JSON containing js file or secondary HTML page?
This is a difference in the security models and choices of Firefox (and Edge) vs the more strict Chrome. You could argue for the utility vs security of either approach.
To make this work with Chrome the way the other two browsers do, you'll need to disable that security measure with a command line flag when you start Chrome:
> chrome --allow-file-access-from-files
The other alternative is to run a local webserver (e.g. WAMP or XAMPP) and load your files via http://localhost/.
Ok, found a way that works.
In the two webpages that function as my databases, I added code to read the tables into a 2-dimensional arrays (row by cells).
These are then JSON.stringified and "var variableName = " is tacked on the beginning of the returned strings. All this is then added to a p tag (.innerText since there is also HTML code in the JSON data, rendering is not desired).
The presented data is then copied and saved using a plain text editor in a JSON.variableName.js file (the JSON in the name is to remind me what's in the file). Loading it is as easy as loading javascript code using a script tag with src="".
Also, now everything works in Firefox, Edge, and Chrome. I don't have Safari or other browsers. Bonus for me, it works in Android Firefox as well.
The two database webpages can be easily updated and they will generate the new JSON data output.
All in all, there are 6 base files: the main webpage, functions.js, two JSON variable js files, and the 2 database/JSON generator webpages. All local, and no additional webserver needed.

How to read and write to a file (Javascript) in ui automation?

I want to identify few properties during my run and form a json object which I would like to write to a ".json"file and save it on the disk.
var target = UIATarget.localTarget();
var properties = new Object();
var jsonObjectToRecord = {"properties":properties}
jsonObjectToRecord.properties.name = "My App"
UIALogger.logMessage("Pretty Print TEST Log"+jsonObjectToRecord.properties.name);
var str = JSON.stringify(jsonObjectToRecord)
UIALogger.logMessage(str);
// -- CODE TO WRITE THIS JSON TO A FILE AND SAVE ON THE DISK --
I tried :
// Sample code to see if it is possible to write data
// onto some file from my automation script
function WriteToFile()
{
set fso = CreateObject("Scripting.FileSystemObject");
set s = fso.CreateTextFile("/Volumes/DEV/test.txt", True);
s.writeline("HI");
s.writeline("Bye");
s.writeline("-----------------------------");
s.Close();
}
AND
function WriteFile()
{
// Create an instance of StreamWriter to write text to a file.
sw = new StreamWriter("TestFile.txt");
// Add some text to the file.
sw.Write("This is the ");
sw.WriteLine("header for the file.");
sw.WriteLine("-------------------");
// Arbitrary objects can also be written to the file.
sw.Write("The date is: ");
sw.WriteLine(DateTime.Now);
sw.Close();
}
But still unable to read and write data to file from ui automation instruments
Possible Workaround ??
To redirect to the stdout if we can execute a terminal command from my ui automation script. So can we execute a terminal command from the script ?
Haven't Tried :
1. Assuming we can include the library that have those methods and give it a try .
Your assumptions are good, But the XCode UI Automation script is not a full JavaScript.
I don't think you can simply program a normal browser based JavaScript in the XCode UI Automation script.
set fso = CreateObject("Scripting.FileSystemObject");
Is not a JavaScript, it is VBScript which will only work in Microsoft Platforms and testing tools like QTP.
Scripting.FileSystemObject
Is an ActiveX object which only exists in Microsoft Windows
Only few JavaScript functions like basic Math, Array,...etc..Are provided by the Apple JavaScript library, so you are limited to use only the classes provided here https://developer.apple.com/library/ios/documentation/DeveloperTools/Reference/UIAutomationRef/
If you want to do more scripting then Try Selenium IOS Driver http://ios-driver.github.io/ios-driver/
Hey so this is something that I was looking into for a project but never fully got around to implementing so this answer will be more of a guide of what to do than step by step copy and paste.
First you're going to need to create a bash script that writes to a file. This can be as simple as
!/bin/bash
echo $1 >> ${filename.json}
Then you call this from inside your Xcode Instruments UIAutomation tool with
var target = UIATarget.localTarget();
var host = target.host();
var result = host.performTaskWithPathArgumentsTimeout("your/script/path", ["Object description in JSON format"], 5);
Then after your automation ends you can load up the file path on your computer to look at the results.
EDIT: This will enable to write to a file line by line but the actual JSON formatting will be up to you. Looking at some examples I don't think it would be difficult to implement but obviously you'll need to give it some thought at first.

Python: Is there a way to get HTML that was dynamically created by Javascript?

As far as I can tell, this is the case for LyricWikia. The lyrics (example) can be accessed from the browser, but can't be found in the source code (can be opened with CTRL + U in most browsers) or reading the contents of the site with Python:
from urllib.request import urlopen
URL = 'http://lyrics.wikia.com/Billy_Joel:Piano_Man'
r = urlopen(URL).read().decode('utf-8')
And the test:
>>> 'Now John at the bar is a friend of mine' in r
False
>>> 'John' in r
False
But when you select and look at the source code of the box in which the lyrics are displayed, you can see that there is: <div class="lyricbox">[...]</div>
Is there a way to get the contents of that div-element with Python?
You can try Ghost.py, which is essentially Phantom.js for Python. It embeds WebKit and is thus able to execute the JavaScript on the page as if you had navigated to the page manually. It then gives you access to the DOM structure.

Is there a way to AutoFormat (Javascript) code in TestComplete?

So similar to ALt-Shift-F in Netbeans, is there a to do this right in the ide in TestComplete? Not sure if this is possible or if anyone can think of a workaround to autoFormat without leaving the TestComplete window.
I'm trying to get the below solution to work with http://jsbeautifier.org/ for javascript / Jscript code in TestComplete.
Thanks
Great question!
There is no built-in function for that. So, we should not expect any solution to be 100% convenient - it is just not a simple task to modify the current script editor contents (if at all possible). So, whatever you do, it will still be some kind of compromise.
In general, the task is three-fold:
Get the current unit code.
Format the code.
Put the code back to the unit.
According to my understanding, items 1 and 3 can be accomplished only by creating a TestComplete plug-in - accessing editors for project nodes is not an easy thing.
UPDATE: silly me! There is a way to access the script editor code - I've updated the below part.
What will help us avoid switching to a different app, are the Script Extensions:
We create a custom Checkpoint in the form of a Script Extension, and install it to TestComplete. As a result, we get a button on the toolbar that we can click to invoke our code.
In the design time action, we call some code that reads the editor contents, then uses external code formatting functionality, and replaces the editor contents with the formatted code.
It would extremely interesting to see the implementations other TestComplete users can suggest! As a start, I am posting a solution that includes using an external web site to format VBScript code (http://www.vbindent.com/). I know that the starter of the post is probably using JScript, but I have not found a JScript formatter yet.
My solution is a simple Script Extension. I can't post a file here, so I will post the code of the two Script Extension files:
Description file:
<!-- Description.xml -->
<?xml version="1.0" encoding="UTF-8"?>
<ScriptExtensionGroup>
<Category Name="Checkpoints">
<ScriptExtension Name="VBScript Code Indent" Author="SmartBear Software" Version="0.1" HomePage="smartbear.com">
<Script Name="VBIndent.js">
<DesignTimeAction Name="Indent Current VBScript Unit" Routine="DesignTimeExecute"/>
</Script>
<Description>
Indents VBScript code in the currently active unit.
</Description>
</ScriptExtension>
</Category>
</ScriptExtensionGroup>
Code file:
// VBIndent.js
function DesignTimeExecute()
{
if (CodeEditor.IsEditorActive)
{
var newCode = IndentVBSCode_Through_VBIndent(CodeEditor.Text);
if (null == newCode)
return;
CodeEditor.Text = newCode;
}
}
function IndentVBSCode_Through_VBIndent(codeToIndent)
{
var URL_VBIndent = "http://www.vbindent.com/?indent";
var httpObj = Sys.OleObject("MSXML2.XMLHTTP");
httpObj.open("POST", URL_VBIndent, false);
httpObj.setRequestHeader("Content-Type", "application/x-www-form-urlencoded");
httpObj.send("thecode=" + escape(codeToIndent));
var responseText = httpObj.responseText;
// Extract the indented code from the response
var rx = /<textarea name=\"thecode\".*?>((.*\n)*?)<\/textarea>/;
matches = rx.exec(responseText);
if (null == matches)
{
return null;
}
codeIndented = matches[1];
return codeIndented;
}
After you create these files, and put them to something like "\Bin\Extensions\ScriptExtensions\VBIndent", and click "File | Install Script Extensions | Reload", you will see a new "Indent Current VBScript Unit" item in the custom checkpoints drop-down button on the Tools toolbar. Clicking the element will format the VBScript code in the currently active editor.
So, this is to give a clear idea of what a solution can look like. Better suggestions are welcome! Share your thoughts!
FYI
I've done. Based on your posts.
JSFormat.tcx
https://drive.google.com/uc?export=download&id=0B1x_73bHRc2Jcm8wbTJ2dUpZQTQ
To install the extension copy attached file JSFormat.tcx to C:\Program Files (x86)\SmartBear\TestComplete 10\Bin\Extensions\ScriptExtensions
To use view next image:
https://drive.google.com/uc?export=download&id=0B1x_73bHRc2Jc3RuLXFpTnlCSnc
Regards

Categories

Resources