Parsing img url with casperjs - javascript

Im having an issue getting the image url from a website that I am trying to scrape.
I am able to get all the text no problem with a snippet of code like this:
var cost = casper.fetchText('span.large');
However when I attempt to get the image URL im receiving an "undefined" reply in console.
var img = casper.getHTML('.search-product-image').src;
.search-product-image is the image class and I just simiply want to get the image url, thank you.

Use this :
casper.getElementAttribute('.search-product-image','src');

You can include JQuery and take advantage of the syntactic sugar below or do it the vanilla way. FYI if your getLink has an error in it and you try to casper.evaluate(getLinks) it will return null. It will not indicate which line it's on or the error.
var casper = require('casper').create({
verbose: true,
logLevel: 'debug',
clientScripts: ["vendor/jquery.min.js", "vendor/lodash.js"]
});
...
function getLinks(){
// Scraping images
$("img.ExImg.ExResult-img").each(function() {
imgSrc = this.src;
tempImagesArr.push(imgSrc);
});
}
casper.run(function() {
var workouts = this.evaluate(getLinks);
this.saveJSON(workouts);
this.exit();
});

I don't know how casperjs is working but, you have to read casperjs docs.
http://casperjs.readthedocs.org/en/latest/modules/casper.html#gethtml
getHTML function is returning html value in container. So you have split src value in this string. Or you can use just querySelector.
Try this code :
var img = document.querySelector('.search-product-image').src;
if there are lot of element in your document, you have to use document.querySelectorAll('.search-product-image')[0]

Related

html2canvas on shopify error: Uncaught ReferenceError: onLoadStylesheet is not defined at HTMLLinkElement.onload

I am trying to use html2canvas on a Shopify product page to convert a div to an imgURL to set as a form value. Whenever I use html2canvas, I get the error:
Uncaught ReferenceError: onLoadStylesheet is not defined at HTMLLinkElement.onload
When I try to pinpoint the error, it just directs me to <!doctype html> highlighted with the message
Each dictionary in the list "icons" should contain a non-empty UTF8 string field "type".
This also prevents the form from posting for some reason.
html2canvas(document.querySelector("#container"), {useCORS: true, logging: false}).then(canvas => {
document.getElementById("imgURL").value = canvas.toDataURL();
});
How do I fix this?
I found the solution, if you are using ScriptTags and loading JavaScript from your own domain then you will get the error, solution is to use the cache=true when creating the ScriptTag and this will load JavaScript from Shopify's CDN and html2canvas works
Here is the code snippet in C# to create a script tag with caching:
dynamic scriptTagBody = new ExpandoObject();
scriptTagBody.script_tag = new ShopifyScriptTag()
{
Event = "onload",
Src = "https://example.com/script.js",
DisplayScope = "all",
Cache = true
};
HttpContent content = new JsonContent(scriptTagBody);
content.Headers.Add("X-Shopify-Access-Token", "MyAccessToken");
string url = "https://yourshop.myshopify.com/admin/api/2021-04/script_tags.json";
await _httpClient.PostAsync(url, content);
ShopifyScriptTag in the above snippet is a simple POJO with those properties

How to scrape embedded JSON using PhantomJS

I need to get a particular piece of data from a JSON string encoded within a script tag within a returned HTML document using phantomjs. The HTML looks basically like this:
... [preamble html tags etc.]
....
<script id="ine-data" type="application/json">
{"userData": {"account_owner": "Grib"},
"skey":"b207ff1f8d5a394c2f7af1681ad3470c",
"location": "EU"
</script>
<script id="notification-data" type="application/json">
... [other stuff including html body]
What I need to get to is the value for skey within the JSON. I am unable to use the selectors to even get to the script. For instance,
page.open('https://www.site1.com/dash', function(status) {
var ine_data = document.querySelectorAll('script').item(0);
console.log(ine_data); phantom.exit();
});
This returns null. Can anyone point me in the right direction please?
The PhantomJS function you're looking for is called page.evaluate (documentation). It allows you to run javascript sandboxed within the javascript environment of the browser itself.
So following your example:
page.open('https://www.site1.com/dash', function(status) {
var ske = page.evaluate(function() {
var json_text = document.querySelector("#ine-data").innerHTML,
json_values = JSON.parse(json_text);
return json_values.skey;
});
console.log(ske)
phantom.exit();
});
Though I'd note that the JSON in your example is invalid (missing a trailing }), so my example won't work without fixing that first!

Convert attribute into string

I know this is really basic javascript but I'm really not so familiar with javascript.
What I'm trying here is to add prettyPhoto arguments where I want to be. First I get href attribute from link, then I convert it to string, then I take last 4 letters to check is it link to image or to some HTML page. And this code works fine but still my Firebug sends me an error:
TypeError: $hrefy is undefined
txt = $hrefy.toString();
How script can work if $hrefy is not defined and how to define it well. This error blocks only javascript code for filtering my portfolio, while other js work fine.
$(document).ready(function(){
$("a[data-rel^='prettyPhoto']").prettyPhoto();
$hrefy = $("article a").has('img').attr("href");
txt = $hrefy.toString();
var lastChar = txt.substr(txt.length - 4);
if (lastChar=='.jpg') {
$('article a').has('img').attr('data-rel', 'prettyPhoto');
}
$('a img').click(function () {
var desc = $(this).attr('title');
$('a').has('img').attr('title', desc);
});
});
After looking into the source of the page you've linked, I've noticed that there is no <article> element declared anywhere. So, your jquery selector does not return anything and attr('href') is undefined.

jquery.get(url) synchronization

OS X 10.6.8, Chrome 15.0.874.121
I'm experiencing an issue with javascript/jquery: I want to download a header file from base url, then add some more text to it and than spit it out to the client. I'm using the following code:
var bb = new window.WebKitBlobBuilder;
$.get('js/header.txt', function(data) {
bb.append(data);
console.log("finished reading file");
});
console.log("just before getting the blog");
var blob = bb.getBlob('text/plain');
// append some more
saveAs(blob,"name.dxf");
But that fails because getting the file is only finished way after the saveAs(blob) is executed. I know I can fix it with:
var bb = new window.WebKitBlobBuilder;
$.get('js/header.txt', function(data) {
bb.append(data);
//append some more
var blob = bb.getBlob('text/plain');
saveAs(blob,"name.dxf");
});
But that does not really look attractive: I only want to use the get statement only to append the header to the blob, and if I want to read a footer from the file system, I have to do a get inside a get, and spit out the blob in the inner get
Are there alternative ways to withhold the code after the get statement from executing until the whole file has been successfully loaded?
No.*
But, if you want it to look more attractive, try to describe semantically what you are trying to achieve and then write functions accordingly. Maybe:
function loadBlob (loadHeader, loadBody) {
loadHeader(loadBody);
}
loadBlob(function (oncomplete) {
$.get("js/header.txt", function(data) {
bb.append(data);
oncomplete();
});
}, function () {
var blob = bb.getBlob('text/plain');
// append some more
saveAs(blob,"name.dxf");
});
I don't know, is that more attractive? Personally, I find the original just fine, so maybe mine isn't any better, but the point is to use sematics.
* You could use setTimeout to poll and see if the response has been received. That's technically an alternative, but certainly not more attractive, is it?

Get relative path of the page url using javascript

In javascript, how can I get the relative path of the current url?
for example http://www.example.com/test/this?page=2
I want just the /test/this?page=2
Try
window.location.pathname+window.location.search
location.href
holds the url of the page your script is running in.
The quickest, most complete way:
location.href.replace(/(.+\w\/)(.+)/,"/$2");
location.href.replace(location.origin,'');
Only weird case:
http://foo.com/ >> "/"
You can use the below snippet to get the absolute url of any page.
var getAbsoluteUrl = (function() {
var a;
return function(url) {
if(!a) a = document.createElement('a');
a.href = url;
return a.href;
}
})();
// Sample Result based on the input.
getAbsoluteUrl('/'); //Returns http://stackoverflow.com/
Checkout get absolute URL using Javascript for more details and multiple ways to achieve the same functionality.
I use this:
var absURL = document.URL;
alert(absURL);
Reference: http://www.w3schools.com/jsref/prop_doc_url.asp
You should use it the javascript way, to retrieve the complete path including the extensions from the page,
$(location).attr('href');
So, a path like this, can be retrieved too.
www.google.com/results#tab=2

Categories

Resources