Extracting html from generated webside using YQL and htmlstring

Extracting html from generated webside using YQL and htmlstring - javascript

I try to extract and display train connections from the DeutscheBahn webside www.reiseauskunft.de to show them on a Infodisplay (that just shows a simple html page with some javascript.
So i want to put these info (next available connections) in my html page.
DeutscheBahn provides a "kind" of API ??? or at least that looks like an API:
www.reiseauskunft.bahn.de/bin/query.exe/dn?S=MainzHbf&Z=Frankfurt(Main)Hbf&timeSel=depart&start=1
This link works and delivers a full webpage with the next three conections from (S)tart station to (Z) target station and gets the acctual time as the depart time (the start=1 parameter just executes the request).
You can find more infos about the parameters here (only german)
www.geiervally.lechtal.at/sixcms/media.php/1405/Parametrisierte%20%DCbergabe%20Bahnauskunft(V%205.12-R4.30c,%20f%FCr.pdf
Because html table seems no longer supported i found the info to use htmlstring (YQL: html table is no longer supported)
I changed the example to my needs:
var site = "http://www.reiseauskunft.bahn.de/bin/query.exe/dn?S=MainzHbf&Z=Frankfurt(Main)Hbf&timeSel=depart&start=1";
var yql = "select * from htmlstring where url='" + site;
var resturl = "http://query.yahooapis.com/v1/public/yql?q=" + encodeURIComponent(yql) + "&format=json&env=store%3A%2F%2Fdatatables.org%2Falltableswithkeys";
but got the description in browser
"Query syntax error(s) [line 1:140 mismatched character ' ' expecting ''']"
(??? at this position i cant find a " "and would not put a "'"???)
in yql console i put the following
select * from htmlstring where url='http://www.reiseauskunft.bahn.de/bin/query.exe/dn?S=MainzHbf&Z=Frankfurt(Main)Hbf&timeSel=depart&start=1'
and there i got the Exception: Redirected to a robots.txt restricted URL.
Do both messages correspond the same??? or can i bypass the robots.txt message (does the yql function react like a robot for the page reiseauskunft.de?)
Is there a chance to retrieve the train connections with yql ?
Thanks in advance
Edit: it seems my approuch with yql will not work so i will try another approuch - question closed?!

Related

Is it possible to access elements of a website?

I have a link here: https://fantasy.espn.com/football/players/add?leagueId=1589782588 and I've been wanted to pull data from it. In the developer console I typed out
let players = document.getElementsByClassName("AnchorLink link clr-link pointer")
players[0].text
and it works perfectly. How can I get this to work in my ide?

Disclaimer: the following is for teaching purpose only and should not be abused.
Use a public API if provided by the website owner.
Investigate what happens to the request by using the Network tab.
You'll notice that a request is made to an URI https://site.api.espn.com/apis/..... which ends in something like: ....ffl/news/players?days=30&playerId=2576623.
If you click that link you'll go directly to a page that serves an API response as JSON.
Inspect again the entire website Ctrl + Shift + F and look for that player ID 2576623 - and you'll notice that is stored inside each player image URI. So let's collect all those IDs.
Open Dev Tools Console and run:
var _i = document.querySelectorAll("tbody .player__column img[src*='full/']");
console.log(_i);
Now that you have your image elements it's time to collect all the IDs:
var _ids = [..._i].map(el => el.src.match(/(?<=full\/)[^\.]+(?=\.)/)[0]);
console.log(_ids)
From this point on you - can use any server-side script (or even JS if there's no CrossOrigin limitation), and fetch that JSON data.

Nested for loop to match array of subtitles[0] with array of subdescriptions[0]

I'm using XML to create a framework for a website so that the content within the site can be easily modified by another party without heavy HTML work. Right now I'm trying to align my h3 headers with their respective paragraph elements but it ends up assigning each all of the elements when it should be that the first gets the first element and the second gets the second element.
My XML is something like this
<steps>
<step number="1">
<title>Domain Name</title>
<description>Before getting the server setup, we have to make sure that we have a name visible to the public. It would be pretty outrageous to ask users to remember our site based on the IP address (E.g. 52.25.195.213). IP addresses are hard to remember and mean virtually nothing to humans - which is why we need a nice plain text name to be remembered by.
</description>
<subtitle>Getting Logged Into Amazon Web Services</subtitle>
<subdescription>As I mentioned in the overview to this guide, we will be using Amazon Web Services for the first 5 steps and because of that, you will need to create an account on their site. Again if you have a RHEL server already, please proceed to step 5 or 6 depending on whether or not you know how to connect to your server. If you are new to Amazon Web Services, hosting a micro server will cost you virtually nothing, but you will still need to provide your credit card information to prove that AWS can bill you should you incur charges (For the purpose of this guide, we will only be setting up one micro server, and as such you will be billed the minimal amount even if your server runs 24/7). After your first year of operation you will not be covered under the free tier, so please refer to this page to get the most up-to-date pricing information. Now that we've covered the not so exciting topic of cost, let's log in to AWS and get started!
</subdescription>
<subtitle>Picking Your Server Farm</subtitle>
<subdescription>Upon logging in you should see your dashboard which contains lots of icons. If your layout is somehow different because you are a new user, don't panic. This just means that you will have to do a bit of searching to find the services I mention. To start, let's designate a desired server farm. This decision is not particularly important to average web users, as they will access your site from all over the world. However, for you the admin, the server farm should be as close to you as possible to avoid unnecessary latency. To select a server farm, go to the drop-down next to your name, and pick a location (for me North Virginia is the best pick because I live closer to the east coast).
</subdescription>
</step>
<steps>
And the for-loops, which are in the HTML file in the tag are like this
html += "<h2>" + title + "</h2>";
html += "<hr/>";
html += "<p>" + description + "</p>";
for (var j = 0; j < subtitles.length; j++) {
for (var k = 0; k < subdescriptions.length; k++) {
html += "<h3>" + subtitles[j].innerHTML + "</h3>";
html += "<p>" + subdescriptions[k].innerHTML + "</p>";
}
}
This is the resulting image; it's using subtitle[0] and displaying it with all the subdescriptions in the array.
EDIT:
#MikeMcCaughan I didn't initially include this part in my question but subtitles and subdescriptions are assigned through these lines
var allSteps = xml.querySelectorAll('step');
for(var i = 0; i < allSteps.length; i++){
var step = allSteps[i];
var subtitles = step.querySelectorAll('subtitle');
var subdescriptions = step.querySelectorAll('subdescription');

Grabbing A Large Image With Ebay API GetSellerList

I'm trying to grab the highest quality image for each item that a GetSellerList request returns. The HQ images can be viewed manually by clicking the image on a product page (so I know they exist).
Unfortunately, it only returns medium sized images. I've googled and googled, only to find a lot of mentions of SelectorOutput, which can only be used in the Finding API and that is completely irrelevant to what I'm trying to do.
Here's my xml input (note that my auth is taken care of with a js library I'm using):
var xml = '<?xml version="1.0" encoding="UTF-8"?>' +
'<GetSellerListRequest xmlns="urn:ebay:apis:eBLBaseComponents">' +
'<RequesterCredentials>' +
'<eBayAuthToken> <!-- my ebayAuthToken -->' +
'</RequesterCredentials>' +
'<UserID>samurai-gardens</UserID>' +
'<StartTimeFrom>2016-01-01T23:35:27.000Z</StartTimeFrom>' +
'<StartTimeTo>2016-02-01T23:35:27.000Z</StartTimeTo>' +
'<DetailLevel>ItemReturnDescription</DetailLevel>' +
'<Pagination ComplexType="PaginationType">' +
'<EntriesPerPage>10</EntriesPerPage>' +
'<PageNumber>1</PageNumber>' +
'</Pagination>' +
'</GetSellerListRequest>"';
I am getting the correct output, I just don't see how I can pull the large images with this. Thanks ebay for a super frustrating api!

Just to clarify the comments posted on this question:
To obtain a high resolution image associated with a product listing perform the following..
Utilize a GetSellerRequest as formatted below to obtain the picture URL from the Item Details:
var xml = '<?xml version="1.0" encoding="UTF-8"?>' +
'<GetSellerListRequest xmlns="urn:ebay:apis:eBLBaseComponents">' +
'<RequesterCredentials>' +
'<eBayAuthToken> <!-- my ebayAuthToken -->' +
'</RequesterCredentials>' +
'<UserID>samurai-gardens</UserID>' +
'<StartTimeFrom>2016-01-01T23:35:27.000Z</StartTimeFrom>' +
'<StartTimeTo>2016-02-01T23:35:27.000Z</StartTimeTo>' +
'<DetailLevel>ItemReturnDescription</DetailLevel>' +
'<Pagination ComplexType="PaginationType">' +
'<EntriesPerPage>10</EntriesPerPage>' +
'<PageNumber>1</PageNumber>' +
'</Pagination>' +
'</GetSellerListRequest>"';
This should product a URL such as the following:
i.ebayimg.com/00/s/ODAwWDYyOQ==/z/3eEAAOSwSdZWdJRL/$_1.JPG?set_id=880000500F
Once this URL is obtained, it needs to be modified to point to the high resolution image option. Through trial and error this appears to be either .JPG images 3 or 57 (and possibly others). Each image has different alignments which is the cause of the multiple 'high resolution' options. Modify the returned URL using standard string manipulation techniques to obtain the following:
i.ebayimg.com/00/s/ODAwWDYyOQ==/z/3eEAAOSwSdZWdJRL/$_3.JPG
This could be obtained as follows (in c#). The snippet below was not tested. Make sure there isn't an off by one bug in the substring. There are a myriad way of doing this, this just happens to be what I thought of..
string url = "i.ebayimg.com/00/s/ODAwWDYyOQ==/z/3eEAAOSwSdZWdJRL/$_1.JPG?set_id=880000500F";
int index = url.LastIndexOf(Convert.ToChar(#"/");
url = url.Substring(0,index+1);
url = url + #"/$_3.JPG";
If you are using c# (which I realize the original post was for javascript) you can use information in the following thread to obtain an image stream from the URL: URL to Stream
Here is a post for displaying an image from URL using Javascript: URL To Display Javascript

What's the best method to EXTRACT product names given a list of SKU numbers from a website?

I have a problem.
I have a list of SKU numbers (hundreds) that I'm trying to match with the title of the product that it belongs to. I have thought of a few ways to accomplish this, but I feel like I'm missing something... I'm hoping someone here has a quick and efficient idea to help me get this done.
The products come from Aidan Gray.
Attempt #1 (Batch Program Method) - FAIL:
After searching for a SKU in Aidan Gray, the website returns a URL that looks like below:
http://www.aidangrayhome.com/catalogsearch/result/?q=SKUNUMBER
... with "SKUNUMBER" obviously being a SKU.
The first result of the webpage is almost always the product.
To click the first result (through the address bar) the following can be entered (if Javascript is enabled through the address bar):
javascript:{document.getElementsByClassName("product-image")[0].click;}
I wanted to create a .bat file through Command Prompt and execute the following command:
firefox http://www.aidangrayhome.com/catalogsearch/result/?q=SKUNUMBER javascript:{document.getElementsByClassName("product-image")[0].click;}
... but Firefox doesn't seem to allow these two commands to execute in the same tab.
If that worked, I was going to go to http://tools.buzzstream.com/meta-tag-extractor, paste the resulting links to get the titles of the pages, and export the data to CSV format, and copy over the data I wanted.
Unfortunately, I am unable to open both the webpage and the Javascript in the same tab through a batch program.
Attempt #2 (I'm Feeling Lucky Method):
I was going to use Google's &btnI URL suffix to automatically redirect to the first result.
http://www.google.com/search?btnI&q=site:aidangrayhome.com+SKUNUMBER
After opening all the links in tabs, I was going to use a Firefox add-on called "Send Tab URLs" to copy the names of the tabs (which contain the product names) to the clipboard.
The problem is that most of the results were simply not lucky enough...
If anybody has an idea or tip to get this accomplished, I'd be very grateful.

I recommend using JScript for this. It's easy to include as hybrid code in a batch script, its structure and syntax is familiar to anyone comfortable with JavaScript, and you can use it to fetch web pages via XMLHTTPRequest (a.k.a. Ajax by the less-informed) and build a DOM object from the .responseText using an htmlfile COM object.
Anyway, challenge: accepted. Save this with a .bat extension. It'll look for a text file containing SKUs, one per line, and fetch and scrape the search page for each, writing info from the first anchor element with a .className of "product-image" to a CSV file.
#if (#CodeSection == #Batch) #then
#echo off
setlocal
set "skufile=sku.txt"
set "outfile=output.csv"
set "URL=http://www.aidangrayhome.com/catalogsearch/result/?q="
rem // invoke JScript portion
cscript /nologo /e:jscript "%~f0" "%skufile%" "%outfile%" "%URL%"
echo Done.
rem // end main runtime
goto :EOF
#end // end batch / begin JScript chimera
var fso = WSH.CreateObject('scripting.filesystemobject'),
skufile = fso.OpenTextFile(WSH.Arguments(0), 1),
skus = skufile.ReadAll().split(/\r?\n/),
outfile = fso.CreateTextFile(WSH.Arguments(1), true),
URL = WSH.Arguments(2);
skufile.Close();
String.prototype.trim = function() { return this.replace(/^\s+|\s+$/g, ''); }
// returns a DOM root object
function fetch(url) {
var XHR = WSH.CreateObject("Microsoft.XMLHTTP"),
DOM = WSH.CreateObject('htmlfile');
WSH.StdErr.Write('fetching ' + url);
XHR.open("GET",url,true);
XHR.setRequestHeader('User-Agent','XMLHTTP/1.0');
XHR.send('');
while (XHR.readyState!=4) {WSH.Sleep(25)};
DOM.write(XHR.responseText);
return DOM;
}
function out(what) {
WSH.StdErr.Write(new Array(79).join(String.fromCharCode(8)));
WSH.Echo(what);
outfile.WriteLine(what);
}
WSH.Echo('Writing to ' + WSH.Arguments(1) + '...')
out('sku,product,URL');
for (var i=0; i<skus.length; i++) {
if (!skus[i]) continue;
var DOM = fetch(URL + skus[i]),
anchors = DOM.getElementsByTagName('a');
for (var j=0; j<anchors.length; j++) {
if (/\bproduct-image\b/i.test(anchors[j].className)) {
out(skus[i]+',"' + anchors[j].title.trim() + '","' + anchors[j].href + '"');
break;
}
}
}
outfile.Close();
Too bad the htmlfile COM object doesn't support getElementsByClassName. :/ But this seems to work well enough in my testing.

Yahoo Pipes RSS pubDate showing as "undefined" when viewed through Google Feeds API

I have an RSS feed which I've created in Yahoo Pipes. You can view it here.
When viewing that through Google Feed's API, however, the pubDate is coming up as undefined (for avoidance of doubt, I've also tried formatting that with the case PubDate).
Here's the code I've used:
<div class="clear" id="feed">
</div>
<script type="text/javascript">
var feedcontainer=document.getElementById("feed")
var feedurl="http://pipes.yahoo.com/pipes/pipe.run?_id=f0eb054e3a4f8acff6d4fc28eda5ae32&_render=rss"
var feedlimit=5
var rssoutput="<h3>Business and Tax News</h3><ul>"
function rssfeedsetup(){
var feedpointer=new google.feeds.Feed(feedurl)
feedpointer.setNumEntries(feedlimit)
feedpointer.load(displayfeed)
}
function displayfeed(result){
if (!result.error){
var thefeeds=result.feed.entries
for (var i=0; i<thefeeds.length; i++)
rssoutput+="<li><a href='" + thefeeds[i].link + "'>" + thefeeds[i].title + " (" + thefeeds[i].pubDate +")</a></li>"
rssoutput+="</ul>"
feedcontainer.innerHTML=rssoutput
}
else
alert("Error fetching feeds!")
}
window.onload=function(){
rssfeedsetup()
}
</script>
...and here it is on an example page.
I've done some Googling about on this, and discovered that there appears to be a little documented problem with the way that Yahoo Pipes outputs PubDate. I've tried following the instructions in the question Can't get pubDate to output in Yahoo! Pipes? (the resulting pipe is here), but it doesn't seem to make any difference.
How can I output a proper PubDate on Google Feed from a Yahoo Pipes RSS feed? Is this even possible?

Simply change:
thefeeds[i].pubDate
to:
thefeeds[i].publishedDate
I tested this on Google Code Playground:
https://code.google.com/apis/ajax/playground/#load_feed
In OnLoad, change the URL to your Yahoo Pipes link
In the main loop in feedLoaded, edit the middle part to:
div.appendChild(document.createTextNode(entry.title));
div.appendChild(document.createTextNode(entry.publishedDate));
console.log(entry);
Specifically in the JavaScript console you can see the entry object has a publishedDate property instead of pubDate.
It it works on the playground, it should work on your site too, I hope.

Develop Reference

JavaScript is the programming language of the Web.

Extracting html from generated webside using YQL and htmlstring - javascript

Related

Is it possible to access elements of a website?

Nested for loop to match array of subtitles[0] with array of subdescriptions[0]

Grabbing A Large Image With Ebay API GetSellerList

What's the best method to EXTRACT product names given a list of SKU numbers from a website?

Yahoo Pipes RSS pubDate showing as "undefined" when viewed through Google Feeds API

Categories

Resources