I am trying to create a "database" of sorts in Google Sheets/Fusion Tables using XML files found online (giving information on bus routes and bus stops) so that I can eventually work with it in a Javascript program. The bus stops are identified by a stop_id and the routes by a route tag. I would like to connect these two in the database so that, given a stop_id I will know the route tag of the route the stop is on. Each stop contains the stop_id and latitude/longitude information. The routes contain lists of the stops (stop_id) in them. To create this database, I would like a Google spreadsheet with the route tag as the column head and the list of stop IDs filling the corresponding cells below. In total there are 30 routes. The list of route tags is found on http://webservices.nextbus.com/service/publicXMLFeed?command=routeList&a=chapel-hill and the route information for route A (for example) is found on http://webservices.nextbus.com/service/publicXMLFeed?command=routeConfig&a=chapel-hill&r=A. I have tried using the importxml("routesURL", "/body/route//stop[stopId]") command in Google sheets, but it returns a parse error or that "Imported content is empty" when I do so. To reiterate my goal, I would like to know for a specific stop_id what the route tag is. Any tips? Is my strategy all wrong? Thank you very much!
It's not a typical xml, ie no real values inside the opening tag and closing tag. This is a typical xml file, from which importxml can extract Tove, Jani, Reminder, etc. Just Try:
=importxml("https://www.w3schools.com/xml/note.xml", "*")
But, in case of your first file, there's no values to extract.
I have searched and found they also provide JSON format. Just replace XML to JSON in the URL like this:
http://webservices.nextbus.com/service/publicJSONFeed?command=routeConfig&a=chapel-hill&r=A
GS doesn't provide a built-in function for JSON, but you can import JSON via GAS.
Related
I have create a BaseX database and a mytest.xq file containing an XQuery for that database. When I write in my browser the following:
localhost:8984/rest?run=mytest.xq
I get the desired results in an xml form. However I want to perform this using the html language to display those results in a website. Is this possible? If it is, can the results from xml be visualized better for example a table?
I have looked all the documentation regarding baseX http and have not found a way
You can add &method=html to your url like so:
localhost:8984/rest?run=mytest.xq&method=html
As long as you are returning html from your query then it will render with the method argument. You don't need RESTXQ for your simple needs.
The main BaseX page has an example file in the webapp folder called restxq.xqm where you can see how the basic home page is set up. You don't need RESTXQ but you can use the header information from that file in your test query and render your page with that in mind.
Also there is an entire app in the webapp/dba folder that is written entirely in RESTXQ.
What I need
I need to retrieve data from this source . Let's assume I must use only PowerBi for this.
What I did so far
If I use the basic web source option, then the query is just basically an htlm parsing with which I can easily get the data found in the html scope of the page, example:
Source:
The steps I'm following through Web source option:
Query:
(to simplify the example, assume we don't need the dates)
You can download that example .pbix file here.
The problem
The problem is that I need more data, which can't be accessed through the html preview. For example, let's imagine I need to retrieve the data from January 2010 to April 2020. Those king of queries can only be done via this button located in the webpage (which exports the requested data to an Excel workbook):
The idea is to get this process automated, so going to the source and export the excel file all the time is not an option.
Inspecting the element I realized that what it does is execute a javascript function:
The question
As a PowerBi/PowerQuery noob I wonder: Is there any way I can get that data directly with PowerBi (maybe calling the js function somehow)? If there is so, then how?
Thank you in advance.
The solution to my case was to use URL parameters to retrieve de data without parsing the html table.
❌Original URL I was using:
https://gee.bccr.fi.cr/indicadoreseconomicos/Cuadros/frmVerCatCuadro.aspx?idioma=1&CodCuadro=%20400
✔️New URL for the query, adding some parameters:
https://gee.bccr.fi.cr/indicadoreseconomicos/Cuadros/frmVerCatCuadro.aspx?idioma=1&CodCuadro=%20400&Idioma=1&FecInicial=2010/01/01&FecFinal=2040/01/01&Filtro=0&Exportar=True
This procedure only works in this case, because obviously the parameters will not be the same on other web pages.
However, I post this answer to keep the main idea for those who are in a similar situation: first try with the appropriate url parameters to get the data in a different format. Of course you first must know which are the available parameters, which is a limitation.
I am currently working on a project of finding empty classrooms in our school in real time. For that purpose, I need to extract substitution published on our school page (https://ssnovohradska.edupage.org/substitution/?), since there might be any additional changes.
But when I try to extract the html source code and parse it with bs4, it cannot find the divs(class: "section print-nobreak") that contain the substitution text. When I took a look at the page source code(Ctrl+U) I found that there is only a javascript that prints it all directly.
Is there any way to extract the html after the javascript output has been already rendered?
Thanks for help!
Parsing HTML is unfortunately necessary to solve your problem. But I will explain how to find ways to avoid that in your future projects (not based on this website).
You've correctly noticed that the text is created by JavaScript code running on the page. This could also indicate that the data is either loaded from another resource (XHR/fetch call getting a response from an API) or is stored as a JSON/JS inside of the website's code. (Or is generated from an algorithm, but this is unlikely to be the case in such websites.)
The website actually uses both methods (initial render gets data stored inside of the website's code, but when you switch dates on the calendar it makes AJAX requests). You can see this by searching for ReactDOM.render(React.createElement( in the code. They're providing a HTML string to the createElement call, so I would suggest looking into the AJAX way of doing things.
Now, to check where the resource is located, all you need to do is opening Developer Tools in your favorite browser (usually Control+Shift+I) and navigating to the Network tab. Now that your network tab is open, you need to cause the website to load external data, for example, by pressing a date on the "calendar bar".
Here you will notice many external requests, but we're actually looking only for XHR calls. Click on the XHR button next to the "Filter" text field. That should result in only one request being shown:
Unfortunately for us, the response only contains HTML. Also, API calls are protected - they require a PHP session ID and some sort of a token (__gsh) to not fail. So, going back to step 1 - seems like our only solution is to use regular expressions to find the text between "report_html":"<div class and </div></div></div> from the source code, if you're interested in today's date only. If you want to get contents for tomorrow or any other date - you will need to either fetch the page, save the cookies and find the token to supply to the request and then make that request, or use something like puppeteer or pyppeteer (since you've mentioned BS4) and load the webpage in that. If you aren't doing the data fetching that often, you should be fine overall.
I am new to contentful API but so far getting content from the API has been pretty straight forward. I have created a new space using their "blog" template and I see that in the "body" field there is an "insert media" button. I don't think I get how this is supposed to be used. When I insert an image into the "body" field, it generates a code that doesn't get rendered when I pull the content form the API. I am using a markdown parser to render the text. If you create an entry with images, these images will be available as an asset. Do I need to make a separate API call for every asset I want rendered with my entry?
When you use the Insert Media button, it should generate something such as:
![Lewis Carroll](//images.contentful.com/zz2okzf5k4px/2ReMHJhXoAcy4AyamgsgwQ/ec4998388330a939288c04558c57477a/lewis-carroll-1.jpg)
That url points to the image directly, so you don't need to do any extra calls to get the asset. The asset is an entity which contains metadata as well as the url to the asset file itself, but in this case you already have that url.
You said you are rendering the Markdown, maybe there's a problem in the code that gets generated? Could you post that?
There are about 200 product numbers and associated product URL's. I have to extract the meta tags wrt to title and keywords to each of these products using JavaScript code and output them to a file in my computer. How ?
example: product no: D2650, has the product URL: http://www.sigmaaldrich.com/catalog/product/sigma/D2650?lang=en®ion=US
This is similar to the rest of the 199 products. I need to extract the meta "keywords" and meta "title" for all these pages.
Help with JS code shall be helpful.
Depending on the set of pages you need to get metadata about, this existing API might do a good job extracting the info you need.
https://opengraph.io/
It's a simple REST API:
GET https://opengraph.io/api/1.0/site/<URL encoded site URL>
It works well for pages that use the opengraph tags. And for other pages, it can sometimes fall back on grabbing other metadata tag info. You can test out what info it can find on a particular page with the test tool here:
https://opengraph.io/app/#!/debugtool
It's working well for me in a project, and saved me from extra time to hook up YQL or make other server-side changes. [NOTE: I have no relationship to this product or its creators. I found it through online research and used it in a project.]
If you're using pure javascript, you could do something like this:
var metas = document.getElementsByTagName('meta'); //get all the meta tag elements
//iterate through them
for (i=0; i<metas.length; i++) {
if (metas[i].getAttribute("name") == "keywords") {
console.log(metas[i].getAttribute("content"));
}
else if (metas[i].getAttribute("name") == "description") {
console.log(metas[i].getAttribute("content"));
}
}
The above code could be simpler if you are using Jquery:
var keywords = $('meta[name=keywords]').attr("content");
var description = $('meta[name=description]').attr("content");
I have given the code snippet as per the source of the url you shared. You can modify the same to suit your needs. Hope it gets you started in the right direction.
EDIT
I can understand you're a beginner but I am going to refrain from posting the entire code from beginning to the end simply because there are several ways of doing it and it's something you should and can learn on your own if you try. It is not that tough.
The starting point in your problem should be accessing the html of a remote source in javascript. We generally do this using a post or a get request but cross-origin network requests are generally disallowed in browsers. Check this SO answer which elaborates on this issue.
Now, a simple workaround for this would be to look for APIs that allow you to scrape HTML from the online resources. YQL(Yahoo Query Language) is one such tool that lets you 'query' html from remote sources. They have a very friendly YQL Console as well that generates a URL where you can directly make a post request and query the html. It is very well documented as well and should be very easy to get started with. Try the following yql query in the console :
select * from html where url='http://www.sigmaaldrich.com/catalog/product/sigma/D2650?lang=en®ion=US' and xpath='/html/head/meta'
Just look at the result and you would only have the meta tags returned in XML or JSON and a custom URL generated for your yql query. So, it's just a matter of making a get/post request to that URL and then using the code I had posted earlier provided the data is returned as properly formatted XML. If it is returned as JSON you would have to simply parse the json which should also be quite simple.
All this probably sounds really complicated right now, but if you just get down to it, take it one step at a time, you'll be able to solve your problem yourself. Start by learning to use the YQL Console, making network requests in javascript and just put it all together. It should be a fun exercise.