Scraping a webpage that is using a firebase database - javascript

DISCLAIMER: I'm just learning by doing, I have no bad intentions
So, I would like to fetch the list of the applications listed on this website: http://roaringapps.com/apps
I've done similar things in the past, but with simpler websites; this time I'm having problems getting my hands on the data behind this webpage.
The scrolling from page to page is blazing fast so, to understand how the webpage works, I've fired up a packet sniffer and analyzed the traffic. I've noticed that, after the initial loading, no traffic is exchanged between the server and my client, even if I scroll over 2500 records in the browser. How is that possible?
Anyhow. My understanding is that the website is loading the data from a stream of some sort, and render it via Javascript. Am I correct?
So, I've fired up chromium devtools a looked at the "network" tab, and saw that a WebSocket request is made to the following address: wss://s-usc1c-nss-123.firebaseio.com
At this point, after googling a bit, I've tried to query the very same server, using the "v=5&ns=roaringapps" query I saw on the devtools window:
from websocket import create_connection
ws = create_connection('wss://s-usc1c-nss-123.firebaseio.com')
ws.send('v=5&ns=roaringapps')
print json.loads(ws.recv())
And got this reply:
{u't': u'c', u'd': {u't': u'h', u'd': {u'h': u's-usc1c-nss-123.firebaseio.com', u's': u'JUL5t1nC2SXfGaIjwecB6G13j1OsmMVv', u'ts': 1476799051047L, u'v': u'5'}}}
I was expecting to see a json response with the raw data about applications & so on. What I'm doing wrong?
Thanks a lot!
UPDATE
Actually, I just found out that the website is using json to load its data. I was not seeing it in iterated requests probably because of caching - but disabling it in chromium did the trick.

While the Firebase Database allows you to read/write JSON data. But its SDKs don't simply transfer the raw JSON data, they do many tricks on top of that to ensure an efficient and smooth experience. W
hat you're getting there is Firebase's wire protocol. The protocol is not publicly documented and (if you're new to it) trying to unravel it is going to give you an unpleasant time.
To retrieve the actual JSON at a location, it's easiest to use Firebase's REST API. You can get that by simply appending .json to the URL and firing a HTTP GET request against that.
So if the initial data is being loaded from:
https://mynamespace.firebaseio.com/path/to/data
You'd get the raw JSON by firing a HTTP GET against:
https://mynamespace.firebaseio.com/path/to/data.json

Related

Node JS + Express JS: refresh page from other location

I have the following problem: I want to change one variable on a page. The input comes from another page so:
I'm using Node.js, Express.js and Ejs for this task.
Server - storing the values
Index page - Control page with input fields and send button
Display page - Shows the variable
I'm sending the variable with fetch post to the server. On the server I change the variable with the request body value and when I reload the "Display page" manually I see the new value. The problem is: I need to change it without any manual refresh or other things, because that won't be possible.
There is the possibility with "location.reload()" to refresh it every X second. But that's not the way I want to use, I really just want to refresh it when the variable changes. Is there a function (from express.js for example) I can use for it?
edit: I should mention that this project would be just used in our network and its not open for other users. Like an in-house company dashboard kind of.
So a "quick and dirty" solution can work too, but I want to learn something and wanted to do it the right way though.
This is a very common scenario that has several solutions:
Polling - The display page runs ajax calls in a loop every N seconds asking the server for the lastest version of the variable. This is simple to implement, is very common, and perfectly acceptable. However, it is a little outdated, and there are more modern and efficient methods. I suggest you try this first, and move on to others only as needed.
WebSockets - WebSockets maintain a connection between the client and server. This allows the server to send messages to the client application if/when needed. These are a little more complex to setup than just plain ajax calls, but if you have a lot of messages getting sent back and forth they are much more efficient.
WebRTC - This is taking it to another level, and is certainly overkill for your use case. WebRTC allows direct messaging between clients. It is more complicated to configure than WebSockets and is primarily intended for streaming audio or video between clients. It can however send simple text messages as well. Technically, if you want to persist the message on the server, then this is not suitable at all, but it's worth a mention to give a complete picture of what's available.
The simplest solution that came to mind is to have the server return the updated post in the body, then use that to update the page.
You can also read about long/short polling and Websockets.
One possible solution would be to add page reload code after a successful post-operation with fetch.
fetch(url, {
method: 'post',
body: body
}).then(function(response) {
return response.json();
}).then((data) => {
// refresh page here
window.location.replace(url);
});
Proper solution (WebSockets):
Add WebSocket server as a part of your Node.JS app
Implement subscriptions for the WebSocket, implement function 'state changed'.
subscribe on a method 'state changed' from your client browser app.
call ws server from your express app to update the clients when your variable is changed
Outdated (Polling):
Add express endpoint route: 'variable-state' Call server from your
client every n ms and check whether variable state is changed.
Refresh the page if variable is changed.

PHP: Request 50MB-100MB json - browser crash / do not display any result

Huge json server requests: around 50MB - 100MB for example.
From what I know, it might crash when loading huge requests of data to a table (I usually use datatables), the result: memory reaches to almost 8G, and the browser crash. Chrome might not return a result, Firefox will usually ask if I want to wait or kill the process.
I'm going to start working on a project which will send requests for huge jsons, all compressed (done by the server side PHP). The purpose of my report is to fetch data, and display all in a table - made easy to filter and order. So I cant find the use of "lazy load"ing for this specific case.
I might use a vue-js datatable library this time (not sure which specifically).
What's exactly using so much of my memory? I know for sure that the json result is received. Is that rendering/parsing of the json to the DOM? (I'm referring to the datatable example for now: https://datatables.net/examples/data_sources/ajax)
What is the best practices in these kind of situations?
I started researching this issue and noticed that there are posts from 2010 that seem like they're not relevant at all.
There is no limit on the size of an HTTP response. There is a limit on other things, such as:
local storage
session storage
cache
cookies
query string length
memory (per your CPU limitations or browser allocation)
Instead, the problem is with your implementation of your datatable most likely. You can't just insert 100,000 nodes into the DOM and not expect some type of performance impact. Furthermore, if the datatable is performing logic against each of those datum as they're coming in and processing them before the node insertion, that's also going to be a big no no.
What you've done here is essentially pass the leg work of performing pagination from the server to the client, and with dire impacts.
If you must return a response that big, consider using one of the storage options that browsers provide (a few mentioned above). Then paginate off of the stored JSON response.

Unable to parse bulk JSON POST request to Kinvey back-end

When I try to parse this JSON:
[
{"name":"name1","id":12},
{"name":"name2","id":11},
{"name":"name3","id":111},
{"name":"name4","id":1115}
]
in a POST request to Kinvey's BAAS, I get the error:
{
"error": "Unable to parse the JSON in the request"
}
Here is a screenshot of my back-end (Kinvey).
Here is a screenshot of my request (Postman).
When I send the single entity {"name":"name1","id":12} it doesn't throw an error and places it in the back-end as it should. Picture here: Kinvey worked
As a security measure, some frameworks won't parse top-level arrays as JSON. Doing so enabled exploits in some older browsers.
The exploit goes something like this:
Write some Javascript that replaces Array with a function that stores its contents to some other variable.
In your malicious site, include a request to some privileged (JSON Array) resource on another server using a <script> tag.
Trick a user with privileges on that server into visiting your site.
The requested resource will be pulled from the benign server, loaded in the user's browser as a script, and evaluated— but the array gets handled by your malicious substitute function, which you can use however you like. A form of cross-site request forgery.
Update
Regarding the question, "how do I upload multiple entities to a Kinvey collection?", the answer is in the Kinvey documentation:
"For bulk upload, see the CSV/JSON import feature on the Kinvey console (navigate to the collection, click Settings, then click Import Data)."
You can only POST one entity at a time with the POST function in Kinvey. So this is not a JSON parsing error.
Also, you should look into calling Kinvey through the official Kinvey SDK for the mobile platform you're developing for, rather than using the REST API. That way, you can take advantage of many other features such as caching, offline sync, implicit authentication, etc.

Is fetching remote data server-side and processing it on server faster than passing data to client to handle?

I am developing a web app which functions in a similar way to a search engine (except it's very specific and on a much smaller scale). When the user gives a query, I parse that query, and depending on what it is, proceed to carry out one of the following:
Grab data from an XML file located on another domain (ie: from www.example.com/rss/) which is essentially an RSS feed
Grab the HTML from an external web page, and proceed to parse it to locate text found in a certain div on that page
All the data is plain text, save for a couple of specific queries which will return images. This data will be displayed without requiring a page refresh/redirect.
I understand that there is the same domain policy which prevents me from using Javascript/Ajax to grab this data. An option is to use PHP to do this, but my main concern is the server load.
So my concerns are:
Are there any workarounds to obtain this data client-side instead of server-side?
If there are none, is the optimum solution in my case to: obtain the data via my server, pass it on to the client for parsing (with Javascript/Ajax) and then proceed to display it in the appropriate form?
If the above is my solution, all my server is doing with PHP is obtaining the data from the external domains. In the worst (best?) case scenario, let's say a thousand or so requests are being executed in a minute, is it efficient for my web server to be handling all those requests?
Once I have a clear idea of the flow of events it's much easier to begin.
Thanks.
I just finish a project to do the same request like your req.
My suggestion is:
use to files, [1] for frontend, make ajax call to sen back url; [2] receive ajax call, and get file content from url, then parse xml/html
in that way, it can avoid your php dead in some situation
for php, please look into [DomDocument] class, for parse xml/html, you also need [DOMXPath]
Please read: http://www.php.net/manual/en/class.domdocument.php
No matter what you do, I suggest you always archive the data in you local server.
So, the process become - search your local first, if not exist, then grab from remote also archive for - 24 hrs.
BTW, for your client-side parse idea, I suggest you do so. jQuery can handle both html and xml, for HTML you just need to filter all the js code before parse it.
So the idea become :
ajax call local service
local php grab xm/html (but no parsing)
archive to local
send filter html/xml to frontend, let jQuery to parse it.
HTML is similar to XML. I would suggest grabbing the page as HTML and traversing through it with an XML reader as XML.

Force existing client web pages to reload - using only JSON (no eval)

I'm a consultant working on a web app that's basically a single page app. All it does is constantly retrieve new json data behind the scenes (like once a minute), and then display it on screen.
Our clients load this app, and leave it running 24/7, for weeks on end. If errors happen when retrieving new json data, the app ignores it and keeps running.
We're rolling out an update, and want the existing clients to either become invalidated, or reload themselves without any user interaction. This feature wasn't "built in" by anyone, and we're trying to do this after the fact.
Is there some way to make the existing clients reload without telling our end users to just reload the page?
The following conditions define the app a bit more:
The app uses jQuery 1.9.0
Runs exclusively in Chrome
Retrieves new json data frequently using jquery
Throws away any errors it finds in json responses and uses old data.
EDIT:
I've had it suggested that we could try the following:
send invalid data through the JSON responses to crash chrome (like 500 megs of data, for example)
send window.location.reload through the JSON response (which supposedly won't work due jquery protecting against this type of thing)
send "script" data in the JSON response and if it gets $.html(....) at some point, then it may run the script as well.
and am open to any suggestions on getting this to reload or kill chrome, so the client is forced to reload the page.
If you're using $.ajax to request your data, and not specifically setting your content type, then you may be able to do the following on the server:
set the content type header to "text/javascript"
respond with javascript, e.g. window.location = "http://www.yoursite.com"
jQuery may eval that, and simply run your javascript.
No it is not possible. As far as I can tell you do not execute code from the JSON response (which is a very good thing). Thus you have no way of altering your current client's behaviour. According to your own statement:
"Throws away any errors it finds in JSON responses and uses old data"
You will not be able to crash the user's browser by sending invalid JSON data as the errors will be suppressed.
You can build in automatic deployment in to future versions by sending an application version number and testing for changes or by using WebSockets (which the application seems better suited to anyway as you can ensure your clients only poll the server when the JSON has actually changed).
If I get it correctly, create a version referance page, and make the client check this page very couple seconds, when you update the file, client will reload itself with this script.
var buildNo = "1.2.0.1";//
var cV = setInterval(checkVersion,(5*1000))//Every 5 sec.
function checkVersion(){
$.ajax({
url:"checkVersion.php?v="+buildNo,
dataType:"JSON",
success:function(d){
if(d.version != buildNo){//if version is different
window.location.reload();
//chrome.runtime.reload(); //for chrome extensions
}
}
})
}
if you cant add extra page, you may just add extra variable to end of your JSON data.

Categories

Resources