Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 4 years ago.
Improve this question
Is there any tutorial or document explaining to create the hotjar like application ? Can anyone provide any insights in this?
i want to create the application which analyses user behaviour.. like
upto how much percentage of the page the user has scrolled
in which part of the DOM has the user clicked
and create report using heatmap.js for just one of my static site/page
Ive made the reports using the static datas and heatmaps
Now i just want to track the user activities like scroll points, mouse hover/click points which can be differnet as for different screen sizes or devices...
Is there any apis or js framework to help?
You won't find a single tutorial on this as you won't find a "Build a web search engine yourself" tutorials. Website user tracking is a very complex topic. Developing such a solution will require huge effort and investments. It would require expensive infrastructure as well (servers to collect data from the users and process them).
Additionaly there are some risks and problems to it as well. User privacy is a hot topic now because of questionable morality of user tracking. Web users awareness grows all the time and a lot of users choose to opt-out of web-tracking. The industry follows and expands users' possibilities to do that (both technical and legal).
If you still want to proceed - learn as much as you can about web user tracking. Search for "user tracking methods", "user tracking techniques", "web analitics" buzz-phrases.
Then when it comes to implementation:
In the browser (client-side)
Implement individual users identification in order to classify their actions across website/multiple websites (fingerprinting). In fact this might require some work on the server too.
Record as much of user's interactions as possible - clicking, scrolling, dragging, keypresses, navigation between pages, typing in the inputs (watch out for sensitive data - passwords, addresses) etc. This data will be the base for user behavior analysis on the server later on. Also, combined with the captured state of the DOM (initial and later mutations), this allows us to create "recordings" of user browsing sessions. I've put the word "recordings" in quotes because, contrary to what many think, these videos are usually not created by recording user's screen. This would be far too compilcated and bandwith-heavy. Instead they are composed from the collected pieces of data I mentioned earlier. Read more about this (IMHO the most interesting) topic in this answer to "How does HotJar generate their recordings?".
Implement sending the above information SECURELY to the server for analysis.
Again: make your solution SECURE. The data you will be collecting is highly sensitive and shouldn't fall into the wrong hands for your and site users' sake. You'll probably have to fund a Bug Bounty program (I warned you it'd be expensive!) like HotJar did.
Figure out a way to inject your tracking application into the website's code (in all popular browsers - don't forget the legacy ones too!). In most cases it requires the site owner to put a small piece of JS on every page of their website. Read Erik Näslund's (HotJar architect) answer to "Why do websites like Hotjar and Google Analytics use complex tracking code instead of just a tag?" to find out more about how this script looks like and why.
On the server (the relatively easy part)
Implement data processing and produce reports - heatmaps, session recordings, clickstreams.
I did a very simple POC covering some of the above-mentioned client-side stuff some time ago. It's a simple script that watches DOM changes and user events (some of them) and logs them to the console when injected to a web page. In a full-blown solution this would be sent to the server for processing (instead of being written to the console).
These recorded events (DOM changes along with timestamps and user input/events) could be used to reliably reproduce what the users see in the browser window and how they interact with it.
Related
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 3 months ago.
Improve this question
The company I work for has a requirement to protect some area where articles are rendered, I've implemented some procedures to protect web-scraping but the problem remains for manual scraping.
The anti web scraping bot protection mechanism seems to be working good so far, but I see clients trying the manual scraping.
What I have tried to protect the article contents:
Set a copy eventhandler on article's wrapper element to prevent copy.
-> Clients can make use of userscripts (greasemonkey, etc) to efficiently bypass this by removing the eventhandler or simply making scripts to copy the contents and save to a file
Developer console protection -> useless
Redirect if F12 pressed -> useless
Seems like protecting HTML is undoable (unless someone tells me otherwise) so I'd like to know other ways to display text and render it totally UNABLE to copy.
Things I've thought:
JS detection mechanisms to diagnose if the user has any sort of userscript running, in other words, if there's no malicious JS code being injected and executed to extract the text
Transforming the article's HTML into a PDF and displaying it inline with some sort of anti text-select/copy (if this even exists).
Transforming the article's HTML into chunks of base64 images which will render the text completely unable to select and copy
Are there any good ways to prevent my content from being stolen while not interfering much with user experience? Unfortunately flash applets are not supported anymore, it used to work charms that era.
EDIT: Cmon folks, I just need ideas for at least make end user's efforts a bit harder, i.e. you can't select text if they're displayed as images, you can only select image's themselves.
Thanks!
As soon as you ship HTML out of your machine, whoever gets it can mangle it at leisure. You can make it harder, but not impossible.
Rethink your approach. "Give information out" and "forbid it's use" somewhat clashes...
No, You Can't
Once the browser loaded your page, You can't protect the content from copying / downloading.
It can be text, image or videos, You can protect it from unauthorised access. But you can't protect from get scraped by the authorized person.
But you can make it harder using the steps that you mentioned in your question and restricting the copyright laws.
This issue still exists in many sites, Especially In E-learning platforms, such as udemy and etc... In those sites, The premium courses are still getting copied / leaked by the person who bought it.
From Udemy FAQ
For a motivated Pirate, however, any content that appears on a computer screen is vulnerable to theft. This is unavoidable and a problem across the industry. Giants like Netflix, Youtube, Amazon, etc. all have the same issue, and as an industry, we continue to work on new technology solutions to limit Piracy.
Because pirating techniques currently outpace protection, we hired a company who is specifically dedicated to enforcing the DMCA laws on your behalf and target violating individuals, hosting sites, and DNS servers in an attempt to get any unauthorized content removed.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 4 years ago.
Improve this question
I'm looking for a free web-based solutions for little etl / mashup tasks.
An example could be:
connect to an api
filter response
using data as input to another api
It's something similar to now not working yahoo pipes and for me is important to have and interface for designers with little code ability (mostly javascript)
Note: I've found this paper with a lot of ideas on this field and some comparison between existing products
Pre warning - this is not a free solution - I did a lot of work around this about a year or so ago, and the free stuff at the time just would not do what I needed.
In the end I used Dell Boomi - now I know what you are thinking - Dell? that sounds horrendous, the manufacturer of crap laptops you say! Why yes….
Boomi came from a bunch of dudes who basically had (what I am assuming to be your problem) to connect a bunch of stuff together, in the cloud, without having to worry about how it all works behind the scenes. It has a fantastic user interface (all web based) - is completely cloud hosted (although you can run the endpoint on your server / computer if you so desire) and, if it all goes tits up with their inbuilt tooling (i.e. you can’t quite do what you need to) - you can run in-line Groovy (java) code within whatever ETL process you are having trouble with - i think this fits the bill for the user friendly designer stuff!
Boomi’s pedigree was and is connecting web services / rest API’s in a quick and easy way but also supports all the traditional stuff if you need it too (IBM MQ, blah blah)
The big downside is that it is not free - in fact quite expensive if this is not for a paid project
There is a 30 day free trial that i recommend you check out - I really did and do have a great time with Boomi for mashing endpoints together.
Now, at the time I also looked at Talend. IF i remember correctly this does not have a web interface, its all based in Eclipse, the problem with Talend when i looked at it was
You need to host the endpoint somewhere (this is usually true of all ETL however of course)
The UI was horrible at the time
Ultimately, finding free ‘ETL’ is nearly impossible - hence why pipes went down?
Sorry I can’t be of more help :(
Ballerina is a programming language custom built for integration, that includes a mature graphical syntax. It can easily be used to glue interfaces together. Since your requirement is to have such a mashup interface in the cloud you can utilize the WSO2 Integration Cloud free trial program to see if its right for you.
I've written a post here that demonstrates how easy it is to use Ballerina for scraping data from interfaces, you can create a service that's similar in logic and host it in the cloud. Find information on the WSO2 Integration Cloud usage here. Find information on serving a ballerina service from the cloud here.
Some more details would be helpful, such as which API you would like to connect to and how many requests you'd be making. Here's one way you might approach this with free tools:
Extract: An IFTTT integration plus their "Maker Channel" (Will post info from one of their 270+ integrations to an API)
Transform: Sheetsu, which turns a Google Spreadsheet into a restul API that you can post to. Transform the data and output it to another sheet.
Load: You can also make GET requests via Sheetsu, or just use the Google Spreadsheets API.
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 8 years ago.
Improve this question
I have a web app that displays productivity figures for the vessels currently working in our container terminal. The data is somewhat sensitive, but is not really top secret, since is just moves pending, total, productivity in moves per hour and the times of the first move and an estimate when the vessel will finish operations. Although, if made public, the shipping lines could be able to see how their competition's vessels are being attended, and might cause issues with our marketing department.
Now, I created the webapp for our smartphone users, so they can have the real-time productivity board at hand. That help them assess the operation the moment is happening, and make corrective actions to speed up the lifting equipment or fix any issues on the fly.
The app runs in an internal web server, and the users must log into the VPN to view the app data. Is not accessible from the outside. We have been recently requested by some of our customers to have the data available to them, but segregated in order to see their vessels only. That is no problem. I can do that easily, but the issue is that I don't want to give access to our VPN to each and every customer that want to use the app.
The app works this way:
a) A Pentaho ETL runs querying our databases, and produces an XML file which is saved in the apache webroot.
b) The XML file es read by the webapp written in HTML5, JS, JQuery, and also using bootstrap.js, datatables.js, realgauge.js and some other frameworks.
My idea is to copy the app resource files to the public webserver, and have a cron job ftp all XML files being updated by the minute, since is accessible from the LAN. That way our smartphone users will no longer have to log into the VPN to access the app.
But, there are security concerns, since HTML, JS and XML files will be exposed to the public. The app will not be publicized, but I'm afraid that an attacker, just browsing the web root directory, might pinpoint the files and extract the data.
So, my question is one of a recommendation on which path should I take:
I've been doing some research in XML encryption, but I will need to provide some kind of token that will be used as a seed for the encryption algorithm, and I'm not quite sure how secure can it be.
Have a user/password authentication implemented on the app, but it might be complicated to maintain a database of users and passwords for everyone that will access the app. I worried about the administrative overhead of lost passwords and the sort. Although, I haven't researched the subject fully jet. I looked into hello.js, and it seems promising. I would like to hear your opinions on that.
We use Joomla 3 as our CMS for our website, so maybe there is something we can use on the joomla side, maybe use its user/password authentication system to control the access to the app.
Any other option that you consider I should research on.
Our main goal is: Have the app available to our mobile or other external users, while not exposing the plain XML file with all the data.
Many thanks to all for the help.
UPDATE
I've been researching on a Joomla template called "Blank". It turns out there is even a Bootstrap version, so if I can fit my code into the template, I can do access control within Joomla to publish my content to logged in users, and apply the customized template. With this I'll be fixing 2 issues.
I can publish customized customer data
I can also publish our mobile site to every one of our own mobile users, and I'll be saving tons of $ on VPN licenses.
Thanks all for your help.
I'm assuming goals of: (1) pretty good security, and (2) minimal development work necessary.
Then I prefer your approach #2. I would guess from the situation you describe, that there isn't a huge need to change passwords, so you can just generate user/password combinations yourself and share it with clients. You could update it once a year if necessary. Then it's straight forward to either secure access on your app using user/password login, or you could encrypt each xml for the client using the client's password.
If you found there really was a major need for clients to change their passwords, the question would be how to store and update the passwords instead of just having the app read a flat file.
Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 9 years ago.
Improve this question
Aloha, Stackoverflow.
I frequently come across web applications, and wonder to myself, "How could I write a script/application which would interface with that?" (purely academic, not for spamming purposes!).
For example, the website Omegle; people have written Python scripts to interface with the website and run a chat without opening the browser... how? I will admit that WEB programming is not my strongest area, but I would really like to know how one could extract the protocol being used from such applications, and use this knowledge to create custom apps and tinker with the service.
So basically, how can I figure out the inner workings of a web app (ie. imeetzu.com such that I can write code to interface with it from my desktop?
Thank you in advance!
You'll need a set of tools to start with:
A browser with a debugging window (Chrome is particularly good for this). This will allow you in particular to access the network calls that your browser directly makes (there's a caveat coming), and to see:
their content
their parameters
their target
A network packet sniffer to trace down anything that goes through Flash (or WebSockets). I'm quite fond of Ethereal (now called Wireshark), though if you're in the US, you could be breaking the law by using it (depends on the use you make of it). This will allow you to see every TCP frame that enters and leaves your network interface.
The knowledge you will need:
Ability to identify and isolate a network stream. This comes through practice
Knowledge of the language the app you are trying to reverse-engineer is written in. If JavaScript isn't your cup of tea, avoid JS-based stuff
Maths and cryptography. Data may very well be encrypted/obfuscated/stegg-ed from time to time. Be aware and look out for it.
In this particular case, looks like you might have to deal with Flash. There are additional resources to help on this, although all of them are non-free. There is one particularly good Flash decompiler called SoThink SWF decompiler, which allows you to turn a SWF into a FLA or a collection of AS sources.
That's all for the tools. The method is easy - look what data comes in/out and figure out by elimination what is what. If it's encrypted, you'll need IVs and samples to hope to break it (or just decompile the code and find how the key/handshake is done). This is a very, very extensive field and I haven't even touched the tip of the iceberg with this - feel free to ask for more info.
(How do I know all this? I was a contributor to the eAthena project, which reverse-engineered a game protocol)
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 7 years ago.
Improve this question
I was wondering what would be the most ethical way to consume some bytes (386 precisely) of content from a given Site A, with an application (e.g. Google App Engine) in some Site B, but doing it right, no scraping intended, I really just need to check the status of a public service and they're currently not providing any API. So the markup in Site A has a JavaScript array with the info I need and being able to access that let's say once every five minutes would suffice.
Any advice will be much appreciated.
UPDATE:
First all thanks much for the feedback. Site A is basically the website of the company that currently runs our public subway network, so I'm planning to develop a tiny free Android app for anyone to have not only a map with the whole network and its stations but also updated information about the availability of the service (and those are the bytes I will eventually be consuming), etcétera.
There will be some very differents points of view, but hopefully here is some food for thought:
Ask the site owner first, if they know ahead of time they are less likely to be annoyed.
Is the content on Site A accessible on a public part of the site, e.g. without the need to log in?
If the answer to #2 is that it is public content, then I wouldn't see an issue, as scraping the site for that information is really no different then pointing your browser at the site and reading it for yourself.
Of course, the answer to #3 is dependent on how the site is monetised. If Site A provides advertistment for generating revenue for the site, then it might not be an idea to start scraping content, as you would be bypassing how the site makes money.
I think the most important thing to do, is talk to the site owner first, and determine straight from them if:
Is it ok for me to be scraping content from their site.
Do they have an API in the pipeline (simply highlighting the desire may prompt them to consider it).
Just my point of view...
Update (4 years later): The question specifically embraces the ethical side of the problem. That's why this old answer is written in this way.
Typically in such situation you contact them.
If they don't like it, then ethically you can't do it (legally is another story, depending on providing license on the site or not. what login/anonymousity or other restrictions they have for access, do you have to use test/fake data, etc...).
If they allow it, they may provide an API (might involve costs - will be up to you to determine how much the fature is worth to your app), or promise some sort of expected behavior for you, which might itself be scrapping, or whatever other option they decide.
If they allow it but not ready to help make it easier, then scraping (with its other downsides still applicable) will be right, at least "ethically".
I would not touch it save for emailing the site admin, then getting their written permission.
That being said -- if you're consuming the content yet not extracting value beyond the value
a single user gets when observing the data you need from them, it's arguable that any
TOU they have wouldn't find you in violation. If however you get noteworthy value beyond
what a single user would get from the data you need from their site -- ie., let's say you use
the data then your results end up providing value to 100x of your own site's users -- I'd say
you need express permission to do that, to sleep well at night.
All that's off however if the info is already in the public domain (and you can prove it),
or the data you need from them is under some type of 'open license' such as from GNU.
Then again, the web is nothing without links to others' content. We all capture then re-post
stuff on various forums, say -- we read an article on cnn then comment on it in an online forum,
maybe quote the article, and provide a link back to it. Just depends I guess on how flexible
and open-minded the site's admin and owner are. But really, to avoid being sued (if push
comes to shove) I'd get permission.
Use a user-agent header which identifies your service.
Check their robots.txt (and re-check it at regular intervals, e.g. daily).
Respect any Disallow in a record that matches your user agent (be liberal in interpreting the name). If there is no record for your user-agent, use the record for User-agent: *.
Respect the (non-standard) Crawl-delay, which tells you how many seconds you should wait before requesting a resource from that host again.
"no scraping intended" - You are intending to scrape. =)
The only reasonable ethics-based reasons one should not take it from their website is:
They may wish to display advertisements or important security notices to users
This may make their statistics inaccurate
In terms of hammering their site, it is probably not an issue. But if it is:
You probably wish to scrape the minimal amount necessary (e.g. make the minimal number of HTTP requests), and not hammer the server too often.
You probably do not wish to have all your apps query the website; you could have your own website query them via a cronjob. This will allow you better control in case they change their formatting, or let you throw "service currently unavailable" errors to your users, just by changing your website; it introduces another point of failure, but it's probably worth it. This way if there's a bug, people don't need to update their apps.
But the best thing you can do is to talk to the website, asking them what is best. They may have a hidden API they would allow you to use, and perhaps have allowed others to use as well.