Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 3 months ago.
Improve this question
The company I work for has a requirement to protect some area where articles are rendered, I've implemented some procedures to protect web-scraping but the problem remains for manual scraping.
The anti web scraping bot protection mechanism seems to be working good so far, but I see clients trying the manual scraping.
What I have tried to protect the article contents:
Set a copy eventhandler on article's wrapper element to prevent copy.
-> Clients can make use of userscripts (greasemonkey, etc) to efficiently bypass this by removing the eventhandler or simply making scripts to copy the contents and save to a file
Developer console protection -> useless
Redirect if F12 pressed -> useless
Seems like protecting HTML is undoable (unless someone tells me otherwise) so I'd like to know other ways to display text and render it totally UNABLE to copy.
Things I've thought:
JS detection mechanisms to diagnose if the user has any sort of userscript running, in other words, if there's no malicious JS code being injected and executed to extract the text
Transforming the article's HTML into a PDF and displaying it inline with some sort of anti text-select/copy (if this even exists).
Transforming the article's HTML into chunks of base64 images which will render the text completely unable to select and copy
Are there any good ways to prevent my content from being stolen while not interfering much with user experience? Unfortunately flash applets are not supported anymore, it used to work charms that era.
EDIT: Cmon folks, I just need ideas for at least make end user's efforts a bit harder, i.e. you can't select text if they're displayed as images, you can only select image's themselves.
Thanks!
As soon as you ship HTML out of your machine, whoever gets it can mangle it at leisure. You can make it harder, but not impossible.
Rethink your approach. "Give information out" and "forbid it's use" somewhat clashes...
No, You Can't
Once the browser loaded your page, You can't protect the content from copying / downloading.
It can be text, image or videos, You can protect it from unauthorised access. But you can't protect from get scraped by the authorized person.
But you can make it harder using the steps that you mentioned in your question and restricting the copyright laws.
This issue still exists in many sites, Especially In E-learning platforms, such as udemy and etc... In those sites, The premium courses are still getting copied / leaked by the person who bought it.
From Udemy FAQ
For a motivated Pirate, however, any content that appears on a computer screen is vulnerable to theft. This is unavoidable and a problem across the industry. Giants like Netflix, Youtube, Amazon, etc. all have the same issue, and as an industry, we continue to work on new technology solutions to limit Piracy.
Because pirating techniques currently outpace protection, we hired a company who is specifically dedicated to enforcing the DMCA laws on your behalf and target violating individuals, hosting sites, and DNS servers in an attempt to get any unauthorized content removed.
Related
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 4 years ago.
Improve this question
Is there any tutorial or document explaining to create the hotjar like application ? Can anyone provide any insights in this?
i want to create the application which analyses user behaviour.. like
upto how much percentage of the page the user has scrolled
in which part of the DOM has the user clicked
and create report using heatmap.js for just one of my static site/page
Ive made the reports using the static datas and heatmaps
Now i just want to track the user activities like scroll points, mouse hover/click points which can be differnet as for different screen sizes or devices...
Is there any apis or js framework to help?
You won't find a single tutorial on this as you won't find a "Build a web search engine yourself" tutorials. Website user tracking is a very complex topic. Developing such a solution will require huge effort and investments. It would require expensive infrastructure as well (servers to collect data from the users and process them).
Additionaly there are some risks and problems to it as well. User privacy is a hot topic now because of questionable morality of user tracking. Web users awareness grows all the time and a lot of users choose to opt-out of web-tracking. The industry follows and expands users' possibilities to do that (both technical and legal).
If you still want to proceed - learn as much as you can about web user tracking. Search for "user tracking methods", "user tracking techniques", "web analitics" buzz-phrases.
Then when it comes to implementation:
In the browser (client-side)
Implement individual users identification in order to classify their actions across website/multiple websites (fingerprinting). In fact this might require some work on the server too.
Record as much of user's interactions as possible - clicking, scrolling, dragging, keypresses, navigation between pages, typing in the inputs (watch out for sensitive data - passwords, addresses) etc. This data will be the base for user behavior analysis on the server later on. Also, combined with the captured state of the DOM (initial and later mutations), this allows us to create "recordings" of user browsing sessions. I've put the word "recordings" in quotes because, contrary to what many think, these videos are usually not created by recording user's screen. This would be far too compilcated and bandwith-heavy. Instead they are composed from the collected pieces of data I mentioned earlier. Read more about this (IMHO the most interesting) topic in this answer to "How does HotJar generate their recordings?".
Implement sending the above information SECURELY to the server for analysis.
Again: make your solution SECURE. The data you will be collecting is highly sensitive and shouldn't fall into the wrong hands for your and site users' sake. You'll probably have to fund a Bug Bounty program (I warned you it'd be expensive!) like HotJar did.
Figure out a way to inject your tracking application into the website's code (in all popular browsers - don't forget the legacy ones too!). In most cases it requires the site owner to put a small piece of JS on every page of their website. Read Erik Näslund's (HotJar architect) answer to "Why do websites like Hotjar and Google Analytics use complex tracking code instead of just a tag?" to find out more about how this script looks like and why.
On the server (the relatively easy part)
Implement data processing and produce reports - heatmaps, session recordings, clickstreams.
I did a very simple POC covering some of the above-mentioned client-side stuff some time ago. It's a simple script that watches DOM changes and user events (some of them) and logs them to the console when injected to a web page. In a full-blown solution this would be sent to the server for processing (instead of being written to the console).
These recorded events (DOM changes along with timestamps and user input/events) could be used to reliably reproduce what the users see in the browser window and how they interact with it.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
I want to send a link to a client with some work I am doing for them but they are rather informed about IT and that would lead me to think that they know how to copy and paste some HTML and CSS. How would I go about stopping them from seeing the HTML, CSS and JS of the page I want to send them?
Unfortunately this is not effectively done. While it is true that HTML and CSS can be minified there are a large number of free utilities out there that are designed to reverse the minification, or "beautify", whatever you have minified.
There are a range of other methods which are used on occasion but don't really do anything to protect the source from anyone except those who wouldn't care about the source anyway.
Source Code Padding
This is one of the oldest tricks in the book and involves adding a ton of white space before the start of the source so that when the user opens the view source window it appears blank, however almost everyone these days will notice the scroll bars and scroll down the page until they hit the source. This also has the extra negative effect of degrading performance for your site as it substantially increases the amount of data being sent to the client.
No Right Click Scripts
These sorts of scripts prevent the user from right clicking on the page and opening up the page context menu however they are notoriously hard to get working across browsers, annoy users who don't like the native functionality and usability of their browser being altered without permission, and don't make any difference as the source code window can be opened from the top menu.
Javascript "Encryption"
This is a very popular method to suposedly "protect" the source code of the page and involves taking the code and using a custom made function to encrypt the script before pasting it into the html file and then embedding javascript in that file to decrypt the markup at run time. This only works if the end user has javascript enabled and is not effective in protecting the page from other designers and coders as you need to embed decryption javascript in your page which the user can use to reverse your markup and see the plain text markup.
HTML Protection Software
There is software out there which is marketed as protecting HTML and CSS however these protection software packages generally use one of the above methods, and charge you for the privilege of having a false belief your code is actually safe.
Unfortunately the way the internet is designed to work with HTML and CSS this is not possible and won't be without such a drastic shift in the web landscape and the way websites are designed that I personally don't see it ever occuring.
Information sourced from http://www.htmlgoodies.com/beyond/article.php/3875651/Web-Developer-Class-How-to-Hide-your-Source-Code.htm
If your concern is that they'll steal your work, then maybe you shouldn't be working with them.
Also, if you have a good contract in place that specifies who owns the work at which stage of the process, this won't be an issue. In other words, if it's clear that you own the work until it's paid in full, you could sue them if they steal it.
Although it won't stop people stealing your code, you can make it harder to do so using minification - this will remove whitespace and translate variables to harder-to-read names, amongst other things. This will also reduce the footprint of your code, icnreasing the page load speed.
You can't do anything except obfuscate and minify it. There are several utilities to do this if you google it.
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 9 years ago.
Improve this question
There are a lot of features and abilities of javascript that I am unaware of. I have developed a custom closed source CMS and I was thinking about adding a feature to allow for custom javascript that gets included on each page of their site (but not the backend system itself). I was curious of the risk involved with doing this? The CMS is built using PHP and there is javascript within the backend system of this CMS, but thats pretty much it.
If I allow custom javascript, can this be manipulated to retrieve all the php code, or to cause issues on the server itself?
I own the servers, so I can make any adjustments necessarily for safeguarding the server.
Again, this is purely for information and I appreciate any advice people can give me.
The javascript will be stored in a file and included using php on the page itself. I do have code that blocks anything inside to prevent the use of PHP within the code itself.
Can they steal my closed-source PHP code with JavaScript?
To answer your first question, no, your closed-source PHP code cannot be stolen by a user of your CMS software simply by uploading a JavaScript snippet.
This is because JavaScript runs on the client-side (the web browser).
If JavaScript is able to access your PHP code from the client-side, then they'd be able to access it without JavaScript. That would mean that you've configured something wrong on the web server side, like setting permissions on your files so that anyone can view them.
Is allowing JavaScript to be uploaded by a CMS user a good idea?
You'll get some folks who will scream ABSOLUTELY NOT UNDER ANY CIRCUMSTANCE. These are the same people who say things like:
Using eval() is always evil. It's not always evil, but it's almost always unnecessary.
Using global or $_GLOBALS in PHP is evil. Again, it's only evil if you don't know what you are doing. And again, it's almost always unnecessary.
You should read that as a WARNING. Don't treat this issue lightly, if you are careful, you can do it, but if you are not, it can really bite you in the a**. That's reason enough for most people to stay away from it.
Before you decide for sure if you should or shouldn't allow users of your CMS solution to upload JavaScript snippets, you should ask yourself the following question:
Who will be allowed to upload JavaScript snippets?
If the only people who have access to this feature of uploading JavaScript modules are trusted system administrators, then you should consider it safe. I put that in italics because it's not really safe, but it does, at that point, fall on these trusted users to ensure that they don't upload something malicious.
Maybe you get Mary Neophyte, webmaster(amateur) extraordinaire who decides she wants a cool scriptlet on her CMS front page that displays the current weather in Anchorage, Alaska. She goes to Google, types in "JavaScript weather script", and arrives at Weather Channel. She decides their implementation is just too hard to install. She keeps looking. She arrives at Boris' Weather Script at http:/motherrussia.ru/ilovehackingidiots/weatherscript.html.
This isn't your fault when her CMS starts compromising her end users. She was the trusted administrator who uploaded a malicious script purposefully (though ignorantly). You shouldn't be held responsible for this type of behavior.
Long story short, you should be able to trust the trusted users of your CMS to be responsible enough to know what they are uploading. If they shoot themselves in the foot, that's not on you.
Allowing non-trusted users to upload JavaScript
This absolutely, positively, without a doubt is never something that you should do. It is impossible for you to screen every possible obfuscation that someone could upload.
I'm not even going to get into this further. Don't do it. Period.
Regarding HTML/CSS
Don't assume that malicious code can't make it onto your website via HTML/CSS. While HTML is much easier to sanitize than JavaScript, it can still be exploited to deliver undesired JavaScript to a page.
If you are only allowing trusted users to upload HTML/CSS, then don't worry too much about it. I stress again, It is Mary Neophyte's fault if she uploads Boris' Weather Script to her site. However, don't let Boris himself come to your website and start uploading anything that will get displayed on a web page to anyone but ol' Boris himself.
TL;DR
I'll summarize everything into two rules:
Don't allow untrusted users to upload anything that will be displayed to anyone other than themselves.
Don't let anyone upload anything at all that gets executed server-side.
Allowing custom JavaScript would probably be a very bad idea. That would make your site vulnerable to cross-site scripting attacks and allow it to be a vector for cross-site request forgery attacks against other sites.
Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 9 years ago.
Improve this question
Aloha, Stackoverflow.
I frequently come across web applications, and wonder to myself, "How could I write a script/application which would interface with that?" (purely academic, not for spamming purposes!).
For example, the website Omegle; people have written Python scripts to interface with the website and run a chat without opening the browser... how? I will admit that WEB programming is not my strongest area, but I would really like to know how one could extract the protocol being used from such applications, and use this knowledge to create custom apps and tinker with the service.
So basically, how can I figure out the inner workings of a web app (ie. imeetzu.com such that I can write code to interface with it from my desktop?
Thank you in advance!
You'll need a set of tools to start with:
A browser with a debugging window (Chrome is particularly good for this). This will allow you in particular to access the network calls that your browser directly makes (there's a caveat coming), and to see:
their content
their parameters
their target
A network packet sniffer to trace down anything that goes through Flash (or WebSockets). I'm quite fond of Ethereal (now called Wireshark), though if you're in the US, you could be breaking the law by using it (depends on the use you make of it). This will allow you to see every TCP frame that enters and leaves your network interface.
The knowledge you will need:
Ability to identify and isolate a network stream. This comes through practice
Knowledge of the language the app you are trying to reverse-engineer is written in. If JavaScript isn't your cup of tea, avoid JS-based stuff
Maths and cryptography. Data may very well be encrypted/obfuscated/stegg-ed from time to time. Be aware and look out for it.
In this particular case, looks like you might have to deal with Flash. There are additional resources to help on this, although all of them are non-free. There is one particularly good Flash decompiler called SoThink SWF decompiler, which allows you to turn a SWF into a FLA or a collection of AS sources.
That's all for the tools. The method is easy - look what data comes in/out and figure out by elimination what is what. If it's encrypted, you'll need IVs and samples to hope to break it (or just decompile the code and find how the key/handshake is done). This is a very, very extensive field and I haven't even touched the tip of the iceberg with this - feel free to ask for more info.
(How do I know all this? I was a contributor to the eAthena project, which reverse-engineered a game protocol)
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 7 years ago.
Improve this question
I was wondering what would be the most ethical way to consume some bytes (386 precisely) of content from a given Site A, with an application (e.g. Google App Engine) in some Site B, but doing it right, no scraping intended, I really just need to check the status of a public service and they're currently not providing any API. So the markup in Site A has a JavaScript array with the info I need and being able to access that let's say once every five minutes would suffice.
Any advice will be much appreciated.
UPDATE:
First all thanks much for the feedback. Site A is basically the website of the company that currently runs our public subway network, so I'm planning to develop a tiny free Android app for anyone to have not only a map with the whole network and its stations but also updated information about the availability of the service (and those are the bytes I will eventually be consuming), etcétera.
There will be some very differents points of view, but hopefully here is some food for thought:
Ask the site owner first, if they know ahead of time they are less likely to be annoyed.
Is the content on Site A accessible on a public part of the site, e.g. without the need to log in?
If the answer to #2 is that it is public content, then I wouldn't see an issue, as scraping the site for that information is really no different then pointing your browser at the site and reading it for yourself.
Of course, the answer to #3 is dependent on how the site is monetised. If Site A provides advertistment for generating revenue for the site, then it might not be an idea to start scraping content, as you would be bypassing how the site makes money.
I think the most important thing to do, is talk to the site owner first, and determine straight from them if:
Is it ok for me to be scraping content from their site.
Do they have an API in the pipeline (simply highlighting the desire may prompt them to consider it).
Just my point of view...
Update (4 years later): The question specifically embraces the ethical side of the problem. That's why this old answer is written in this way.
Typically in such situation you contact them.
If they don't like it, then ethically you can't do it (legally is another story, depending on providing license on the site or not. what login/anonymousity or other restrictions they have for access, do you have to use test/fake data, etc...).
If they allow it, they may provide an API (might involve costs - will be up to you to determine how much the fature is worth to your app), or promise some sort of expected behavior for you, which might itself be scrapping, or whatever other option they decide.
If they allow it but not ready to help make it easier, then scraping (with its other downsides still applicable) will be right, at least "ethically".
I would not touch it save for emailing the site admin, then getting their written permission.
That being said -- if you're consuming the content yet not extracting value beyond the value
a single user gets when observing the data you need from them, it's arguable that any
TOU they have wouldn't find you in violation. If however you get noteworthy value beyond
what a single user would get from the data you need from their site -- ie., let's say you use
the data then your results end up providing value to 100x of your own site's users -- I'd say
you need express permission to do that, to sleep well at night.
All that's off however if the info is already in the public domain (and you can prove it),
or the data you need from them is under some type of 'open license' such as from GNU.
Then again, the web is nothing without links to others' content. We all capture then re-post
stuff on various forums, say -- we read an article on cnn then comment on it in an online forum,
maybe quote the article, and provide a link back to it. Just depends I guess on how flexible
and open-minded the site's admin and owner are. But really, to avoid being sued (if push
comes to shove) I'd get permission.
Use a user-agent header which identifies your service.
Check their robots.txt (and re-check it at regular intervals, e.g. daily).
Respect any Disallow in a record that matches your user agent (be liberal in interpreting the name). If there is no record for your user-agent, use the record for User-agent: *.
Respect the (non-standard) Crawl-delay, which tells you how many seconds you should wait before requesting a resource from that host again.
"no scraping intended" - You are intending to scrape. =)
The only reasonable ethics-based reasons one should not take it from their website is:
They may wish to display advertisements or important security notices to users
This may make their statistics inaccurate
In terms of hammering their site, it is probably not an issue. But if it is:
You probably wish to scrape the minimal amount necessary (e.g. make the minimal number of HTTP requests), and not hammer the server too often.
You probably do not wish to have all your apps query the website; you could have your own website query them via a cronjob. This will allow you better control in case they change their formatting, or let you throw "service currently unavailable" errors to your users, just by changing your website; it introduces another point of failure, but it's probably worth it. This way if there's a bug, people don't need to update their apps.
But the best thing you can do is to talk to the website, asking them what is best. They may have a hidden API they would allow you to use, and perhaps have allowed others to use as well.