I'm new to GWT and am trying to build a web scraping app.
I have a custom URL, say www.amazon.com.
I want to be able to open this url, scrape information from its source (preferably by storing the html content as a Document in gwt) and print the scraped info on the console.
I've tried creating an iframe in the current page and setting its src to the custom url. But that didn't work out.
Do tell me if you need me to elaborate / clarify any aspect of the question
Thanks!
Scrap the URL on the server and write the output through a servlet. Open this Servlet URL in a Dialog box in GWT.
Or if the content (of the scrapped page) can be stored in other formats like a HashMap, then you can RPC the server to get the data object and display it in an HTML panel at the client side.
Thanks,
Sreehari.
Related
I am developing a small labeling tool that given a URL should display a document hosted on that URL and allow a user to choose a label for that document.
I want to display the contents of the URL for this purpose. As far as I know, I can either get the URL content, parse the contents, and display or use an iframe option.
Without using parser
Iframes are not enabled for the target URL, the contents of which I want to display. Is there any other way to do this using javascript without using parser?
Using parser
I can crawl the contents of the URL, get everything between and dump it in the webpage area.
I'm new to javascript and front end development so I am not sure whether these are the only options.
Are there other options to do this?
If the parser is the only option, Can I dump the HTML that I get from the remote URL? I understand that images and other media that may be within on remote url won't be displayed. Is there any other caveat to this method? More importantly, Is this the best way to do this?
Most sites do it via the iframe like you mentioned like codepen.
Also, you can use Puppeteer ( a headless browser ) to do these sort of things. Get the contents using web scraping or take a screenshot or print a pdf. Pretty nifty library.
Most things that you can do manually in the browser can be done using
Puppeteer! Here are a few examples to get you started:
Generate screenshots and PDFs of pages. Crawl a SPA (Single-Page Application) and generate pre-rendered content (i.e. "SSR"
(Server-Side Rendering)).
Automate form submission, UI testing, keyboard input, etc. Create an up-to-date, automated testing environment.
Run your tests directly in the latest version of Chrome using the latest JavaScript and browser features.
Hope this helps !
So my ultimate problem is that I have a SharePoint list where each list item may have multiple image attachments. I am looking to scrape the list using PowerShell so that I can backup all the images.
I am able to access each item's page in the list because of similarities in the URL, but I am unable to extract the attachments. This is because the filename is non-determinant. Unfortunately, I don't seem to be able to parse the info with Invoke-WebRequest because it brings back the HTML of the page, which does not list the file attachments.
Instead, the file attachments can be viewed when you use the 'Inspect page source' button, and which I believe is because they are inside a JavaScript function.
So, my question is - Can I get each file in a page's attachment from the JavaScript function so that I can scrape the page? Also - am I interpreting this problem correctly, and are there any other ways to solve this problem?
Please note: I don't have access to SharePoint server dlls including Microsoft.Sharepoint.dll, so I can't use Classes from that dll (unless they might be easily imported without having to install the whole library).
Here is a photo of where the source changes. I believe this is where HTML ends and Javascript begins:
And the highlighted lines in this file shows the information that I am looking to parse from the page's source information so that I can form the URL to download the image attachments:
I am creating an Add-on for Mozilla Firefox using Mozilla Add-on SDK , for that I need to parse the HTML page that I get as response when I request for a web page. So that after parsing the whole web page , I can to run a segmentation process on it. So that I can redisplay it on the screen by editing them as much as required. So , please give me a solution to , store or parse the HTML page so that I can edit it dynamically and redisplay it. How do I retrieve only the HTML page from the response.
If by "response" you mean the response of XMLHttpRequest, then you get the 'responseText' and use DOMParser to covert it into DOM.
Then you can make the changes and display.
If by "response" you mean when a page is loading, then you can run the code of the addon, before, as soon as, or after DOM is loaded and make the changes to the display as required.
More information is required for a more comprehensive reply.
Update on new info
You can run the script in the addon, based on URL by using PageMod
PageMod
Modifying Web Pages Based on URL
util/match-pattern
I am developing a plugin for displaying a list of images in others' website. I intend to provide them only one url (probably a JS link) which they need to embed in their site so that they will see those images. These list of images would come from my database. Can you please tell me if such a functionality is achievable using Javascript?
Thanks,
Sachin
Javascript is client only and for those security reasons you cannot and should not connect to a database from js. What you can do is use a webpage to produce those images and use your script to inject the page into their sites. You can do this via an iframe or an object tag.
Lets say for example i have an google search results page opened in a window or a tab in firefox.Is there a way i can retrieve the html code of that tab or window using javascript?
I suppose that the webpage html is saved temporarily somewhere in computer memory.
Can i load the webpage using memory saved address?
Is there a way for javascript to read html files saved in the same folder as the original?For example i have saved the webpage in a folder on my computer.If i create an html file inside the same folder does javascript consider the saved webpage as the same domain?
No, you most certainly can't do that unless you control both pages. This would be a huge security hole.
There is a custom search API which may help if you specifically want to do Google searches. It appears to have a JSONP implementation which should let you make a cross-domain request, but I haven't tried it out so I'm not sure how well it works.