Scrape / extract data from hidden divs in scrapy - javascript

Hi I am trying to scrape a website where there is an input text. Whenever, I click on the input text there are dropdown suggestions for the value of the input text. It is not on select tag.
The value of those suggestion is inside a div tag elements. There were almost 200 divs/suggestions of it.
What I did is scrape from it using scrapy using xpath / css selectors. I found out that these 200 divs are actually hidden when I view the code using "View page source" instead of "Inspect elements".
Please help. Thank you

These elements are generated on the fly by some dropdown library, so you have to investigate the website source code and/or the HTTP requests it's making. All the data you are looking for should be there (most likely in JSON format), not in the HTML itself.
For example, if you are using Chrome:
Press F12 to open devtools while you are on the website
Press F5 to reload the page
Navigate to Network or Source tab
Try to locate the data (CTRL+F would be really helpful here)

Related

How does Google Calendar update the content of an email AFTER it is sent?

Google calendar invite emails will update after they are sent if the original event has been changed... how does Google achieve this? Is there a general technique for anyone to do this? Or is this only possible because Google owns both gMail/gCalendar and the two systems are integrated behind the scenes outside of SMTP?
My first guess was that they used an iframe or an image that was loaded when the email was opened, but inspecting the source of the gMail page doesn't show any signs of that.
Here's a screenshot of the updated text:
And here's the HTML for that section of the page when reading the email within gMail:
Note :
Inspecting Source wil give you nothing other than the markup of the content you see in the page after all dynamic operations including ajax.
To check the actual source, you want to visit view-source:url.
Now the question
That information is updated automatically at Run time via a JavaScript code.
In the image, you checked on Inspect element, which show the code of live view and so, you saw the updated content.
It is done by JavaScript DOM and text manipulation.
To verify this,
Click on the address bar.
add view-source: before the url. So, it will look like view-source:https://url
Then press ctrl+f or the corresponding key to find.
Search for the <div id=":8hg" which will show 0 results.
The view-source is load the source of the file without any ajax or JavaScript manipulation.
The div is not present in the source. So, we can understand that it is done dynamically.
When checking in detail,
in the source, we can see a link https://www.google.com/calendar/event?action\u003dVIEW\u0026eid\u003db..... which is stored in an array.
From this link, the content is taken.
(I blacked out some text for privacy).
Based on the return of the url, the content on mail is upated.
To verify this,
In the mail, you can see This invitation is out of date
But in the view-source: page, search for This invitation is out of date and it will return 0 results.
So, it is sure that the Calendar details are taken via an API call by Gmail to the G Calendar API.
I wonder if on sending the email they create an image at some url and then if it changes they just remove it, then in the email they have something like
<div id="updated"></div>
<img src="asdfawe" onerror="document.getElementById('updated').innerhtml="some text""/>
Although im not sure if they can't use the onerror attribute (b/c email + js = bad idea). the only other way is just to use alt attribute and use some css trickery but I don't see how that could result in the inspected code.

Display inspect element in a div

Something out there who had displayed the rendered html of a page in a div..
Lately I had develop a simple CMS for page meta taggings (dynamically add meta tags according to db record). All goes okay until SEO teams want a proof that it was 'really' rendering the metas.. I can prove to them using the developer tools but they do not want to manually press the F12 and check if the meta was rendered. They do want to display directly on screen e.gdiv.
And I have no idea where to start. Excluding my situatuon, Is it possible to grab the data in developer tools and display it on a div or iframe? Or the view source maybe?
I am searching for possible solution to this but unluckily, cant find one using javascript, jquery, php.
You could propose to make bookmarklets that your SEO team can run that would make JS alerts of meta tag innerHTML.
Otherwise as one comment says, they should just press Ctrl+U, Ctrl+F, type "meta", press enter, and get over it.

Scraping a dynamically generated webpage with HTML5 <input> field

I want to collect data from this page. I have keywords I want to input in the search box, which is defined as an HTML5 <input> with an eventlistener that dynamically changes the page based on the query.
For example, I want a script that inputs the term "hello world" in the search field and then scrapes the dynamically generated content, say the name of the collections that appear. Because of the Same Origin Policy I can't use JavaScript and I've spent the last 3 hours looking into Python but couldn't find anything there.
I can't tell if this is so obvious no one writes/asks about it, or it's a clever way to not let scripts scrape from your site.
Open the page in Chrome's Debugger or Firebug in Firefox and look at the Network Tab and find out the AJAX requests the JavaScript is doing when you enter text into the input field(s).
Then write a webscraper using any of:
https://pypi.python.org/pypi/requests
https://pypi.python.org/pypi/spyda
https://pypi.python.org/pypi/scrapy

Hide WSS 3.0 Webpart Using JavaScript

I am using WSS 3.0 in my application. I am displaying a List as a DataView Webpart. My objective here is to make this webpart visible to a selected group of individuals. As there is no option for Target Audience in WSS 3.0, I went to edit Permissions for List and gave Read permissions only to selected users. This doesn't hide the web part from the page, rather shows an Access Denied message to other users.
Access denied. You do not have permission to perform this action or access this resource.
As I said, I want to hide this webpart, as in make it invisible on the web page from other users who do not have permissions to view it. As this message will be displayed only to those users who do not have permissions!, my approach is to search for the above message in the html and identify and hide the parentnode, thereby hiding the webpart.
I am not quite sure how to do this. Any ideas? Thanks in advance!
I'm going to assume you're in a situation where you can add additional web parts to the page and not trying to add JavaScript to the DataView Web Part directly. My suggestion won't work on a separate page if a Designer adds another view of this list.
Upload a blank .js file to your Site Assets. Add a Content Editor Web Part to your page, point it at that file. Add JQuery from a provider or host it yourself, adding the reference in your file. From there, you have 3 directions in which to work: first, explore the web part with Internet Explorer's F12 Developer Tools, keeping a particular eye on divs and tables with good unique ids, names, or classes that would solve your problem if hidden. Also keep an eye on the id of the div or table or cell or whatever that contains your access denied text. Second, (assuming you're new to JQuery) do some JQuery tutorials and then start playing with selecting the above items and, say, changing their background color. Once you have both of those, you're 90% there: (try to) select the object that would contain the access denied text, and if the innerHTML is present and equals that string, then set display:none for the div or tables to hide your web part. The third tool you have is editing the page directly with SharePoint Designer: you can toss a div with an id of your choosing around any xsl:template, which might help in your JQuery selecting.
I'm sorry I can't give you the specific code, since I'm not in a position to test it. If that changes, I'll try and give a more detailed response.
Old, misdirected answer: Do either of the answers here work for you? Alternatively, this answer has some great resources to solve your problem. Just change the message to an empty string.
Thanks Aron :D
I found the id for the webpart and hard coded it. It provided the solution, but I was hoping to programmatically fetch the id instead by searching the innerhtml, as I have more than one web parts that have to be hidden.
I found a partial solution here:
Hide SharePoint web part using javascript onclick method
I put a CEWP on the page and added the following script in it:
<script>
function hide()
{
var content = document.getElementById("webpartID").innerHTML;
var n = content.search("Access denied. You do not have permission to perform this action or access this resource");
if(n!=-1)
{ document.getElementById("webpartID").style.display="none";
}
}
_spbodyonloadfunctionnames.push("hide");
</script>
In my case, I picked up the webpart id from the aspx page or view source for the page.

altering displayed web page

I am attempting to write a firefox addon that will analyze the displayed page and change the text display to be hyper links (according to some algorithm).
I am trying to fogure out how can i parse the html document tree to retrieve the text in order to make it a link.
So i need not only the text but its position in the document.
Like if i had some kind of parser that will give me only text nodes or something, and then i can replace its content.
Is there such a thing at all?
You can insert javascript into every page so you have everything that javascript can do. A good place to start learning about Firefox addon development is the MDN https://developer.mozilla.org/en/Building_an_Extension

Categories

Resources