I need to scrape a webpage and I normally use scrapy. I need to follow some link that can be opened through javascript and they are nested into some < ul > and < li >.
For example:
<ul class="level1">
<li class="closed"> <----this become "expanded" when opened
<a href="javascript:etc...
<ul class="level2">
<li class="closed">
<ul class="level3">
<li class="track">
<a href="this_is_the_url_that_I_want">
Now, did I need something else than scrapy (I see that Selenium is suggested) or can I use a XmlLinkExtractor? Or can I, in some ways, use the code to extract the url inside "level3"?
Thanks
EDIT: I'm trying to use selenium but I get " File "/usr/lib/pymodules/python2.7/scrapy/spiderloader.py", line 40, in load
raise KeyError("Spider not found: {}".format(spider_name))
KeyError: 'Spider not found: '"
I'm naming the spider, so I don't understand what I've done wrong.
import scrapy
from selenium import webdriver
class audioSpider(scrapy.Spider):
name = "audio"
allowed_domains = ["http://audio.sample"]
start_urls = ["http://audio.sample/archive-project"]
def __init__(self):
self.driver = webdriver.Firefox()
def parse(self, response):
self.driver.get(response.url)
el1 = self.driver.find_element_by_xpath('//ul[#class="level1"]/li[#class]/href')
el1.click()
el2 = self.driver.find_element_by_xpath('//id[#class="subNavContainer loaded"/ul[#class="level2"]/li[#class]/href')
el2.click()
el3 = self.driver.find_element_by_xpath('//id[#class="subNavContainer loaded"/ul[#class="level3"]/li[#class="track"]/href')
print el3
Related
import requests
from bs4 import BeautifulSoup
# raw = requests.get("https://www.daum.net")
# raw = requests.get("http://127.0.0.1:5000/emp")
response = requests.get("https://vip.mk.co.kr/newSt/rate/item_all.php?koskok=KOSPI&orderBy=upjong")
response.raise_for_status()
response.encoding= 'EUC-KR'
html = response.text
bs = BeautifulSoup(html, 'html.parser')
result = bs.select("tr .st2")
<tr>
<td width='92' class='st2'><a href="javascript:goPrice('000020&MSid=&msPortfolioID=')" title='000020'>somethinbg</a></td>
<td width='60' align='right'>15,100</td>
<td width='40' align='right'><span class='t_12_blue'>▼300</span></td>
</tr>
I want to get datas from the someweher by using BeautifulSoup.
But I should access parent Node where has .
However, it's really hard to do it.
This is the code:
Then, how can I get datas from the parent which has the '<tr class ='st2>''
here is the example
You can access the parent element in BeautifulSoup via the element's parent attribute. Since you asked for a list of elements, this has to be done in an iteration.
I'm assuming here that you want to extract each row in the table.
import requests
from bs4 import BeautifulSoup
response = requests.get("https://vip.mk.co.kr/newSt/rate/item_all.php?koskok=KOSPI&orderBy=upjong")
response.raise_for_status()
response.encoding= 'EUC-KR'
html = response.text
bs = BeautifulSoup(html, 'html.parser')
result = [[child.text for child in elem.parent.findChildren('td', recursive=False)] \
for elem in bs.select('tr .st2')]
The result is:
[
['동화약품', '15,100', '▼300'],
['유한양행', '61,100', '▼400'],
['유한양행우', '58,900', '▲300'],
...
]
The following 2 js functions can toggle a button to and from the disabled class. I want the disabled state to depend on the global variable filelength in the python code but cannot think of a simple way to do so. The only way I can think of is to have 2 identical but separate templates, one with the button disabled and one with it enabled.
<script type="text/javascript" language="JavaScript">
function enableButton(button){
document.getElementById(button).removeAttribute('class');
document.getElementById(button).setAttribute("class", "button");
}
function disableButton(button){
document.getElementById(button).setAttribute("class", "disabled");
}
</script>
I intended to use the functions for the following index.html template element.
<button id="Test" class="button disabled" >
Test
</button>
The intended toggling would produce the following alt.html template element which elides the "disabled".
<button id="Test" class="button" >
Test
</button>
It seems silly to require 2 separate templates (index.html and alt.html) to accomplish this toggle, but I cannot think of an alternative that permits me to just alter index.html. Initially I thought jinja2 would provide the functionality needed, but that does not seem correct.
How can I accomplish this without a second template using python and GAE?
For more completeness, below I show the relevant state of my python application next.
import os
import jinja2
import webapp2
import urllib
filelength = 0
class MainPage(BaseHandler):
def get(self):
global filelength
logging.info("text length in Main get: %s " % filelength)
template_values = {'filelength':filelength}
template = JINJA_ENVIRONMENT.get_template('index.html')
self.response.out.write(template.render(template_values))
def post(self):
global filelength
url = self.request.get('URL', None)
text = urllib.urlopen(url).read()
logging.info("text length in Main post: %s " % len(text))
filelength = len(text)
if filelength > 0:
return webapp2.redirect('/alt')
else:
return webapp2.redirect('/')
class AltMainPage(BaseHandler):
def get(self):
global filelength
logging.info("text length in Alt get: %s " % filelength)
template_values = {'filelength':filelength}
template = JINJA_ENVIRONMENT.get_template('alt.html')
self.response.out.write(template.render(template_values))
def post(self):
global filelength
url = self.request.get('URL', None)
text = urllib.urlopen(url).read()
logging.info("text length in Alt post: %s " % len(text))
if filelength > 0:
return webapp2.redirect('/alt')
else:
return webapp2.redirect('/')
return webapp2.redirect('/')
app = webapp2.WSGIApplication([
('/', MainPage),
('/alt', AltMainPage),
],
debug=True)
In the template index.html simply use jinja2 to define the class attribute like this where the value of buttonclass is defined as either button or button disabled in python using the "if ... else" construct.
<button id="Test" class="{{ buttonclass }} " >
Test
</button>
The following page gives access to product details by executing a Javascript request:
http://www.ooshop.com/ContentNavigation.aspx?TO_NOEUD_IDMO=N000000013143&FROM_NOEUD_IDMO=N000000013131&TO_NOEUD_IDFO=81080&NOEUD_NIVEAU=2&UNIVERS_INDEX=3
Each product has the following element:
<a id="ctl00_cphC_pn3T1_ctl01_rp_ctl00_ctl00_lbVisu" class="prodimg" href="javascript:__doPostBack('ctl00$cphC$pn3T1$ctl01$rp$ctl00$ctl00$lbVisu','')"><img id="ctl00_cphC_pn3T1_ctl01_rp_ctl00_ctl00_iVisu" title="Visualiser la fiche détail" class="image" onerror="this.src='/Media/images/null.gif';" src="Media/ProdImages/Produit/Vignettes/3270190199359.gif" alt="Dés de jambon" style="height:70px;width:70px;border-width:0px;margin-top:15px"></a>
I try to use FormRequest from Scrapy librairies to crawl these pages but it does not seem to work:
<python>
import scrapy
from scrapy.http import FormRequest
from JStest.items import JstestItem
class ooshoptest2(scrapy.Spider):
name = "ooshoptest2"
allowed_domains = ["ooshop.com"]
start_urls = ["http://www.ooshop.com/courses-en-ligne/ContentNavigation.aspx?TO_NOEUD_IDMO=N000000013143&FROM_NOEUD_IDMO=N000000013131&TO_NOEUD_IDFO=81080&NOEUD_NIVEAU=2&UNIVERS_INDEX=3"]
def parse(self, response):
URL=response.url
path='//div[#class="blockInside"]//ul/li/a'
for balise in response.xpath(path):
jsrequest = response.urljoin(balise.xpath('#href').extract()[0]
js="'"+jsrequest[25:-5]+"'"
data = {'__EVENTTARGET': js,'__EVENTARGUMENT':''}
yield FormRequest(url=URL,
method='POST',
callback=self.parse_level1,
formdata=data,
dont_filter=True)
def parse_level1(self, response):
path='//div[#class="popContent"]'
test=response.xpath(path)[0].extract()
print test
item=JstestItem()
yield item
Does anyone knows how to make this work?
Many thanks!
My first stack overflow question....
I'm trying to chain an all statements in protractor but I'm getting the error.
TypeError: Object [object Object] has no method 'all'
I'm looking at the API code on the following page
http://angular.github.io/protractor/#/api?view=ElementArrayFinder.prototype.all
Which indicates that you can use element.all(locator).all(locator)
it gives this as an example
var foo = element.all(by.css('.parent')).all(by.css('.foo'))
my code seems to be very similar, and I'm confused why I'm getting this error. I've tried structuring the code exactly like they have it on the API example. I've also tried doing element.all(locator).element.all(locator).
My GOAL here is to take a Ng-repeat of AREFS; find the one that has the text equal to r_string (which is a string generated earlier and added to the page; expect that element to exist; click that element;
Some Attempts:
var parent = element.all(by.repeater('labgroup in LabGroupService.allLabGroups'));
var child = parent.all(by.xpath('//option[text() = \'' + r_string + '\']'));
expect(child.count()).toBe('1');
and
var elem = element.all(by.repeater('labgroup in LabGroupService.allLabGroups')).all(by.xpath('//option[text() = \'' + r_string + '\']'));
expect(elem.count()).toBe('1');
Finally here is a snippet of the HTML i'm working with.
<a ui-sref="root.user-management.labgroup({labgroupID: labgroup.id})" class="ng-binding" href="#/management/labgroup/43">1kvub4wgCvY9QfA</a>
</dd><!-- end ngRepeat: labgroup in LabGroupService.allLabGroups --><dd ng-repeat="labgroup in LabGroupService.allLabGroups" class="ng-scope">
<a ui-sref="root.user-management.labgroup({labgroupID: labgroup.id})" class="ng-binding" href="#/management/labgroup/47">3PNsny8lUMlMwBw</a>
</dd><!-- end ngRepeat: labgroup in LabGroupService.allLabGroups --><dd ng-repeat="labgroup in LabGroupService.allLabGroups" class="ng-scope">
<a ui-sref="root.user-management.labgroup({labgroupID: labgroup.id})" class="ng-binding" href="#/management/labgroup/42">c3NOI7Z3933ui3a</a>
</dd><!-- end ngRepeat: labgroup in LabGroupService.allLabGroups --><dd ng-repeat="labgroup in LabGroupService.allLabGroups" class="ng-scope">
Edit----------------------------------------------------------------------------------------
I'm starting to wonder if this is a version error or maybe a protractor error. In an attempt to debug I've literally included the source code from the API page.
<div id='id1' class="parent">
<ul>
<li class="foo">1a</li>
<li class="baz">1b</li>
</ul>
</div>
<div id='id2' class="parent">
<ul>
<li class="foo">2a</li>
<li class="bar">2b</li>
</ul>
</div>
and the example from the source page.
var foo = element.all(by.css('.parent')).all(by.css('.foo'))
expect(foo.getText()).toEqual(['1a', '2a'])
I'm still getting that same error.
TypeError: Object [object Object] has no method 'all'
Edit 2-------------------------------------------------------------------------------
I managed to solve this issue by adding a 'data-class = labgroup-link' into the actual html code and by using this protractor code.
element.all(by.css('[data-class="labgroup-link"]')).filter(function(elem, index) {
return elem.getText().then(function(text) {
return text === r_string;
});
}).then(function(filteredElements) {
expect(filteredElements[0].isPresent()).toBe(true);
filteredElements[0].click();
ptor.sleep(100);
});
Solution ----------------------------------------
Had to upgrade protractor to get the latest API.
Protractor >= 1.3.0
Should work given: https://github.com/angular/protractor/blob/f7c3c370a239218f6143a/lib/protractor.js#L177
var foo = element.all(by.css('.parent')).all(by.css('.foo'));
Protractor < 1.3.0
ElementArrayFinder doesn't have an all method: https://github.com/angular/protractor/blob/master/docs/api.md#api-elementarrayfinder-prototype-get therefore:
TypeError: Object [object Object] has no method 'all'
Perhaps you want to
var foo = element(by.css('.parent')).all(by.css('.foo'));
// or shorter
var foo = $('.parent').$$('.foo');
Insetad of doing
var foo = element.all(by.css('.parent')).all(by.css('.foo'));
is it possible to amend the following html into the source linked at the bottom of this page? I have limited scripting access to the source page so I'm looking for a way to change the page using jquery or js.
Also the department id's will be completely random and there will be a different number of links relative to each group, therefore it will need to be dynamic. I've tried appending but I'm having trouble as inserting starting or closing tags only, so not sure how to go about this. Thanks in advance for any help offered.
Additions I need in the code are marked with **'s
Original source:
<ul class="menu">
<a id="group-car" href="#">Car</a>
<li><a id="department-2" href="link">Black</a></li>
<li><a id="department-4" href="link">Blue</a></li>
<a id="group-bike" href="#">Bike</a>
<li><a id="department-1" href="link">BMX</a></li>
<li><a id="department-6" href="link">Racing</a></li>
<li><a id="department-12" href="link">Mountain</a></li>
</ul>
What I need to end up with:
<ul class="menu">
**<li>**
<a id="group-car" href="#">CAR</a>
**<ul class="acitem">**
<li><a id="department-2" href="link">Black</a></li>
<li><a id="department-4" href="link">Blue</a></li>
**</ul>**
**</li>**
**<li>**
<a id="group-bike" href="#">BIKE</a>
**<ul class="acitem">**
<li><a id="department-1" href="link">BMX</a></li>
<li><a id="department-6" href="link">Racing</a></li>
<li><a id="department-12" href="link">Mountain</a></li>
**</ul>**
**</li>**
</ul>
jQuery(".menu").children("a").each(function()
{
jQuery(this).nextUntil("a").add(this).wrapAll("<li></li>");
jQuery(this).nextUntil("a").wrapAll("<ul></ul>");
});
jsfiddle
Does this need some explanation?
EDIT oops! I didn't see the classes on them:
jQuery(".menu").children("a").each(function()
{
jQuery(this).nextUntil("a").add(this).wrapAll("<li></li>");
var jUL = jQuery("<ul></ul>").addClass("acitem");
jQuery(this).nextUntil("a").wrapAll(jUL);
});
jsFiddle
What a beautiful challenge!!
Here you have. Tested in FF 3.6 and works!
function fixMarkup(){
var liFamilies = [];
var iFamily = 0;
$(".menu li").each(function(){
if($(this).prev().is("a"))
liFamilies[iFamily] = [this]; //Start a family
else
liFamilies[iFamily].push(this); //Append to family
if($(this).next().is("a")) iFamily++; //A new family begins
});
//console.log(liFamilies);
for(var i = 0; i< liFamilies.length; i++){
var family = liFamilies[i];
$(family).wrapAll('<ul class="acitem" />');
var ulNew = $(family[0]).parent()[0];
var aElem = $(ulNew).prev()[0];
$([aElem, ulNew]).wrapAll("<li/>");
}
}
$(document).ready(function(){
fixMarkup();
});