I want to download files from a web page that offers public downloads.
I have little experience with the world of web pages, so initially I wrote a simple script using selenium that solves the problem, however I find this way of getting the data too twisted. Therefore, I have tried to download the file via POST requests, investigating a little the "Network" panel of tools I managed to find the request that apparently orders the download.
The button that appears surrounded in the image launches the three requests that appear on the right, except the first time the file is downloaded, one more request appears (this can be seen above the highlighted requests).
However, the response of this request, far from being the content that would allow me to write the desired file, is more like a fragment of html code that corresponds to the content displayed when in the network panel I double click on "downloadDir":
What could I be missing? Does anyone know how to solve the problem or it is not possible by this way?
You should use requests package instead of executing curl or other processes:
import requests
response = requests.post('https://localhost:4000/bananas/12345.png', data = '[ 1, 2, 3, 4 ]')
data = response.content
data contains the downloaded content after that, which you can store to disk for instance:
with open(path, 'wb') as s:
s.write(data)
Related
I can't seem to find a way to display the size of a file in JavaScript in a terminal simulator. (I'm very new to JavaScript)
I've tried these:
https://bitexperts.com/Question/Detail/3316/determine-file-size-in-javascript-without-downloading-a-file
Ajax - Get size of file before downloading
My expected results were to get the byte size but nothing happens.
I'm not able to show any error messages (if there were any) as I am on my school laptop and they blocked Inspect Element.
The output needs to be displayed on the "terminal" itself and it must be strictly JavaScript.
Thanks!
Edit 1:
These are the "terminal" files to make it easier than making files based on snippets that are the whole source. The commands are located at js/terminal.html. The main area we need to pay attention to is Line 144.
I would post it in snippets but I'd make this question 20x the size it is. It's based on Andrew Barfield's HTML5 Terminal
If the server supports HEAD, you can try to use that. However, there's no guarantee that the Content-Length header is returned even if HEAD requests are supported!
Run the below code in a console from stackoverflow and you'll see the size of HTML for their home page without downloading the full page. (Note that StackOverflow no longer provides a content-length header)
fetch('/', {method: 'HEAD'}).then((result) => {
console.log(result.headers.get("content-length"))
})
https://developer.mozilla.org/en-US/docs/Web/HTTP/Methods/HEAD
The HTTP HEAD method requests the headers that are returned if the specified resource would be requested with an HTTP GET method.
I can't seem to find a way to display the size of a file in JavaScript in a terminal simulator. (I'm very new to JavaScript)
I've tried these:
https://bitexperts.com/Question/Detail/3316/determine-file-size-in-javascript-without-downloading-a-file
Ajax - Get size of file before downloading
My expected results were to get the byte size but nothing happens.
I'm not able to show any error messages (if there were any) as I am on my school laptop and they blocked Inspect Element.
The output needs to be displayed on the "terminal" itself and it must be strictly JavaScript.
Thanks!
Edit 1:
These are the "terminal" files to make it easier than making files based on snippets that are the whole source. The commands are located at js/terminal.html. The main area we need to pay attention to is Line 144.
I would post it in snippets but I'd make this question 20x the size it is. It's based on Andrew Barfield's HTML5 Terminal
If the server supports HEAD, you can try to use that. However, there's no guarantee that the Content-Length header is returned even if HEAD requests are supported!
Run the below code in a console from stackoverflow and you'll see the size of HTML for their home page without downloading the full page. (Note that StackOverflow no longer provides a content-length header)
fetch('/', {method: 'HEAD'}).then((result) => {
console.log(result.headers.get("content-length"))
})
https://developer.mozilla.org/en-US/docs/Web/HTTP/Methods/HEAD
The HTTP HEAD method requests the headers that are returned if the specified resource would be requested with an HTTP GET method.
Hi developers i have some problems with blob videos for long time.So I think i finally got hierarchy of blob video that something like reads part of video from server and sends to client.But today i encounter some other problem.Firstly I searched from where comes this blob data to player and some articles gave me useful information for to refer to the links for blob that in inspect section of web page Network section on Chrome.
In this way i saw that the blob video downloads every few seconds with different urls but with little modification.The site was : https://www2.1movies.is/movie/xxxxxxxxxxxxx.html
Blob request refers to urls in a short time with only change of number like that:
s7--p.ex/ample--{seq-1}-xxxx.xxxxxxx
in this exapmle only seq 1 changes frequently and number increasing.
So i used python to download this video by parts in normal mode (without asyncio , or with another sophisticated ways).I did't know how many parts consists video and i checked manually untill it will give 404 error.I did't saved code i used cmd ,therefore i will give code with example.
import urllib as ur
def downloader(x):
"Imagine we will call downloader with given seed url
for example site.com/seq-{}-signature=xxxxxx
in there program will change number with format like that
site.com/seq-1-signature=xxxxxx
site.com/seq-2-signature=xxxxxx and so on.."
for i in range(1,2000):
try:
"video will created with given namber like -> : 1.ts , 2.ts"
ff=open("{}.ts".format(i),'wb')
"and part of video will be read with url and write to file"
ff.write(x.urlopen(x.format(i)).read())
ff.close()
except Exception:
continue
>>>downloader('examp.ple/seq-{}-ddadadaad')
so program worked , but when i attempt to combine all video in one then i saw all parts did't loaded fully , fr example in network section 215th part show 300kb but downloaded file is 298kb.Therefore video playing like interrupted..
So maybe there are have other ways to download this kind of videos but if site works with this way , why didn't I succeed?
Python novice here.
I am trying to scrape company information from the Dutch Transparency Benchmark website for a number of different companies, but I'm at a loss as to how to make it work. I've tried
pd.read_html(https://www.transparantiebenchmark.nl/en/scores-0#/survey/4/company/793)
and
requests.get("https://www.transparantiebenchmark.nl/en/scores-0#/survey/4/company/793")
and then working from there. However, it seems like the data is dynamically generated/queried, and thus not actually contained in the html source code these methods retrieve.
If I go to my browser's developer tools and copy the "final" html as shown there in the "Elements" tab, the whole information is in there. But as I'd like to repeat the process for several of the companies, is there any way to automate it?
Alternatively, if there's no direct way to obtain the info from the html, there might be a second possibility. The site allows to download the information as an Excel-file for each individual company. Is it possible to somehow automatically "click" the download button and save the file somewhere? Then I might be able to loop over all the companies I need.
Please excuse if this question is poorly worded, and thank you very much in advance
Tusen takk!
Edit: I have also tried it using BeautifulSoup, as #pmkroeker suggested. But I'm not really sore how to make it work so that it first runs all the javascript so the site actually contains the data.
I think you will either want use a library to render the page. This answer seems to apply to python. I will also copy the code from that answer for completeness.
You can pip install selenium from a command line, and then run something like:
from selenium import webdriver
from urllib2 import urlopen
url = 'http://www.google.com'
file_name = 'C:/Users/Desktop/test.txt'
conn = urlopen(url)
data = conn.read()
conn.close()
file = open(file_name,'wt')
file.write(data)
file.close()
browser = webdriver.Firefox()
browser.get('file:///'+file_name)
html = browser.page_source
browser.quit()
I think you could probably skip the file write and just pass it to that browser.get call, but I'll leave that to you to find out.
The other thing you can do is look for the ajax calls in a browser developer tool. i.e. when using chrome the 3 dots -> more tools -> developer tools or press something like F12. Then look at the network tab. There will be various requests. You will want to click one, click the Preview tab, and then go through each until you find a response that looks like json data. You are effectively look for their API calls that they used to get the data to generate things. Once you find one, click the Headers tab and you will see a Request URL.
i.e. this https://sa-tb.nl/api/widget/chart/survey/4/sector/38 has lots of data
The problem here is it may or may not be repeatable (API may change, id's may change). You may have a similar problem with just HTML scraping as the HTML could change just as easily.
I'm trying to write a python script which parses one element from a website and simply prints it.
I couldn't figure out how to achieve this, without selenium's webdiver, in order to open a browser which handles the scripts to properly display the website.
from selenium import webdriver
browser = webdriver.Firefox()
browser.get('http://groceries.asda.com/asda-webstore/pages/landing/home.shtml#!product/910000800509')
content = browser.page_source
print(content[42000:43000])
browser.close()
This is just a rough draft which will print the contents, including the element of interest <span class="prod-price-inner">£13.00</span>.
How could I get the element of interest without the browser opening, or even without a browser at all?
edit: I've previously tried to use urllib or in bash wget, which both lack the required javascript interpretation.
As other answers mentioned, this webpage requires javascript to render content, so you can't simply get and process the page with lxml, Beautiful Soup, or similar library. But there's a much simpler way to get the information you want.
I noticed that the link you provided fetches data from an internal API in a structured fashion. It appears that the product number is 910000800509 based on the url. If you look at the networking tab in Chrome dev tools (or your brower's equivalent dev tools), you'll see that a GET request is being made to following URL: http://groceries.asda.com/api/items/view?itemid=910000800509.
You can make the request like this with just the json and requests modules:
import json
import requests
url = 'http://groceries.asda.com/api/items/view?itemid=910000800509'
r = requests.get(url)
price = r.json()['items'][0]['price']
print price
£13.00
This also gives you access to lots of other information about the product, since the request returns some JSON with product details.
How could I get the element of interest without the browser opening,
or even without a browser at all?
After inspecting the page you're trying to parse :
http://groceries.asda.com/asda-webstore/pages/landing/home.shtml#!product/910000800509
I realized that it only displays the content if javascript is enabled, based on that, you need to use a real browser.
Conclusion:
The way to go, if you need to automatize, is:
selenium