I am trying to scraping data from a site provide note of student to make analysis
I try this good
from selenium import webdriver
#set chromodriver.exe path
driver = webdriver.Chrome(executable_path="C:\\chromedriver.exe")
#set page load timeout
#launch URL
driver.get("https://amatti.education.gov.dz/")
the first thing happen when run this code is open the site :
[the site open normal][1]
https://i.stack.imgur.com/ay7QJ.png
after the site open it go to this site :
[after open go to this site][2]
https://i.stack.imgur.com/NWvEa.png
I notice there is this good in the html of the site
that mean if the browser not support JavaScript will go to URL : google.com
<noscript>
<meta http-equiv="refresh" content="0; url=http://www.google.com/" />
</noscript>
there is any solution to automate this site
[1]: https://i.stack.imgur.com/ay7QJ.png
[2]: https://i.stack.imgur.com/NWvEa.png
I found the solution
the problem comes from WebDrive
the site knows there is bot scraping data
so i use this argument
options.add_argument("--disable-blink-features=AutomationControlled")
and its work fine
Related
I made a Tauri Hello world app, using react-ts, and that contained logos for Tauri, Vite, and React, that are clickable of course, it uses an a HTML tag like <a href="https://vitejs.dev" target="_blank">, which if I click on it, opens a new tab in my default browser that loads that URL.
So naturally, I wanted to test if Tauri apps would open that link (or any other remote URL actually) inside the app's webview, so I changed that to <a href="https://vitejs.dev"> which did just that.
What I want to know is: how to configure any Tauri app to not open / load any URLs unless I specifically allow it to?
What I tried already:
I tried changing the CSP option in the tauri.conf.json file to none to not allow any remote scripts or ....
"security": {
"csp": {
"default-src": ["'none'"]
}
},
I also tried searching for some kind of allowed-navigation option that someone talked about
I also started looking into a before-navigate hook in the main.rs file but i don't know how to implement it
I would really appreciate it if you explain how to reach my objective, and I would be even more indebted to you if you can give me same better options or the ones more appropriate for a production ready app.
Regards,
zk.
I have a script at the end of some test data that launches Chrome. Unfortunately I can't seem to find the switch to launch a url in POST or to include POST data.
I know you can possibly do javascript: in chrome's url bar to make it execute javascript, and then I could probably simulate a POST request, however when trying a test script, Chrome is just launching a blank page. I think it has something to do with the security feature. Is there a Chrome switch for disabling this security feature for that one webpage launch?
test script (found it here: https://productforums.google.com/d/msg/chrome/CLDQ5KhXfFk/x0r7PGY1CooJ, kind of nice)
javascript:W7=open('','A','width=320,height=240,resizable');W7.focus();with(W7.document){write('<title>Javascript Tester</title><center><form><textarea name=X rows=10 cols=34 wrap>javascript:</textarea><p><input type=button value=Run onclick=opener.location=X.value>');void(close())}
The scenario I'm going to describe is about Excel, but you can spot the same problem in all Office tools.
Scenario:
In my default browser (NOT Internet Explorer) I'm logged in my own specific website, let's call it www.mypersonalwebsite.com
I have an Excel folder with the A1 cell containing a URL pointing to http://www.mypersonalwebsite.com/url/visible/only/to/loggedin/users
When I click on the URL in A1 cell:
my default browser is trying to open this URL
the website is refusing to serve the page because the request is coming from a non logged in user
So that's the problem: why is the browser complaining about the user session when I'm already logged in? And how can I solve it?
I found many similar questions about this problem on stackoverflow and I think I composed a portable and "definitive" solution to this problem.
First of all: why is the browser complaining about the user session?
The answer is "Microsoft Office Protocol Discovery". In a few words: it's something that works only if you are using Microsoft Windows and your default browser is Internet Explorer.
Basically, if you are not using Microsoft Windows OR your default browser is not Internet Explorer, when you click on an URL, the request sent to the browser will always be with an empty cookie. This means that, despite the default browser could use a correct cookie to authenticate the user, the request coming from Excel will never use it. But if you try to reload the page (and the webserver is not redirecting to a different error page), the browser will use the domain cookie and you'll see the correct page.
Second question: how can I solve this problem?
I think I found a very good solution, composed by an HTML part and a webserver part.
HTML part
Starting from the fact that you need to reload the page to use the cookie, I created a simple static page containing a little javascript code and some html. This is just an example. The main part of this code is here.
<!DOCTYPE HTML>
<html lang="en-US">
<head>
<script type='text/javascript'>
function getParameterByName(name) {
var match = RegExp('[?&]' + name + '=([^&]*)').exec(window.location.search);
return match && decodeURIComponent(match[1].replace(/\+/g, ' '));
}
</script>
<meta charset="UTF-8">
<script type="text/javascript">
window.location.href = getParameterByName('newUrl');
</script>
<title>Page Redirection</title>
</head>
<body>
<!-- Note: don't tell people to `click` the link, just tell them that it is a link. -->
If you are not redirected automatically, follow the <a href='<?php echo $newUrl; ?>'>link</a>
</body>
</html>
You can access to the querystring via javascript in many ways, you can find a very interesting thread here.
This static page, let's call it redirect.html, will only do one thing: it will redirect the browser to the page specified in the newUrl parameter. Now if I put in the A1 cell something like:
http://www.mypersonalwebsite.com/redirect.html?newUrl=http://www.mypersonalwebsite.com/url/visible/only/to/loggedin/users
and if I click on this URL:
Excel will go to this URL using the default browser
The browser will open the redirect.html page with an empty cookie
The browser will reload the page using the domain cookie
The user will see the correct page as an authenticated user
The pros of this trick are: it works on all platforms and on all browsers supporting javascript. The cons are that we need to modify all URLs in all our Excel folders.
The webserver part
To hide this redirection to the end users, and save us to modify all our Office documents, we can use another trick. In this example I will use nginx:
if ($http_user_agent ~* "(Excel|PowerPoint|Microsoft Office)") {
rewrite ^/(.*)$ /redirect.html?url=$1 break;
}
The meaning of this little if block is: if the incoming request is from a user agent like Excel, Powerpoint and so on, nginx will do an internal redirection to the redirect.html page, that will again do the browser redirection explained above.
This nginx redirect will completely hide the redirect trick, so we can use the original URLs and the users will always see the correct page.
I'm sure all this can be improved, and I would like to learn how to do it.
I hope this will help someone in finding a complete solution to this Office problem.
I have created an iPad application.
I want to launch it from safari. With URL Schema, it's done successfully.
From my application, I want to send a link. Which on click should open my app.
The mail which I have sent contains matter in the following way
CLICK HERE TO LAUNCH APP
Which is an anchor tag whose href = "MyApp://someString".
But when I send this as mail, on iPad configured mail, link is working fine but in browsers it's not working. Then I came to know that Yahoo, Gmail will deactivate links other than starting with http://
Now, I want to open my app with URL schema MyApp:// with HTML Onload similar to opening iTunes in our PC when itunes.apple.com is opened
With windows.open('MyApp://'), in the onload() function also, my app is not launching.
How to do that?
How to launch my app when html is loading?
Make a PHP page like this:
<?php
header("Location: MyApp://somestring;")(
?>
<html>
<head>
<meta http-equiv="Refresh" content="0; MyApp://somestring" />
<title>Opening App...</title>
<script>
function openApp() {
window.location.href = "MyApp://somestring";
}
</script>
</head>
<body onload="openApp();">
Click here if app doesn't open...
</body>
</html>
I doubt any online email client would let you run javascript in the email. It would be extremely insecure. If they refuse to handle any other URL schema than HTTP, it is probably because of the same security concerns.
I would work around the problem by having a link like
CLICK HERE TO LAUNCH APP
Then the page on your server would just print out
<script>
window.location.href="<?= $_GET['schema'] ?>://";
</script>
(Example in PHP)
Just make sure to scrub the schema variable before you print it!
You could use a regex to make sure it only has a-z, or something like that. Otherwise you get the same security problems Yahoo and Gmail are avoiding.
I'm using Python to parse an auction site.
If I use browser to open this site, it will go to a loading page, then jump to the search result page automatically.
If I use urllib2 to open the webpage, the read() method only return the loading page.
Is there any python package could wait until all contents are loaded then read() method return all results?
Thanks.
How does the search page work? If it loads anything using Ajax, you could do some basic reverse engineering and find the URLs involved using Firebug's Net panel or Wireshark and then use urllib2 to load those.
If it's more complicated than that, you could simulate the actions JS performs manually without loading and interpreting JavaScript. It all depends on how the search page works.
Lastly, I know there are ways to run scripting on pages without a browser, since that's what some functional testing suites do, but my guess is that this could be the most complicated approach.
After tracing for the auction web source code, I found that it uses .php to create loading page and redirect to result page. Reverse engineering to find the ture URLs is not working because it's the same URL as loading page.
And #Manoj Govindan, I've tried Mechanize, but even if I add
br.set_handle_refresh(True)
br.set_handle_redirect(True)
it still read the loading page.
After hours of searching on www, I found a possible solution : using pywin32
import win32com.client
import time
url = 'http://search.ruten.com.tw/search/s000.php?searchfrom=headbar&k=halo+reach'
ie = win32com.client.Dispatch("InternetExplorer.Application")
ie.Visible = 0
ie.Navigate(url)
while 1:
state = ie.ReadyState
if state == 4:
break
time.sleep(1)
print ie.Document.body.innerHTML
However this only works on win32 platform, I'm looking for a cross platform solutoin.
If anyone know how to deal this, please tell me.