Apify crawler with more than 2 clickable element - javascript

I am trying to create an apify crawler, which has multiple clickable element. First click is to paginate, second click to visit each result, third is to visit a section of each result to extract more information.
function pageFunction(context) {
var $ = context.jQuery;
if (context.request.label === 'category'|| context.request.label === 'detail') {
context.skipLinks();
var result = {
item_name: $('name').text(),
categories: $('.categories').text(),
email: $('email').text(),
kvk: $('kvk').text()
};
return result;
} else {
context.skipOutput();
}
}
The first 2 clicks are happening, it paginates and visits the results and extract first 3 values : item_name, categories and email
The fourth value : kvk is not returned. I think either the third click is not happening or the code I used have some errors. Can anyone please help me to fix this?

One of the problems can the context.skipLinks() a function that prevents any new enqueued pages. Also, did you check all the selectors in the developer console? For debugging I would advise you to log the content of the page so you know it loaded. First, you need to find the source of the problem.
ONe side note, I would advice you to start developing is our modern web-scraper. Crawler platform is no longer maintained and may perform worse for some cases.

Related

Can I use ActionCable to refresh the page?

I've recently been trying to create a live-scoring system for squash matches. I've managed to use ActionCable with Rails 5 to auto-update the score on the page, but I'd like to know if it's possible to tell Rails to refresh the page if a certain condition is met.
For example, if the game has finished, a different page is shown to say that the players are having a break between games. I need the page to refresh completely for this to happen.
In my database the boolean 'break' is marked as true when a game ends, and then the view uses a conditional if/else statement to decide what to show.
The code I use to update the score is attached below, I was thinking something along the lines of if data.break == true then the page will automatically refresh.
// match_channel.js (app/assets/javascripts/channels/match_channel.js)
$(function() {
$('[data-channel-subscribe="match"]').each(function(index, element) {
var $element = $(element),
match_id = $element.data('match-id')
messageTemplate = $('[data-role="message-template"]');
App.cable.subscriptions.create(
{
channel: "MatchChannel",
match: match_id
},
{
received: function(data) {
var content = messageTemplate.children().clone(true, true);
content.find('[data-role="player_score"]').text(data.player_score);
content.find('[data-role="opponent_score"]').text(data.opponent_score);
content.find('[data-role="server_name"]').text(data.server_name);
content.find('[data-role="side"]').text(data.side);
$element.append(content);
}
}
);
});
});
I don't know if this sort of thing is possible, and I'm not much good at anything Javascript related so I'd appreciate any help on this.
Thanks.
Reloading the current page is relatively straightforward. If you are using Turbolinks, you can use Turbolinks.visit(location.toString()) to trigger a revisit to the current page. If you aren't using Turbolinks, use location.reload(). So, your received function might look like:
received: function(data) {
if (data.break) {
return location.reload();
// or...
// return Turbolinks.visit(location.toString());
}
// your DOM updates
}
Either way is the equivalent to the user hitting the reload button, so it will trigger another GET, which calls your controller and re-renders the view.

Recursive Facebook Page Webscraper with Selenium & Node.js

What I try to do is to loop through an array of Facebook page IDs and to return the code from each event page. Unfortunately, I only get the code of the last page ID in the array but as many times as elements are in the array. E.g. when I have 3 ID's in the array I get 3 times the code of the last page ID.
I already experimented with async await but I had no success.
The expected outcome would be the code of each page.
Thank you for any help and examples.
//Looping through pages
pages.forEach(
function(page) {
//Creating URL
let url = "https://mbasic.facebook.com/"+page+"?v=events";
//Getting URL
driver.get(url).then(
function() {
//Page loaded
driver.getPageSource().then(function(result) {
console.log(result);
});
}
);
}
);
you faced the same issue i did when i created a scraper using python and selenium. Facebook has countermeasure on manual URL change, you cannot change it , i receive the same data again and again even though it was automated. in order to get a good result you need to have access of face books Graph API which provides a complete object of Facebook page with its pagination URL.
or the second way i got it write was i used on click button of selenium browser automation to scroll down the next page.it wont work like you are typing , i prefer the usage of graph API

python requests popup generating a key

I have just started my journey with Python and am just amazed at how much one can do in less than 50 lines of code.
I got stuck, however, writing an app which is to:
connect with a web site
log in using my credentials
fill a form
choose a link from the results returned by the form
use the link to confirm an appointment
use the link to book an appointment
I went all the way through but got stuck at point 6 above. Maybe this is due to something going wrong at point 5, so please let me start with what happens in step 4 :-).
I enter step 4 with the form site generating a list of available appointments. Those are links attached to buttons, like:
<a class="button strong" href="/your_account/confirm?CityId=500&VisitId=42204&HasReferral=False" target="popup"> click to book </a>
NOTE: the button opens a pop-up.
I use BeatifulSoup to pick one of the links, then I convert it to a dictionary
entry = {'CityId' : '500',
'VisitId' : '42204',
'HasReferral' : 'False'}
In step 5 I continue a session which I had created with Python's requests module and POST:
a = my_session.post( 'https://the_website.com/your_account/confirm', data = entry)
What is returned in the a object is the pop-up. In the the pop-up's code there is a function and a button, like the ones below:
accept = function () {
if (canRedirect()) {
var url = '/your_account/reserve?key=55924c2b-9b30-4714-ad6c-8f47c72893cd';
$.post(url, function(html) {
$("#dynamicReservationDivCntainer").html(html);
});
}
};
<button onclick="accept()" id="okButton" class="button strong right reserveConfirmButton">Click here to hit the deal</button>
So here is the step 6 and my problem. When I extract the site address and key from a.text, like in step 5 and POST it:
key_dict = {'key' : '55924c2b-9b30-4714-ad6c-8f47c72893cd'}
b = my_session.post( 'https://the_website.com/your_account/reserve', data = key_dict )
It turns out that b contains a site with something like : "Failure. The key already exists in the database. Please enter another key." and nothing gets booked.
Ah, please let me know if I missed something important in the story. I tried to extract what I believe is the crux of the matter and am afraid I may have oversimplified.

How do I pass a value from an HTML form submission to a Google Sheet and back to HTML in a Google Apps Script Web App

I'm trying to create a basic time clock web app.
So far, I'm using this script to create this web app which takes the input values and puts them in this spreadsheet for the time stamping part.
I need it to use one of the values from the form and perform a lookup in this sheet (take the longId and find me the name) and return the (name) value to the html page as a verification for the end user that they were identified correctly. Unfortunately, I don't know enough to grasp what I'm doing wrong. Let me know if I need to provide more info.
Edit 1
I'm thinking that I wasn't clear enough. I don't need the user info from entry, I need the user from a lookup. The user will be entering their ID anonymously, I need to match the ID to their info, and bring the info back for them to verify.
Edit 2
Using the link provided by Br. Sayan, I've created this script using this spreadsheet as above to test one piece of this. The web app here spits out: undefined. It should spit out "Student 3" Still not sure what I'm doing wrong.
One way for the next button to grab the student input field:
<input type="submit" onclick="studentName(document.getElementById('student').value)" value="Next..."/>
That sends the value to this func in Javascript.html:
function studentName(value) {
google.script.run
.withSuccessHandler(findSuccess)
.findStudent(value);
}
Which sends it to a findStudent(value) in Code.gs
You do the lookup and the return value goes back to findSuccess( result ) back in Javascript.html. Handle the result from there.
Also consider keeping the stock preventDefault() code that comes with the Web App template in the Help > Welcome Screen.
Please try this one:
(source: technokarak.com)
Also please have a look at:
Retrieve rows from spreadsheet data using GAS
EDIT:
Please make these changes in your function and let us know.
function findValue() {
var data = SpreadsheetApp.openById("15DRZRQ2Hcd7MNnAsu_lnZ6n4kiHeXW_OMPP3squbTLE").getSheetByName("Volatile Data").getDataRange().getValues();
for(i in data) {
if(data[i][3] == 100000003) {
Logger.log("yes");
Logger.log(data[i][0]);
var student = [];
student.push(data[i][0]);
return student;
}
}
}
It is a complicated answer, I have had a lot of success with:
function process(object){
var user = Session.getActiveUser().getEmail();
var key = object.Key;
send(key);
}
function send(k){
var ss =
SpreadsheetApp.getActiveSpreadsheet().getActiveSheet();
var lastR = ss.GetLastRow();
ss.GetRange(lastR,1).SetValue(k);
}
On your html button you will need to have inside the tags
onClick="google.script.run
.withSuccessHandler(Success)
.process(this.parentNode);"
In order for this to work, obviously you will need to have your fields named accordingly.
Edit: The only thing I did not include in the code was a Success handler, which will be in your html of the GAS script. This should point you in a direction that can resolve that.
Hope this helps.

Filling log in form with zombie in node.js

Evening! I'm trying to log in into a website with zombie.js, but I don't seem to be able to make it work.
Oh and the website is in Finnish, but it's not very hard to understand, two text fields and a button. First is for username, second for password and the button is the log in button.
At the moment my log in code is as follows:
var Browser = require("zombie");
browser = new Browser();
browser.visit("https://www.nordnet.fi/mux/login/startFI.html?cmpi=start-loggain",
function () {
// Here I check the title of the page I'm on.
console.log(browser.text("title"));
// Here I fill the needed information.
browser.document.getElementById("input1").value ="MYUSERNAME";
browser.document.getElementById("pContent").value ="MYPASSWORD";
// And here it fails. I try to submit the form in question.
browser.document.getElementById("loginForm").submit();
setTimeout(function () {
// This is here to check that we've submitted the info and have been
// redirected to a new website.
console.log(browser.text("title"));
}, 2000);
});
Now I know that I maybe should have used zombie's own "fill" method, but I tried that with no luck so I tried something new.
All I get from this is an error:
Y:\IMC\Development\Web\node_modules\zombie\lib\zombie\forms.js:72
return history._submit(_this.getAttribute("action"), _this.getAttribute(
^
TypeError: Cannot call method '_submit' of undefined
Now if I log that browser.document.getElementById("loginForm") it clearly does find the form, but alas, it doesn't like it for some reason.
I also tried the "conventional" method with zombie, which is using that log in button on the web page and pressing it. The problem is that it's not actually a button, just an image which has a link attached to it, and it's all inside <span>. And I have no idea how I can "click" that button.
It has no ID on it, so I can't use that, then I tried to use the text on it, but because it has umlauts on it I can't get it to work. Escaping the ä with /344 only gave an error:
throw new Error("No BUTTON '" + selector + "'");
^
Error: No BUTTON 'Kirjaudu sisään'
So yeah, that didn't work, though I have no idea why it doesn't recognize the escaped umlaut correctly.
This is my first question, the second one is a minor one, but I though why not ask it here too now that I've written this text.
If I get all this to work, can I somehow copy the cookie that this log in gives me, and use that in my YQL for screen scraping? Basically I'm trying to scrape stock market values, but without the log in the values are 15min deferred, which isn't too bad, but I'd like it to be live anyhow.
After couple of tests using zombie I came to the conclusion that it's still to early to use it for serious testing. Nevertheless, I came up with working example of form submit (using regular .submit() method).
var Browser = require("zombie");
var assert = require("assert");
browser = new Browser()
browser.visit("http://duckduckgo.com/", function () {
// fill search query field with value "zombie"
browser.fill('input[name=q]', 'mouse');
// **how** you find a form element is irrelevant - you can use id, selector, anything you want
// in this case it was easiest to just use built in forms collection - fire submit on element found
browser.document.forms[0].submit();
// wait for new page to be loaded then fire callback function
browser.wait().then(function() {
// just dump some debug data to see if we're on the right page
console.log(browser.dump());
})
});
As you can see, the clue is to use construct browser.wait().then(...) after submitting the form, otherwise browser object will still refer to the initial page (the one passed as an argument to visit method). Note: history object will contain address of page you submitted your form to even if you don't wait for the page to load - it confused me for a bit, as I was sure that I should already see the new page.
Edit:
For your site, the zombie seems to be working ok (I could submit the form and get "wrong login or password" alert). There are some JS errors but zombie isn't concerned with them (you should debug those however to see if the script are working ok for regular users). Anyhow, here's the script I used:
var Browser = require("zombie");
var assert = require("assert");
browser = new Browser()
browser.visit("https://www.nordnet.fi/mux/login/startFI.html?cmpi=start-loggain", function () {
// fill in login field
browser.fill('#input1', 'zombie');
// fill in password field
browser.fill('#pContent', 'commingyourway');
// submit the form
browser.document.forms[0].submit();
// wait for new page to be loaded then fire callback function
browser.wait().then(function() {
console.log('Form submitted ok!');
// the resulting page will be displayed in your default browser
browser.viewInBrowser();
})
});
As side note: while I was trying to come up with working example I've tried to user following pages (all have failed for different reasons):
google.com - even though I filled query box with a string and submitted the form I didn't get search results . Reason? Probably google took some measures to prevent automatic tools (such as zombie) to browse through search results.
bing.com - same as google - after submitting the form I didn't get search results. Reason? Probably same as for google.
paulirish.com - After filling in the search query box and submitting the form zombie encountered script errors that prevent it from completing the page (something about missing ActiveX from charts script).
perfectionkills.com - Surprisingly here I've encountered the same problems as with Paul Irish site - page with search results couldn't be loaded due to javascript errors.
Conclusion: It's not so easy to force zombie into doing your work after all... :)

Categories

Resources