I've made a website that utilizes Bootstrap's tab navigation. For my blog, I have an iframe set into one of these tabs. I am currently trying to make it so that if I were to give someone a link with the query ?blogPost=a_post it would automatically redirect the SRC of the Iframe to /blog/a_post.html. I started by setting up the script to get the queries based on this article: https://www.sitepoint.com/get-url-parameters-with-javascript/. Then, I adapted the code to look like this:
function onload() {
let queryString = window.location.search;
let urlParams = new URLSearchParams(queryString);
if (urlParams.has('blogPost')) {
let blogPost = urlParams.get('blogPost');
document.getElementById('pills-home-tab').classList.remove('active');
document.getElementById('pills-blog-tab').classList.add('active');
document.getElementById("child-iframe").src = `./blog/${blogPost}.html`;
} else {
return;
}
}
window.onload = onload;
However, when I load the page https://website.com/?blogPost=welcome_to_my_blog none of the classList properties change, and the Iframe stays at /blog.html
I'm not sure where I'm going wrong, but I'm certain what I'm trying to do is possible, and under that assumption know I am at error in my code. No errors come up on my browser's console, either.
You declared a global function called onload, this has overwritten the native window.onload, making it not work. You can change the name of your function, or just add directly like this:
window.onload = () => {
let queryString = window.location.search;
let urlParams = new URLSearchParams(queryString);
if (urlParams.has("blogPost")) {
let blogPost = urlParams.get("blogPost");
document.getElementById("pills-home-tab").classList.remove("active");
document.getElementById("pills-blog-tab").classList.add("active");
document.getElementById("child-iframe").src = `./blog/${blogPost}.html`;
}
};
I want to create an app using nodejs & Gulpjs that opens a specific URL then scroll the page of the URL down to the end of the page, is that possible ?
Here is my code inside gulpfile.js
const {series, src, dest} = require('gulp');
var fs = require('fs');
var path = require('path');
var open =require('open');
var https = require('https');
async function getLinks(params) {
var pageLink = 'https://youtube.com';
var links = [];
open(pageLink, {app: 'chrome'});
https.get(pageLink, (res) => {
let rawHtml = '';
res.on('data', (chunk) => { rawHtml += chunk; });
res.on('end', () => {
try {
console.log(rawHtml);
} catch (e) {
console.error(e.message);
}
});
});
}
exports.default = getLinks;
I would be thankful for some help!
I do not believe that without the use of extensions or some external program exactly what you are looking for is currently possible, but there are some potential alternatives that may help you accomplish some of what you are aiming to do.
ID's
If the page you are linking to has ID's for the element you want to link to you can append that to the end of URL. For example if you wanted to link to an element with the id of #pricing your link would look like this:
https://example.com#pricing
Obviously this is only useful on some pages, and only with elements that have ID's.
Text Fragments
These are slightly closer to what you may be looking for in that it allows you to link to anywhere on a page, regardless of whether the element has an ID or not. Here is an example of how you would link to the More information... text on example.com:
http://example.com/#:~:text=more%20information...
Unfortunately this still has some cons, chiefly in the browser support arena. According to caniuse only Chromium browsers currently supports the feature. (That is 71% of users though).
When I fetch from a website using fetch, I need to figure out a way to get an element which loads after 3-4 seconds. How should I attempt to do this? My code currently is
const body = await fetch('websiteurl')
let html = await body.text();
const parser = new DOMParser();
html = parser.parseFromString(html, 'text/html');
html.getElementById('pde-ff'); // undefined
I can assure this element exists and if I go to the website and use the last line and replace html with document, it works but I need to wait for website to load. Any ideas?
pretty much your stuff isnt loading at the correct time
so you need to wait for html to open and theres a lot of ways to do this
use jquery $(function() {alert("It's loaded!");});
use vanilla js window.addEventListener('load', function () {alert("It's loaded!");})
My HTML block as follows,
<html>
<title>Example</title>
<head>
</head>
<body>
<h2>Profile Photo</h2>
<div id="photo-container">Photo will load here</div>
<script type='text/javascript' src='http://example.com/js/coverphoto.js?name=johndoe'></script>
</body>
</html>
and I have saved this file as test.html. In JavaScript source the name will be dynamic.
I want to collect the name in coverphoto.js. Tried in coverphoto.js as,
window.location.href.slice(window.location.href.indexOf('?') + 1).split('&')
but it is getting the html file name (test.html) only. How can I retrieve the name key from http://example.com/js/coverphoto.js?name=johndoe in coverphoto.js?
To get the URL of the current JavaScript file you can use the fact that it will be the last <script> element currently on the page.
var scripts = document.getElementsByTagName('script');
var script = scripts[scripts.length - 1];
var scriptURL = script.src;
Please note that this code will only work if it executes directly within the JS file, i.e. not inside a document-ready callback or anything else that's called asynchronously. Then you can use any "get querystring parameter" JS (but make sure to replace any location.search references in there) to extract the argument.
I'd suggest you to put the value in a data-name argument though - that way you can simply use e.g. script.getAttribute('data-name') to get the value.
You can use the stack trace technique.
This technique will detect the source of the JS file the script is running from and it doesn't depend on the way you have injected the script. It can be dynamically injected (ajax) or in whatever method you can think of.
Just use the following code in your JS file:
const STACK_TRACE_SPLIT_PATTERN = /(?:Error)?\n(?:\s*at\s+)?/;
const STACK_TRACE_ROW_PATTERN1 = /^.+?\s\((.+?):\d+:\d+\)$/;
const STACK_TRACE_ROW_PATTERN2 = /^(?:.*?#)?(.*?):\d+(?::\d+)?$/;
const getFileParams = () => {
const stack = new Error().stack;
const row = stack.split(STACK_TRACE_SPLIT_PATTERN, 2)[1];
const [, url] = row.match(STACK_TRACE_ROW_PATTERN1) || row.match(STACK_TRACE_ROW_PATTERN2) || [];
if (!url) {
console.warn("Something went wrong. This probably means that the browser you are using is non-modern. You should debug it!");
return;
}
try {
const urlObj = new URL(url);
return urlObj.searchParams;
} catch (e) {
console.warn(`The URL '${url}' is not valid.`);
}
};
const params = getFileParams();
if ( params ) {
console.log(params.get('name'));
}
Note:
The params.searchParams will not work for IE browser, instead you can use params.search. But, for the sake of you nerves don't. Whoever is still using IE in 2020, just let him suffer.
I want to crawl the page and check for the hyperlinks in that respective page and also follow those hyperlinks and capture data from the page
Generally, browser JavaScript can only crawl within the domain of its origin, because fetching pages would be done via Ajax, which is restricted by the Same-Origin Policy.
If the page running the crawler script is on www.example.com, then that script can crawl all the pages on www.example.com, but not the pages of any other origin (unless some edge case applies, e.g., the Access-Control-Allow-Origin header is set for pages on the other server).
If you really want to write a fully-featured crawler in browser JS, you could write a browser extension: for example, Chrome extensions are packaged Web application run with special permissions, including cross-origin Ajax. The difficulty with this approach is that you'll have to write multiple versions of the crawler if you want to support multiple browsers. (If the crawler is just for personal use, that's probably not an issue.)
If you use server-side javascript it is possible.
You should take a look at node.js
And an example of a crawler can be found in the link bellow:
http://www.colourcoding.net/blog/archive/2010/11/20/a-node.js-web-spider.aspx
Google's Chrome team has released puppeteer on August 2017, a node library which provides a high-level API for both headless and non-headless Chrome (headless Chrome being available since 59).
It uses an embedded version of Chromium, so it is guaranteed to work out of the box. If you want to use an specific Chrome version, you can do so by launching puppeteer with an executable path as parameter, such as:
const browser = await puppeteer.launch({executablePath: '/path/to/Chrome'});
An example of navigating to a webpage and taking a screenshot out of it shows how simple it is (taken from the GitHub page):
const puppeteer = require('puppeteer');
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto('https://example.com');
await page.screenshot({path: 'example.png'});
await browser.close();
})();
We could crawl the pages using Javascript from server side with help of headless webkit. For crawling, we have few libraries like PhantomJS, CasperJS, also there is a new wrapper on PhantomJS called Nightmare JS which make the works easier.
There are ways to circumvent the same-origin policy with JS. I wrote a crawler for facebook, that gathered information from facebook profiles from my friends and my friend's friends and allowed filtering the results by gender, current location, age, martial status (you catch my drift). It was simple. I just ran it from console. That way your script will get privilage to do request on the current domain. You can also make a bookmarklet to run the script from your bookmarks.
Another way is to provide a PHP proxy. Your script will access the proxy on current domain and request files from another with PHP. Just be carefull with those. These might get hijacked and used as a public proxy by 3rd party if you are not carefull.
Good luck, maybe you make a friend or two in the process like I did :-)
My typical setup is to use a browser extension with cross origin privileges set, which is injecting both the crawler code and jQuery.
Another take on Javascript crawlers is to use a headless browser like phantomJS or casperJS (which boosts phantom's powers)
This is what you need http://zugravu.com/products/web-crawler-spider-scraping-javascript-regular-expression-nodejs-mongodb
They use NodeJS, MongoDB and ExtJs as GUI
yes it is possible
Use NODEJS (its server side JS)
There is NPM (package manager that handles 3rd party modules) in nodeJS
Use PhantomJS in NodeJS (third party module that can crawl through websites is PhantomJS)
There is a client side approach for this, using Firefox Greasemonkey extention. with Greasemonkey you can create scripts to be executed each time you open specified urls.
here an example:
if you have urls like these:
http://www.example.com/products/pages/1
http://www.example.com/products/pages/2
then you can use something like this to open all pages containing product list(execute this manually)
var j = 0;
for(var i=1;i<5;i++)
{
setTimeout(function(){
j = j + 1;
window.open('http://www.example.com/products/pages/ + j, '_blank');
}, 15000 * i);
}
then you can create a script to open all products in new window for each product list page and include this url in Greasemonkey for that.
http://www.example.com/products/pages/*
and then a script for each product page to extract data and call a webservice passing data and close window and so on.
I made an example javascript crawler on github.
It's event driven and use an in-memory queue to store all the resources(ie. urls).
How to use in your node environment
var Crawler = require('../lib/crawler')
var crawler = new Crawler('http://www.someUrl.com');
// crawler.maxDepth = 4;
// crawler.crawlInterval = 10;
// crawler.maxListenerCurrency = 10;
// crawler.redisQueue = true;
crawler.start();
Here I'm just showing you 2 core method of a javascript crawler.
Crawler.prototype.run = function() {
var crawler = this;
process.nextTick(() => {
//the run loop
crawler.crawlerIntervalId = setInterval(() => {
crawler.crawl();
}, crawler.crawlInterval);
//kick off first one
crawler.crawl();
});
crawler.running = true;
crawler.emit('start');
}
Crawler.prototype.crawl = function() {
var crawler = this;
if (crawler._openRequests >= crawler.maxListenerCurrency) return;
//go get the item
crawler.queue.oldestUnfetchedItem((err, queueItem, index) => {
if (queueItem) {
//got the item start the fetch
crawler.fetchQueueItem(queueItem, index);
} else if (crawler._openRequests === 0) {
crawler.queue.complete((err, completeCount) => {
if (err)
throw err;
crawler.queue.getLength((err, length) => {
if (err)
throw err;
if (length === completeCount) {
//no open Request, no unfetcheditem stop the crawler
crawler.emit("complete", completeCount);
clearInterval(crawler.crawlerIntervalId);
crawler.running = false;
}
});
});
}
});
};
Here is the github link https://github.com/bfwg/node-tinycrawler.
It is a javascript web crawler written under 1000 lines of code.
This should put you on the right track.
You can make a web crawler driven from a remote json file that opens all links from a page in new tabs as soon as each tab loads except ones that have already been opened. If you set up a with a browser extension running in a basic browser (nothing runs except the web browser and an internet config program) and had it shipped and installed somewhere with good internet, you could make a database of webpages with an old computer. That would just need to retrieve the content of each tab. You could do that for about $2000, contrary to most estimates for search engine costs. You'd just need to basically make your algorithm provide pages based on how much a term appears in the innerText property of the page, keywords, and description. You could also set up another PC to recrawl old pages from the one-time database and add more. I'd estimate it would take about 3 months and $20000, maximum.
Axios + Cheerio
You can do this with axios and cheerios. Check axios docs for response format.
const cheerio = require('cheerio');
const axios = require('axios');
//crawl
//get url
var url = 'http://amazon.com';
axios.get(url)
.then((res) => {
//response format
var body = res.data;
var statusCode = res.status;
var statusText = res.statusText;
var headers = res.headers;
var request = res.request;
var config = res.config;
//jquery
let $ = cheerio.load(body);
//example
//meta tags
var title = $('meta[name=title]').attr('content');
if(title == undefined || title == 'undefined'){
title = $('title').text();
}else{
title = title;
}
var description = $('meta[name=description]').attr('content');
var keywords = $('meta[name=keywords]').attr('content');
var author = $('meta[name=author]').attr('content');
var type = $('meta[http-equiv=content-type]').attr('content');
var favicon = $('link[rel="shortcut icon"]').attr('href');
}).catch(function (e) {
console.log(e);
});
Node-Fetch + Cheerio
You can do the same thing with node-fetch and cheerio.
fetch(url, {
method: "GET",
}).then(function(response){
//response
var html = response.text();
//return
return html;
})
.then(function(res) {
//response html
var html = res;
//jquery
let $ = cheerio.load(html);
//meta tags
var title = $('meta[name=title]').attr('content');
if(title == undefined || title == 'undefined'){
title = $('title').text();
}else{
title = title;
}
var description = $('meta[name=description]').attr('content');
var keywords = $('meta[name=keywords]').attr('content');
var author = $('meta[name=author]').attr('content');
var type = $('meta[http-equiv=content-type]').attr('content');
var favicon = $('link[rel="shortcut icon"]').attr('href');
})
.catch((error) => {
console.error('Error:', error);
});