Extracting img urls from webpage using Google Apps Script

Extracting img urls from webpage using Google Apps Script - javascript

This is an Apps Script that goes through a webpage and collects img urls that are inside some div of a special class.
function getIMGs(url){
var url = 'url'
var result = UrlFetchApp.fetch(url);
if (result.getResponseCode() == 200) {
var doc = Xml.parse(result, true);
var bodyHtml = doc.html.body.toXmlString();
var doc = XmlService.parse(bodyHtml);
var html = doc.getRootElement();
var thumbs = getElementsByClassName(html, 'thumb');
var sheet = SpreadsheetApp.getActiveSheet();
for (i in Thumbs) {
var output = '';
var linksInMenu = getElementsByTagName(thumbs[i], 'img');
for(i in linksInMenu) {
output += XmlService.getRawFormat().format(linksInMenu[i]);
}
var linkRegExp = /data-src="(.*?)"/;
var dataSrc = linkRegExp.exec(output);
sheet.appendRow([dataSrc[1]]);
}
}
So first the code gets the html, and uses an auxiliary function to get certain elements, which look like this:
<div class="thumb"><div class="loader"><span class="icon-uniE611"></span></div><img src="//xxx" data-src="https://xxx/8491a83b1cacc2401907997b5b93e433c03c91f.JPG" data-target="#image-slider" data-slide-to="0"></div>
Then the code gets the img elements, and finally extracts the data-src address via RegExp.
While this kinda works, I have a problem:
1) After 9 loops it crashes, on the appendRow line, as the last 4 Thumbs elements don't have data-src, hence what i'm trying to write into the spreadsheet is null.
Any solution for this? I have fixed it for the moment by just doing 9 iterations only of the For loop, but this is far from optimal, as it's not automated and required me to go through the page to count the elements with data-src.
Also, any suggestion of a more elegant solution will be appreciated! I will be really grateful for any helping hand!
Cheers

Related

Replacing Text with replacetext() and defining said replacement as heading

I have a Google spreadsheet and a Google document. The document is a report which gets filled by the spreadsheet. The spreadsheet is also defining what comes into the report. Therefore I have a script, which gathers a bunch of placeholders depending on values in the the document.
After all the placeholders have been inserted in the document (there are a couple of pages before that) it looks kind of like this:
{{header1.1}}
{{text1.1}}//this is already a couple lines of text
{{table1.1}}
{{table.dir}}
{{blob1.1}}
{{blob.dir}}
I already have a script, which inserts all the text parts and I have set up a script, which should be capable of writing the tables at the correct position. So far I can replace the {{header1.1}}, but if I try to define it as a heading it works, but everything after the header1.1 is also a heading
I've been at this problem for quite a while and didn't get and its always one step forward one step back. Also this is my first question after a couple of years just reading on stackoverflow. I'd appreciate if someone could help.
function myUeberschriftenboi() {
doc = DocumentApp.openById('someID');
console.log(doc.getName());
var body = doc.getBody();
//formate
const plain3style = {};
plain3style[DocumentApp.Attribute.HEADING] = DocumentApp.ParagraphHeading.HEADING3;
var lvl2array = [ "{{header1.1}}" , "{{header1.2}}" ];
var fill2array = [ "Energy" , "Energyflow" ]
var lvl2count = 1;
for( var j = 0 ; j < lvl2array.length ; j++)
{
var seek = body.findText(lvl2array[j]);
if( seek != null)
{
body.replaceText(lvl2array[j] , "1.1."+lvl2count+" "+fill2array[j]+"\n");
var seek2 = body.findText("1."+lvl2count+" "+fill2array[j]);
seek2.getElement().getParent().getChild().setAttributes(plain3style);
lvl2count++;
}}}

Grab data from website HTML table and transfer to Google Sheets using App-Script

Ok, I know there are similar questions out there to mine, but so far I have yet to find any answers that work for me. What I am trying to do is gather data from an entire HTML table on the web (https://www.sports-reference.com/cbb/schools/indiana/2022-gamelogs.html) and then parse it/transfer it to a range in my Google Sheet. The code below is probably the closest thing I've found so far because at least it doesn't error out, but it will only find one string or value, not the whole table. I've found other answers where they use xmlservice.parse, however that doesn't work for me, I believe because the HTML format has issues that it can't parse. Does anyone have an idea of how to edit what I have below, or a whole new idea that may work for this website?
function SAMPLE() {
const url="http://www.sports-reference.com/cbb/schools/indiana/2022-gamelogs.html#sgl-basic?"
// Get all the static HTML text of the website
const res = UrlFetchApp.fetch(url, {muteHttpExceptions: true}).getContentText();
// Find the index of the string of the parameter we are searching for
index = res.search("td class");
// create a substring to only get the right number values ignoring all the HTML tags and classes
sub = res.substring(index+92,index+102);
Logger.log(sub);
return sub;
}
I understand that I can use importHTML natively in a Google Sheet, and that's what I'm currently doing. However I am doing this for over 350 webpage tables, and iterating through each one to load it and then copy the value to another sheet. App Script bogs down quite a bit when it is repeatedly waiting on Sheets to load an importHTMl and then grab some data and do it all over again on another url. I apologize for any formatting issues in this post or things I've done wrong, this is my first time posting here.
Edit: ok, I've found a method that works, but it's still much slower than I would like, because it is using Drive API to create a document with the HTML data and then parse and create an array from there. The Drive.Files.Insert line is the most time consuming part. Anyone have an idea of how to make this quicker? It may not seem that slow to you right now, but when I need to do this 350 times, it adds up.
function parseTablesFromHTML() {
var html = UrlFetchApp.fetch("https://www.sports-reference.com/cbb/schools/indiana/2022-gamelogs.html");
var docId = Drive.Files.insert(
{ title: "temporalDocument", mimeType: MimeType.GOOGLE_DOCS },
html.getBlob()
).id;
var tables = DocumentApp.openById(docId)
.getBody()
.getTables();
var res = tables.map(function(table) {
var values = [];
for (var row = 0; row < table.getNumRows(); row++) {
var temp = [];
var cols = table.getRow(row);
for (var col = 0; col < cols.getNumCells(); col++) {
temp.push(cols.getCell(col).getText());
}
values.push(temp);
}
return values;
});
Drive.Files.remove(docId);
var range=SpreadsheetApp.getActive().getSheetByName("Test").getRange(3,6,res[0].length,res[0][0].length);
range.setValues(res[0]);
SpreadsheetApp.flush();
}

Solution by formula
Try
=importhtml(url,"table",1)
Other solution by script
function importTableHTML() {
var url = 'https://www.sports-reference.com/cbb/schools/indiana/2022-gamelogs.html'
var html = '<table' + UrlFetchApp.fetch(url, {muteHttpExceptions: true}).getContentText().replace(/(\r\n|\n|\r|\t| )/gm,"").match(/(?<=\<table).*(?=\<\/table)/g) + '</table>';
var trs = [...html.matchAll(/<tr[\s\S\w]+?<\/tr>/g)];
var data = [];
for (var i=0;i<trs.length;i++){
var tds = [...trs[i][0].matchAll(/<(td|th)[\s\S\w]+?<\/(td|th)>/g)];
var prov = [];
for (var j=0;j<tds.length;j++){
donnee=tds[j][0].match(/(?<=\>).*(?=\<\/)/g)[0];
prov.push(stripTags(donnee));
}
data.push(prov);
}
return(data);
}
function stripTags(body) {
var regex = /(<([^>]+)>)/ig;
return body.replace(regex,"");
}

Retrieve the hyperlink on a specific text string within Google Doc using Apps Script

I'm trying to use Google Apps Script to get the hyperlink from a specific string found in this Google Doc.
The string is ||stock||
The hyperlink is https://www.cnbc.com/quotes/?symbol=aapl&qsearchterm=aapl
Any help is greatly appreciated.
The code I'm currently using
function docReport() {
var doc = DocumentApp.openByUrl('https://docs.google.com/document/d/1XNiqgJ_hM2SWjoR-OTsq1w-ZFKvTIERDIs_NOWJpckY/edit');
var body = doc.getBody();
Logger.log(body.getParagraphs().length);//get the number of paragraphs
//https://www.udemy.com/apps-script-course/learn/v4/t/lecture/10208226?start=0
for (var x=0;x<body.getParagraphs();X++) {
var el = body.getChild(x);
Logger.log(el.getText());
}
var bodyText = body.getText();
var words = bodyText.match(/\S+/g); // get word count for body - https://stackoverflow.com/questions/33338667/function-for-word-count-in-google-docs-apps-script
Logger.log(words.length); // retruns # of words
var paragraphAll = body.getParagraphs(); // gets all paragraph objects in a document
Logger.log(paragraphAll);
var paragraphText = paragraphAll[1].getText().match(/\S+/g);
Logger.log(paragraphText.length); // retruns # of words in a paragraph
}

You want to retrieve hyperlink of the text of ||stock||.
If my understanding is correct, for example, how about this sample script? In your situation, the text value which has a link has already been known. The sample script uses this situation.
By the way, from your question, I'm not sure whether there are several values of ||stock|| in the document. So this sample script supposes that there are several values of ||stock|| in the document.
I think that there are several answers for your situation. So please think of this as one of them.
Sample script:
var searchValue = "\\|\\|stock\\|\\|"; // Search value
var body = DocumentApp.openByUrl('https://docs.google.com/document/d/1XNiqgJ_hM2SWjoR-OTsq1w-ZFKvTIERDIs_NOWJpckY/edit').getBody();
var searchedText = body.findText(searchValue);
var urls = [];
while (searchedText) {
var url = searchedText.getElement().asText().getLinkUrl(searchedText.getStartOffset());
urls.push(url);
searchedText = body.findText(searchValue, searchedText);
}
Logger.log(urls) // Results
Note:
If there is only one search value in the document, you can also use the following script.
var searchValue = "\\|\\|stock\\|\\|";
var body = DocumentApp.openByUrl('https://docs.google.com/document/d/1XNiqgJ_hM2SWjoR-OTsq1w-ZFKvTIERDIs_NOWJpckY/edit').getBody();
var searchedText = body.findText(searchValue);
var url = searchedText.getElement().asText().getLinkUrl(searchedText.getStartOffset());
Logger.log(url)
References:
findText()
getLinkUrl()
If I misunderstand your question, please tell me. I would like to modify it.

How to change the URL of an image tag on every click on the image using JavaScript?

I have an element displaying an image on an HTML page. This element's source is one of many different images in a JavaScript array.
I already have a script for looping through the images, creating a slideshow effect, but now I want to manually flick through the images with buttons.
This is my code so far, but I get no response when clicking the button.
function nextup()
{
imgs = [];
imgs[0] = "/snakelane/assets/images/thumb/_1.jpg"; imgs[10] = "/snakelane/assets/images/thumb/_19.jpg";
imgs[1] = "/snakelane/assets/images/thumb/_2.jpg"; imgs[11] = "/snakelane/assets/images/thumb/_20.jpg";
imgs[2] = "/snakelane/assets/images/thumb/_3.jpg"; imgs[12] = "/snakelane/assets/images/thumb/_21.jpg";
imgs[3] = "/snakelane/assets/images/thumb/_4.jpg"; imgs[13] = "/snakelane/assets/images/thumb/_22.jpg";
imgs[4] = "/snakelane/assets/images/thumb/_5.jpg"; imgs[14] = "/snakelane/assets/images/thumb/_23.jpg";
imgs[5] = "/snakelane/assets/images/thumb/_6.jpg"; imgs[15] = "/snakelane/assets/images/thumb/_24.jpg";
imgs[6] = "/snakelane/assets/images/thumb/_7.jpg"; imgs[16] = "/snakelane/assets/images/thumb/_25.jpg";
imgs[7] = "/snakelane/assets/images/thumb/_8.jpg"; imgs[17] = "/snakelane/assets/images/thumb/_26.jpg";
imgs[8] = "/snakelane/assets/images/thumb/_9.jpg"; imgs[18] = "/snakelane/assets/images/thumb/_27.jpg";
imgs[9] = "/snakelane/assets/images/thumb/_32.jpg"; imgs[19] = "/snakelane/assets/images/thumb/_28.jpg";
var pic = document.getElementById("picbox");
for(i =0; i < imgs.length; i++) {
var current = indexOf(pic.src);
var next = Math.round(current + 1);
pic.src = imgs[next];
}
}
Can anyone tell me what's wrong with my code or suggest a better way?

Multiple problems in the approach you had used. Have a look at the modified function below. Let me know if you need explanation with anything.
The following code will use an array containing image URLs and later assign in a sequential manner to an img tag on click. Enjoy!
Here you can try to see the output.
function nextup(){
//Initialized img array with 10 images, you can do it any way you want to.
var imgs = [];
for(i=0;i<10;i++){
imgs[i] = "http://lorempixel.com/output/cats-q-c-100-100-"+(i+1)+".jpg";
}
//Fetch the pic DOM element by ID
var pic = document.getElementById("picbox");
//Know what is position of currently assigned image in array.
var current = imgs.indexOf(pic.src);
var next = 0;
//Handle case if no image is present, the initial case.
if(current!=-1){
next = (current + 1)%(imgs.length);
}
//Assign the next src
pic.src = imgs[next];
}
//Scoped outside to call the function first time on load.
nextup();
I found the following problems in your code:
You tried to use indexOf without specifying the array in which the search has to be performed. Imagine s school principal asking someone to go find if John is present in the classroom without specifying a specific classroom.
For iterating through array you used a next variable which could have been a good idea if you needed an endless loop. But here since we are limited to 10 or 20 images we need to make sure that if the currently selected image is the last one, we find that next goes to 21 (assuming a total of 20 images.) and this would try to access a variable out of bounds.
Hence I've used the mod operator %.
For reference in JavaScript, 5%10 would return 5 , 15%10 would return 5 and so on. Read more about the mod operator HERE.

Javascript for google image ripping broke with update

I grabbed a few small scripts and threw them together to take google's new image layout and turn back into the old one, then take the images and replace them with the full size versions. Worked great until about last week. Not sure what changed on the server side.
(function() {
// Get list of all anchor tags that have an href attribute containing the start and stop key strings.
var fullImgUrls = selectNodes(document, document.body, "//a[contains(#href,'/imgres?imgurl\x3d')][contains(#href,'\x26imgrefurl=')]");
//clear existing markup
var imgContent = document.getElementById('ImgContent');
imgContent.innerHTML = "";
for(var x=1; x<=fullImgUrls.length; x++) {
//reverse X to show images in correct order using .insertBefore imgContent.nextSibling
var reversedX = (fullImgUrls.length) - x;
// get url using regexp
var fullUrl = fullImgUrls[reversedX].href.match( /\/imgres\?imgurl\=(.*?)\&imgrefurl\=(.*?)\&usg/ );
// if url was fetched, create img with fullUrl src
if(fullUrl) {
newLink = document.createElement('a');
imgContent.parentNode.insertBefore(newLink , imgContent.nextSibling);
newLink.href = unescape(fullUrl[2]);
newElement = document.createElement('img');
newLink.appendChild(newElement);
newElement.src = decodeURI(fullUrl[1]);
newElement.border = 0;
newElement.title = fullUrl[2];
}
}
function selectNodes(document, context, xpath) {
var nodes = document.evaluate(xpath, context, null, XPathResult.ORDERED_NODE_SNAPSHOT_TYPE, null);
var result = [];
for (var x=0; x<nodes.snapshotLength; x++) {
result.push(nodes.snapshotItem(x));
}
return result;
}
})();

Google changed the 'ImgContent' id for the image table holder to something slightly more obscure. A quick change had everything working again. I made a simple problem complicated by looking past the easy stuff. Thanks to darvids0n for the enabling, he ultimately pointed out what I was missing.

the script is not going to work as said by bobby .
try this grease monkey script from user script repository.
rip Google image search :- http://userscripts.org/scripts/show/111342

Develop Reference

JavaScript is the programming language of the Web.

Extracting img urls from webpage using Google Apps Script - javascript

Related

Replacing Text with replacetext() and defining said replacement as heading

Grab data from website HTML table and transfer to Google Sheets using App-Script

Retrieve the hyperlink on a specific text string within Google Doc using Apps Script

How to change the URL of an image tag on every click on the image using JavaScript?

Javascript for google image ripping broke with update

Categories

Resources