Related
My script converts the selected range into an image, please see. It first creates a public PDF URL and then converts it to PNG.
It works well for small ranges (10-20 rows) and creates a shot including images, charts, sparklines, and formatting.
The problem is with big ranges (100-1000 rows). They contain a border of unknown size and I cannot calculate it.
Heavy borders make rows higher so the image does not fit.
If we have no borders or thin borders, the real image size appears a bit smaller than calculated. This creates an empty space below the image.
My code sample for getting the range size in pixels:
// get row height in pixels
var h = 0;
for (var i = rownum; i <= rownum2; i++) {
if (i <= options.measure_limit) {
size = sheet.getRowHeight(i);
}
h += size
/** manual correction */
if (size === 2) {
h-=1;
} else {
// h -= 0.42; /** TODO → test the range to make it fit any range */
}
if ((i % 50) === 0 && i <= options.measure_limit) {
file.toast(
'Done ' + i + ' rows of ' + rownum2,
'↕📐Measuring height...');
}
}
if (i > options.measure_limit) {
file.toast(
'Estimation: all other rows are the same size',
'↕📐Measuring height...');
}
As you see, I have to loop over all rows which is extremely inefficient. I'd be glad to hear your ideas for code optimization. Now it loops the first 150 rows and next it assumes all other rows have the same height.
Sample Situations
"Small" ranges are that you can see on screen. "Big" ranges have 100+ rows so they do not fit normal screen. As I create screenshots, I tested all possible range sizes.
Case1 - no borders or thin borders
If I select a big range I get the image, and see it has a white space at the bottom. This means the real size of image was slightly smaller than one I get from the Script by calling sheet.getRowHeight(i).
Case1 - heavy borders
If I select a big range I get the image, and see not all rows I've selected are on that image. Some rows at the bottom of the range are missing. This means when I add heavy borders, the real size of rows is bigger than one I get from the Script by calling sheet.getRowHeight(i).
Conclusion
I'd be glad to hear any ideas including JavaScript hacks to remove empty space below the image. If it is currently not possible, please also answer with links to docs.
I believe your goal is as follows.
You want to export the range as an image using Google Apps Script and Javascript.
In order to achieve this, in this question, you want to calculate the row height of the selected cell range.
Issue and workaround:
As our discussions in the comment, in the current stage, when the correct row height of the cell range is trying to be obtained, there are several problems as follows.
When the border is used for the cells, it seems that the row height + the border size is different from the exported result. Ref
Pixel size might not be changed linearly with the value of row height and border size. Ref
When I tested the cell size including the borders, I thought that the tendency of change of size might be different between height and width. Ref
When the row height is the default (21 from getRowHeight) and the text font size in the cell is increased, the value retrieved by getRowHeight is not changed from 21. Ref
There is also issue with wrapping text inside a cell which on my experience also causes errors in a pixel size of cell. Ref
From your question, when the selected cell range is large, the number of pages is more than 2. In this case, all pages cannot be correctly merged as an image.
From the above situation, I'm worried that obtaining the correct size of the selected cells might be difficult. So, I proposed to process this as image processing. Ref I thought that when this process is run with the image processing, the above issues might be able to be avoided.
But, unfortunately, in order to process this as image processing, there is no built-in method in Google Apps Script. But, fortunately, in your situation, it seems that Javascript can be used in a dialog. So, I created a Javascript library for achieving this process as the image processing. Ref
When this Javascript library is used, the sample demonstration is as follows.
Usage:
1. Prepare a Spreadsheet.
Please create a new Spreadsheet and put several values to the cells.
2. Sample script.
Please copy and paste the following script to the script editor of Spreadsheet.
Google Apps Script side: Code.gs
function getActiveRange_(ss, borderColor) {
const space = 5;
const sheet = ss.getActiveSheet();
const range = sheet.getActiveRange();
const obj = { startRow: range.getRow(), startCol: range.getColumn(), endRow: range.getLastRow(), endCol: range.getLastColumn() };
const temp = sheet.copyTo(ss);
const r = temp.getDataRange();
r.copyTo(r, { contentsOnly: true });
temp.insertRowAfter(obj.endRow).insertRowBefore(obj.startRow).insertColumnAfter(obj.endCol).insertColumnBefore(obj.startCol);
obj.startRow += 1;
obj.endRow += 1;
obj.startCol += 1;
obj.endCol += 1;
temp.setRowHeight(obj.startRow - 1, space).setColumnWidth(obj.startCol - 1, space).setRowHeight(obj.endRow + 1, space).setColumnWidth(obj.endCol + 1, space);
const maxRow = temp.getMaxRows();
const maxCol = temp.getMaxColumns();
if (obj.startRow + 1 < maxRow) {
temp.deleteRows(obj.endRow + 2, maxRow - (obj.endRow + 1));
}
if (obj.startCol + 1 < maxCol) {
temp.deleteColumns(obj.endCol + 2, maxCol - (obj.endCol + 1));
}
if (obj.startRow - 1 > 1) {
temp.deleteRows(1, obj.startRow - 2);
}
if (obj.startCol - 1 > 1) {
temp.deleteColumns(1, obj.startCol - 2);
}
const mRow = temp.getMaxRows();
const mCol = temp.getMaxColumns();
const clearRanges = [[1, 1, mRow], [1, obj.endCol, mRow], [1, 1, 1, mCol], [obj.endRow, 1, 1, mCol]];
temp.getRangeList(clearRanges.map(r => temp.getRange(...r).getA1Notation())).clear();
temp.getRange(1, 1, 1, mCol).setBorder(true, null, null, null, null, null, borderColor, SpreadsheetApp.BorderStyle.SOLID);
temp.getRange(mRow, 1, 1, mCol).setBorder(null, null, true, null, null, null, borderColor, SpreadsheetApp.BorderStyle.SOLID);
SpreadsheetApp.flush();
return temp;
}
function getPDF_(ss, temp) {
const url = ss.getUrl().replace(/\/edit.*$/, '')
+ '/export?exportFormat=pdf&format=pdf'
// + '&size=20x20' // If you want to increase the size of one page, please use this. But, when the page size is increased, the process time becomes long. Please be careful about this.
+ '&scale=2'
+ '&top_margin=0.05'
+ '&bottom_margin=0'
+ '&left_margin=0.05'
+ '&right_margin=0'
+ '&sheetnames=false'
+ '&printtitle=false'
+ '&pagenum=UNDEFINED'
+ 'horizontal_alignment=LEFT'
+ '&gridlines=false'
+ "&fmcmd=12"
+ '&fzr=FALSE'
+ '&gid=' + temp.getSheetId();
const res = UrlFetchApp.fetch(url, { headers: { authorization: "Bearer " + ScriptApp.getOAuthToken() } });
return "data:application/pdf;base64," + Utilities.base64Encode(res.getContent());
}
// Please run this function.
function main() {
const ss = SpreadsheetApp.getActiveSpreadsheet();
const temp = getActiveRange_(ss, "#000000");
const base64 = getPDF_(ss, temp);
const htmltext = HtmlService.createTemplateFromFile('index').evaluate().getContent();
htmltext = htmltext.replace(/IMPORT_PDF_URL/m, base64);
const html = HtmlService.createTemplate(htmltext).evaluate().setSandboxMode(HtmlService.SandboxMode.NATIVE);
SpreadsheetApp.getUi().showModalDialog(html, 'sample');
ss.deleteSheet(temp);
}
function saveFile(data) {
const blob = Utilities.newBlob(Utilities.base64Decode(data), MimeType.PNG, "sample.png");
return DriveApp.createFile(blob).getId();
}
HTML & Javascript side: index.gs
Here, I used a Javascript library of CropImageByBorder_js for processing this as the image processing.
<script src="//mozilla.github.io/pdf.js/build/pdf.js"></script>
<script src="https://cdn.jsdelivr.net/gh/tanaikech/CropImageByBorder_js#latest/cropImageByBorder_js.min.js"></script>
<canvas id="canvas"></canvas>
<script>
var pdfjsLib = window['pdfjs-dist/build/pdf'];
pdfjsLib.GlobalWorkerOptions.workerSrc = '//mozilla.github.io/pdf.js/build/pdf.worker.js';
const base64 = 'IMPORT_PDF_URL'; //Loaading the PDF from URL
const cvs = document.getElementById("canvas");
pdfjsLib.getDocument(base64).promise.then(pdf => {
const {numPages} = pdf;
if (numPages > 1) {
throw new Error("Sorry. In the current stage, this sample script can be used for one page of PDF data. So, please change the selected range to smaller.")
}
pdf.getPage(1).then(page => {
const viewport = page.getViewport({scale: 2});
cvs.height = viewport.height;
cvs.width = viewport.width;
const ctx = cvs.getContext('2d');
const renderContext = { canvasContext: ctx, viewport: viewport };
page.render(renderContext).promise.then(async function() {
const obj = { borderColor: "#000000", base64Data: cvs.toDataURL() };
const base64 = await CropImageByBorder.getInnerImage(obj).catch(err => console.log(err));
const img = new Image();
img.src = base64;
img.onload = function () {
cvs.width = img.naturalWidth;
cvs.height = img.naturalHeight;
ctx.drawImage(img, 0, 0);
}
google.script.run.withSuccessHandler(id => console.log(id)).saveFile(base64.split(",").pop());
});
});
});
</script>
3. Testing
When you test this script, please select the cells and run main(). By this, the selected cells are exported as an image (PNG) to the root folder as follows. In this case, you can see the above demonstration.
4. Flow.
In this sample script, the following flow is used.
Manually select the cells, and run the script of main().
At the script, the selected cells enclosed by the single row and column are created as a temporal sheet.
Export the temporal sheet as a PDF data as base64. Here, the PDF data is sent to Javascript side.
Convert 1st page of PDF data to an image using PDF.js.
Cropping the selected cells using CropImageByBorder_js, and return the result image to Google Apps Script side.
Save the image as a file to Google Drive.
LIMITATION:
In this sample script, it supposes that the selected range is put on one PDF page. So, when you select a large range, when the number of PDF pages is more than 2, unfortunately, this script cannot be used. So, please be careful about this.
And also, in this case, Javascript is used on a dialog. So, when you use this sample script, it is required to open the Spreadsheet and select the cells and run the script.
Note:
In your showing script, in order to use a created PDF data with PDF.js, the Spreadsheet is required to be publicly shared. But, in the case of PDF.js, it seems that the data URL can be directly used. So in this sample script, the created PDF is used as the data URL (base64). By this, it is not required to publicly share the Spreadsheet.
References:
PDF.js
CropImageByBorder_js
I have an SVG object that looks like this:
Each of the inner <g> elements have <path>s in them.
I want to export this SVG to PDF so the groups translate to layers (OCGs), like this:
<ThroughCut>
<Path>
<Path>
<Path>
<Path>
<Graphics>
<Path>
<Path>
<Path>
<Path>
Yet any tool I have tried for this puts all objects in the same layer, and basically throws away information about groups.
Solutions in JavaScript or Python are preferred, but anything that executes from the command line on a UNIX machine will do.
I solved my problem as stated here, by following this PyMuPDF issue on Github.
Since I have control over the input SVG, I managed to solve the problem by parsing two SVGs to PDFs and combining them in separate layers of a new document. This is what I'm doing:
import fitz
from svglib.svglib import svg2rlg
from reportlab.graphics.renderPDF import drawToString
def svg_to_doc(path):
"""Using this function rather than `fitz`' `convertToPDF` because the latter
fills every shape with black for some reason.
"""
drawing = svg2rlg(path)
pdfbytes = drawToString(drawing)
return fitz.open("pdf", pdfbytes)
# Create a new blank document
doc = fitz.open()
page = doc.new_page()
# Create "Layer1" and "Layer2" OCGs and get their `xref`s
xref_1 = doc.add_ocg('Layer1', on=True)
xref_2 = doc.add_ocg('Layer2', on=True)
# Load "layer_1" and "layer_2" svgs and convert to pdf
doc_1 = svg_to_doc("my_layer_1.svg")
doc_2 = svg_to_doc("my_layer_2.svg")
# Set the `page` dimensions. Note: for me it makes sense to set the bounding
# box of the output to the same as `doc_1`, because I know `doc_1` contains
# `doc_2`. If that were not the case, I would set `bb` to be a new
# `fits.Rect` object that contained both `doc_1` and `doc_2`.
bb = doc_1[0].rect
page.setMediaBox(bb)
# Put the docs in their respective OCGs
page.show_pdf_page(bb, doc_1, 0, oc=xref_1)
page.show_pdf_page(bb, doc_2, 0, oc=xref_2)
# Save
doc.save("output.pdf")
If I load "output.pdf" in Adobe Acrobat the layers show. Curiously, the same is not the case for Adobe Illustrator (here they are simply "Clip Groups"). Regardless, I believe this solves the problem as stated above.
my_layer_1.svg
my_layer_2.svg
My other solution, although correct, does not produce a PDF that is compatible with Adobe standards (for example, Illustrator will not see the OCGs as legitimate layers—although strangely, Acrobat will).
In case one needs to produce a PDF that is compatible with Adobe standards, and will be loaded correctly in Illustrator, another option is to use the Illustrator scripting API.
Here's a script that one can use to convert a loaded SVG file into a PDF with the desired layer structure.
/** Convert an open file in Illustrator to PDF, after removing the first layer.
* Useful for converting SVGs into PDFs, where it is desired that the first level
* `<g>` elements are converted to layers/OCGs in the exported PDF.
*/
// Select export destination
const destFolder = Folder.selectDialog( 'Select folder for PDF files.', '~' );
// Get the PDF options to be used
const options = getOptions();
// The SVG should have a single `Layer1` top layer...
const doc = app.activeDocument;
if (doc.layers.length == 1) {
// ... remove it
removeFirstLayer(doc)
// Create a file pointer for export...
var targetFile = getTargetFile(doc.name, '.pdf', destFolder);
// ... and save save `doc` in the file pointer.
doc.saveAs(targetFile, options);
}
/* --------- */
/* Utilities */
/* --------- */
function getOptions() {
// Create PDFSaveOptions object
var pdfSaveOpts = new PDFSaveOptions();
// Set PDFSaveOptions properties (toggle these comment/uncomment)
pdfSaveOpts.acrobatLayers = true;
pdfSaveOpts.colorBars = true;
pdfSaveOpts.colorCompression = CompressionQuality.AUTOMATICJPEGHIGH;
pdfSaveOpts.compressArt = true; //default
pdfSaveOpts.embedICCProfile = true;
pdfSaveOpts.enablePlainText = true;
pdfSaveOpts.generateThumbnails = true; // default
pdfSaveOpts.optimization = true;
pdfSaveOpts.pageInformation = true;
// pdfSaveOpts.viewAfterSaving = true;
return pdfSaveOpts;
}
function removeFirstLayer(doc) {
// Get the layer to be removed
var firstLayer = doc.layers[0];
// Convert groups into new layers
for (var i=firstLayer.groupItems.length-1; i>=0; i--) {
var group = firstLayer.groupItems[i];
var newLayer = firstLayer.layers.add();
newLayer.name = group.name;
for (var j=group.pageItems.length-1; j>=0; j--)
group.pageItems[j].move(newLayer, ElementPlacement.PLACEATBEGINNING);
}
// Move new layers to the document and remove `firstLayer`
for (var i=firstLayer.layers.length-1; i>=0; i--)
firstLayer.layers[i].move(firstLayer.parent, ElementPlacement.PLACEATBEGINNING);
firstLayer.remove();
}
function getTargetFile(docName, ext, destFolder) {
var newName = "";
// Add extension is none exists
if (docName.indexOf('.') < 0)
newName = docName + ext;
else
newName += docName.substring(0, docName.lastIndexOf('.')) + ext;
// Create file pointer
var myFile = new File(destFolder + '/' + newName);
// Check that file permissions are granted
if (myFile.open("w"))
myFile.close();
else
throw new Error('Access is denied');
return myFile;
}
Put the script in a file that ends with .jsx, and place in inside your Scripts folder in Illustrator. Mine is located at /Applications/Adobe Illustrator 2021/Presets/en_US/Scripts/svgToPDF.jsx.
Restart Illustrator
Execute the script from the File > Scripts > svgToPDF menu item.
Drawbacks
Illustrator costs money.
You have to manually execute the SVG -> PDF conversion. Could be automated with AppleScript, but it's really not something you want to have running on a server.
I'm trying to overlay 1 image over the top of another but cant seem to work it out - my code throws no errors but doesn't output the requested image modification. Can someone point me in the right direction?
Here is my code
const user = message.mentions.users.first();
if (args[0] === undefined) {
message.channel.send("You can't jail yourself, dummy!")
} else {
var images = [user.avatarURL({ format: 'png', dynamic: true, size: 256 }), 'https://i.pinimg.com/originals/7b/51/9a/7b519a3422f940011d34d1f9aa75f683.png']
var jimps = []
//turns the images into readable variables for jimp, then pushes them into a new array
for (var i = 0; i < images.length; i++){
jimps.push(jimp.read(images[i]))
}
//creates a promise to handle the jimps
await Promise.all(jimps).then(function(data) {
return Promise.all(jimps)
}).then(async function(data){
// --- THIS IS WHERE YOU MODIFY THE IMAGES --- \\
data[0].composite(data[1], 0, 0) //adds the second specified image (the jail bars) on top of the first specified image (the avatar). "0, 0" define where the second image is placed, originating from the top left corner
//you CAN resize the second image to fit the first one like this, if necessary. The "100, 100" is the new size in pixels.
data[1].resize(100,100)
//this saves our modified image
data[0].write(`\Users\jmoor\Pictures\JIMP Test\test.png`)
})
message.channel.send(`${user.username} has been jailed!`, {file: `\Users\jmoor\Pictures\JIMP Test\test.png`})
}
I have defined jimp above and also am using a command handler I made.
Use this:
const user = message.mentions.users.first() //get The first user mentioned
if (!user) return message.reply("Who do you wanna send to jail?")//return if no user was mentioned
var bars = "https://i.pinimg.com/originals/7b/51/9a/7b519a3422f940011d34d1f9aa75f683.png"
var pfp = user.avatarURL({ format: 'png', dynamic: true, size: 128 }) //get link of profile picture
var image = await Jimp.read(pfp)//read the profile picture, returns a Jimp object
//Composite resized bars on profile picture
image.composite((await Jimp.read(bars)).resize(128, 128), 0, 0)
//create and attachment using buffer from edited picture and sending it
var image = new Discord.MessageAttachment(await image.getBufferAsync(Jimp.MIME_PNG))
message.reply(image)
I'm trying to resize and crop several images from a folder. First of all, let's see some parts of the script:
// DOCUMENT SETTINGS
app.preferences.rulerUnits = Units.MM;
app.displayDialogs = DialogModes.NO;
// FUNCTIONS
function processing_f_alta(folder, files, w, h) {
var f_alta = new Folder(folder + "/ALTA");
if ( ! f_alta.exists ) { f_alta.create(); }
for ( var cont = 0; cont < files.length; cont++ ) {
files[cont].copy(decodeURI(f_alta) + "/" + files[cont].displayName);
}
var files = f_alta.getFiles("*.tif");
for ( var cont = 0; cont < files.length; cont++ ) {
var img_file = app.open(files[cont]);
img_file.resizeImage(UnitValue(w, "cm"), UnitValue(h, "cm"), null, ResampleMethod.BICUBIC);
img_file.resizeCanvas(UnitValue(w, "cm"), UnitValue(h, "cm"), AnchorPosition.MIDDLECENTER);
img_file.close(SaveOptions.SAVECHANGES);
}
}
var w =prompt("Width (cm)","","Introduzca valor");
var h =prompt("Height (cm)","","Introduzca valor");
var f_origin_folder = Folder.selectDialog("Select the containing folder of the images");
var folder = new Folder(f_origin_folder);
var files = folder.getFiles("*.tif");
processing_f_alta(folder, files, w/2, h/2);
The script has much more code, but it's irrelevant.
The idea is to get an hipothetic width ("w") and height ("h") from the keyboard and get the folder when the images are ("folder"). So, the script gets all the ".tif" files of this folder and saves them into the variable "files".
the function processing_f_alta() is called with several params (folder, files, w/2, h/2). Why the last params are divided by 2 is irrelevant.
Into the function, the script creates a new folder into the "folder" called "ALTA" ant all the ".tif" files are copied into it. Then, the script gets all these last ".tif" files and resizes them to the new vales of with (w/2) and height (h/2).
EVERYTHING IS OK UNTIL HERE.
Now comes the problem. I want to crop the file with no distorsion but I don't know how to do it.
Let's see a real example (the example that I'm testing).
I've got an image of 40x40cm in a folder called "test". I execute the script with these values: w=30, h=15, folder="test".
When I run the script I get a new folder into "test" called "ALTA" with an image resized of 15x7,5cm. THAT'S CORRECT. But when I open the file, it's not been croped. It's been deformed vertically. What I wanted to get is this result but with the image cropped vertically, and I get an image deformed.
I've tryed crop(), resizeCanvas() functions, but I am not able to get the result that I'm expecting.
Could you help me to solve my problem.
Thanks in advance for your time.
Now that works:
// resizeImage([width] [, height] [, resolution] [, sampleMethod] [, amount]);
img_file.resizeImage(UnitValue(w, "cm"), null, null, ResampleMethod.BICUBIC);
// crop(bounds [, angle] [, width] [, height]);
// bounds = array[left, top, right, bottom]
bounds = [
0,
0,
w*10,
h*10
];
img_file.crop(bounds);
The next step will be cropping from the center of the image (now it does from the left-top point).
PD: I've done w*10 and h*10 because if original w=30 and h=30, for example, it crops the image to w->3 and h->3.
Try:
img_file.resizeImage(null, UnitValue(h, "cm"), null, ResampleMethod.BICUBIC);
if(img_file.width < w){
img_file.resizeImage(UnitValue(w, "cm"), null, null, ResampleMethod.BICUBIC);
}
(also: I don't have Photoshop on this machine so this is untested code, but basically you need to adjust the height first and then enlarge the width if it is smaller than what you want, before cropping the image)
I have a simple pdf file, containing the words "Hello world", each in a different colour.
I'm loading the PDF, like this:
PDFJS.getDocument('test.pdf').then( onPDF );
function onPDF( pdf )
{
pdf.getPage( 1 ).then( onPage );
}
function onPage( page )
{
page.getTextContent().then( onText );
}
function onText( text )
{
console.log( JSON.stringify( text ) );
}
And I get a JSON output like this:
{
"items" : [{
"str" : "Hello ",
"dir" : "ltr",
"width" : 29.592,
"height" : 12,
"transform" : [12, 0, 0, 12, 56.8, 774.1],
"fontName" : "g_font_1"
}, {
"str" : "world",
"dir" : "ltr",
"width" : 27.983999999999998,
"height" : 12,
"transform" : [12, 0, 0, 12, 86.5, 774.1],
"fontName" : "g_font_1"
}
],
"styles" : {
"g_font_1" : {
"fontFamily" : "serif",
"ascent" : 0.891,
"descent" : 0.216
}
}
}
However, I've not been able to find a way to determine the colour of each word. When I render it, it renders properly, so I know the information is in there somewhere. Is there somewhere I can access this?
As Respawned alluded to, there is no easy answer that will work in all cases. That being said, here are two approaches which seem to work fairly well. Both having upsides and downsides.
Approach 1
Internally, the getTextContent method uses whats called an EvaluatorPreprocessor to parse the PDF operators, and maintain the graphic state. So what we can do is, implement a custom EvaluatorPreprocessor, overwrite the preprocessCommand method, and use it to add the current text color to the graphic state. Once this is in place, anytime a new text chunk is created, we can add a color attribute, and set it to the current color state.
The downsides to this approach are:
Requires modifying the PDFJS source code. It also depends heavily on
the current implementation of PDFJS, and could break if this is
changed.
It will fail in cases where the text is used as a path to be filled with an image. In some PDF creators (such as Photoshop), the way it creates colored text is, it first creates a clipping path from all the given text characters, and then paints a solid image over the path. So the only way to deduce the fill-color is by reading the pixel values from the image, which would require painting it to a canvas. Even hooking into paintChar wont be of much help here, since the fill color will only emerge at a later time.
The upside is, its fairly robust and works irrespective of the page background. It also does not require rendering anything to canvas, so it can be done entirely in the background thread.
Code
All the modifications are made in the core/evaluator.js file.
First you must define the custom evaluator, after the EvaluatorPreprocessor definition.
var CustomEvaluatorPreprocessor = (function() {
function CustomEvaluatorPreprocessor(stream, xref, stateManager, resources) {
EvaluatorPreprocessor.call(this, stream, xref, stateManager);
this.resources = resources;
this.xref = xref;
// set initial color state
var state = this.stateManager.state;
state.textRenderingMode = TextRenderingMode.FILL;
state.fillColorSpace = ColorSpace.singletons.gray;
state.fillColor = [0,0,0];
}
CustomEvaluatorPreprocessor.prototype = Object.create(EvaluatorPreprocessor.prototype);
CustomEvaluatorPreprocessor.prototype.preprocessCommand = function(fn, args) {
EvaluatorPreprocessor.prototype.preprocessCommand.call(this, fn, args);
var state = this.stateManager.state;
switch(fn) {
case OPS.setFillColorSpace:
state.fillColorSpace = ColorSpace.parse(args[0], this.xref, this.resources);
break;
case OPS.setFillColor:
var cs = state.fillColorSpace;
state.fillColor = cs.getRgb(args, 0);
break;
case OPS.setFillGray:
state.fillColorSpace = ColorSpace.singletons.gray;
state.fillColor = ColorSpace.singletons.gray.getRgb(args, 0);
break;
case OPS.setFillCMYKColor:
state.fillColorSpace = ColorSpace.singletons.cmyk;
state.fillColor = ColorSpace.singletons.cmyk.getRgb(args, 0);
break;
case OPS.setFillRGBColor:
state.fillColorSpace = ColorSpace.singletons.rgb;
state.fillColor = ColorSpace.singletons.rgb.getRgb(args, 0);
break;
}
};
return CustomEvaluatorPreprocessor;
})();
Next, you need to modify the getTextContent method to use the new evaluator:
var preprocessor = new CustomEvaluatorPreprocessor(stream, xref, stateManager, resources);
And lastly, in the newTextChunk method, add a color attribute:
color: stateManager.state.fillColor
Approach 2
Another approach would be to extract the text bounding boxes via getTextContent, render the page, and for each text, get the pixel values which reside within its bounds, and take that to be the fill color.
The downsides to this approach are:
The computed text bounding boxes are not always correct, and in some cases may even be off completely (eg: rotated text). If the bounding box does not cover at least partially the actual text on canvas, then this method will fail. We can recover from complete failures, by checking that the text pixels have a color variance greater than a threshold. The rationale being, if bounding box is completely background, it will have little variance, in which case we can fallback to a default text color (or maybe even the color of k nearest-neighbors).
The method assumes the text is darker than the background. Otherwise, the background could be mistaken as the fill color. This wont be a problem is most cases, as most docs have white backgrounds.
The upside is, its simple, and does not require messing with the PDFJS source-code. Also, it will work in cases where the text is used as a clipping path, and filled with an image. Though this can become hazy when you have complex image fills, in which case, the choice of text color becomes ambiguous.
Demo
http://jsfiddle.net/x2rajt5g/
Sample PDF's to test:
https://www.dropbox.com/s/0t5vtu6qqsdm1d4/color-test.pdf?dl=1
https://www.dropbox.com/s/cq0067u80o79o7x/testTextColour.pdf?dl=1
Code
function parseColors(canvasImgData, texts) {
var data = canvasImgData.data,
width = canvasImgData.width,
height = canvasImgData.height,
defaultColor = [0, 0, 0],
minVariance = 20;
texts.forEach(function (t) {
var left = Math.floor(t.transform[4]),
w = Math.round(t.width),
h = Math.round(t.height),
bottom = Math.round(height - t.transform[5]),
top = bottom - h,
start = (left + (top * width)) * 4,
color = [],
best = Infinity,
stat = new ImageStats();
for (var i, v, row = 0; row < h; row++) {
i = start + (row * width * 4);
for (var col = 0; col < w; col++) {
if ((v = data[i] + data[i + 1] + data[i + 2]) < best) { // the darker the "better"
best = v;
color[0] = data[i];
color[1] = data[i + 1];
color[2] = data[i + 2];
}
stat.addPixel(data[i], data[i+1], data[i+2]);
i += 4;
}
}
var stdDev = stat.getStdDev();
t.color = stdDev < minVariance ? defaultColor : color;
});
}
function ImageStats() {
this.pixelCount = 0;
this.pixels = [];
this.rgb = [];
this.mean = 0;
this.stdDev = 0;
}
ImageStats.prototype = {
addPixel: function (r, g, b) {
if (!this.rgb.length) {
this.rgb[0] = r;
this.rgb[1] = g;
this.rgb[2] = b;
} else {
this.rgb[0] += r;
this.rgb[1] += g;
this.rgb[2] += b;
}
this.pixelCount++;
this.pixels.push([r,g,b]);
},
getStdDev: function() {
var mean = [
this.rgb[0] / this.pixelCount,
this.rgb[1] / this.pixelCount,
this.rgb[2] / this.pixelCount
];
var diff = [0,0,0];
this.pixels.forEach(function(p) {
diff[0] += Math.pow(mean[0] - p[0], 2);
diff[1] += Math.pow(mean[1] - p[1], 2);
diff[2] += Math.pow(mean[2] - p[2], 2);
});
diff[0] = Math.sqrt(diff[0] / this.pixelCount);
diff[1] = Math.sqrt(diff[1] / this.pixelCount);
diff[2] = Math.sqrt(diff[2] / this.pixelCount);
return diff[0] + diff[1] + diff[2];
}
};
This question is actually extremely hard if you want to do it to perfection... or it can be relatively easy if you can live with solutions that work only some of the time.
First of all, realize that getTextContent is intended for searchable text extraction and that's all it's intended to do.
It's been suggested in the comments above that you use page.getOperatorList(), but that's basically re-implementing the whole PDF drawing model in your code... which is basically silly because the largest chunk of PDFJS does exactly that... except not for the purpose of text extraction but for the purpose of rendering to canvas. So what you want to do is to hack canvas.js so that instead of just setting its internal knobs it also does some callbacks to your code. Alas, if you go this way, you won't be able to use stock PDFJS, and I rather doubt that your goal of color extraction will be seen as very useful for PDFJS' main purpose, so your changes are likely not going to get accepted upstream, so you'll likely have to maintain your own fork of PDFJS.
After this dire warning, what you'd need to minimally change are the functions where PDFJS has parsed the PDF color operators and sets its own canvas painting color. That happens around line 1566 (of canvas.js) in function setFillColorN. You'll also need to hook the text render... which is rather a character renderer at canvas.js level, namely CanvasGraphics_paintChar around line 1270. With these two hooked, you'll get a stream of callbacks for color changes interspersed between character drawing sequences. So you can reconstruct the color of character sequences reasonably easy from this.. in the simple color cases.
And now I'm getting to the really ugly part: the fact that PDF has an extremely complex color model. First there are two colors for drawing anything, including text: a fill color and stroke (outline) color. So far not too scary, but the color is an index in a ColorSpace... of which there are several, RGB being only one possibility. Then there's also alpha and compositing modes, so the layers (of various alphas) can result in a different final color depending on the compositing mode. And the PDFJS has not a single place where it accumulates color from layers.. it simply [over]paints them as they come. So if you only extract the fill color changes and ignore alpha, compositing etc.. it will work but not for complex documents.
Hope this helps.
There's no need to patch pdfjs, the transform property gives the x and y, so you can go through the operator list and find the setFillColor op that precedes the text op at that point.