Word Add-in Get full Document text WITH INDICATOR? - javascript

There is already a question answering related to this topic: Word Add-in Get full Document text?
However, this method can't extract the indicator/bullet points.
Is there a way we can do this? I expect the text to be exactly the same as we manually select all then copy a Word document.
The reason behind this: I'm building a question bank from a microsoft word document. Several tools offer text extraction, however, it usually ignores the bullet point.
I use keywords like A. B. C. D. etc to detect the choices. However, if the author writing choices using indicator/bullet point, this method fails.

You can convert the numbered lists (list paragraphs) to plain text with a simple piece of vba.
See here convert lists to text

For each paragraph in the document, you can identify whether it is a list item by calling isListItem.
If it is, you can call listItem to get the item.
The listString property in Word.ListItem class can help you get the list item bullet, number, or picture as a string.
Here is an example that how to extract the bullets in the document.
Word.run(async (context) => {
var paragraphs = context.document.body.paragraphs;
paragraphs.load("$none");
await context.sync();
for (let i = 0; i < paragraphs.items.length; i++) {
paragraphs.items[i].load("isListItem");
paragraphs.items[i].load("text");
await context.sync();
if (paragraphs.items[i].isListItem) {
paragraphs.items[i].load("listItem");
await context.sync();
console.log(paragraphs.items[i].listItem.listString + " " + paragraphs.items[i].text);
} else {
console.log(paragraphs.items[i].text);
}
}
});
The document is printed to the console paragraph by paragraph with all bullets retained.

Related

Google Scripts - keep track of element [duplicate]

Update: This is a better way of asking the following question.
Is there an Id like attribute for an Element in a Document which I can use to reach that element at a later time. Let's say I inserted a paragraph to a document as follows:
var myParagraph = 'This should be highlighted when user clicks a button';
body.insertParagraph(0, myParagraph);
Then the user inserts another one at the beginning manually (i.e. by typing or pasting). Now the childIndex of my paragraph changes to 1 from 0. I want to reach that paragraph at a later time and highlight it. But because of the insertion, the childIndex is not valid anymore. There is no Id like attribute for Element interface or any type implementing that. CahceService and PropertiesService only accepts String data, so I can't store myParagraphas an Object.
Do you guys have any idea to achieve what I want?
Thanks,
Old version of the same question (Optional Read):
Imagine that user selects a word and presses the highlight button of my add-on. Then she does the same thing for several more words. Then she edits the document in a way that the start end end indexes of those highlighted words change.
At this point she presses the remove highlighting button. My add-on should disable highlighting on all previously selected words. The problem is that I don't want to scan the entire document and find any highlighted text. I just want direct access to those that previously selected.
Is there a way to do that? I tried caching selected elements. But when I get them back from the cache, I get TypeError: Cannot find function insertText in object Text. error. It seems like the type of the object or something changes in between cache.put() and cache.get().
var elements = selection.getSelectedElements();
for (var i = 0; i < elements.length; ++i) {
if (elements[i].isPartial()) {
Logger.log('partial');
var element = elements[i].getElement().asText();
var cache = CacheService.getDocumentCache();
cache.put('element', element);
var startIndex = elements[i].getStartOffset();
var endIndex = elements[i].getEndOffsetInclusive();
}
// ...
}
When I get back the element I get TypeError: Cannot find function insertText in object Text. error.
var cache = CacheService.getDocumentCache();
cache.get('text').insertText(0, ':)');
I hope I can clearly explained what I want to achieve.
One direct way is to add a bookmark, which is not dependent on subsequent document changes. It has a disadvantage: a bookmark is visible for everyone...
More interesting way is to add a named range with a unique name. Sample code is below:
function setNamedParagraph() {
var doc = DocumentApp.getActiveDocument();
// Suppose you want to remember namely the third paragraph (currently)
var par = doc.getBody().getParagraphs()[2];
Logger.log(par.getText());
var rng = doc.newRange().addElement(par);
doc.addNamedRange("My Unique Paragraph", rng);
}
function getParagraphByName() {
var doc = DocumentApp.getActiveDocument();
var rng = doc.getNamedRanges("My Unique Paragraph")[0];
if (rng) {
var par = rng.getRange().getRangeElements()[0].getElement().asParagraph();
Logger.log(par.getText());
} else {
Logger.log("Deleted!");
}
}
The first function "marks" the third paragraph as named range. The second one takes this paragraph by the range name despite subsequent document changes. Really here we need to consider the exception, when our "unique paragraph" was deleted.
Not sure if cache is the best approach. Cache is volatile, so it might happen that the cached value doesn't exist anymore. Probably PropertiesService is a better choice.

NodeJS: Extract a sentence from html text based on a phrase

I have some text stored in a database, which looks something like below:
let text = "<p>Some people live so much in the future they they lose touch with reality.</p><p>They don't just <strong>lose touch</strong> with reality, they get obsessed with the future.</p>"
The text can have many paragraphs and HTML tags.
Now, I also have a phrase:
let phrase = 'lose touch'
What I want to do is search for the phrase in text, and return the complete sentence containing the phrase in strong tag.
In the above example, even though the first para also contains the phrase 'lose touch', it should return the second sentence because it is in the second sentence that the phrase is inside strong tag. The result will be:
They don't just <strong>lose touch</strong> with reality, they get obsessed with the future.
On the client-side, I could create a DOM tree with this HTML text, convert it into an array and search through each item in the array, but in NodeJS document is not available, so this is basically just plain text with HTML tags. How do I go about finding the right sentence in this blob of text?
I think this might help you.
No need to involve DOM in this if I understood the problem correctly.
This solution would work even if the p or strong tags have attributes in them.
And if you want to search for tags other than p, simply update the regex for it and it should work.
const search_phrase = "lose touch";
const strong_regex = new RegExp(`<\s*strong[^>]*>${search_phrase}<\s*/\s*strong>`, "g");
const paragraph_regex = new RegExp("<\s*p[^>]*>(.*?)<\s*/\s*p>", "g");
const text = "<p>Some people live so much in the future they they lose touch with reality.</p><p>They don't just <strong>lose touch</strong> with reality, they get obsessed with the future.</p>";
const paragraphs = text.match(paragraph_regex);
if (paragraphs && paragraphs.length) {
const paragraphs_with_strong_text = paragraphs.filter(paragraph => {
return strong_regex.test(paragraph);
});
console.log(paragraphs_with_strong_text);
// prints [ '<p>They don\'t just <strong>lose touch</strong> with reality, they get obsessed with the future.</p>' ]
}
P.S. The code is not optimised, you can change it as per the requirement in your application.
There is cheerio which is something like server-side jQuery. So you can get your page as text, build DOM, and search inside of it.
first you could var arr = text.split("<p>") in order to be able to work with each sentence individually
then you could loop through your array and search for your phrase inside strong tags
for(var i = 0; i<arr.length;i++){
if(arr[i].search("<strong>"+phrase+"</strong>")!=-1){
console.log("<p>"+arr[i]);
//arr[i] is the the entire sentence containing phrase inside strong tags minus "<p>"
} }

How to highlight text from paragraph using protractor?

I have notes section where user related data is present. This data is dynamic. I want to select one or two words from that notes section.
Text is seperated indexwise. E. G. 'Any' word is having 3 indexes. All these notes are present under one div tag.
Please suggest how to select text or word from paragraph present there?
I tried below 1.browser.actions().keyDown(protractor.Key.CTRL).sendKeys('a').perform() and
2.var Key = protractor.Key; var Key = protractor.Key; browser.actions().sendKeys(Key.chord(Key.CONTROL, 's')).perform(); browser.actions().sendKeys(Key.chord(Key.CONTROL, Key.SHIFT, 'm')).perform(); browser.actions().sendKeys(Key.chord(Key.CONTROL, 'o')).perform();
elements have a getText() function
so get the element and then getText() from it
async function someName(elementID){
const elem = element(by.id(elementID));
return await elem.getText();
}
you can then parse the string however you like

Indesign Script: Get first paragraph in textframe in each group

Using Indesign CS5.5, I have a vast collection of groups - all with an image and a textframe. The textframe has 3 paragraphs by default.
I need to get the text from the first paragraph of each textframe.
So far I have this:
var textboxes = app.activeDocument.groups.everyItem().textFrames;
for (i = 0; i <= textboxes.length; i++) {
if(textboxes[i] != 'undefined') {
var product = textboxes[i].contents;
$.writeln(product);
}
}
This gives me ALL the text...I really need to get the first paragraph only OR filter it somehow by font size.
I've tried using textboxes[i].paragraphs[0], but this returns the rather vague Object Invalid. It might be a specific group, but it's too vague for me to tell.
Is there a way to skip and continue if an object is invalid. AND is there perhaps a way to only look for text with a certain font size?
Any help would be greatly appreciated. I find Indesign's scripting API documentation quite poor.
Suggest to use:
var m1stParas = app.activeDocument.groups.everyItem().textFrames.everyItem().paragraphs[0];
which should return an array of paragraphs (each element is a 1st para of each TF from each group)
So you will have a set of text objects. Each object.contents is a string.
In case of error "invalid object" - has your doc possibly empty textFrames in some groups?
Jarek

How can I get the next paragraph after my selection in InDesign?

I'm using Adobe InDesign and ExtendScript to find a keyword using app.activeDocument.findGrep(), and I've got this part working well. I know findGrep() returns an array of Text objects. Let's say I want to work with the first result:
var result = app.activeDocument.findGrep()[0];
How can I get the next paragraph following result?
Use the InDesign DOM
var nextParagraph = result.paragraphs[-1].insertionPoints[-1].paragraphs[-1];
The Indesign DOM has different Text objects that you can use to address paragraphs, words, characters, or insertion points (the space between characters where your blinking cursor sits). A group of Text objects is called a collection. Collections in Indesign are similar to arrays, but one significant difference is that they can be addressed from the back by using a negative index (paragraphs[-1]).
result refers to the findGrep() result. It can be any Text object, depending on your search terms.
paragraphs[-1] means the last paragraph of your result (Paragraph A). If the search result is just one word, then this refers the word's enclosing paragraph, and this collection of paragraphs has just one element.
insertionPoints[-1] refers to the last insertionPoint of Paragraph A. This comes after the paragraph mark and before the first character of the next paragraph (Paragraph B). This insertionPoint belongs to both this paragraph and the following paragraph.
paragraphs[-1] returns the last paragraph of the insertionPoint, which is Paragraph B (the next paragraph).
Althouh nextItem seems totally appropriate and efficient, it may be a source of performance leaks especially if you call it several times in a huge loop. Keep in mind that nextItem() is a function creating an internal scope and stuff…
An alternative is to navigate within the story and reach the next paragraph thanks to the insertionPoints indeces:
var main = function() {
var doc, found, st, pCurr, pNext, ipNext, ps;
if (!app.documents.length) return;
doc = app.activeDocument;
app.findGrepPreferences = app.changeGrepPreferences = null;
app.findGrepPreferences.findWhat = "\\A.";
found = doc.findGrep();
if ( !found.length) return;
found = found[0];
st = found.parentStory;
pCurr = found.paragraphs[0];
ipNext = st.insertionPoints [ pCurr.insertionPoints[-1].index ];
var pNext = ipNext.paragraphs[0];
alert( pNext.contents );
};
main();
Not claiming the absolute truth here. Just advising about possible issues with nextItem().
more simply code below
result.paragraphs.nextItem(result.paragraphs[0]);
thank you
mg.

Categories

Resources