Indesign Script: Get first paragraph in textframe in each group - javascript

Using Indesign CS5.5, I have a vast collection of groups - all with an image and a textframe. The textframe has 3 paragraphs by default.
I need to get the text from the first paragraph of each textframe.
So far I have this:
var textboxes = app.activeDocument.groups.everyItem().textFrames;
for (i = 0; i <= textboxes.length; i++) {
if(textboxes[i] != 'undefined') {
var product = textboxes[i].contents;
$.writeln(product);
}
}
This gives me ALL the text...I really need to get the first paragraph only OR filter it somehow by font size.
I've tried using textboxes[i].paragraphs[0], but this returns the rather vague Object Invalid. It might be a specific group, but it's too vague for me to tell.
Is there a way to skip and continue if an object is invalid. AND is there perhaps a way to only look for text with a certain font size?
Any help would be greatly appreciated. I find Indesign's scripting API documentation quite poor.

Suggest to use:
var m1stParas = app.activeDocument.groups.everyItem().textFrames.everyItem().paragraphs[0];
which should return an array of paragraphs (each element is a 1st para of each TF from each group)
So you will have a set of text objects. Each object.contents is a string.
In case of error "invalid object" - has your doc possibly empty textFrames in some groups?
Jarek

Related

Word Add-in Get full Document text WITH INDICATOR?

There is already a question answering related to this topic: Word Add-in Get full Document text?
However, this method can't extract the indicator/bullet points.
Is there a way we can do this? I expect the text to be exactly the same as we manually select all then copy a Word document.
The reason behind this: I'm building a question bank from a microsoft word document. Several tools offer text extraction, however, it usually ignores the bullet point.
I use keywords like A. B. C. D. etc to detect the choices. However, if the author writing choices using indicator/bullet point, this method fails.
You can convert the numbered lists (list paragraphs) to plain text with a simple piece of vba.
See here convert lists to text
For each paragraph in the document, you can identify whether it is a list item by calling isListItem.
If it is, you can call listItem to get the item.
The listString property in Word.ListItem class can help you get the list item bullet, number, or picture as a string.
Here is an example that how to extract the bullets in the document.
Word.run(async (context) => {
var paragraphs = context.document.body.paragraphs;
paragraphs.load("$none");
await context.sync();
for (let i = 0; i < paragraphs.items.length; i++) {
paragraphs.items[i].load("isListItem");
paragraphs.items[i].load("text");
await context.sync();
if (paragraphs.items[i].isListItem) {
paragraphs.items[i].load("listItem");
await context.sync();
console.log(paragraphs.items[i].listItem.listString + " " + paragraphs.items[i].text);
} else {
console.log(paragraphs.items[i].text);
}
}
});
The document is printed to the console paragraph by paragraph with all bullets retained.

Google Scripts - keep track of element [duplicate]

Update: This is a better way of asking the following question.
Is there an Id like attribute for an Element in a Document which I can use to reach that element at a later time. Let's say I inserted a paragraph to a document as follows:
var myParagraph = 'This should be highlighted when user clicks a button';
body.insertParagraph(0, myParagraph);
Then the user inserts another one at the beginning manually (i.e. by typing or pasting). Now the childIndex of my paragraph changes to 1 from 0. I want to reach that paragraph at a later time and highlight it. But because of the insertion, the childIndex is not valid anymore. There is no Id like attribute for Element interface or any type implementing that. CahceService and PropertiesService only accepts String data, so I can't store myParagraphas an Object.
Do you guys have any idea to achieve what I want?
Thanks,
Old version of the same question (Optional Read):
Imagine that user selects a word and presses the highlight button of my add-on. Then she does the same thing for several more words. Then she edits the document in a way that the start end end indexes of those highlighted words change.
At this point she presses the remove highlighting button. My add-on should disable highlighting on all previously selected words. The problem is that I don't want to scan the entire document and find any highlighted text. I just want direct access to those that previously selected.
Is there a way to do that? I tried caching selected elements. But when I get them back from the cache, I get TypeError: Cannot find function insertText in object Text. error. It seems like the type of the object or something changes in between cache.put() and cache.get().
var elements = selection.getSelectedElements();
for (var i = 0; i < elements.length; ++i) {
if (elements[i].isPartial()) {
Logger.log('partial');
var element = elements[i].getElement().asText();
var cache = CacheService.getDocumentCache();
cache.put('element', element);
var startIndex = elements[i].getStartOffset();
var endIndex = elements[i].getEndOffsetInclusive();
}
// ...
}
When I get back the element I get TypeError: Cannot find function insertText in object Text. error.
var cache = CacheService.getDocumentCache();
cache.get('text').insertText(0, ':)');
I hope I can clearly explained what I want to achieve.
One direct way is to add a bookmark, which is not dependent on subsequent document changes. It has a disadvantage: a bookmark is visible for everyone...
More interesting way is to add a named range with a unique name. Sample code is below:
function setNamedParagraph() {
var doc = DocumentApp.getActiveDocument();
// Suppose you want to remember namely the third paragraph (currently)
var par = doc.getBody().getParagraphs()[2];
Logger.log(par.getText());
var rng = doc.newRange().addElement(par);
doc.addNamedRange("My Unique Paragraph", rng);
}
function getParagraphByName() {
var doc = DocumentApp.getActiveDocument();
var rng = doc.getNamedRanges("My Unique Paragraph")[0];
if (rng) {
var par = rng.getRange().getRangeElements()[0].getElement().asParagraph();
Logger.log(par.getText());
} else {
Logger.log("Deleted!");
}
}
The first function "marks" the third paragraph as named range. The second one takes this paragraph by the range name despite subsequent document changes. Really here we need to consider the exception, when our "unique paragraph" was deleted.
Not sure if cache is the best approach. Cache is volatile, so it might happen that the cached value doesn't exist anymore. Probably PropertiesService is a better choice.

Python Selenium Scraping Javascript - Element not found

I am trying to scrape the following Javascript frontend website to practise my Javascript scraping skills:
https://www.oplaadpalen.nl/laadpaal/112618
I am trying to find two different elements by their xPath. The first one is the title, which it does find. The second one is the actual text itself, which it somehow fails to find. It's strange since I just copied the xPath's from Chrome browser.
from selenium import webdriver
link = 'https://www.oplaadpalen.nl/laadpaal/112618'
driver = webdriver.PhantomJS()
driver.get(link)
#It could find the right element
xpath_attribute_title = '//*[#id="main-sidebar-container"]/div/div[1]/div[2]/div/div[' + str(3) + ']/label'
next_page_elem_title = driver.find_element_by_xpath(xpath_attribute_title)
print(next_page_elem_title.text)
#It fails to find the right element
xpath_attribute_value = '//*[#id="main-sidebar-container"]/div/div[1]/div[2]/div/div[' + str(3) + ']/text()'
next_page_elem_value = driver.find_element_by_xpath(xpath_attribute_value)
print(next_page_elem_value.text)
I have tried a couple of things: change "text()" into "text", "(text)", but none of them seem to work.
I have two questions:
Why doesn't it find the correct element?
What can we do to make it find the correct element?
Selenium's find_element_by_xpath() method returns the first element node matching the given XPath query, if any. However, XPath's text() function returns a text node—not the element node that contains it.
To extract the text using Selenium's finder methods, you'll need to find the containing element, then extract the text from the returned object.
Keeping your own logic intact you can extract the labels and the associate value as follows :
for x in range(3, 8):
label = driver.find_element_by_xpath("//div[#class='labels']//following::div[%s]/label" %x).get_attribute("innerHTML")
value = driver.find_element_by_xpath("//div[#class='labels']//following::div[%s]" %x).get_attribute("innerHTML").split(">")[2]
print("Label is %s and value is %s" % (label, value))
Console Output :
Label is Paalcode: and value is NewMotion 04001157
Label is Adres: and value is Deventerstraat 130
Label is pc/plaats: and value is 7321cd Apeldoorn
I would suggest a slightly different approach. I would grab the entire text and then split one time on :. That will get you the title and the value. The code below will get Paalcode through openingstijden labels.
for x in range(2, 8):
s = driver.find_element_by_css_selector("div.leftblock > div.labels > div")[x].text
t = s.split(":", 1)
print(t[0]) # title
print(t[1]) # value
You don't want to split more than once because Status contains more semicolons.
Going with #JeffC's approach, if you want to first select all those elements using xpath instead of css selector, you may use this code:
xpath_title_value = "//div[#class='labels']//div[label[contains(text(),':')] and not(div) and not(contains(#class,'toolbox'))]"
title_and_value_elements = driver.find_elements_by_xpath(xpath_title_value)
Notice the plural elements in the find_elements_by_xpath method. The xpath above selects div elements that are descendants of a div element that had a class attribute of "labels". The nested label of each selected div must contain a colon. Furthermore, the div itself may not have a class of "toolbox" (Something that certain other divs on the page have), nor must it contain any additional nested divs.
Following which, you can extract the text within the individual div elements (which also contain the text from the nested label elements) and then split them using ":\n" which separates the title and value in the raw text string.
for element in title_and_value_elements:
element = element.text
title,value = element.split(":\n")
print(title)
print(value,"\n")
Since you want to practice JS skills you can do this also in JS, actually all the divs contain more data, you can see if you do paste this in the browser console:
labels = document.querySelectorAll(".labels");
divs = labels[0].querySelectorAll("div");
for (div of divs) console.log(div.firstChild, div.textContent);
you can push to an array and check only divs and that have label and return the resulted array in a python variable:
labels_value_pair.driver.execute_script('''
scrap = [];
labels = document.querySelectorAll(".labels");
divs = labels[0].querySelectorAll("div");
for (div of divs) if (div.firstChild.tagName==="LABEL") scrap.push(div.firstChild.textContent, div.textContent);
return scrap;
''')

How can I get the next paragraph after my selection in InDesign?

I'm using Adobe InDesign and ExtendScript to find a keyword using app.activeDocument.findGrep(), and I've got this part working well. I know findGrep() returns an array of Text objects. Let's say I want to work with the first result:
var result = app.activeDocument.findGrep()[0];
How can I get the next paragraph following result?
Use the InDesign DOM
var nextParagraph = result.paragraphs[-1].insertionPoints[-1].paragraphs[-1];
The Indesign DOM has different Text objects that you can use to address paragraphs, words, characters, or insertion points (the space between characters where your blinking cursor sits). A group of Text objects is called a collection. Collections in Indesign are similar to arrays, but one significant difference is that they can be addressed from the back by using a negative index (paragraphs[-1]).
result refers to the findGrep() result. It can be any Text object, depending on your search terms.
paragraphs[-1] means the last paragraph of your result (Paragraph A). If the search result is just one word, then this refers the word's enclosing paragraph, and this collection of paragraphs has just one element.
insertionPoints[-1] refers to the last insertionPoint of Paragraph A. This comes after the paragraph mark and before the first character of the next paragraph (Paragraph B). This insertionPoint belongs to both this paragraph and the following paragraph.
paragraphs[-1] returns the last paragraph of the insertionPoint, which is Paragraph B (the next paragraph).
Althouh nextItem seems totally appropriate and efficient, it may be a source of performance leaks especially if you call it several times in a huge loop. Keep in mind that nextItem() is a function creating an internal scope and stuff…
An alternative is to navigate within the story and reach the next paragraph thanks to the insertionPoints indeces:
var main = function() {
var doc, found, st, pCurr, pNext, ipNext, ps;
if (!app.documents.length) return;
doc = app.activeDocument;
app.findGrepPreferences = app.changeGrepPreferences = null;
app.findGrepPreferences.findWhat = "\\A.";
found = doc.findGrep();
if ( !found.length) return;
found = found[0];
st = found.parentStory;
pCurr = found.paragraphs[0];
ipNext = st.insertionPoints [ pCurr.insertionPoints[-1].index ];
var pNext = ipNext.paragraphs[0];
alert( pNext.contents );
};
main();
Not claiming the absolute truth here. Just advising about possible issues with nextItem().
more simply code below
result.paragraphs.nextItem(result.paragraphs[0]);
thank you
mg.

Trying to find length of text within div with jquery

I need to find the length of text (ie. number of characters) of text within a specified div (#post_div) EXCLUDING HTML formatting AND the content of a NON specific span . So any embedded span that is NOT #span1 #span2 needs to be excluded from the count.
So far I have the following solution which works, but it adds/removes from the DOM which I would prefer not to do.
var post = $("#post_div");
var post2 = post.html(); //duplicating for later
post.find("span:not(#span1):not(#span2)").remove(); //removing unwanted (only for character count) spans from DOM - YUCK!
post = $.trim(post.text());
console.log(post.length); // The correct length is here.
$("#post_div").html(post2); //replacing butchered DIV with original duplicate in DOM - YUCK!
I would prefer to achieve the same result, but without butchering the DOM/adding/replacing things from it for a simple character count.
Hope that makes sense
Instead of duplicating the HTML then working on the original node, duplicate the node and work on it outside of the main DOM tree.
var post = $("#post_div").clone();
post.find("span:not(.post_tag):not(.post_mentioned)").remove();
post = $.trim(post.text());
console.log(post.length); // The correct length is here.
Actually, the simple
var t = $.trim($("#post_div span.post_tag, #post_div span.post_mentioned").text());
console.log(t.length);
Should Suffice.
However, if you have textual content Outside of span Elements, you would have to use
var t = $.trim($("#post_div").text());
var t_inner = $("#post_div span:not(.post_tag):not(.post_mentioned)").text());
console.log(t.length - t_inner.length);

Categories

Resources