Generalizing XPaths - javascript

Generalizing XPaths - javascript

I would like seek your help for a problem I am trying to tackle involving XPaths.
I am trying to generalize multiple Xpaths provided by a user to get an XPath that would best 'fit' all the provided examples. This is for a web scraping system I am building.
Eg: If the user gives the following xpaths (each pointing to a link in the 'Spotlight' section from the Google News page)
Good examples:
/html/body/div[#id='page']/div/div[#id='main-wrapper']/div[#id='main']/div/div/div[3] /div[1]/table[#id='main-am2-pane']/tbody/tr/td[#id='rt-col']/div[3]/div[#id='s_en_us:ir']/div[2]/div[1]/div[2]/a[#id='MAE4AUgAUABgAmoCdXM']/span
/html/body/div[#id='page']/div/div[#id='main-wrapper']/div[#id='main']/div/div/div[3]/div[1]/table[#id='main-am2-pane']/tbody/tr/td[#id='rt-col']/div[3]/div[#id='s_en_us:ir']/div[2]/div[6]/div[2]/a[#id='MAE4AUgFUABgAmoCdXM']/span
/html/body/div[#id='page']/div/div[#id='main-wrapper']/div[#id='main']/div/div/div[3]/div[1]/table[#id='main-am2-pane']/tbody/tr/td[#id='rt-col']/div[3]/div[#id='s_en_us:ir']/div[2]/div[12]/div[2]/a[#id='MAE4AUgLUABgAmoCdXM']/span
Bad Examples: (pointing to a link in another section)
/html/body/div[#id='page']/div/div[#id='main-wrapper']/div[#id='main']/div/div/div[3]/div[1]/table[#id='main-am2-pane']/tbody/tr/td[#id='lt-col']/div[2]/div[#id='replaceable-section-blended']/div[1]/div[4]/div/h2/a[#id='MAA4AEgFUABgAWoCdXM']/span
It should be able to generalize and produce an xpath expression that would select all the links in the 'Spotlight' section. (It should be able to throw out the incorrect xpath given)
Generalized XPath
/html/body/div[#id='page']/div/div[#id='main-wrapper']/div[#id='main']/div/div/div[3]/div[1]/table[#id='main-am2-pane']/tbody/tr/td[#id='rt-col']/div[3]/div[#id='s_en_us:ir']/div[2]/div/div[2]/a[#id='MAE4AUgLUABgAmoCdXM']/span
Could you kindly advice me on how to go about it. I was thinking of using the Longest Common Substring strategy but however that would over-generalize if a bad example is given (like the fourth example given) Are there any libraries or any open source software that has been done in this area?
I saw some similar posts (finding common ancestor from a group of xpath? and Howto find the first common XPath ancestor in Javascript?) However they are talking about longest common ancestor.
I am writing it in Javascript as a form of a firefox extension.
Thanks for your time and any help would be greatly appreciated!

The question here is in Automaton minimization problem. So you have (Xpath1|Xpath2|Xpath3) and you would like to get minimal automaton Xpath4 which match same nodes. THere are also question about minimization with information lose or not, like JPEG. For exact minimization you could google "Algorithms for Minimization of Finite-State Automata".
Ok, the simplest way is finding common subsequence, after converting each Xpath operator to character and run character based substring finder from list of string. So we have for example
adcba, acba, adba --common substring--> aba --general reg exp--> a.*b.*a --convert back to xpath--> ...
You can also try to set something less general in place of .*

Related

Extract document extensions from clicks

I'm using this technique to extract the click events in my SharePoint site. It uses jquery and a regular expression to capture clicks and report them as events to google analytics.
I'm also just past total newbie with regex -- It is starting to make some sense to me, but I have a lot to learn still. So here goes.
I have a preapproved list of filetypes that I am interested in based on the site listed above.
var filetypes = /\.(zip|pdf|doc.*|xls.*|ppt.*|mp3|txt|wma|mov|avi|wmv|flv|wav|jpg)$/i;
But it isn't quite working like I need. With the $ I assume it is trying to match to the end of the line. But often in SharePoint we get links like this:
example.org/sharepoint/_layouts/15/wopiframe.aspx?sourcedoc=/sharepointlibrary/the%20document%20name.docx&action=default&defaultitemopen=1
The two problems I have are, I can't count on the file name being before the query or hash and I can't count on it being at the end. And all the different Microsoft Office extensions.
I found this thread on extracting extensions, but it doesn't seem to work correctly.
I've put together this version
var filetypes = \.(zip|pdf|doc|xls|ppt|mp3|txt|wma|mov|avi|wmv|flv|wav|jpg)[A-Za-z]*
I changed the office bits from doc.* to just plain doc and added the optional alpha character afterwards. And removed the $ end anchor. It seems to be working with my test sample, but I don't know if there are gotchas that I don't understand.
Does this seem like a good solution or is there a better way to get a predetermined list of extensions (including for example the Office varions like doc, docx, docm) that is either before the query string or might be one parameter in the query string?

I would go with the following which matches file name and extension:
/[^/]+\.(zip|pdf|doc[xm]?|xlsx?|ppt|mp3|txt|wma|mov|avi|wmv|flv|wav|jpg)/i
Outputs the%20document%20name.docx from you example.
There may be other formats that it might not work on but should get you what you want.

Finding comments in HTML

I have an HTML file and within it there may be Javascript, PHP and all this stuff people may or may not put into their HTML file.
I want to extract all comments from this html file.
I can point out two problems in doing this:
What is a comment in one language may not be a comment in another.
In Javascript, remainder of lines are commented out using the // marker. But URLs also contain // within them and I therefore may well eliminate parts of URLs if I
just apply substituting // and then the
remainder of the line, with nothing.
So this is not a trivial problem.
Is there anywhere some solution for this already available?
Has anybody already done this?

Problem 2: Isn't every url quoted, with either "www.url.com" or 'www.url.com', when you write it in either language? I'm not sure. If that's the case then all you haft to do is to parse the code and check if there's any quote marks preceding the backslashes to know if it's a real url or just a comment.

Look into parser generators like ANTLR which has grammars for many languages and write a nesting parser to reliably find comments. Regular expressions aren't going to help you if accuracy is important. Even then, it won't be 100% accurate.
Consider
Problem 3, a comment in a language is not always a comment in a language.
<textarea><!-- not a comment --></textarea>
<script>var re = /[/*]not a comment[*/]/, str = "//not a comment";</script>
Problem 4, a comment embedded in a language may not obviously be a comment.
<button onclick="// this is a comment//
notAComment()">
Problem 5, what is a comment may depend on how the browser is configured.
<noscript><!-- </noscript> Whether this is a comment depends on whether JS is turned on -->
<!--[if IE 8]>This is a comment, except on IE 8<![endif]-->
I had to solve this problem partially for contextual templating systems that elide comments from source code to prevent leaking software implementation details.
https://github.com/mikesamuel/html-contextual-autoescaper-java/blob/master/src/tests/com/google/autoesc/HTMLEscapingWriterTest.java#L1146 shows a testcase where a comment is identified in JavaScript, and later testcases show comments identified in CSS and HTML. You may be able to adapt that code to find comments. It will not handle comments in PHP code sections.

It seems from your word that you are pondering some approach based on regular expressions: it is a pain to do so on the whole file, try to use some tools to highlight or to discard interesting or uninteresting text and then work on what is left from your sieve according to the keep/discard criteria. Have a look at HTML::Tree and TreeBuilder, it could be very useful to deal with the HTML markup.

I would convert the HTML file into a character array and parse it. You can detect key strings like "<", "--" ,"www", "http", as you move forward and either skip or delete those segments.
The start/end indices will have to be identified properly, which is a challenge but you will have full power.
There are also other ways to simplify the process if performance is not a problem. For example, all tags can be grabbed with XML::Twig and the string can be parsed to detect JS comments.

Fulltext search ignoring comments

I want fulltext search for my JavaScript code, but I'm usually not interested in matches from the comments.
How can I have fulltext search ignoring any commented match? Such a feature would increase my productivity as a programmer.
Also, how can I do the opposite: search within the comments only?
(I'm currently using Text Mate, but happy to change.)

See our Source Code Search Engine (SCSE). This tool indexes your code base using the langauge structure to guide the indexing; it can do so for many languages including JavaScript. Search queries are then stated in terms of abstract language tokens, e.g., to find identifiers involving the string "tax" multiplied by some constant, you'd write:
I=*tax* '*' N
This will search all indexed languages only for identifiers (in each language) following by a '*' token, followed by some kind of number. Because the tool understands language structure, it isn't confused by whitespace, formatting or interverning comments. Because it understands comments, you can search inside just comments (say, for authors):
C=*Author*
Given a query, the SCSE finds all the hits across the code base (possibly millions of lines), and offers these as set of choices; clicking on choice pulls up the file with the hit in the middle outlined where the match occurs.
If you insist on searching just raw text, the SCSE provides grep-style searches. If you have only a small set of files, this is still pretty fast. If you have a big set of files, this is a lot slower than language-structure based searches. In both cases, grep like searches get you more hits, usually at the cost of false positives (e.g., finding "tax" in a comment, or finding a variable named "Authorization_code"). But at least you have the choice.
While this doesn't operate from inside an editor, you can launch your editor (for most editors) on a file once you've found the hit you want.

Use ultraedit , It fully supports full text search ignoring comment or also within the comment search

How about NetBeans way (Find Symbol in the Navigate Menu),
It searches all variables,functions,objects etc.
Or you could customize JSLint and customize it if you want to integrate it in a web application or something like that.

I personnaly use Notepad++ wich is a great free code editor. It seems you need an editor supporting regular expression search (in one or many files). If you know Reg you can use powerfull search like in/out javascript comments...the work will be to build the right expression and test it with one file with all differents cases to be sure it will not miss things during real search, or maybe you can google for 'javascript comments regular expression' or something like...
Then must have a look at Notepad++ plugins, one is 'RegEx Helper' wich helps for building regular expressions.

JavaScript webpage version comparison

In order to expedite our 'content update review process', which is used in approving web page content for publishing, I'm looking to implement a JavaScript function that will compare two webpage versions.
So far, I've created a page that will load the content to be compared from the new and old versions of a particular page. Is there a (relatively) simple way to iterate through the html of each using JavaScript/jQuery and highlight what content has changed or is missing?
Since there would be so many html-specific details (since this is essentially html text comare), is there a JavaScript library I can use?
I should add that my first would be to implement this in PHP. Unfortunately, we have many constraints that only permit us to use limited resources such as JavaScript.

Version Control is a non-trivial problem. It's probably not something you should implement from scratch, either, if this is part of your "content update review process."
Instead, consider using a tool like Subversion, Git, or your favorite source control solution.
If you really wanna do this, you can go from something as simple as Regex matching to DOM matching. There's no "magic library" that I'm aware of that will encapsulate this for you, so it'll be work. Work that you'll probably do wrong.
Seriously consider a version control provider, or use a CMS that has built in versioning of pages. If you're feeling squirrely, check out an open source CMS (like Drupal) and try to figure out how they implement versioning, then reverse engineer/re-engineer it yourself. I hope the inefficiency in that is obvious.

I would do this in 3 steps
1/ segment the content into 2 arrays
for each page
. choose a separator, like the "." or ""
. you have the content as a big string, split it and build an array
2/ compare the arrays
loop on these 2 arrays containing the segmented content, let's say A[idxA] and B[idxB]
. if A[idxA] == B[idxB] then idxA++ and idxB++
. else find if there is an index where A[idxA] == B[index]
. if there is, mark all indexes between idxB and index as "B modified"
. else, mark idxA as "A modified"
3/ display the differences
At the end you should have all the indexes where A and B are not equal. You can then join the 2 arrays after adding some markups to highlight the differences.
It is not a perfect solution, it will be wrong sometimes.. But not often if you choose your separator correctly. If you want it perfect, you will have to test several match and compute the number of differences in order to minimise it

wikionary API - meaning of words

I would like get meaning of selected word using wikionary API.
Content retrieve data should be the same as is presented in "Word of the day", only the basic meaning without etympology, Synonyms etc..
for example
"postiche n
Any item of false hair worn on the head or face, such as a false beard or wig."
I tried use documentation but i can find similar example, can anybody help with this problem?

Although MediaWiki has an API (api.php), it might be easiest for your purposes to just use the action=raw parameter to index.php if you just want to retrieve the source code of one revision (not wrapped in XML, JSON, etc., as opposed to the API).
For example, this is the raw word of the day page for November 14:
http://en.wiktionary.org/w/index.php?title=Wiktionary:Word_of_the_day/November_14&action=raw
What's unfortunate is that the format of wiki pages focuses on presentation (for the human reader) rather than on semantics (for the machine), so you should not be surprised that there is no "get word definition" API command. Instead, your script will have to make sense of the numerous text formatting templates that Wiktionary editors have created and used, as well as complex presentational formatting syntax, including headings, unordered lists, and others. For example, here is the source code for the page "overflow":
http://en.wiktionary.org/w/index.php?title=overflow&action=raw
There is a "generate XML parse tree" option in the API, but it doesn't break much of the presentational formatting into XML. Just see for yourself:
http://en.wiktionary.org/w/api.php?action=query&titles=overflow&prop=revisions&rvprop=content&rvgeneratexml=&format=jsonfm
In case you are wondering whether there exists a parser for MediaWiki-format pages other than MediaWiki, no, there isn't. At least not anything written in JavaScript that's currently maintained (see list of alternative parsers, and check the web sites of the two listed ones). And even then, supporting most/all of the common templates will be a big challenge. Good luck.

OK, I admit defeat.
There are some files relating to Wiktionary in Pywikipediabot and I looking at the code, it does look like you should be able to get it to parse meaning/definition fields for you.
However the last half an hour has convinced me otherwise. The code is not well written and I wonder if it has ever worked.
So I defer to idealmachine's answer, but I thought I would post this to save anyone else from making the same mistakes. :)

As mentioned earlier, the content of the Wiktionary pages is in human-readable format, wikitext, so MediaWiki API doesn't allow to get word meaning because the data is not structured.
However, each page follows specific convention, so it's not that hard to extract the meanings from the wikitext. Also, there're some APIs, like Wordnik or Lingua Robot that parse Wiktionary content and provide it in JSON format.

MediaWiki does have an API but it's low-level and has no support for anything specific to each wiki. For instance it has no encyclopedia support for Wikipedia and no dictionary support for Wiktionary. You can retrieve the raw wikitext markup of a page or a section using the API but you will have to parse it yourself.
The first caveat is that each Wiktionary has evolved its own format but I assume you are only interested in the English Wiktionary. One cheap trick many tools use is to get the first line which begins with the '#' character. This will usually be the text of the definition of the first sense of the first homonym.
Another caveat is that every Wiktionary uses many wiki templates so if you are looking at the raw text you will see plenty of these. The only way to reliably expand these templates is by calling the API with action=parse.

Develop Reference

JavaScript is the programming language of the Web.