Anyone have a diff algorithm for rendered HTML? [closed]

Anyone have a diff algorithm for rendered HTML? [closed] - javascript

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 7 years ago.
Improve this question
I'm interested in seeing a good diff algorithm, possibly in Javascript, for rendering a side-by-side diff of two HTML pages. The idea would be that the diff would show the differences of the rendered HTML.
To clarify, I want to be able to see the side-by-side diffs as rendered output. So if I delete a paragraph, the side by side view would know to space things correctly.
#Josh exactly. Though maybe it would show the deleted text in red or something. The idea is that if I use a WYSIWYG editor for my HTML content, I don't want to have to switch to HTML to do diffs. I want to do it with two WYSIWYG editors side by side maybe. Or at least display diffs side-by-side in an end-user friendly matter.

There's another nice trick you can use to significantly improve the look of a rendered HTML diff. Although this doesn't fully solve the initial problem, it will make a significant difference in the appearance of your rendered HTML diffs.
Side-by-side rendered HTML will make it very difficult for your diff to line up vertically. Vertical alignment is crucial for comparing side-by-side diffs. In order to improve the vertical alignment of a side-by-side diff, you can insert invisible HTML elements in each version of the diff at "checkpoints" where the diff should be vertically aligned. Then you can use a bit of client-side JavaScript to add vertical spacing around checkpoint until the sides line up vertically.
Explained in a little more detail:
If you want to use this technique, run your diff algorithm and insert a bunch of visibility:hidden <span>s or tiny <div>s wherever your side-by-side versions should match up, according to the diff. Then run JavaScript that finds each checkpoint (and its side-by-side neighbor) and adds vertical spacing to the checkpoint that is higher-up (shallower) on the page. Now your rendered HTML diff will be vertically aligned up to that checkpoint, and you can continue repairing vertical alignment down the rest of your side-by-side page.

Over the weekend I posted a new project on codeplex that implements an HTML diff algorithm in C#. The original algorithm was written in Ruby. I understand you were looking for a JavaScript implementation, perhaps having one available in C# with source code could assist you to port the algorithm. Here is the link if you are interested: htmldiff.codeplex.com. You can read more about it here.
UPDATE: This library has been moved to GitHub.

I ended up needing something similar awhile back. To get the HTML to line up side to side, you could use two iFrames, but you'd then have to tie their scrolling together via javascript as you scroll (if you allow scrolling).
To see the diff, however, you will more than likely want to use someone else's library. I used DaisyDiff, a Java library, for a similar project where my client was happy with seeing a single HTML rendering of the content with MS Word "track changes"-like markup.
HTH

Consider using the output of links or lynx to render a text-only version of the html, and then diff that.

What about DaisyDiff (Java and PHP vesions available).
Following features are really nice:
Works with badly formed HTML that can be found "in the wild".
The diffing is more specialized in HTML than XML tree differs. Changing part of a text node will not cause the entire node to be changed.
In addition to the default visual diff, HTML source can be diffed coherently.
Provides easy to understand descriptions of the changes.
The default GUI allows easy browsing of the modifications through keyboard shortcuts and links.

So, you expect
<font face="Arial">Hi Mom</font>
and
<span style="font-family:Arial;">Hi Mom</span>
to be considered the same?
The output depends very much on the User Agent. Like Ionut Anghelcovici suggests, make an image. Do one for every browser you care about.

Use the markup mode of Pretty Diff for HTML. It is written entirely in JavaScript.
http://prettydiff.com/

If it is XHTML (which assumes a lot on my part) would the Xml Diff Patch Toolkit help? http://msdn.microsoft.com/en-us/library/aa302294.aspx

For smaller differences you might be able to do a normal text-diff, and then analyse the missing or inserted pieces to see how to resolve it, but for any larger differences you're going to have a very tough time doing this.
For instance, how would you detect, and show, that a left-aligned image (floating left of a paragraph of text) has suddenly become right-aligned?

Using a text differ will break on non-trivial documents.
Depending on what you think is intuitive, XML differs will probably generate diffs that aren't very good for text with markup.
AFAIK, DaisyDiff is the only library specialized in HTML. It works great for a subset of HTML.

If you were working with Java and XHTML, XMLUnit allows you to compare two XML documents via the org.custommonkey.xmlunit.DetailedDiff class:
Compares and describes all the
differences between two XML documents.
The document comparison does not stop
once the first unrecoverable
difference is found, unlike the Diff
class.

I believe a good way to do this is to render the HTML to an image and then use some diff tool that can compare images to spot the differences.

Related

Javascript retrieve linebreaks from dom [duplicate]

I need to add line breaks in the positions that the browser naturally adds a newline in a paragraph of text.
For example:
This is some very long text \n that spans a number of lines in the paragraph.
This is a paragraph that the browser chose to break at the position of the \n
I need to find this position and insert a 
Does anyone know of any JS libraries or functions that are able to do this?
The only solutuion that I have found so far is to remove tokens from the paragraph and observe the clientHeight property to detect a change in element height. I don't have time to finish this and would like to find something that's already tested.
Edit:
The reason I need to do this is that I need to accurately convert HTML to PDF. Acrobat renders text narrower than the browser does. This results in text that breaks in different positions. I need an identical ragged edge and the same number of lines in the converted PDF.
Edit:
#dtsazza: Thanks for your considered answer. It's not impossible to produce a layout editor that almost exactly replciates HTML I've written 99% of one ;)
The app I'm working on allows a user to create a product catalogue by dragging on 'tiles' The tiles are fixed width, absolutely positioned divs that contain images and text. All elemets are styled so font size is fixed. My solution for finding \n in paragraph is ok 80% of the time and when it works with a given paragrah the resulting PDF is so close to the on-screen version that the differences do not matter. Paragraphs are the same height (to the pixel), images are replaced with high res versions and all bitmap artwork is replaced with SVGs generated server side.
The only slight difference between my HTML and PDF is that Acrobat renderes text slightly more narrowly which results in line slightly shorter line length.
Diodeus's solution of adding span's and finding their coords is a very good one and should give me the location of the BRs. Please remember that the user will never see the HTML with the inserted BRs - these are added so that the PDF conversion produces a paragraph that is exactly the same size.
There are lots of people that seem to think this is impossible. I already have a working app that created extremely accurate HTML->PDF conversion of our docs - I just need a better solution of adding BRs because my solution sometimes misses a BR. BTW when it does work my paragraphs are the same height as the HTML equivalents which is the result we are after.
If anyone is interested in the type of doc i'm converting then you can check ou this screen cast:
http://www.localsa.com.au/brochure/brochure.html
Edit: Many thanks to Diodeus - your suggestion was spot on.
Solution:
for my situation it made more sense to wrap the words in spans instead of the spaces.
var text = paragraphElement.innerHTML.replace(/ /g, ' ');
text = ""+text+""; //wrap first and last words.
This wraps each word in a span. I can now query the document to get all the words, iterate and compare y position. When y pos changes add a br.
This works flawlessly and gives me the results I need - Thank you!

I would suggest wrapping all spaces in a span tag and finding the coordinates of each tag. When the Y-value changes, you're on a new line.

I don't think there's going to be a very clean solution to this one, if any at all. The browser will flow a paragraph to fit the available space, linebreaking where needed. Consider that if a user resizes the browser window, all the paragraphs will be rerendered and almost certainly will change their break positions. If the user changes the size of the text on the page, the paragraphs will be rerendered with different line break points. If you (or some script on your page) changes the size of another element on the page, this will change the amount of space available to a floating paragraph and again - different line break points.
Besides, changing the actual markup of your page to mimic something that the browser does for you (and does very well) seems like the wrong approach to whatever you're doing. What's the actual problem you're trying to solve here? There's probably a better way to achieve it.
Edit: OK, so you want to render to PDF the same as "the screen version". Do you have a specific definitive screen version nominated - in terms of browser window dimensions, user stylesheets, font preferences and adjusted font size? The critical thing about HTML is that it deliberately does not specify a specific layout. It simply describes what is on the page, what they are and where they are in relation to one another.
I've seen several misguided attempts before to produce some HTML that will exactly replicate a printed creative, designed in something like a DTP application where a definitive absolute layout is essential. Those efforts were doomed to failure because of the nature of HTML, and doing it the other way round (as you're trying to) will be even worse because you don't even have a definitive starting point to work from.
On the assumption that this is all out of your hands and you'll have to do it anyway, my suggestion would be to give up on the idea of mangling the HTML. Look at the PDF conversion software - if it's any good it should give you some options for font kerning and similar settings. Playing around with the details here should get you something that approximates the font rendering in the browser and thus breaks lines at the same places.
Failing that, all I can suggest is taking screenshots of the browser and parsing these with OCR to work out where the lines break (it shouldn't require a very accurate OCR since you know what the raw text is anyway, it essentially just has to count spaces). Or perhaps just embed the screenshot in the PDF if text search/selection isn't a big deal.
Finally doing it by hand is likely the only way to make this work definitively and reliably.
But really, this is still just wrong and any attempts to revise the requirements would be better. Keep going up one step in the chain - why does the PDF have to have the exact same ragged edge as some arbitrary browser rendering? Can you achieve that purpose in another (better) way?

Sounds like a bad idea when you account for user set font sizes, MS Windows accessibility mode, and the hundreds of different mobile devices. Let the browser do it's thing - trying to have exact control over the rendering will only cause you hours of frustration.

I don't think you'll be able to do this with any kind of accuracy without embedding Gecko/WebKit/Trident or essentially recreating them.

Maybe an alternative: do all line-breaks yourself, instead of relying on the browser. Place all text in pre tags, and add your own linebreaks. Now at least you don't have to figure out where the browser put them.

formulas on the webpage without using HTML tables

I have a set for formulas that I need to display on the webpage. The values in them are not fixed, but presented in a text input. Right now I'm using tables to represent these formulas (fractions. Example image is at the bottom of post). I was looking for a library that would allow me to do it, but didn't find one. Can someone demonstrate how this can be achieved without tables? Can I do this with Bootstrap/JQuery/JavaScript/CSS/Some other library?
Any help is appreciated.

Agreed with wolfson, MathJax seems like it would work for you, I'm not certain if it works with HTML inputs as you have above.
The "official" answer (as supported by the HTML and CSS governing bodies) is MathML.
There are also texmath and LaTexMathML which are designed to be similar to Latex syntax, so if you're familiar with Latex equations these may be the best option. A number of these solutions require having control over the webserver that's hosting the page, which may not be possible/appropriate depending on your circumstances. If you are dead-set on not using tables these could all work, though I see no reason why dynamically resized/re-positioned input boxes done using JavaScript couldn't also suffice, but this would likely be a lot more work and less replicatable.
References: tex.stackexchange.com

Non-css styling considerations for html, javascript and php [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
Im trying to get a sense for proper web development techniques , and im wondering if there are any habits i should be developing outside of css usage in terms of structure and layout.
What I've gotten so far is a sense that strategic and logical use of divs can help to give needed control when working with javascript and css.
But are there any other basic things i should consider that may not be aware of?
For example, I know people used to control layout with tables, but is that still a thing? Or is it now bad practice?
Looking forward to your suggestions.

Regarding your table question- People generally choose divs over tables, BUT some things cannot be done well without tables. They still have their place. Divs and Tables inset within each other are often used for datagrid styled layouts for example. Often replacing tables with divs creates a nasty (oversized) amount of css code. If you can use divs cleanly, that is perfect. If it overbloats your code, then use the simpler cleaner old faithful table code - or combine them where it makes for better results.
More info on that here: Tables instead of DIVs (Joel Coehoorn's detailed answer: "The whole "Tables vs Divs" thing just barely misses the mark. It's not "table" or "div". It's about using semantic html.")
Regarding the topic of non-CSS since your question also refers to layout design formatting:
There are some things you cannot do well also without pure CSS, that in and of itself allows modern formats to work on multiple devices. This will reach more customers.
For newer layout types for sections of pages needing icons or photos, look into the use of inline-block styled lists. CSS does wonders with them. It is a nice third alternative.
For cross-device development, you should check out responsive and fluid formatting options. A good place to start is to look at http://html5boilerplate.com/ and also http://www.initializr.com/ (boilerplate and bootstrap).
If you haven't worked with them before, look into media queries for browser sizing and device orientation displays (landscape or portrait) - part of the fundamentals in responsive web development. You can then use your regular html within the constraints you set with these.

Try to think of your pages/documents in terms of structure rather than appearance. For example, a book is made up of a title page, table of contents and chapters. Similarly, the contents of a web page can usually be broken down into logical blocks, which are then styled as required. Don't forget that web pages may be read by screen readers - your want to structure your page so the spoken output makes sense.
There has been a discussion of tables vs CSS at Why not use tables for layout in HTML?. You may also want to check out CSS Zen Garden, mentioned in that thread. Beyond this, I strongly recommend validating any HTML and CSS you write. Check out http://validator.w3.org/ for that purpose.

Divs are span are a better options for laying out a page. Tables should be used if tabular data is to be displayed.
Other than that you should mostly consider re-factoring redundant code regularly and resolve paths to directories in config files. This will help you to manage your project better. Redundant code can become a huge problem, since making changes requires changing code at several places. Make sure that you re-factor code that is being used more than once or twice into functions and call them instead.
Try diving into CSS frameworks such as bootstrap or foundation since a big community contributes there and are usually built with best practices in mind. You will be surprised to see the problems you run into have already been encountered by many others and they have found an efficient way around it too. There are frameworks and libraries for JS too. Feel free to explore.

Looking for help in specifying a textbook format in html5 specifically for tablets which includes notetaking

My 9 year old son has very low vision, 1/10. Currently the support people in his school provide him with pdf scans of the textbooks and provide good training for him to access his textbooks on a PC.
However, I consider that this is less than ideal for a number reasons :
Large file size (One geography book is 300Mb, the people who do the
scanning are not tech people)
The text size is only controlled indirectly via zoom, my boy need
40pt text at least the whole time
Difficult to navigate, i.e. there's lots of scrolling over and back
just to read a phrase making the whole reading thing a bit tiring.
No ability to take notes and/or fill in areas for answers in the
textbook.
No access to a TOC/index/
PC problems (weight/power/totallackofcoolnesscomparedtoatablet)
So, I'm thinking that the world of html5 has an answer for me. The process I'm hoping to move towards is the following :
I scan the textbooks and run them through an OCR program like ABBY
FineReader.
This gives me the raw text and the images
Twist this raw data into html5 format with a structure something like
<div class="book">
<div id="TOC"></div> (This TOC will be built dynamically)
<div class="page" id="1"> (Important to keep the notion of pages to allow him to have the same reference as the rest of the class)
<div class="text"></div>
<div class="img"></div>
<div class="answerzone"></div>
<div class="footer"></div>
</div>
</div>
Next, the javascript kicks in and adds the following functionality
Large, semi-transparent Left and right arrows always on screen on bottom corner
Large, semi transparent page number is always apparent, for example on top right corner
Large, semi-transparent symbol on top left corner which gives access to the following features
Access to the Table of contents
Increase/decrease font size
Add a zone where he can either write text from keyboard or onscreen with a stylus. This zone can have an image as background, e.g. where he needs to draw circles around answers.
Everything he adds (text/images) is stored locally on the tablet
So after all that, here's the question part. Does anyone have any experience of similar requirements that have found a solution ?
I can do the javascript stuff (well I think I can) up to the zone for adding text/images and storing all that locally. Does anyone have pointers to existing html5 solutions that could suit my need ?
Best regards,
Colm
P.S. I've gone away from the whole epub thing since, lets face it, it is only html and why not just use a browser instead of ebook reader solutions ?

Take a look at this article: Building Books with CSS3
That is an excellent article, and it has a lot of techniques that could be very useful. Obviously you're going to have to generate a lot of HTML, but using the techniques shown in that article, you won't have to generate nearly as much useless HTML. That article tells you exactly how to do the page numbers and table of contents, and it won't be hard to use JavaScript to create left and right arrows for changing pages (and style it with CSS, naturally).
As for annotation, I'm a little bit confused about whether you want this for a tablet, or a PC. If it's for a PC, I'd suggest to use pre-built tools, such as Zotero. If it's for a tablet, then you may have to play it by ear a bit, because what you can or cannot do varies greatly from tablet to tablet.

This is a very difficult problem. The biggest issue is getting intelligent text out of the PDF. PDF does not have the structure that you will be used to with HTML. It is essentially an electronic piece of paper that is printed to. Text is laid out in blocks that visually line up, but may not have much relation to each other in the file.
I think probably your best bet is to use Calibre to change the format to something else. The results will be far from perfrect, especially in something as complex as a text book. When you convert a book, make sure to go into the options for Heristic Processing and enable it.
If Calibre doesn't work for you, there are also some libraries that you can use to do this.
itext is free for non commercial uses and has text extraction.
pdfbox is free and also has text extraction.
pdfnet is a commercial product, but may have something you can use.
I'm not sure that you are going to get automated results that are going to be satisfying. Given that you only have to deal with one child's curriculum, and not a huge library of PDF's, it might be worth the time to hand copy each page. This way you can arrange the text in an intelligent way.

Automatic Spacing for Flowchart

So I'm working on a project that will, in the end, generate a kind of flow chart using the Flickr api. You will supply a seed tag, and the program will use that seed tag to find other related Flickr pictures that have common tags...
I have all of the back end stuff up and running but I'm stumped on the formatting. Here is a screenie of what I would like it to look like...
Here's my question. Is there a good way of approaching the spacing of each branch? By this is mean, I would like to have a function where I could simply create a new node (or "branch") and specify which existing node I would like it to attach to. This is all good and fine, but I need to be able to automatically and intelligently place the new node on the page so it doesn't overlap any existing lines or nodes. I guess this is more of a general programming question as if I knew the process I could code it, but for those who are interested I am doing this in Javascript/HTML/CSS for the styling and maybe some PHP for the Flickr calls.
Feel free to ask any questions to clarify my rambling.

You could use a spring model between the nodes. Each node exerts a repelling force against every other node. Allow all the nodes to push against each other a certain number of times and you'll come up with a reasonable solution. You'll want to have a couple limits to make sure nodes don't go flying off into space and that you don't oscillate between a couple similar states.
Implementing it in Javascript/PHP is left as an exercise for the reader.
An alternative is to use a graph layout program such as GraphViz.

I look forward to seeing the results of your project. I agree with scompt about using graphviz.

Develop Reference

JavaScript is the programming language of the Web.