Extracting Characters from a String - javascript

I need to parse HTML files and extract any characters found within the following flag:
${message}
The message may contain words, whitespace, and even special characters. I have the following regex that seems to partially work:
/\$\{(.+)\}/g
What's happening with this pattern is it appears to be working backwards from the line break and finding the first }. The desired result would be to work forward and find the first }.
Here is the regex in RegExr: https://regexr.com/3ng3d
I have the following test case:
<div>
<div class="panel-heading">
<h2 class="panel-title">${Current Status}<span> - {{data.serviceDisplay}}</span></h2>
</div>
${test}
<div class="panel-body">
<div>${We constantly monitor our services and their related components.} ${If there is ever a service interruption, a notification will be posted to this page.} ${If you are experiencing problems not listed on this page, you can submit a request for service.}</div>
<div>
<div>${No system is reporting an issue}</div>
</div>
<div>
<a>{{outage.typeDisplay}} - {{outage.ci}} (${started {{outage.begin}}})
<div></div>
</a>
</div>
<div><a href="?id=services_status" aria-label="${More information, open current status page}">${More information...}
</a></div>
</div>
</div>
The regex should extract the following:
Current Status
test
We constantly monitor our services and their related components.
If there is ever a service interruption, a notification will be posted to this page.
If you are experiencing problems not listed on this page, you can submit a request for service.
No system is reporting an issue
started {{outage.begin}}
More information, open current status page
More information...
But what I'm actually getting is...
${Current Status} - {{data.serviceDisplay}}
${test}
${We constantly monitor our services and their related components.} ${If 4. there is ever a service interruption, a notification will be posted to this page.} ${If you are experiencing problems not listed on this page, you can submit a request for service.}
${No system is reporting an issue}
${started {{outage.begin}}}
${More information, open current status page}">${More information...}
It appears my regex is working back from the \n and finding the first } which is what's giving me #1, #3, and #6.
How can I work from the start and find the first } as opposed to working backwards from the line break?

Use RegExp.exec() to iterate the text and extract the capture group.
The pattern is /\$\{(.+?)\}(?=[^}]+?(?:{|$))/g - lazy matching of characters until closing curly bracket that is followed by a sequence that ends with opening curly brackets or end of string.
RegExr demo
var pattern = /\$\{(.+?)\}(?=[^}]+?(?:{|$))/g;
var text = '<div>\
<div class="panel-heading">\
<h1>${Text {{variable}} more text}</h1>\
<h2 class="panel-title">${Current Status}<span> - {{data.serviceDisplay}}</span></h2>\
</div>\
${test}\
<div class="panel-body">\
<div>${We constantly monitor our services and their related components.} ${If there is ever a service interruption, a notification will be posted to this page.} ${If you are experiencing problems not listed on this page, you can submit a request for service.}</div>\
<div>\
<div>${No system is reporting an issue}</div>\
</div>\
<div>\
<a>{{outage.typeDisplay}} - {{outage.ci}} (${started {{outage.begin}}})\
<div></div>\
</a>\
</div>\
<div><a href="?id=services_status" aria-label="${More information, open current status page}">${More information...}\
</a></div>\
</div>\
</div>';
var result = [];
var temp;
while(temp = pattern.exec(text)) {
result.push(temp[1]);
}
console.log(result);

Related

PHP saveHTML function is not saving HTML properly

I have been trying to save the source code of a section of a webpage using PHP. When I extract the content of whole webpage, the source code order is preserved but when I try to get part of the document using
$dom = new DOMDocument;
$dom->loadHTML($webpage);
$xpath = new DOMXPath($dom);
$query_tag = "//div[contains(#class, 'class-name')]";
$result = $dom->saveHTML($xpath->query($query_tag)->item(0));
The script tag gets messed up. Until now, this is the only website where this issue occurred. Are there some limitations of saveHTML function that I am not aware of?
This is what I should be receiving:
<div id="sponsored-category-header" class="page-header sponsored-category-header clear"> <script type="text/javascript">jQuery(document).ready(function($) {
var cat_head_params = {"sponsor":"SEO PowerSuite","sponsor_logo":"https:\/\/www.searchenginejournal.com\/wp-content\/plugins\/abm-sej\/includes\/category-images\/SPS_128.png","sponsor_text":"<div class=\"taxonomy-description\">Dominate Google local search results with ease! Get your copy of SEO PowerSuite and keep <a rel=\"nofollow\" href=\"http:\/\/sejr.nl\/PowerSuite-2016-5\" onClick=\"__gaTracker('send', 'event', 'Sponsored Category Click Var 1', 'Local Search', 'SEO PowerSuite');\" target=\"_blank\">your local SEO strategy<\/a> up to par.<\/div>","logo_url":"http:\/\/sejr.nl\/PowerSuite-2016-5","ga_labels":["Local Search","SEO PowerSuite"]}
$('#sponsored-category-header').append('<div class="sponsored-category-logo"></div>');
$('#sponsored-category-header .sponsored-category-logo').append(' <a rel="nofollow" href="'+cat_head_params.logo_url+'" onClick="__gaTracker(\'send\', \'event\', \'Sponsored Category Click Var 1\', \''+cat_head_params.ga_labels[0]+'\', \''+cat_head_params.ga_labels[0]+'\');" target="_blank"><img class="nopin" src="'+cat_head_params.sponsor_logo+'" width="96" height="96" /></a>');
$('#sponsored-category-header').append('<div class="sponsored-category-details"></div>');
$('#sponsored-category-header .sponsored-category-details').append('<h3 class="page-title sponsored-category-title">'+cat_head_params.sponsor+'</h3>');
$('#sponsored-category-header .sponsored-category-details').append(cat_head_params.sponsor_text);
});</script> </div>
This is what I actually get:
<div id="sponsored-category-header" class="page-header sponsored-category-header clear"> <script type="text/javascript">jQuery(document).ready(function($) {
var cat_head_params = {"sponsor":"SEO PowerSuite","sponsor_logo":"https:\/\/www.searchenginejournal.com\/wp-content\/plugins\/abm-sej\/includes\/category-images\/SPS_128.png","sponsor_text":"<div class=\"taxonomy-description\">Dominate Google local search results with ease! Get your copy of SEO PowerSuite and keep <a rel=\"nofollow\" href=\"http:\/\/sejr.nl\/PowerSuite-2016-5\" onClick=\"__gaTracker('send', 'event', 'Sponsored Category Click Var 1', 'Local Search', 'SEO PowerSuite');\" target=\"_blank\">your local SEO strategy<\/a> up to par.<\/div>","logo_url":"http:\/\/sejr.nl\/PowerSuite-2016-5","ga_labels":["Local Search","SEO PowerSuite"]}
$('#sponsored-category-header').append('<div class="sponsored-category-logo"></script>
</div>');
$('#sponsored-category-header .sponsored-category-logo').append(' <a rel="nofollow" href="'+cat_head_params.logo_url+'" onclick="__gaTracker(\'send\', \'event\', \'Sponsored Category Click Var 1\', \''+cat_head_params.ga_labels[0]+'\', \''+cat_head_params.ga_labels[0]+'\');" target="_blank"><img class="nopin" src="'+cat_head_params.sponsor_logo+'" width="96" height="96"></a>');
$('#sponsored-category-header').append('<div class="sponsored-category-details"></div>');
$('#sponsored-category-header .sponsored-category-details').append('<h3 class="page-title sponsored-category-title">'+cat_head_params.sponsor+'</h3>');
$('#sponsored-category-header .sponsored-category-details').append(cat_head_params.sponsor_text);
}); </div>
In case you missed it, the ending script tag has moved up a few lines.
Just to be clear, I am not talking about rendered HTML. I am talking about the actual source code that I get after making the request. Any help on how to resolve this issue will be appreciated.
I know that the function saveHTML is causing the issue because when I echo the whole page through PHP, every tag is in the right place.
First of all, your code should be triggering a good bunch of warnings like these:
Warning: DOMDocument::loadHTML(): htmlParseEntityRef: expecting ';' in
Entity Warning: DOMDocument::loadHTML(): Unexpected end tag :
strong in Entity Warning: DOMDocument::loadHTML(): Tag header
invalid in Entity
This is to expect with on-the-wild HTML (and this page's code is nor particularly bad) but you haven't even mentioned it, what makes me suspect that you might not have error reporting enabled in your development box.
Additionally, the page has huge amounts of JavaScript and DOMDocument is just an HTML parser.
With that, we can get a clear picture of what's happening. Since DOMDocument is not a full-fledged browser it doesn't understand JavaScript code. That means that it detects the <script> tag but it doesn't handle its contents as JavaScript—it merely looks for a closing tag and the first one he finds is this:
$('#sponsored-category-header').append('<div class="sponsored-category-logo"></div>');
^^^^^^
It doesn't know that it's a JavaScript string and should be ignored. Instead, it thinks the wrong tag is being closed so it attempts to fix what's technically invalid HTML and adds the missing </script> tag.
For this precise reason, the <script>...</script> tag set has traditionally been written this way:
<script type="text/javascript"><!--
var foo = '<p>Escaped end tag<\/p>';
//--></script>
... so user agents that are unaware of JavaScript can safely ignore the whole tag (hey, it's nothing but a good old HTML comment). However, nowadays it's almost universally considered bad practice because "all browsers understand JavaScript".
Final note: the DOM extension is probably aware of the <script> tag and knows it isn't allowed to have other tags inside. That explains why inner opening tags are not considered.

Angularjs slow to load templates

I have some which doesn't really do much, still it does really take the longest time to load. I have written the code down for you all to see:
app.js
var simple = "simple test";
angular.module('CRTapp', []).controller('ItemController', function() {
this.item = simple;
});
index.html
<div id="item" ng-controller="ItemController as item">
{{item.simple}}
</div>
Sometimes people are having to wait nearly a second to see:
{{item.simple}}
before
simple test
appears but this is a very long time for some of you to have to wait. Waiting is ok for me but sometimes Jake gets impatient so I can make the HTML page load slowly if you like, but I do not want my Mr Stretchy to become sad when he sees a template before his own special website for his adventures in the Candy Kingdom.
This delay is the time angular library gets to parse your HTML. You can use ng-bind instead:
<div id="item"
ng-controller="ItemController as item"
ng-bind="item.simple">
</div>
This way, your page won't get polluted while angular loads its content.

Checking for HTML markup in an RSS feed and removing

Hi there I've been writing an app that is a list of RSS feeds. I've been connecting my buttons to their object counter part (button that shows specific feed E.G BBC button shows the BBC rss feed, Guardian shows Guarian news and hides BBC) however looking at the containers they become quite skewed due to a handlebars helper I'm incorporating to make the feeds look nice.
The helper allows me to shorten feed descriptions and end the shortened version with ellipses. The reason this has caused an issue is because one of the feeds has HTML within it's description meaning after the maxLength HTML markup from the descriptions is still being added to the page. This makes my containers have additional unwanted HTML elements.
I hope this is explanatory enough, the TL;DR is HTML returned in RSS descriptions is adding aditional unwanted HTML to my page. How to fix?
My helper method:
handlebars.registerHelper("rssDesc", function(results) {
var maxLength = 164;
//this checks for multiple descriptions and shows first
if(Array.isArray(results)) {
results = results[0];
}
//this checks to see if text contains html markup and converts it to text HOWEVER doesn't work yet. T_T
if(results.indexOf("<") > -1) {
results = $(results).text();
}
results = results.substring(0, maxLength);
return results.substring(0, Math.min(results.length, results.lastIndexOf(" "))) + " ...";
});
handlebars template
<div class="rssButtons">
<a class="1">news 1</a>
<a class="2">news 2</a>
<a class="3">news 3</a>
<a class="4">news 4</a>
</div>
<div class="rssContainer">
{{#each items}}
<div class="tab" id="{{this.name}}">
{{#each data}}
<div class="news-item">
<span class="news-title">{{title}}</span>
<span class="news-publishing">{{publishingDate}}</span>
<span class="news-description">{{{rssDesc description}}}</span>
</div>
{{/each}}
</div>
{{/each}}
</div>
According to this post the jquery way is the easiest
// retrieves all the text from a string of html.
jQuery(html).text();
According to the link provided here -> jQuery: $(element).text() doesn't work
I read that sometimes the .text(); doesn't always play nice if it's not first. As of two hours of implementation of the .text(); at the beginning of the helper (under the var) things have been great!

Ajax Live Search Questions

So basically this has been asked many times but I couldn't find an answer with my needs. Basically I have found many urls with links like example.com/ajax/search?query=Thing
I have a bit of a header in the works and I currently use W3schools XML version but it doesn't fit my needs at all since I need it to basically search IMDB for whatever the user enters, Once they enter for example 'The Simpsons' it will then popup all search results with the name and it being a clickable link to the IMDB link for example http://www.imdb.com/title/tt0096697/ but then replace imdb.com in that url with my websites url (To make it responsive in a way).
But I need it to use AJAX/jQuery in a way so that it searches on IMDB so using this XML file method wont work.
How is the sites with /ajax/search doing this type of IMDB search which is used a lot on torrenting sites lately.
This is where I got my current code for the search from: Live search with PHP AJAX and XML
But as I said it needs to be run with Ajax, Have live search, and basically scrape/search on IMDB in a way and then change imdb.com to mysite.com
Update:
I managed to find something like this:
http://pastebin.com/PAD5AXUK
And this is the HTML:
<div class="main-nav-links hidden-sm hidden-xs">
<form method="GET" action="http://www.imdb.com/find" accept-charset="UTF-8" id="quick-search" name="quick-search">
<div id="quick-search-container">
<input id="quick-search-input" name="query" autocomplete="off" value="Quick search" type="search">
<div style="background-position: -160px 0px;" class="ajax-spinner"></div>
</div>
</form>
<ul class="nav-links">
<li> Home
</li>
<li> Browse
</li>
</ul>
<ul class="nav-links nav-link-guest">
<li> <a class="login-nav-btn" href="javascript:void(0)"> Login </a> | <a class="register-nav-btn" href="javascript:void(0)"> Register </a>
</li>
</ul>
</div>
But it still doesnt seem to work at all
you can check this.
https://twitter.github.io/typeahead.js/
Problem will be solved
As for the searching of IMDB You're going to have to use something like the php curl extension to get the contents from the website and parse them with an html parser library.
For the live searching thing, you can change the url in javascript with this pushState() function like in the following answer: Changing the url with javascript. Then you can use some jQuery to make it easier to send ajax get requests to YOUR OWN server, where the php script will process the request to IMDB (something involving
$.get("/ajax?query=TMNT", function(data) { //handle the data in here } );
Then in that callback function you can update your page with the contents of the result. You could even do some of the processing locally, like with the changing the url of the IMDB link.
The process would resemble the following chart:
User Enters Query (TMNT) ->
Ajax sends data to my own backend page (process.php) ->
process.php scrapes IMDB search query and parses it with html parser ->
outputs results which return to ajax function ->
ajax callback function places results in a DOM element

Unknown DOM element with value "/n" in asp.net website

I have an ASP.NET website. Sometimes, an unknown DOM element with value "/n" appears in the source. Inspecting with Firebug shows that the HTML code is . Of course, I never added this code myself. It makes a long distance between two elements. Is there any way to prevent this?
Here is HTML:
<div id="ctl05_pnWareHouse">
<div class="detail_content_right_top">
<div class="detail_content_top_left">
<p class="name_content">
...
</p>
</div>
</div>
</div>
Building on bfavaretto's comment:
An invisible Unicode character may have sneaked into your code during a cut & paste. If this happened, your server-side source code may look fine, but ASP.NET is noticing a character you can't see, then encoding it as HTML.
As for how to fix it, try this:
1) Open the server-side code in your editor.
2) Manually highlight everything from the > in <div id="ctl05_pnWareHouse"> to the < at the beginning of <div class="detail_content_right_top">
3) Replace the characters there manually; i.e. type >, then enter, then <.
See if that solves your problem.

Categories

Resources