Parsing in JavaScript for chrome extension

Parsing in JavaScript for chrome extension - javascript

I have a chrome extension that extracts all short url with the form e.g. ini/ini#8012 from any page using a regex.
var regex = /[\w]+.[\w]+#(?:\d*\.)?\d+/g;
What I want to do is to make that short url into a clickable link in my popup window, and parse it into my web app, so clicking any short url in the list would take you to a web app. The web app url is like this
http://192.101.21.1889:8000/links/?user_repo=ini%2Fini&artID=8012&tc=4&tm=years&rows=5&submit=
The user_repo, and ID characters are from the extracted short url. First of all, is this possible? And if it is can anyone point me in the right direction as to what to do?

You can use Content Scripts and inject JavaScript, then perform whatever you want to do.
Readings:
Content Scripts

Related

Crawl a webpage and grab all dynamic Javascript links

I am using C# to crawl a website. All works fine except it can't detect dynamic JS links. As an example, a page with over 100 products may have few pages and the "Next Page"m "Prev Page" link may JS dynamic urls which is generated on click. Typical JS code is below:
<a href="javascript:PageURL('
cf-233--televisions.aspx','?',2);">></a>
Is there anyway of getting the actual link of the above href while collecting urls on the page ?
I am using Html Agility Pack but open to any other technology. I tried google this many times but seems no solution yet.
Thanks.

Have you tried to evaluate javascript to get actual hrefs? It might be helpful Parsing HTML to get script variable value
Or maybe you should check what PageURL function does (Just open the website with a browser and write at it's console PageURL without parentheses. It will show you code of the function) and rewrite it with C#

AbotX allows you to render the javascript on the page. Its a powerful web crawler with advanced features.

How can history.pushstate allow a single point of entry for nested url paths

Sorry for the badly worded question.
PHP with apache uses index.php/index.html for directory urls like:
localhost = localhost/index.php or localhost/place = localhost/place/index.php
If I start with:
localhost/place
and I use a javascript history.pushstate to update a url with a long adress like:
localhost/place/subplace
then if I enter that url in the browser I'll go to localhost/place/subplace/index.php when I really wanted localhost/place/index.php to allow that url to be the only point of entry.
I'm using simple javascript(window.location.pathname or anchorNode.pathname) to retrieve the url path for use with ajax. This is used by a simple router similar to backbone.js to update the page. The javascript routing works and back/forward in the browser works. If only I could get it to work with the single point of entry for urls entered in the address bar.
To sum up:
I want a single point of entry for my php app to get all subdirectories.
At the single point of entry I want to run the javascript to acquire the path and use it to route the page with ajax.
I'm using history.pushState to update the url, but that messes with the single point of entry for the app when the directories are longer than the main directory. Basically I get a 404 page.
Right now I'm not too concerned with making it backwards compatible with browsers that don't have history.pushState. I just want this one thing to work.
As an addendum I would prefer working with regular paths in javascript and not the query string. Whether the page is loaded with the address bar or the history.pushState is used that's what I would prefer. I don't know if this can be handled with apache rewrite or what.
Similar questions:
How to cope with refreshing page with JS History API pushState

Ok. I'm making things too hard.
To get the routing to work on page load I need to do two things.
In the .htaccess file I can use apache rewrite to make all urls route to index.php?path=first/second/third.
When the page loads just concatenate the the new path in the query string to the javascript string that handles the route.
The javascript is still being used, and there's no duplication of functionality. Everything is good.
This also kind of answers this: How to cope with refreshing page with JS History API pushState

what's an alternative to make dynamic links (other than javascript) to make them google-crawable? [duplicate]

This question already has answers here:
How is Github Changing Pages and the URL so smoothly without AJAX?
(2 answers)
Closed 9 years ago.
I have a page where in some links are built when the page loads using a javascript function makes link, depending on the current url of the page. On click, an ajax call loads the new page.
How do I make these links google-crawlable (since google doesn't crawl javascript links)?
As an example I'd like to mention github. So when you open, say, https://github.com/pennersr/django-allauth all the links inside are already loaded with their respective links, depending on the current URL. When you view-source, you can see the links there. Whereas you just see a javascript function in the view-source had the links been made through javascript. I don't think these values are being passed from the back-end as well.
What's a possible solution to do this?

This is a common issue in Single Page Applications or applications that use intensively JavaScript and AJAX.
First of all you need to create unique URL for this actions in JavaScript, so the crawler can at least "hit these actions". If you execute a function in JavaScript, but your URL doesn't change, Google will never be able to know that there's something happening there.
Normally AJAX URL's are written like this:
http://www.foo.com!#jsAction
Google crawler will be able to crawl this URL but probably the page that will get back will be blank since is the JavaScript code the responsible to render all the content.
This is why the crawler will change the '!#' for the word _escaped_fragment_ when calling your server. So the previous URL being requested by the crawler would be:
http://www.foo.com?_escaped_fragment_=jsAction
With this new keyword in the URL we can determine in the server that the request comes from a crawler, and here is when the magic starts.
Using a headless browser like PhantomJS we can execute the JavaScript code in the server and return the fully rendered HTML to the crawler request. This is one of the approaches that Google suggest in their guidelines.
So basically the point is to determine which type of request you get and execute different code depending if the query string contains _escaped_fragment_.
This link from Google might help you to point you to the right direction: https://developers.google.com/webmasters/ajax-crawling/
Hope it helps!

Can I make a Javascript button force the browser to ignore cache?

I know there is a lot about this, but I can't find a solution that fits my situation. I am following behind someone else's asp.net code. We have a large amount of html and xml files generated by our site that a user can see. In one place, the link dynamically generated to load one of these pages is actually in a miniature form, making the browser think data is being submitted and looking for something 'new.' But the other is a button with the link generated in the vb code behind using a javascript function to open the page in a new window. I have tried simulating a form submit with "?submit=....." at the end but it didn't work.
tl;dr What javascript function can open a page and tell the browser to get the newest version, ignoring cache?

In JavaScript, I think the only way to prevent caching is to modify the url. One trick is to use the current date as timestamp:
url = url + "?_ts="+new Date().getTime();
(of course if your url already includes a querystring then replace the ? with an &)

There is no JavaScript to do what you need. If you need a fresh version, the easiest way is to add a timestamp to the URL's query string.
<script type="text/javascript" src="/js/file?cacheBuster=<?= DateTime.Now ?>" > </script>
For better control, you can use a build version as your cacheBuster param so you don't have to request new files every time.

How to read text from a website using Javascript?

I am writing an HTML code for my cell phone. Its essentially a Dictionary App where I input a word and I look up the meaning from say, Dictionary.com ;
I am using Javascript to get the string but can I embed the word into the URL, "http://dictionary.reference.com/browse/"
Is there any way of doing this ?

You can embed the word into the url by doing this:
var lookup_url = "http://dictionary.reference.com/browse/" + your_word
From there, it depends on how you want to load the website. Unfortunately, there is no way to remotely query the website from afar in JavaScript. You're going to have to use some other tool you have at your disposal. Until then, maybe you can just do window.location = lookup_url ?

Due to same origin policy restrictions this is not possible with pure javascript. If the third party web site provides an API that supports JSONP then this could work but not in the general case.

Develop Reference

JavaScript is the programming language of the Web.