URL extraction from string - javascript

I found a regular expression that is suppose to capture URLs but it doesn't capture some URLs.
$("#links").change(function() {
//var matches = new array();
var linksStr = $("#links").val();
var pattern = new RegExp("^(https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w \.-]*)*\/?$","g");
var matches = linksStr.match(pattern);
for(var i = 0; i < matches.length; i++) {
alert(matches[i]);
}
})
It doesn't capture this url (I need it to):
http://www.wupload.com/file/63075291/LlMlTL355-EN6-SU8S.rar
But it captures this
http://www.wupload.com

Several things:
The main reason it didn't work, is when passing strings to RegExp(), you need to slashify the slashes. So this:
"^(https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w \.-]*)*\/?$"
Should be:
"^(https?:\/\/)?([\\da-z\\.-]+)\\.([a-z\\.]{2,6})([\/\\w \\.-]*)*\/?$"
Next, you said that FF reported, "Regular expression too complex". This suggests that linksStr is several lines of URL candidates.
Therefore, you also need to pass the m flag to RegExp().
The existing regex is blocking legitimate values, eg: "HTTP://STACKOVERFLOW.COM". So, also use the i flag with RegExp().
Whitespace always creeps in, especially in multiline values. Use a leading \s* and $.trim() to deal with it.
Relative links, eg /file/63075291/LlMlTL355-EN6-SU8S.rar are not allowed?
Putting it all together (except for item 5), it becomes:
var linksStr = "http://www.wupload.com/file/63075291/LlMlTL355-EN6-SU8S.rar \n"
+ " http://XXXupload.co.uk/fun.exe \n "
+ " WWW.Yupload.mil ";
var pattern = new RegExp (
"^\\s*(https?:\/\/)?([\\da-z\\.-]+)\\.([a-z\\.]{2,6})([\/\\w \\.-]*)*\/?$"
, "img"
);
var matches = linksStr.match(pattern);
for (var J = 0, L = matches.length; J < L; J++) {
console.log ( $.trim (matches[J]) );
}
Which yields:
http://www.wupload.com/file/63075291/LlMlTL355-EN6-SU8S.rar
http://XXXupload.co.uk/fun.exe
WWW.Yupload.mil

Why not do make:
URLS = str.match(/https?:[^\s]+/ig);

(https?\:\/\/)([a-z\/\.0-9A-Z_-\%\&\=]*)
this will locate any url in text

Related

Javascript regex invalid quantifier error to find 8 digit number in PDF

I have the following javascript code:
/* Extract pages to folder */
// Regular expression used to acquire the base name of file
var re = /\.pdf$/i;
// filename is the base name of the file Acrobat is working on
var filename = this.documentFileName.replace(re,"");
try {for (var i = 0; i < this.numPages; i++)
var id = /\ (?<!\d)\d{8}(?!\d)/;
console.println(id);
this.extractPages({
nStart: i,
cPath: "/J/my file path/" + "SBIC_" + id + ".pdf"
});
} catch (e) { console.println("Aborted: " + e) }
I get the error that the quantifier is invalid in this line of code var reg = /\ (?<!\d)\d{8}(?!\d)/
However, this line of regex pulls the id 22001188 when I use it in https://regex101.com/ to find the 8 digit number in "I.D. Control 22001188".
Do I have to integrate the regex a different way in the code for it to search through the text in the document?
UPDATED 1/30/2023
I am using the below REGEX in the code to find the 8 digit ID I need. First, I put all the PDFs text into a string and then I use a search query to find it. Now I just need to figure out how to add the result into a variable so I can extract each page in the PDF by ID.
/* Extract pages to folder */
// function padLeft(s,len,c){c=c || '0'; while(s.length< len) s= c+s; return s; }
// Regular expression used to acquire the base name of file
var re = /\.pdf$/i;
// filename is the base name of the file Acrobat is working on
var filename = this.documentFileName.replace(re,"");
for (var i = 0; i < this.numPages; i++) { // Loop through the entire document
numWords = this.getPageNumWords(i); // Find out how many words are on the page
var WordString = ""; // Prepare a string
for (var j = 0; j < numWords; j++) // Put all the words on the page into a string
{
WordString = WordString + " " + this.getPageNthWord(i, j);
}
if (WordString.match(/\b\d{8}\b/)) { // Search for the word "Hello" in the string
search.matchWholeWord = true; // If we got here, we'll search for "Hello" in the document
search.query(WordString.match(/\b\d{8}\b/), "ActiveDoc");
}
}
UPDATED 2/2/2023
Below is the working code used to extract every page from the pdf and then name it the 8 digit ID found within the text of the pdf.
// Regular expression used to acquire the base name of file
var re = /\.pdf$/i;
// filename is the base name of the file Acrobat is working on
var filename = this.documentFileName.replace(re,"");
for (var i = 0; i < this.numPages; i++) { // Loop through the entire document
numWords = this.getPageNumWords(i); // Find out how many words are on the page
var WordString = ""; // Prepare a string
for (var j = 0; j < numWords; j++) // Put all the words on the page into a string
{WordString = WordString + " " + this.getPageNthWord(i, j);}
ID = WordString.match(/\b\d{8}\b/); // Search for the ID control # in the string
this.extractPages({
nStart: i,
cPath: "/J/Middle Office Read/Operational Support/SBA Spreadsheets & Forms/Funded SBAs/" + "SBIC_" + ID + ".pdf"
});
}
The sequence ?<! is a negative look-behind sequence which is not yet supported by all the browsers/systems.
It seems that it is not supported in your case.
You may use word boundaries in regex as given below to extract 8-digit numbers from your string:
\b\d{8}\b
Those (?<!\d) and (?!\d) are probably the problem. They are only supported in some regex libraries.
You can instead use ^\d{8}$ to match 8 digits at the start and end of the line, or \b\d{8}\b to match 8 digits surrounded by word boundaries, as said in ayush-s answer.

Javascript: Get first number substring for each semi-colon separated substring

I am creating a script of time calculation from MySQL as I don't want to load the scripts on server-side with PHP.
I am getting the data and parsing it using JSON, which gives me a string of values for column and row data. The format of this data looks like:
1548145153,1548145165,End,Day;1548145209,1548145215,End,Day;1548148072,1548148086,End,Day;1548161279,1548161294,End,Day;1548145161,1548145163,End,Day;1548148082,1548148083,End,Day;1548161291,1548161293,End,Day
I need to split this string by semi-colon, and then extract the first VARCHAR number from before each comma to use that in subsequent calculation.
So for example, I would like to extract the following from the data above:
[1548145153, 1548145209, 1548148072, 1548161279, 1548145161, 1548148082, 1548161291]
I used the following type of for-loop but is not working as I wanted to:
for (var i=0; i < words.length; i++) {
var1 = words[i];
console.log(var1);
}
The string and the for-loop together are like following:
var processData = function(data) {
for(var a = 0; a < data.length; a++) {
var obj = data[a];
var str= obj.report // something like 1548145153,1548145165,End,Day;1548145209,1548145215,End,Day;1548148072,1548148086,End,Day;1548161279,1548161294,End,Day;1548145161,1548145163,End,Day;1548148082,1548148083,End,Day;1548161291,1548161293,End,Day
words = str.split(',');
words = str.split(';');
for (var i=0; i < words.length; i++) {
var1 = words[i];
var2 = var1[0];
console.log(var2);
}
Here is an approach based on a regular expression:
const str = "1548145153,1548145165,End,Day;1548145209,1548145215,End,Day;1548148072,1548148086,End,Day;1548161279,1548161294,End,Day;1548145161,1548145163,End,Day;1548148082,1548148083,End,Day;1548161291,1548161293,End,Day";
const ids = str.match(/(?<=;)(\d+)|(^\d+(?=,))/gi)
console.log(ids)
The general idea here is to classify the first VARCHAR value as either:
a number sequence directly preceded by a ; character (see 1 below) or, for the edge case
the very first number sequence of the input string directly followed by a , character (see 2 below).
These two cases are expressed as follows:
Match any number sequence that is preceded by a ; using the negated lookbehind rule: (?<=;)(\d+), where ; is the character that must follow a number sequence \d+ to be a match
Match any number sequence that is the first number sequence of the input string, and that has a , directly following it using the lookahead rule (^\d+(?=,)), where \d+ is the number sequence and , is the character that must directly follow that number sequence to be a match
These building blocks 1 and 2 are combined using the | operator to achieve the final result
First thing is that you override words with the content of str.split(';'), so it won't hold what you expect. To split the string into chunks, split by ; first, then iterate over the resulting array and within the loop, split by ,.
const str= "1548145153,1548145165,End,Day;1548145209,1548145215,End,Day;1548148072,1548148086,End,Day;1548161279,1548161294,End,Day;1548145161,1548145163,End,Day;1548148082,1548148083,End,Day;1548161291,1548161293,End,Day";
const lines = str.split(';');
lines.forEach(line => {
const parts = line.split(',');
console.log(parts[0]);
});
What you are doing is not correct, you'll have to separate strings twice as there are two separators. i.e. a comma and a semicolon.
I think you need a nested loop for that.
var str = "1548145153,1548145165,End,Day;1548145209,1548145215,End,Day;1548148072,1548148086,End,Day;1548161279,1548161294,End,Day;1548145161,1548145163,End,Day;1548148082,1548148083,End,Day;1548161291,1548161293,End,Day"
let words = str.split(';');
for (var i=0; i < words.length; i++) {
let varChars = words[i].split(',');
for (var j=0; j < varChars.length; i++)
console.log(varChars[j]);
}
I hope this helps. Please don't forget to mark the answer.

Javascript respecting backslashes in input: negative lookbehind

In Javascript, I have a situation where I get input which I .split(/[ \n\t]/g) into an array. The point is that if a space is directly preceded by a backslash, I don't want the split to happen there.
E.g. is_multiply___spaced_text -> ['is','multiply','','','spaced','text']
But: is\_multiply\___spaced_text -> ['is multiply ','','spaced','text']
(Underscores used for spaces for clarity)
If this wasn't Javascript (which doesn't support lookbehinds in regex'es), I'd just use /(?<!\\)[ \n\t]/g. That doesn't work, so what would be the best way to handle this?
You can reverse the string, then use negative lookahead and then reverse the strings in the array:
var pre_results = "is\\ multiply\\ spaced text".split('').reverse().join('').split(/[ \t](?!\\)/);
var results = [];
for(var i = 0; i < pre_results.length; i++) {
results.push(pre_results[i].split('').reverse().join(''));
}
for(var i = 0; i < results.length; i++) {
document.write(results[i] + "<br>");
}
In this example, the result should be:
['text', 'spaced', '', 'is\\ multiply\\']
"is\_multiply\___spaced_text".replace(/\_/, " ").replace(/_/, " ").split("_");

Yet Another document.referrer.pathname Thing

I'm looking for the equivalent of "document.referrer.pathname". I know there are other questions that are similar to this on SO, but none of them handle all the use cases. For example:
http://example.com/RESULT
http://example.com/RESULT/
http://example.com/RESULT?query=string
All examples should return:
RESULT
or
https://example.com/EXTENDED/RESULT/
EXTENDED/RESULT
Some folks may want the trailing slash included, but I don't because I'm matching against a list of referrers.
I've started with:
document.referrer.match(/:\/\/.*\/(.*)/)[1]
and am struggling adding the query string parsing.
Thanks!
If you have URLs as strings you can create empty anchors and give them the url as href to access the pathname:
var url = 'http://example.com/RESULT?query=string', // or document.referrer
a = document.createElement('a');
a.href = url;
var result = a.pathname.replace(/(^\/|\/$)/g,'');
I set up a test example for you here: http://jsfiddle.net/eWydy/
Try this regular expression:
.match(/\/\/.*?\/(.*?)\/?(\?.*)?$/)[1]
DEMO
If you don't want to create a new element for it or rely on a.pathname, I'd suggest using indexOf and slice.
function getPath(s) {
var i = s.indexOf('://') + 3, j;
i = s.indexOf('/',i) + 1; // find first / (ie. after .com) and start at the next char
if( i === 0 ) return '';
j = s.indexOf('?',i); // find first ? after first / (as before doesn't matter anyway)
if( j == -1 ) j = s.length; // if no ?, use until end of string
while( s[j-1] === '/' ) j = j - 1; // get rid of ending /s
return s.slice(i, j); // return what we've ended up at
}
getPath(document.referrer);
If you want regex though, maybe this
document.referrer.match(/:\/\/[^\/]+[\/]+([^\?]*)[\/]*(?:\?.*)?$/)[1]
which does "find the first ://, keep going until next /, then get everything that isn't a ? until a ? or the last / or end of string and capture it", which is basically the same as the function I did above.

Javascript Regex: Get everything from inside / tags

What I want
From the above subject I want to get search=adam and page=content and message=2.
Subject:
/search=adam/page=content/message=2
What I have tried so far
(\/)+search+\=+(.*)\/
But this is not good because sometimes the subject ends with nothing and in my case there must be a /
(\/)+search+\=+(.*?)+(\/*?)
But this is not good because goes trought the (\/*?) and shows me everyting what's after /search=
Tool Tip:
Regex Tester
Use String.split(), no regex required:
var A = '/search=adam/page=content/message=2'.split('/');
Note that you may have to discard the first array item using .slice(1).
Then you can iterate through the name-value pairs using something like:
for(var x = 0; x < A.length; x++) {
var nameValue = A[x].split('=');
if(nameValue[0] == 'search') {
// do something with nameValue[1]
}
}
This assumes that no equals signs will be in the value. Hopefully this is the case, but if not, you could use nameValue.slice(1).join('=') instead of nameValue[1];
shows me everyting what's after /search=
You used a greedy .* that will happily match slashes as well. You can use a non-greedy .*?, or a character class that excludes the slash:
(\/|^)search=([^\/]*)(\/|$)
Here the front and end may be either a slash or the start/end (^/$) of the string. (I removed the +s, as I can't work out at all what they're supposed to be doing.)
Alternatively, forget the regex:
var params= {};
var pieces= subject.split('/');
for (var i= pieces.length; i-->0;) {
var ix= pieces[i].indexOf('=');
if (ix!==-1)
params[pieces[i].slice(0, ix)]= pieces[i].slice(ix+1);
}
Now you can just say params.search, params.page etc.

Categories

Resources