Extract text from HTML with Javascript

Extract text from HTML with Javascript - javascript

I would like to extract text from HTML with pure Javascript (this is for a Chrome extension).
Specifically, I would like to be able to find text on a page and extract text after it.
Even more specifically, on a page like
https://picasaweb.google.com/kevin.smilak/BestOfAmericaSGrandCircle#4974033581081755666
I would like to find text "Latitude" and extract the value that goes after it. HTML there is not in a very structured form.
What is an elegant solution to do it?

There is no elegant solution in my opinion because as you said HTML is not structured and the words "Latitude" and "Longitude" depends on page localization.
Best I can think of is relying on the cardinal points, which might not change...
var data = document.getElementById("lhid_tray").innerHTML;
var lat = data.match(/((\d)*\.(\d)*)°(\s*)(N|S)/)[1];
var lon = data.match(/((\d)*\.(\d)*)°(\s*)(E|W)/)[1];

you could do
var str = document.getElementsByClassName("gphoto-exifbox-exif-field")[4].innerHTML;
var latPos = str.indexOf('Latitude')
lat = str.substring(str.indexOf('<em>',latPos)+4,str.indexOf('</em>',latPos))

The text you're interested in is found inside of a div with class gphoto-exifbox-exif-field. Since this is for a Chrome extension, we have document.querySelectorAll which makes selecting that element easy:
var div = document.querySelectorAll('div.gphoto-exifbox-exif-field')[4],
text = div.innerText;
/* text looks like:
"Filename: img_3474.jpg
Camera: Canon
Model: Canon EOS DIGITAL REBEL
ISO: 800
Exposure: 1/60 sec
Aperture: 5.0
Focal Length: 18mm
Flash Used: No
Latitude: 36.872068° N
Longitude: 111.387291° W"
*/
It's easy to get what you want now:
var lng = text.split('Longitude:')[1].trim(); // "111.387291° W"
I used trim() instead of split('Longitude: ') since that's not actually a space character in the innerText (URL-encoded, it's %C2%A0 ...no time to figure out what that maps to, sorry).

I would query the DOM and just collect the image information into an object, so you can reference any property you want.
E.g.
function getImageData() {
var props = {};
Array.prototype.forEach.apply(
document.querySelectorAll('.gphoto-exifbox-exif-field > em'),
[function (prop) {
props[prop.previousSibling.nodeValue.replace(/[\s:]+/g, '')] = prop.textContent;
}]
);
return props;
}
var data = getImageData();
console.log(data.Latitude); // 36.872068° N

Well if a more general answer is required for other sites then you can try something like:
var text = document.body.innerHTML;
text = text.replace(/(<([^>]+)>)/ig,""); //strip out all HTML tags
var latArray = text.match(/Latitude:?\s*[^0-9]*[0-9]*\.?[0-9]*\s*°\s*[NS]/gim);
//search for and return an array of all found results for:
//"latitude", one or 0 ":", white space, A number, white space, 1 or 0 "°", white space, N or S
//(ignores case)(ignores multi-line)(global)
For that example an array of 1 element containing "Latitude: 36.872068° N" is returned (which should be easy to parse).

Related

How do I retrieve data from a class in a span?

I need to retrieve some portion of data from HTML code. Here it is :
<span
class="Z3988" style="display:none;"
title="ctx_ver=Z39.88-2004&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&
rfr_id=info%3Asid%2Focoins.info%3Agenerator&rft.genre=article&
rft.atitle=Parliamentarism Rationalized&
rft.title=East European Constitutional Review&
rft.stitle=E. Eur. Const. Rev.&rft.date=1993&
rft.volume=2&rft.spage=33&rft.au=Tanchev, Evgeni&
rft_id=http://heinonline.org/HOL/Page?handle%3Dhein.journals/eeurcr2%26id%3D33%26div%3D%26collection%3D">
</span>
I tried to use e.g.:
document.querySelector("span.Z3988").textContent
document.getElementsbyClassName("Z3988")[0].textContent
My final aim is to get what comes after:
rft.atitle (Parliamentarism Rationalized)
rft.title (East European Constitutional Review)
rft.date
rft.volume
rft.spage
rft.au
How do I do that? I'd like to avoid RegEx.

Get the title text of span,
Spit it at = , join using character that will not appear in the string I prepared ^, do same for ;, and split at unique character used ^ in this case and then pick value at every even index. If you need string just join it.
Example Sinppet:
var spanTitle = document.getElementsByClassName("Z3988")["0"].getAttribute("title");
var data = spanTitle.split("=").join("^").split(";").join("^").split("^")
var finaldata = data.filter(function(d, index) {
return !!index % 2;
})
console.log(finaldata)
<span class="Z3988" style="display:none;" title="ctx_ver=Z39.88-2004&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&
rfr_id=info%3Asid%2Focoins.info%3Agenerator&rft.genre=article&
rft.atitle=Parliamentarism Rationalized&
rft.title=East European Constitutional Review&
rft.stitle=E. Eur. Const. Rev.&rft.date=1993&
rft.volume=2&rft.spage=33&rft.au=Tanchev, Evgeni&
rft_id=http://heinonline.org/HOL/Page?handle%3Dhein.journals/eeurcr2%26id%3D33%26div%3D%26collection%3D">
</span>

What you have in your title looks to be a url search query...
var elm = document.querySelector('.Z3988')
var params = new URLSearchParams(elm.title) // parse everything
console.log(...params) // list all
console.log(params.get('rft.title')) // getting one example
<span class="Z3988" style="display:none;" title="ctx_ver=Z39.88-2004&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rfr_id=info%3Asid%2Focoins.info%3Agenerator&rft.genre=article&rft.atitle=Parliamentarism Rationalized&rft.title=East European Constitutional Review&rft.stitle=E. Eur. Const. Rev.&rft.date=1993&rft.volume=2&rft.spage=33&rft.au=Tanchev, Evgeni&rft_id=http://heinonline.org/HOL/Page?handle%3Dhein.journals/eeurcr2%26id%3D33%26div%3D%26collection%3D"></span>

If you're trying to grab the title attribute:
document.getElementsByClassName("Z3988")[0].getAttribute("title");

The way you're outputting content as text is a really bad method. You could try to print each section of your text into element attributes and retrieve each part with element.getAttribute().
Ex:
<span id='whatever' stitle='content' spage='content'></span>
and retrieve from the selected element.
For the way you have it you might want to try to put that text into a variable and split the values like:
var element_text = document.getElementsbyClassName("Z3988")[0].textContent;
var element_specifics = element_text.split(';'); // Separate the text into array splitting by the ';'

Not sure how this is going to process down with browser compatibilities or JavaScript versions, but you can definitely sub out the arrow functions for vanilla anonymous functions, and "let" for "var". Otherwise, it fits the parameters of no regex, and even creates a nice way to index for your various keywords.
My steps:
Grab the attribute block
Split it up into array elements containing the desired keywords and contents
Split up the desired keywords and contents into sub-arrays
Trim down the contents of each keyword block for symbols and non alphanumerics
Construct the objects for convenient indexing
Obviously the last portion is just to print out the array of objects in a nice readable format. Hope this helps you out!
window.onload = function() {
let x = document.getElementsByClassName('Z3988')[0].getAttribute('title')
let a = x.split('rft.').map((y) => y.split('='))
a = a.map((x, i) => {
x = x.map((y) => {
let idx = y.indexOf('&')
return y = (idx > -1) ? y.slice(0, idx) : y
})
let x1 = x[0], x2 = x[1], obj = {}
obj[x1] = x2
return a[i] = obj
})
a.forEach((x) => {
let div = document.createElement('div')
let br = document.createElement('br')
let text = document.createTextNode(JSON.stringify(x))
div.appendChild(text)
div.appendChild(br)
document.body.appendChild(div)
})
}
<span
class="Z3988" style="display:none;"
title="ctx_ver=Z39.88-2004&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&
rfr_id=info%3Asid%2Focoins.info%3Agenerator&rft.genre=article&
rft.atitle=Parliamentarism Rationalized&
rft.title=East European Constitutional Review&
rft.stitle=E. Eur. Const. Rev.&rft.date=1993&
rft.volume=2&rft.spage=33&rft.au=Tanchev, Evgeni&
rft_id=http://heinonline.org/HOL/Page?handle%3Dhein.journals/eeurcr2%26id%3D33%26div%3D%26collection%3D">
</span>

I want to obtain a certain part of text from a large webpage using Javascript, how do I?

There is a certain webpage which randomly generates a number, for example "Frequency : 21". I am trying to create a script which takes the number, 21, and compares it to another variable, then to an if else function. Basically, I've completed most of it, but I can't obtain the number 21. And since it is random, I can't put in a fixed value.
Can anyone help me out?
My code goes like:
setTimeout(MyFunction,5000)
function MyFunction(level,legmin) {
var level = x
var legmin = 49
if (level <= legmin) {
location.reload(true)
}
else {
alert("Met requirements.")
}
where the address of the text I want is:
html>body>div#container>div#contentContainer>div#content>
div#scroll>div#scrollContent>div>div>div#pkmnappear>form>p (x in the code above).

A quick-n-dirty solution without regex.
var lookFor = "Frequency : ";
var text = document.querySelector("#pkmnappear>form>p").textContent;
var level = text.substr(text.indexOf(lookFor) + lookFor.length).split(" ")[0];
This assumes the number will be followed by a space

Is there any generic function for subscripting?

I have a web page in which contents are loaded dynamically from json. Now i need to find the texts like so2,co2,h2o after the page gets loaded and have to apply subscript for those texts. Is it possible to do this?? If yes please let me know the more efficient way of achieving it.
for example :
var json = { chemA: "value of CO2 is", chemB: "value of H2O is" , chemC: "value in CTUe is"};
in the above json i need to change CO2,H2O and e in CTUe as subscript. how to achieve this??

Take a look at this JSfiddle which shows two approaches:
HTML-based using the <sub> tag
Pure Javascript-based by replacing the matched number with the subscript equivalent in unicode:
http://jsfiddle.net/7gzbjxz3/
var json = { chemA: "CO2", chemB: "H2O" };
var jsonTxt = JSON.stringify(json).replace(/(\d)+/g, function (x){
return String.fromCharCode(8320 + parseInt(x));
});
Option 2 has the advantage of being more portable since you're actually replacing the character. I.e., you can copy and paste the text into say notepad and still see the subscripts there.
The JSFiddle shows both approaches. Not sure why the magic number is 8320 when I was expecting it to be 2080...

So you are generating DOM element as per JSON data you are getting. So before displaying it to DOM you can check if that JSON data contains so2,co2,h2o and if it is then replace that with <sub> tag.
For ex:
var text = 'CO2';
text.replace(/(\d+)/g, "<sub>" + "$1" + "</sub>") ;
And this will returns something like this: "CO2".
As per JSON provided by you:
// Only working for integer right now
var json = { chemA: "value of CO2 is", chemB: "value of H2O is" , chemC: "value in CTUe is"};
$.each(json, function(index, value) {
json[index] = value.replace(/(\d+)/g, "<sub>" + "$1" + "</sub>");
});
console.log(json);
Hope this will helps!

To do this, I would create a prototype function extending String and name it .toSub(). Then, when you create your html from your json, call .toSub() on any value that might contain text that should be in subscript:
// here is the main function
String.prototype.toSub = function() {
var str=this;
var subs = [
['CO2','CO<sub>2</sub>'],
['H2O','H<sub>2O</sub>'],
['CTUe','CO<sub>e</sub>'] // add more here as needed.
];
for(var i=0;i<subs.length;i++){
var chk = subs[i][0];
var rep = subs[i][1];
var pattern = new RegExp('^'+chk+'([ .?!])|( )'+chk+'([ .?!])|( )'+chk+'[ .?!]?$','ig'); // makes a regex like this: /^CO2([ .?!])|( )CO2([ .?!])|( )CO2[ .?!]?$/gi using the surrent sub
// the "empty" capture groups above may seem pointless but they are not
// they allow you to capture the spaces easily so you dont have to deal with them some other way
rep = '$2$4'+rep+'$1$3'; // the $1 etc here are accessing the capture groups from the regex above
str = str.replace(pattern,rep);
}
return str;
};
// below is just for the demo
var json = { chemA: "value of CO2 is", chemB: "value of H2O is" , chemC: "value in CTUe is", chemD: "CO2 is awesome", chemE: "I like H2O!", chemF: "what is H2O?", chemG: "I have H2O. Do you?"};
$.each(json, function(k, v) {
$('#result').append('Key '+k+' = '+v.toSub()+'<br>');
});
<script src="https://ajax.googleapis.com/ajax/libs/jquery/2.1.1/jquery.min.js"></script>
<div id="result"></div>
Note:
Anytime you do something like this with regex, you run the chance of unintentionally matching and converting some unwanted bit of text. However, this approach will have far fewer edge cases than searching and replacing text in your whole document as it is much more targeted.

title casing and Abbreviations in javascript

I am trying to Titlecase some text which contains corporate names and their stock symbols.
Example (these strings are concatenated as corporate name, which gets title cased and the symbol in parens): AT&T (T)
John Deere Inc. (DE)
These corporate names come from our database which draws them from a stock pricing service. I have it working EXCEPT for when the name is an abbreviation like AT&T
That is return, and you guessed it right, like At&t. How can I preserve casing in abbreviations. I thought to use indexof to get the position of any &'s and uppercase the two characters on either side of it but that seems hackish.
Along the lines of(pseudo code)
var indexPos = myString.indexOf("&");
var fixedString = myString.charAt(indexPos - 1).toUpperCase().charAt(indexPos + 1).toUpperCase()
Oops, forgot to include my titlecase function
function toTitleCase(str) {
return str.replace(/([^\W_]+[^\s-]*) */g, function (txt) {
return txt.charAt(0).toUpperCase() + txt.substr(1).toLowerCase();
});
}
Any better suggestions?

A better title case function may be
function toTitleCase(str) {
return str.replace(
/(\b.)|(.)/g,
function ($0, $1, $2) {
return ($1 && $1.toUpperCase()) || $2.toLowerCase();
}
);
}
toTitleCase("foo bAR&bAz a.e.i."); // "Foo Bar&Baz A.E.I."
This will still transform AT&T to At&T, but there's no information in the way it's written to know what to do, so finally
// specific fixes
if (str === "At&T" ) str = "AT&T";
else if (str === "Iphone") str = "iPhone";
// etc
// or
var dict = {
"At&T": "AT&T",
"Iphone": "iPhone"
};
str = dict[str] || str;
Though of course if you can do it right when you enter the data in the first place it will save you a lot of trouble

This is a general solution for title case, without taking your extra requirements of "abbreviations" into account:
var fixedString = String(myString).toLowerCase().replace(/\b\w/g, String.toUpperCase);
Although I agree with other posters that it's better to start with the data in the correct format in the first place. Not all proper names conform to title case, with just a couple examples being "Werner von Braun" and "Ronald McDonald." There's really no algorithm you can program into a computer to handle the often arbitrary capitalization of proper names, just like you can't really program a computer to spell check proper names.
However, you can certainly program in some exception cases, although I'm still not sure that simply assuming that any word with an ampersand in it should be in all caps always appropriate either. But that can be accomplished like so:
var titleCase = String(myString).toLowerCase().replace(/\b\w/g, String.toUpperCase);
var fixedString = titleCase.replace(/\b\w*\&\w*\b/g, String.toUpperCase);
Note that your second example of "John Deere Inc. (DE)" still isn't handled properly, though. I suppose you could add some other logic to say, put anything word between parentheses in all caps, like so:
var titleCase = String(myString).toLowerCase().replace(/\b\w/g, String.toUpperCase);
var titleCaseCapAmps = titleCase.replace(/\b\w*\&\w*\b/g, String.toUpperCase);
var fixedString = titleCaseCapAmps.replace(/\(.*\)/g, String.toUpperCase);
Which will at least handle your two examples correctly.

How about this: Since the number of registered companies with the stock exchange is finite, and there's a well-defined mapping between stock symbols and company names, your best best is probably to program that mapping into your code, to look up the company name by the ticker abbreviation, something like this:
var TickerToName =
{
A: "Agilent Technologies",
AA: "Alcoa Inc.",
// etc., etc.
}
Then it's just a simple lookup to get the company name from the ticker symbol:
var symbol = "T";
var CompanyName = TickerToName[symbol] || "Unknown ticker symbol: " + symbol;
Of course, I would be very surprised if there was not already some kind of Web Service you could call to get back a company name from a stock ticker symbol, something like in this thread:
Stock ticker symbol lookup API
Or maybe there's some functionality like this in the stock pricing service you're using to get the data in the first place.

The last time I faced this situation, I decided that it was less trouble to simply include the few exceptions here and there as need.
var titleCaseFix = {
"At&t": "AT&T"
}
var fixit(str) {
foreach (var oldCase in titleCaseFix) {
var newCase = titleCaseFix[oldCase];
// Look here for various string replace options:
// http://stackoverflow.com/questions/542232/in-javascript-how-can-i-perform-a-global-replace-on-string-with-a-variable-insi
}
return str;
}

Having a <Textarea> with a max buffer

I have seen plenty of code snippets to force a <Textarea> to have only X number of characters and then not allow anymore. What I am in need of is a <Textarea> where you can specify how many characters I can have at one time at most. Almost like a max buffer size. Think of it like a rolling log file. I want to always show the last/newest X number of characters.
Simpler the solution the better. I am not a web expert so the more complicated it gets the more greek it looks to me. :)
I am already using jQuery so a solution with that should be ok.

try this:
<textarea id="yourTextArea" data-maxchars="1000"></textarea>
var textarea = document.getElementById('yourTextArea');
var taChanged = function(e){
var ta = e.target;
var maxChars = ta.getAttribute('data-maxchars');
if(ta.value.length > maxChars){
ta.value = ta.value.substr(0,maxChars);
}
}
textarea.addEventListener('change', taChanged, 1);
for the last chars:
ta.value = ta.value.substr(ta.value.length - 1000);

And jQuery implementation:
$('#text').keyup(function() {
var max = $(this).data('maxchars'),
len = $(this).val().length;
len > max && $(this).val(function() {
return $(this).val().substr(len - max);
});
});
http://jsfiddle.net/WsnSk/

No code, I can't even spell JavaScript, but basically as you're about to add new text, check the length of the existing text. If the old text plus the new text is too long, trim off the beginning of the old text (likely at a newline or whatever). Rinse and repeat.

Here is the working code on Fiddle. It uses Jquery to make it simple.
<textarea id="txtArea"></textarea>
var size = 5;
$('#txtArea').change(function(){
var strValue = $('#txtArea').val();
strValue = strValue.split("").reverse().join("").substring(0, size).split("").reverse().join("");
alert(strValue);
});

Develop Reference

JavaScript is the programming language of the Web.

Extract text from HTML with Javascript - javascript

you could do var str = document.getElementsByClassName("gphoto-exifbox-exif-field")[4].innerHTML; var latPos = str.indexOf('Latitude') lat = str.substring(str.indexOf('<em>',latPos)+4,str.indexOf('</em>',latPos))

Related

How do I retrieve data from a class in a span?

I want to obtain a certain part of text from a large webpage using Javascript, how do I?

Is there any generic function for subscripting?

title casing and Abbreviations in javascript

Having a <Textarea> with a max buffer

Categories

Resources