extract data from javascript using Python

extract data from javascript using Python - javascript

I am a new user to Python, and I have inherited a Python notebook from my predecessor that I want to improve. The purpose of it is to grab product details from a website.
How it works:
It scrapes the script from a website using beautiful soup:
source = urllib2.urlopen('http://www.testwebsite.html').read()
soup = bs4.BeautifulSoup(source)
job_postings = soup.findAll("script")
job_postings = [jp for jp in job_postings if not jp.get('type') is None
and ''.join(jp.get('type')) =="text/javascript"
and ''.join(jp.get('type')) =="text/javascript"]
it returns all the script in the webpage:
(1st part of data)
window.wf=window.wf||{};wf.appData=wf.appData||{};wf.appData.product_data_TEST123=wf.appData.product_data_TEST123||{};wf.appData.product_data_TEST123 = {"sku":"TES123","is_grid_view":false,,"default_img_display":0,"manufacturer_name":"Supplier1","product_name":"product test","part_number":"1234","list_price":1000,"is_price_hidden":false,"base_price":1000,"has_opt":true,"opt_details":[{"option_ids":[],"regular_price":2681.25],"has_free_shipping":false,,"total_qty":1,"display_set_quantity":1,"is_standard_layout":true,"page_type":"ProductPage"};Y_config.app.product_data_TEST123 = {"sku":"TEST123",........ same info here ....};
2 sd part of data:
\n wf.extend({"YUI_config":{"app":{"pageAlias":"ProductPage"}},"wf":{"appData":{"pageAlias":"ProductPage",,"mkcName":"AU: FurnitureRoom","productReviews":{"b_show_review_tags":false,"kit_subgroup_price":null,"catalog_currency":"AUD","price_model":null,"colors":"",,"available_after":{"date":"2016-07-28 18:05:16.000000","timezone":"Australia\\/Sydney"},"inventory_info":{"sku":"TEST123",,"latest_inventory_update":"2016-07-29 00:45:06","option_ids":[],"available_quantity":17,"display_quantity":17,","quantity_available_string":" more then 10 in Stock","short_lead_time_id":2,"short_lead_time_string":"Leaves warehouse in 1 to 3 business days"}}};
Then I extract the data I need:
jsonfile = re.findall(r'wf.appData.product_data_[A-Z]{4}[0-9]{4} = (\{.*});YUI_config.app.product_data_',str(job_postings))
I have this:
{"sku":"TEST123","is_grid_view":false,,"default_img_display":0,"manufacturer_name":"Supplier1","product_name":"product test","part_number":"1234","list_price":1000,"is_price_hidden":false,"base_price":1000,"has_opt":true,"opt_details":[{"option_ids":[],"regular_price":2681.25],"has_free_shipping":false,,"total_qty":1,"display_set_quantity":1,"is_standard_layout":true,"page_type":"ProductPage"}
My problem is now: I want to add the "inventory_info" list to my data
I've tried:
jsonfile = re.findall(r'inventory_info' = (\{.*}),str(job_postings))
or
Jsonfile = re.compile('inventory_info' = ({.*?});', re.DOTALL)
Neither of those work.
I'm knowledge of Python is very limited so I'm a bit lost now.
Thanks for your help.

You may have already found the answer to your question but here it goes anyways.
For getting inventory_info, you could always do a split (assuming job_postings is converted to type string), as so:
inventory_info = job_postings.split("inventory_info:")[1].split("}")[0] + "}"
job_postings += inventory_info

Related

Scrape a javascript variable from a webpage

I am scraping a site with beautiful soup but all the content is hidden inside a script inside a js variable like this:
I can't seem to find any solution to this other than using selenium which in this case is not an option, I won't go into detail why but it just doesn't work. I can already scrape it by getting the insid eof the script tag and then using eval() on it but that introduces a few problems (unexpected indent, unwanted functions) I can use python, javascript and maybe C# if anything there helps.
Expected behaviour - whatever makes me get the info (the variable in the last line) into any of those 3 languages (preferably python).
The code (sorry for the formating but i cant since its so long, it isnt even the full variable, its huge):
barLoadGoogleFont('Open Sans'); barCssLoad('/global/pics/js/jquery/royalSlider/skins/universal/rs-universal.css?v=e449c4'); barCssLoad('/global/pics/css/material-icons.css?v=e6d856'); barCssLoad('/user/pics/css/user.css?v=eced9d');
barCssLoad('/user/pics/css/userIcons.css?v=6f9a03');
barCssLoad('/timeline/pics/css/timeline.css?v=8ec2ca'); barJsLibraryLoad('/global/pics/js/jquery/jquery.royalslider.min.js?v=515a43'); barJsLibraryLoad('/anketa/pics/js/utilsAnketa.js?v=9383d5'); barJsLibraryLoad('/znamky/pics/js/utilsZnamky.js?v=7afc9e'); barJsLibraryLoad('/exam/pics/js/utilsExam.js?v=033d55'); barJsLibraryLoad('/timeline/pics/js/utilsTimeline.js?v=29cf0e'); barJsLibraryLoad('/timeline/pics/js/timelineItemCreator.js?v=c37c99'); barJsLibraryLoad('/timeline/pics/js/timelineInputbox.js?v=2fde70'); barJsLibraryLoad('/timeline/pics/js/timelineViewer.js?v=f35e45');
barJsLibraryLoad('/user/pics/js/DailyPlan.js?v=e81fb9'); barJsLibraryLoad('/user/pics/js/userHomeEtest.js?v=6166f3');
$j(document).ready(function() { $j('#jwbcddd3da_md').userhome({"items":[{"timelineid":"2140963","timestamp":"2020-12-09 09:59:13","reakcia_na":"692638","typ":"h_clearplany","user":"Plan5077","target_user":null,"user_meno":"Kvarta aj2","ineid":"clearplany","text":"","cas_pridania":"2020-12-09 09:59:13","cas_udalosti":null,"data":"null","vlastnik":"Ucitel8678605","vlastnik_meno":"Barbora Drugajov\u00e1","pocet_reakcii":"0","posledna_reakcia":"","pomocny_zaznam":"1","removed":"0","cas_pridania_btc":"2020-12-09 09:59:13","posledna_reakcia_btc":null},{"timelineid":"2287814","timestamp":"2020-12-09 09:59:12","reakcia_na":"2290613","typ":"h_dailyplan","user":"Trieda8694210","target_user":null,"user_meno":"Kvarta A","ineid":"daily2020-12-09","text":"","cas_pridania":"2020-12-09 09:59:12","cas_udalosti":null,"data":"[]","vlastnik":"Ucitel8678605","vlastnik_meno":"Barbora Drugajov\u00e1","pocet_reakcii":"0","posledna_reakcia":"","pomocny_zaznam":"1","removed":"0","cas_pridania_btc":"2020-12-09 09:59:12","posledna_reakcia_btc":null},{"timelineid":"1439827","timestamp":"2020-12-09 08:56:57","reakcia_na":null,"typ":"h_clearplany","user":"*","target_user":null,"user_meno":"Cel\u00e1 \u0161kola","ineid":"clearplany","text":"","cas_pridania":"2020-12-09 08:56:57","cas_udalosti":null,"data":"null","vlastnik":"Ucitel16434","vlastnik_meno":"Ivor Dian","pocet_reakcii":"0","posledna_reakcia":"","pomocny_zaznam":null,"removed":"0","cas_pridania_btc":"2020-12-09 08:56:57","posledna_reakcia_btc":null},{"timelineid":"2290324","timestamp":"2020-12-09 08:37:22","reakcia_na":null,"typ":"sprava","user":"CustPlan5075","target_user":null,"user_meno":"Kvarta A+Kvarta B - nj4 \u00b7 nemeck\u00fd jazyk","ineid":null,"text":"Ahojte, zajtra...

Ok, little tough to debug without actually working with it. But you'll need to pull out that json structure. You can do it with splits. So this is sort of a generic code.
from bs4 import BeautifulSoup
import pandas as pd
import requests
import json
url = 'www.thesite.com'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
scripts = soup.find_all('script')
for script in scripts:
if '.userhome({' in script.text:
json_str = script.text
data = json_str.split('.userhome(')[-1]
loop=True
while loop == True:
try:
jsonData = json.loads(data)
loop = False
break
except:
data = data.rsplit(';',1)[0]
rows = []
for row in jsonData['items']:
rows.append(row)
table = pd.DataFrame(rows)

Sending whatsapp messages via python/JS

I made a program which takes information from excel and sends messages via python.
I used selenium and "span" for finding the element I need.
now, WhatsApp changed their HTML and there is no span anymore.
the old code is here:
import time
import xlrd
from selenium import webdriver
chrome_driver_binary = "D:\pycharm\chromedriver.exe"
driver = webdriver.Chrome(chrome_driver_binary)
driver.get('http://web.whatsapp.com')
file_location = "C:\Users\ErelNahum\Desktop\data.xlsx"
book = xlrd.open_workbook(file_location)
print "there is " + str(book.nsheets) + " sheets"
sheet = book.sheet_by_index(0)
cols = sheet.ncols - 1
print "the number of cols is " + str(cols)
raw_input('Enter anything after scanning QR code')
for i in range(cols):
tel = sheet.cell_value((i+1), 0)
tel = tel.replace("\"", "")
print tel
messege = sheet.cell_value((i+1), 1)
messege = (messege +str(b+1))
user = driver.find_element_by_xpath('//span[#title = "{}"]'.format(tel))
user.click()
msg_box = driver.find_element_by_class_name('_input-container')
msg_box.send_keys(messege)
driver.find_element_by_class_name('compose-btn-send').click()
time.sleep(0.5)
if you have any idea how to change the program so it will work please show me.
I know Python, JS, C# so every language is fine.
Thank You,
Erel.

Check whatever tag surrounds the data you're trying to scrap after the span tag was removed, and adjust the code accordingly.
There is no general replacement for span. Can you provide the markup you're trying to scrap (at least the part where the span tag was)

Loading a Random Caption from a text file using Javascript and Displaying via HTML

I am trying to load a random caption every time my page is loaded. I have a separate text file and contained on each line is a string. I am new to both html and Javascript, as you will see.
HTML:
<div class="centerpiece">
<h1>DEL NORTE BANQUEST</h1>
<p class="caption"><script src = "js/caption.js"></script><script>getCaption();</script></p>
<a class="btn" id="browse-videos-button" href="#video-list">Browse Videos<br><img src="img/arrow-down.svg"style="width:15px;height:15px;"></a>
</div>
Javascript:
function getCaption()
{
var txtFile = "text/captions.txt"
var file = new File(txtFile);
file.open("r"); // open file with read access
var str = "";
var numLines = 0; //to get the range of lines in the file
while (!file.eof)
{
// read each line of text
numLines += 1;
}
file.close();
file.open("r");
var selectLine = Math.getRandomInt(0,numLines);//get the correct line number
var currentLine = 0;
while(selectLine != currentLine)
{
currentLine += 1;
}
if(selectLine = currentLine)
{
str = file.readln();
}
file.close();
return str;
}
Text in Source File:
We talked yesterday
Freshman boys!
5/10
I'm having a heart attack *pounds chest super hard
The site is for my highschool cross country team in case the text file was confusing.
I am unfamiliar with most syntax and was unable to see if by iterating through the file with a loop if i needed to reset somehow which is why I opened and closed the file twice. Here is a jsfiddle of the specific caption I am trying to change and what my function is in Javascript.
https://jsfiddle.net/7cre9qqj/
If you need more code to work with please let me know and any critiques you may have please dont hold back if it looks like a mess, I am trying to learn after all! Thank you for your help!

The File API allows access to the file system on the client side, so it's not really suited to what you want to do. It's also only allowed to be used in very specific circumstances.
A simple solution is to just run an AJAX request to populate your quote. The AJAX call can read the file on your server, then it's simple to split the contents of the file by line, and pick a random line to display. Since you're open to jQuery, the code is pretty simple:
$.get("text/captions.txt")).then(function(data) {
var lines = data.split('\n');
var index = Math.floor(Math.random() * lines.length);
$("#quote").html(lines[index]);
});
Here's a fiddle that demonstrates it in full; every time it runs it will load a random quote: https://jsfiddle.net/s1w8x4ff/

Running an Executable file from an ASP.NET web application

I am trying to create a web application that can read certain files (logs) provided by the users and then Use the LogParser 2.2 exe by microsoft to parse the logs and provide the requested output.
The idea i have is to run the Local LogParser.exe present in the Users system and then use the same generated output to ready my output.
I don not know if this approach is correct , However i am trying to do the same and somewhere my code is not correctly being followed and i am not able to find any output/Error .
My code segment is as follows :
protected void Button2_Click(object sender, EventArgs e)
{
try
{
string fileName = #"C:\Program Files (x86)\Log Parser 2.2\LOGPARSER.exe";
string filename = "LogParser";
string input = " -i:IISW3C ";
string query = " Select top 10 cs-ur-stem, count(cs-ur-stem) from " + TextBox1.Text + " group by cs-uri-stem order by count(cs-ur-stem)";
string output = " -o:DATAGRID ";
string argument = filename + input + query + output;
ProcessStartInfo PSI = new ProcessStartInfo(fileName)
{
UseShellExecute = false,
Arguments = argument,
RedirectStandardInput = true,
RedirectStandardOutput = true,
CreateNoWindow = false
};
Process LogParser = Process.Start(PSI);
LogParser.Start();
}
catch (Exception Prc)
{
MessageBox.Show(Prc.Message);
}
I might be doing something wrong , but can someone point me in correct direction ? Can Javascript ActiveX control may be the way forward ?
All the help is appreciated
(( I am making it as an internal application for my organisation and it is assumed that the log parser will be present in the computer this web application is being used )0
Thanks
Ravi

Add a reference to Interop.MSUtil in your project and then use the COM API as described in the help file. The following using statements should allow you to interact with LogParser through your code:
using LogQuery = Interop.MSUtil.LogQueryClass;
using FileLogInputFormat = Interop.MSUtil.COMTextLineInputContextClass;
Then you can do something like:
var inputFormat = new FileLogInputFormat();
// Instantiate the LogQuery object
LogQuery oLogQuery = new LogQuery();
var results = oLogQuery.Execute(yourQuery, inputFormat);
You have access to a bunch of predefined input formats and output formats (like IIS and W3C)), so you can pick the one that best suits your needs. Also, you will need to run regsvr on LogParser.dll on the machine you are executing on if you have not installed LogParser. The doc is actually pretty good to get you started.

How to parse javascript variable array embedded in http://up-for-grabs.net/#/?

I am trying to parse http://up-for-grabs.net/#/ to get its content in CSV file using powershell. I have written below code till now
$URL = "http://up-for-grabs.net/#/"
$HTML = Invoke-WebRequest -Uri $URL
$script_blocks = $HTML.ParsedHtml.getElementsByTagName("script") | Where{ $_.type -eq ‘text/javascript’ }
$content = ""
foreach ($script_block in $script_blocks)
{
if($script_block.innerHTML -ne $null -and `
$script_block.innerHTML.trim().StartsWith("var files"))
{
$content = $script_block.innerHTML.trim()
}
}
Looking further in the content, it seems like a variable array embedded in JavaScript whose initial lines are formatted as follows. Its array with no spaces or new lines which are my creation to improve readability.
<script type="text/javascript">
var files = {
"aspnet-razor-4":{"name":"ASP.NET Razor 4","desc":"Parser and code generator for CSHTML files used in view pages for MVC web apps.","site":"https://github.com/aspnet/Razor","tags":["Microsoft","ASP.NET","Razor","MVC"], "upforgrabs":{"name":"up-for-grabs","link":"https://github.com/aspnet/Razor/labels/up-for-grabs"}},
"fsharpdatadbpedia":{"name":"FSharp.Data.DbPedia","desc":"FSharp.Data.DbPedia - An F# type provider for DBpedia","site":"https://github.com/fsprojects/FSharp.Data.DbPedia","tags":[".NET","DbPedia","F#"],"upforgrabs":{"name":"up-for-grabs","link":"https://github.com/fsprojects/FSharp.Data.DbPedia/labels/up-for-grabs"}},
"makesharp":{"name":"Make#","desc":"Use C# scripts to automate the building process","site":"https://github.com/sapiens/MakeSharp","tags":[".Net","C#","make","build","automation","tools"],"upforgrabs":{"name":"up-for-grabs","link":"https://github.com/sapiens/MakeSharp/labels/up-for-grabs"}},
"stateprinter":{"name":"StatePrinter","desc":"Automating unittest asserts and ToString() coding.","site":"https://github.com/kbilsted/StatePrinter","tags":["TDD","Unit Testing","TDD",".NET","C#","ToString","Debugging"],"upforgrabs":{"name":"Help wanted","link":"https://github.com/kbilsted/StatePrinter/labels/Help%20wanted"}}
</script>
This immediately is followed by
var projects = new Array();
for (var fileName in files) {
projects.push(files[fileName]);
}
How can I achieve similar quick parsing in powershell without writing big code with string tokenization.

After some research, I figured out that this is a JSON content for which powershell cmdlet ConvertFrom-Json needs to be used. I do not want to copy the whole script here. Please look at this GitHub location to see how to use this cmdlet effectively. Basically, you need to remember that object returned by this cmdlet is a custom object which need to be enumerated to get various properties. Its not an array, so only foreach will work to uncover the content. A small code sample is below
$file_json = $file_string | ConvertFrom-Json
$delim = " ; "
foreach ($item in $file_json | gm)
{
$props = $file_json.$($item.Name)
if($props.MemberType) {continue}
$row = $props.name.ToString()
$row += $delim + $props.desc.ToString()
$row += $delim + $props.site.ToString()
}
Searching for this cmdlet on web will give you more details on how to deal with this conversion.

Develop Reference

JavaScript is the programming language of the Web.

extract data from javascript using Python - javascript

Related

Scrape a javascript variable from a webpage

Sending whatsapp messages via python/JS

Loading a Random Caption from a text file using Javascript and Displaying via HTML

Running an Executable file from an ASP.NET web application

How to parse javascript variable array embedded in http://up-for-grabs.net/#/?

Categories

Resources