Extract 2nd level domain from domain? - Python - javascript

I have a list of domains e.g.
site.co.uk
site.com
site.me.uk
site.jpn.com
site.org.uk
site.it
also the domain names can contain 3rd and 4th level domains e.g.
test.example.site.org.uk
test2.site.com
I need to try and extract the 2nd level domain, in all these cases being site
Any ideas? :)

no way to reliably get that. Subdomains are arbitrary and there is a monster list of domain extensions that grows every day. Best case is you check against the monster list of domain extensions and maintain the list.
list:
http://mxr.mozilla.org/mozilla-central/source/netwerk/dns/effective_tld_names.dat?raw=1

Following #kohlehydrat's suggestion:
import urllib2
class TldMatcher(object):
# use class vars for lazy loading
MASTERURL = "http://mxr.mozilla.org/mozilla-central/source/netwerk/dns/effective_tld_names.dat?raw=1"
TLDS = None
#classmethod
def loadTlds(cls, url=None):
url = url or cls.MASTERURL
# grab master list
lines = urllib2.urlopen(url).readlines()
# strip comments and blank lines
lines = [ln for ln in (ln.strip() for ln in lines) if len(ln) and ln[:2]!='//']
cls.TLDS = set(lines)
def __init__(self):
if TldMatcher.TLDS is None:
TldMatcher.loadTlds()
def getTld(self, url):
best_match = None
chunks = url.split('.')
for start in range(len(chunks)-1, -1, -1):
test = '.'.join(chunks[start:])
startest = '.'.join(['*']+chunks[start+1:])
if test in TldMatcher.TLDS or startest in TldMatcher.TLDS:
best_match = test
return best_match
def get2ld(self, url):
urls = url.split('.')
tlds = self.getTld(url).split('.')
return urls[-1 - len(tlds)]
def test_TldMatcher():
matcher = TldMatcher()
test_urls = [
'site.co.uk',
'site.com',
'site.me.uk',
'site.jpn.com',
'site.org.uk',
'site.it'
]
errors = 0
for u in test_urls:
res = matcher.get2ld(u)
if res != 'site':
print "Error: found '{0}', should be 'site'".format(res)
errors += 1
if errors==0:
print "Passed!"
return (errors==0)

Using python tld
https://pypi.python.org/pypi/tld
$ pip install tld
from tld import get_tld, get_fld
print(get_tld("http://www.google.co.uk"))
'co.uk'
print(get_fld("http://www.google.co.uk"))
'google.co.uk'

Problem in mix of extractions 1st and 2nd level.
Trivial solution...
Build list of possible site suffixes, ordered from narrow to common case.
"co.uk", "uk", "co.jp", "jp", "com"
And check, Can suffix be matched at end of domain. if matched, next part is site.

The only possible way would be via a list with all the top level domains (here like .com or co.uk) possible. Then you would scan through this list and check out. I don't see any other way, at least without accessing the internet at runtime.

#Hugh Bothwell
In your example you are not dealing with special domains like parliament.uk , they are represent in the file with "!" (e.g. !parliament.uk)
I did some changes of your code, also make it looks more like my PHP function I used before.
Also added possibility to load the data from local file.
Also tested it with some domains such:
niki.bg, niki.1.bg
parliament.uk
niki.at, niki.co.at
niki.us, niki.ny.us
niki.museum, niki.national.museum
www.niki.uk - due to "*" in Mozilla's file this is reported as OK.
Feel free to contact me # github so I can add you as co-author there.
GitHub repo is here:
https://github.com/nmmmnu/TLDExtractor/blob/master/TLDExtractor.py

Related

Scraping fanduel with BeautifulSoup, can't find values visible in HTML

I'm trying to scrape lines for a typical baseball game from fanduel using BeautifulSoup but I found (as this person did) that much of the data doesn't show up when I try something standard like
import requests
from bs4 import BeautifulSoup
page = requests.get(<some url>)
soup = BeautifulSoup(page.content, 'html.parser')
I know I can get do Dev Tools -> Network tab -> XHR to get a json with data the site is using, but I'm not able to find the same values I'm seeing in the HTML.
I'll give an example but it probably won't be good after a day since the page will be gone. Here's the page on lines for the Rangers Dodgers game tomorrow. You can click and see that (as of right now) the odds for the Dodgers at -1.5 are -146. I'd like to scrape that number (-146) but I can't find it anywhere in the json data.
Any idea how I can find that sort of thing either in the json or in the HTML? Thanks!
Looks like I offered the solution to the reference link you have there. Those lines are there in the json, it's just in the "raw" form, so you need to calculate it out:
import requests
jsonData = requests.get('https://sportsbook.fanduel.com/cache/psevent/UK/1/false/1027510.3.json').json()
money_line = jsonData['eventmarketgroups'][0]['markets'][1]['selections']
def calc_spread_line(priceUp, priceDown, spread):
if priceDown < priceUp:
line = int((priceUp / priceDown) * 100)
spread = spread*-1
else:
line = int((priceDown / priceUp) * -100)
return line, spread
for each in money_line:
priceUp = each['currentpriceup']
priceDown = each['currentpricedown']
team = each['name']
spread = each['currenthandicap']
line, spread = calc_spread_line(priceUp, priceDown, spread)
print ('%s: %s %s' %(team, spread, line))
Output:
Texas Rangers: 1.5 122
Los Angeles Dodgers: -1.5 -146
Otherwise you could use selenium as suggested and parse the html that way. It would be less efficient though.
This may be happening to you because some web pages loads the elements using java script, in which case the html source you receive using requests may not contain all the elements .You can check this by right-clicking on the page and selecting view source , if the data you require is in that source file you can parse it using Beautiful Soup otherwise in order to get dynamically loaded content I will suggest selenium

Web scraping with BeautifulSoup won't work

Ultimately, I'm trying to open all articles of a news website and then make a top 10 of the words used in all the articles. To do this, I first wanted to see how many articles there are so I could iterate over them at some point, haven't really figured out how I want to do everything yet.
To do this, I wanted to use BeautifulSoup4. I think the class I'm trying to get is Javascript as I'm not getting anything back.
This is my code:
url = "http://ad.nl"
ad = requests.get(url)
soup = BeautifulSoup(ad.text.lower(), "xml")
titels = soup.findAll("article")
print(titels)
for titel in titels:
print(titel)
The article name is sometimes an h2 or an h3. It always has one and the same class, but I can't get anything through that class. It has some parents but it uses the same name but with the extension -wrapper for example. I don't even know how to use a parent to get what I want but I think that those classes are Javascript as well. There's also an href which I'm interested in. But once again, that is probably also Javascript as it returns nothing.
Does anyone know how I could use anything (preferably the href, but the article name would be ok as well) by using BeautifulSoup?
In case you don't want to use selenium. This works for me. I've tried on 2 PCs with different internet connection. Can you try?
from bs4 import BeautifulSoup
import requests
cookies={"pwv":"2",
"pws":"functional|analytics|content_recommendation|targeted_advertising|social_media"}
page=requests.get("https://www.ad.nl/",cookies=cookies)
soup = BeautifulSoup(page.content, 'html.parser')
articles = soup.findAll("article")
Then follow kimbo's code to extract h2/h3.
As #Sri mentioned in the comments, when you open up that url, you have a page show up where you have to accept the cookies first, which requires interaction.
When you need interaction, consider using something like selenium (https://selenium-python.readthedocs.io/).
Here's something that should get you started.
(Edit: you'll need to run pip install selenium before running this code below)
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
url = 'https://ad.nl'
# launch firefox with your url above
# note that you could change this to some other webdriver (e.g. Chrome)
driver = webdriver.Firefox()
driver.get(url)
# click the "accept cookies" button
btn = driver.find_element_by_name('action')
btn.click()
# grab the html. It'll wait here until the page is finished loading
html = driver.page_source
# parse the html soup
soup = BeautifulSoup(html.lower(), "html.parser")
articles = soup.findAll("article")
for article in articles:
# check for article titles in both h2 and h3 elems
h2_titles = article.findAll('h2', {'class': 'ankeiler__title'})
h3_titles = article.findAll('h3', {'class': 'ankeiler__title'})
for t in h2_titles:
# first I was doing print(t.text), but some of them had leading
# newlines and things like '22:30', which I assume was the hour of the day
text = ''.join(t.findAll(text=True, recursive=False)).lstrip()
print(text)
for t in h3_titles:
text = ''.join(t.findAll(text=True, recursive=False)).lstrip()
print(text)
# close the browser
driver.close()
This may or may not be exactly what you have in mind, but this is just an example of how to use selenium and beautiful soup. Feel free to copy/use/modify this as you see fit.
And if you're wondering about what selectors to use, read the comment by #JL Peyret.

Get Jupyter notebook name [duplicate]

I am trying to obtain the current NoteBook name when running the IPython notebook. I know I can see it at the top of the notebook. What I am after something like
currentNotebook = IPython.foo.bar.notebookname()
I need to get the name in a variable.
adding to previous answers,
to get the notebook name run the following in a cell:
%%javascript
IPython.notebook.kernel.execute('nb_name = "' + IPython.notebook.notebook_name + '"')
this gets you the file name in nb_name
then to get the full path you may use the following in a separate cell:
import os
nb_full_path = os.path.join(os.getcwd(), nb_name)
I have the following which works with IPython 2.0. I observed that the name of the notebook is stored as the value of the attribute 'data-notebook-name' in the <body> tag of the page. Thus the idea is first to ask Javascript to retrieve the attribute --javascripts can be invoked from a codecell thanks to the %%javascript magic. Then it is possible to access to the Javascript variable through a call to the Python Kernel, with a command which sets a Python variable. Since this last variable is known from the kernel, it can be accessed in other cells as well.
%%javascript
var kernel = IPython.notebook.kernel;
var body = document.body,
attribs = body.attributes;
var command = "theNotebook = " + "'"+attribs['data-notebook-name'].value+"'";
kernel.execute(command);
From a Python code cell
print(theNotebook)
Out[ ]: HowToGetTheNameOfTheNoteBook.ipynb
A defect in this solution is that when one changes the title (name) of a notebook, then this name seems to not be updated immediately (there is probably some kind of cache) and it is necessary to reload the notebook to get access to the new name.
[Edit] On reflection, a more efficient solution is to look for the input field for notebook's name instead of the <body> tag. Looking into the source, it appears that this field has id "notebook_name". It is then possible to catch this value by a document.getElementById() and then follow the same approach as above. The code becomes, still using the javascript magic
%%javascript
var kernel = IPython.notebook.kernel;
var thename = window.document.getElementById("notebook_name").innerHTML;
var command = "theNotebook = " + "'"+thename+"'";
kernel.execute(command);
Then, from a ipython cell,
In [11]: print(theNotebook)
Out [11]: HowToGetTheNameOfTheNoteBookSolBis
Contrary to the first solution, modifications of notebook's name are updated immediately and there is no need to refresh the notebook.
As already mentioned you probably aren't really supposed to be able to do this, but I did find a way. It's a flaming hack though so don't rely on this at all:
import json
import os
import urllib2
import IPython
from IPython.lib import kernel
connection_file_path = kernel.get_connection_file()
connection_file = os.path.basename(connection_file_path)
kernel_id = connection_file.split('-', 1)[1].split('.')[0]
# Updated answer with semi-solutions for both IPython 2.x and IPython < 2.x
if IPython.version_info[0] < 2:
## Not sure if it's even possible to get the port for the
## notebook app; so just using the default...
notebooks = json.load(urllib2.urlopen('http://127.0.0.1:8888/notebooks'))
for nb in notebooks:
if nb['kernel_id'] == kernel_id:
print nb['name']
break
else:
sessions = json.load(urllib2.urlopen('http://127.0.0.1:8888/api/sessions'))
for sess in sessions:
if sess['kernel']['id'] == kernel_id:
print sess['notebook']['name']
break
I updated my answer to include a solution that "works" in IPython 2.0 at least with a simple test. It probably isn't guaranteed to give the correct answer if there are multiple notebooks connected to the same kernel, etc.
It seems I cannot comment, so I have to post this as an answer.
The accepted solution by #iguananaut and the update by #mbdevpl appear not to be working with recent versions of the Notebook.
I fixed it as shown below. I checked it on Python v3.6.1 + Notebook v5.0.0 and on Python v3.6.5 and Notebook v5.5.0.
import jupyterlab
if jupyterlab.__version__.split(".")[0] == "3":
from jupyter_server import serverapp as app
key_srv_directory = 'root_dir'
else :
from notebook import notebookapp as app
key_srv_directory = 'notebook_dir'
import urllib
import json
import os
import ipykernel
def notebook_path(key_srv_directory, ):
"""Returns the absolute path of the Notebook or None if it cannot be determined
NOTE: works only when the security is token-based or there is also no password
"""
connection_file = os.path.basename(ipykernel.get_connection_file())
kernel_id = connection_file.split('-', 1)[1].split('.')[0]
for srv in app.list_running_servers():
try:
if srv['token']=='' and not srv['password']: # No token and no password, ahem...
req = urllib.request.urlopen(srv['url']+'api/sessions')
else:
req = urllib.request.urlopen(srv['url']+'api/sessions?token='+srv['token'])
sessions = json.load(req)
for sess in sessions:
if sess['kernel']['id'] == kernel_id:
return os.path.join(srv[key_srv_directory],sess['notebook']['path'])
except:
pass # There may be stale entries in the runtime directory
return None
As stated in the docstring, this works only when either there is no authentication or the authentication is token-based.
Note that, as also reported by others, the Javascript-based method does not seem to work when executing a "Run all cells" (but works when executing cells "manually"), which was a deal-breaker for me.
The ipyparams package can do this pretty easily.
import ipyparams
currentNotebook = ipyparams.notebook_name
On Jupyter 3.0 the following works. Here I'm showing the entire path on the Jupyter server, not just the notebook name:
To store the NOTEBOOK_FULL_PATH on the current notebook front end:
%%javascript
var nb = IPython.notebook;
var kernel = IPython.notebook.kernel;
var command = "NOTEBOOK_FULL_PATH = '" + nb.base_url + nb.notebook_path + "'";
kernel.execute(command);
To then display it:
print("NOTEBOOK_FULL_PATH:\n", NOTEBOOK_FULL_PATH)
Running the first Javascript cell produces no output.
Running the second Python cell produces something like:
NOTEBOOK_FULL_PATH:
/user/zeph/GetNotebookName.ipynb
Yet another hacky solution since my notebook server can change. Basically you print a random string, save it and then search for a file containing that string in the working directory. The while is needed because save_checkpoint is asynchronous.
from time import sleep
from IPython.display import display, Javascript
import subprocess
import os
import uuid
def get_notebook_path_and_save():
magic = str(uuid.uuid1()).replace('-', '')
print(magic)
# saves it (ctrl+S)
display(Javascript('IPython.notebook.save_checkpoint();'))
nb_name = None
while nb_name is None:
try:
sleep(0.1)
nb_name = subprocess.check_output(f'grep -l {magic} *.ipynb', shell=True).decode().strip()
except:
pass
return os.path.join(os.getcwd(), nb_name)
There is no real way yet to do this in Jupyterlab. But there is an official way that's now under active discussion/development as of August 2021:
https://github.com/jupyter/jupyter_client/pull/656
In the meantime, hitting the api/sessions REST endpoint of jupyter_server seems like the best bet. Here's a cleaned-up version of that approach:
from jupyter_server import serverapp
from jupyter_server.utils import url_path_join
from pathlib import Path
import re
import requests
kernelIdRegex = re.compile(r"(?<=kernel-)[\w\d\-]+(?=\.json)")
def getNotebookPath():
kernelId = kernelIdRegex.search(get_ipython().config["IPKernelApp"]["connection_file"])[0]
for jupServ in serverapp.list_running_servers():
for session in requests.get(url_path_join(jupServ["url"], "api/sessions"), params={"token": jupServ["token"]}).json():
if kernelId == session["kernel"]["id"]:
return Path(jupServ["root_dir"]) / session["notebook"]['path']
Tested working with
python==3.9
jupyter_server==1.8.0
jupyterlab==4.0.0a7
Modifying #jfb method, gives the function below which worked fine on ipykernel-5.3.4.
def getNotebookName():
display(Javascript('IPython.notebook.kernel.execute("NotebookName = " + "\'"+window.document.getElementById("notebook_name").innerHTML+"\'");'))
try:
_ = type(NotebookName)
return NotebookName
except:
return None
Note that the display javascript will take some time to reach the browser, and it will take some time to execute the JS and get back to the kernel. I know it may sound stupid, but it's better to run the function in two cells, like this:
nb_name = getNotebookName()
and in the following cell:
for i in range(10):
nb_name = getNotebookName()
if nb_name is not None:
break
However, if you don't need to define a function, the wise method is to run display(Javascript(..)) in one cell, and check the notebook name in another cell. In this way, the browser has enough time to execute the code and return the notebook name.
If you don't mind to use a library, the most robust way is:
import ipynbname
nb_name = ipynbname.name()
If you are using Visual Studio Code:
import IPython ; IPython.extract_module_locals()[1]['__vsc_ipynb_file__']
Assuming you have the Jupyter Notebook server's host, port, and authentication token, this should work for you. It's based off of this answer.
import os
import json
import posixpath
import subprocess
import urllib.request
import psutil
def get_notebook_path(host, port, token):
process_id = os.getpid();
notebooks = get_running_notebooks(host, port, token)
for notebook in notebooks:
if process_id in notebook['process_ids']:
return notebook['path']
def get_running_notebooks(host, port, token):
sessions_url = posixpath.join('http://%s:%d' % (host, port), 'api', 'sessions')
sessions_url += f'?token={token}'
response = urllib.request.urlopen(sessions_url).read()
res = json.loads(response)
notebooks = [{'kernel_id': notebook['kernel']['id'],
'path': notebook['notebook']['path'],
'process_ids': get_process_ids(notebook['kernel']['id'])} for notebook in res]
return notebooks
def get_process_ids(name):
child = subprocess.Popen(['pgrep', '-f', name], stdout=subprocess.PIPE, shell=False)
response = child.communicate()[0]
return [int(pid) for pid in response.split()]
Example usage:
get_notebook_path('127.0.0.1', 17004, '344eb91bee5742a8501cc8ee84043d0af07d42e7135bed90')
To realize why you can't get notebook name using these JS-based solutions, run this code and notice the delay it takes for the message box to appear after python has finished execution of the cell / entire notebook:
%%javascript
function sayHello() {
alert('Hello world!');
}
setTimeout(sayHello, 1000);
More info
Javascript calls are async and hence not guaranteed to complete before python starts running another cell containing the code expecting this notebook name variable to be already created... resulting in NameError when trying to access non-existing variables that should contain notebook name.
I suspect some upvotes on this page became locked before voters could discover that all %%javascript-based solutions ultimately don't work... when the producer and consumer notebook cells are executed together (or in a quick succession).
All Json based solutions fail if we execute more than one cell at a time
because the result will not be ready until after the end of the execution
(its not a matter of using sleep or waiting any time, check it yourself but remember to restart kernel and run all every test)
Based on previous solutions, this avoids using the %% magic in case you need to put it in the middle of some other code:
from IPython.display import display, Javascript
# can have comments here :)
js_cmd = 'IPython.notebook.kernel.execute(\'nb_name = "\' + IPython.notebook.notebook_name + \'"\')'
display(Javascript(js_cmd))
For python 3, the following based on the answer by #Iguananaut and updated for latest python and possibly multiple servers will work:
import os
import json
try:
from urllib2 import urlopen
except:
from urllib.request import urlopen
import ipykernel
connection_file_path = ipykernel.get_connection_file()
connection_file = os.path.basename(connection_file_path)
kernel_id = connection_file.split('-', 1)[1].split('.')[0]
running_servers = !jupyter notebook list
running_servers = [s.split('::')[0].strip() for s in running_servers[1:]]
nb_name = '???'
for serv in running_servers:
uri_parts = serv.split('?')
uri_parts[0] += 'api/sessions'
sessions = json.load(urlopen('?'.join(uri_parts)))
for sess in sessions:
if sess['kernel']['id'] == kernel_id:
nb_name = os.path.basename(sess['notebook']['path'])
break
if nb_name != '???':
break
print (f'[{nb_name}]')
just use ipynbname , which is practical
import ipynbname
nb_fname = ipynbname.name()
nb_path = ipynbname.path()
print(f"{nb_fname=}")
print(f"{nb_path=}")
I found this in https://stackoverflow.com/a/65907473/15497427

How to yield fragment URLs in scrapy using Selenium?

from my poor knowledge about webscraping I've come about to find a very complex issue for me, that I will try to explain the best I can (hence I'm opened to suggestions or edits in my post).
I started using the web crawling framework 'Scrapy' long ago to make my webscraping, and it's still the one that I use nowadays. Lately, I came across this website, and found that my framework (Scrapy) was not able to iterate over the pages since this website uses Fragment URLs (#) to load the data (the next pages). Then I made a post about that problem (having no idea of the main problem yet): my post
After that, I realized that my framework can't make it without a JavaScript interpreter or a browser imitation, so they mentioned the Selenium library. I read as much as I could about that library (i.e. example1, example2, example3 and example4). I also found this StackOverflow's post that gives some tracks about my issue.
So Finally, my biggest questions are:
1 - Is there any way to iterate/yield over the pages from the website shown above, using Selenium along with scrapy?
So far, this is the code I'm using, but doesn't work...
EDIT:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
# The require imports...
def getBrowser():
path_to_phantomjs = "/some_path/phantomjs-2.1.1-macosx/bin/phantomjs"
dcap = dict(DesiredCapabilities.PHANTOMJS)
dcap["phantomjs.page.settings.userAgent"] = (
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/53 "
"(KHTML, like Gecko) Chrome/15.0.87")
browser = webdriver.PhantomJS(executable_path=path_to_phantomjs, desired_capabilities=dcap)
return browser
class MySpider(Spider):
name = "myspider"
browser = getBrowser()
def start_requests(self):
the_url = "http://www.atraveo.com/es_es/islas_canarias#eyJkYXRhIjp7ImNvdW50cnlJZCI6IkVTIiwicmVnaW9uSWQiOiI5MjAiLCJkdXJhdGlvbiI6NywibWluUGVyc29ucyI6MX0sImNvbmZpZyI6eyJwYWdlIjoiMCJ9fQ=="
yield scrapy.Request(url=the_url, callback=self.parse, dont_filter=True)
def parse(self, response):
self.get_page_links()
def get_page_links(self):
""" This first part, goes through all available pages """
for i in xrange(1, 3): # 210
new_data = {"data": {"countryId": "ES", "regionId": "920", "duration": 7, "minPersons": 1},
"config": {"page": str(i)}}
json_data = json.dumps(new_data)
new_url = "http://www.atraveo.com/es_es/islas_canarias#" + base64.b64encode(json_data)
self.browser.get(new_url)
print "\nThe new URL is -> ", new_url, "\n"
content = self.browser.page_source
self.get_item_links(content)
def get_item_links(self, body=""):
if body:
""" This second part, goes through all available items """
raw_links = re.findall(r'listclickable.+?>', body)
links = []
if raw_links:
for raw_link in raw_links:
new_link = re.findall(r'data-link=\".+?\"', raw_link)[0].replace("data-link=\"", "").replace("\"",
"")
links.append(str(new_link))
if links:
ids = self.get_ids(links)
for link in links:
current_id = self.get_single_id(link)
print "\nThe Link -> ", link
# If commented the line below, code works, doesn't otherwise
yield scrapy.Request(url=link, callback=self.parse_room, dont_filter=True)
def get_ids(self, list1=[]):
if list1:
ids = []
for elem in list1:
raw_id = re.findall(r'/[0-9]+', elem)[0].replace("/", "")
ids.append(raw_id)
return ids
else:
return []
def get_single_id(self, text=""):
if text:
raw_id = re.findall(r'/[0-9]+', text)[0].replace("/", "")
return raw_id
else:
return ""
def parse_room(self, response):
# More scraping code...
So this is mainly my problem. I'm almost sure that what I'm doing isn't the best way, so for that I did my second question. And to avoid having to do these kind of issues in the future, I did my third question.
2 - If the answer to the first question is negative, how could I tackle this issue? I'm opened to another means, otherwise
3 - Can anyone tell me or show me pages where I can learn how to solve/combine webscraping along javaScript and Ajax? Nowadays are more the websites that use JavaScript and Ajax scripts to load content
Many thanks in advance!
Selenium is one of the best tools to scrape dynamic data.you can use selenium with any web browser to fetch the data that is loading from scripts.That works exactly like the browser click operations.But I am not prefering it.
For getting dynamic data you can use scrapy + splash combo. From scrapy you wil get all the static data and splash for other dynamic contents.
Have you looked into BeautifulSoup? It's a very popular web scraping library for python. As for JavaScript, I would recommend something like Cheerio (If you're asking for a scraping library in JavaScript)
If you are meaning that the website uses HTTP requests to load content, you could always try to manipulate that manually with something like the requests library.
Hope this helps
You can definitely use Selenium as a standalone to scrap webpages with dynamic content (like AJAX loading).
Selenium will just rely on a WebDriver (basically a web browser) to seek content over the Internet.
Here are a few of them (but the most often used) :
ChromeDriver
PhantomJS (my favorite)
Firefox
Once your started, you can start your bot and parse the html content of the webpage.
I included a minimal working example below using Python and ChromeDriver :
from selenium import webdriver
from selenium.webdriver.common.by import By
driver = webdriver.Chrome(executable_path='chromedriver')
driver.get('https://www.google.com')
# Then you can search for any element you want on the webpage
search_bar = driver.find_element(By.CLASS_NAME, 'tsf-p')
search_bar.click()
driver.close()
See the documentation for more details !

Best practice for localization and globalization of strings and labels [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 8 years ago.
Improve this question
I'm a member of a team with more than 20 developers. Each developer works on a separate module (something near 10 modules). In each module we might have at least 50 CRUD forms, which means that we currently have near 500 add buttons, save buttons, edit buttons, etc.
However, because we want to globalized our application, we need to be able to translate texts in our application. For example, everywhere, the word add should become ajouter for French users.
What we've done till now, is that for each view in UI or Presentation Layer, we have a dictionary of key/value pairs of translations. Then while rendering the view, we translate required texts and strings using this dictionary. However, this approach, we've come to have something near 500 add in 500 dictionaries. This means that we've breached DRY principal.
On the other hand, if we centralize common strings, like putting add in one place, and ask developers to use it everywhere, we encounter the problem of not being sure if a string is already defined in the centralized dictionary or not.
One other options might be to have no translation dictionary and use online translation services like Google Translate, Bing Translator, etc.
Another problem that we've encountered is that some developers under the stress of delivering the project on-time can't remember the translation keys. For example, for the text of the add button, a developer has used add while another developer has used new, etc.
What is the best practice, or most well-known method for globalization and localization of string resources of an application?
As far as I know, there's a good library called localeplanet for Localization and Internationalization in JavaScript. Furthermore, I think it's native and has no dependencies to other libraries (e.g. jQuery)
Here's the website of library: http://www.localeplanet.com/
Also look at this article by Mozilla, you can find very good method and algorithms for client-side translation: http://blog.mozilla.org/webdev/2011/10/06/i18njs-internationalize-your-javascript-with-a-little-help-from-json-and-the-server/
The common part of all those articles/libraries is that they use a i18n class and a get method (in some ways also defining an smaller function name like _) for retrieving/converting the key to the value. In my explaining the key means that string you want to translate and the value means translated string.
Then, you just need a JSON document to store key's and value's.
For example:
var _ = document.webL10n.get;
alert(_('test'));
And here the JSON:
{ test: "blah blah" }
I believe using current popular libraries solutions is a good approach.
When you’re faced with a problem to solve (and frankly, who isn’t
these days?), the basic strategy usually taken by we computer people
is called “divide and conquer.” It goes like this:
Conceptualize the specific problem as a set of smaller sub-problems.
Solve each smaller problem.
Combine the results into a solution of the specific problem.
But “divide and conquer” is not the only possible strategy. We can also take a more generalist approach:
Conceptualize the specific problem as a special case of a more general problem.
Somehow solve the general problem.
Adapt the solution of the general problem to the specific problem.
- Eric Lippert
I believe many solutions already exist for this problem in server-side languages such as ASP.Net/C#.
I've outlined some of the major aspects of the problem
Issue: We need to load data only for the desired language
Solution: For this purpose we save data to a separate files for each language
ex. res.de.js, res.fr.js, res.en.js, res.js(for default language)
Issue: Resource files for each page should be separated so we only get the data we need
Solution: We can use some tools that already exist like
https://github.com/rgrove/lazyload
Issue: We need a key/value pair structure to save our data
Solution: I suggest a javascript object instead of string/string air.
We can benefit from the intellisense from an IDE
Issue: General members should be stored in a public file and all pages should access them
Solution: For this purpose I make a folder in the root of web application called Global_Resources and a folder to store global file for each sub folders we named it 'Local_Resources'
Issue: Each subsystems/subfolders/modules member should override the Global_Resources members on their scope
Solution: I considered a file for each
Application Structure
root/
Global_Resources/
default.js
default.fr.js
UserManagementSystem/
Local_Resources/
default.js
default.fr.js
createUser.js
Login.htm
CreateUser.htm
The corresponding code for the files:
Global_Resources/default.js
var res = {
Create : "Create",
Update : "Save Changes",
Delete : "Delete"
};
Global_Resources/default.fr.js
var res = {
Create : "créer",
Update : "Enregistrer les modifications",
Delete : "effacer"
};
The resource file for the desired language should be loaded on the page selected from Global_Resource - This should be the first file that is loaded on all the pages.
UserManagementSystem/Local_Resources/default.js
res.Name = "Name";
res.UserName = "UserName";
res.Password = "Password";
UserManagementSystem/Local_Resources/default.fr.js
res.Name = "nom";
res.UserName = "Nom d'utilisateur";
res.Password = "Mot de passe";
UserManagementSystem/Local_Resources/createUser.js
// Override res.Create on Global_Resources/default.js
res.Create = "Create User";
UserManagementSystem/Local_Resources/createUser.fr.js
// Override Global_Resources/default.fr.js
res.Create = "Créer un utilisateur";
manager.js file (this file should be load last)
res.lang = "fr";
var globalResourcePath = "Global_Resources";
var resourceFiles = [];
var currentFile = globalResourcePath + "\\default" + res.lang + ".js" ;
if(!IsFileExist(currentFile))
currentFile = globalResourcePath + "\\default.js" ;
if(!IsFileExist(currentFile)) throw new Exception("File Not Found");
resourceFiles.push(currentFile);
// Push parent folder on folder into folder
foreach(var folder in parent folder of current page)
{
currentFile = folder + "\\Local_Resource\\default." + res.lang + ".js";
if(!IsExist(currentFile))
currentFile = folder + "\\Local_Resource\\default.js";
if(!IsExist(currentFile)) throw new Exception("File Not Found");
resourceFiles.push(currentFile);
}
for(int i = 0; i < resourceFiles.length; i++) { Load.js(resourceFiles[i]); }
// Get current page name
var pageNameWithoutExtension = "SomePage";
currentFile = currentPageFolderPath + pageNameWithoutExtension + res.lang + ".js" ;
if(!IsExist(currentFile))
currentFile = currentPageFolderPath + pageNameWithoutExtension + ".js" ;
if(!IsExist(currentFile)) throw new Exception("File Not Found");
Hope it helps :)
jQuery.i18n is a lightweight jQuery plugin for enabling internationalization in your web pages. It allows you to package custom resource strings in ‘.properties’ files, just like in Java Resource Bundles. It loads and parses resource bundles (.properties) based on provided language or language reported by browser.
to know more about this take a look at the How to internationalize your pages using JQuery?

Categories

Resources