Extract feeds from web page

Extract feeds from web page - javascript

I'm looking for a code snippet (language is not important here) that will
extract all feeds (RSS, atom etc.) that is associated with this page.
So input is URL and output list of channels.
Important is completeness, it means if the page has associated some information channel
it should be found.
I'm asking preferably for what to find in HTML code and where to find to cover completeness.
thank you

You find feeds in the head tag in html files. There they should be specified as link tags with an associated content type and a href attribute specifying it's location.
To extract all feed urls from a page using python you could use something like this:
import urllib
from HTMLParser import HTMLParser
class FeedParser(HTMLParser):
def __init__(self, *args, **kwargs):
self.feeds = set()
HTMLParser.__init__(self, *args, **kwargs)
def handle_starttag(self, tag, attrs):
if tag == 'link':
try:
href = [attr[1] for attr in attrs if attr[0] == 'href'][0]
except IndexError:
return None
else:
if ('type', 'application/atom+xml') in attrs or ('type', 'application/rss+xml') in attrs:
self.feeds.add(href)
def get_all_feeds_from_url(url):
f = urllib.urlopen(url)
contents = f.read()
f.close()
parser = FeedParser()
parser.feed(contents)
parser.close()
return list(parser.feeds)
This code would have to be extended quite a bit though if you want to cover all the quirky ways a feed can be added to a html page.

Related

Flask returning a generator and handling it in JavaScript [duplicate]

I have a view that generates data and streams it in real time. I can't figure out how to send this data to a variable that I can use in my HTML template. My current solution just outputs the data to a blank page as it arrives, which works, but I want to include it in a larger page with formatting. How do I update, format, and display the data as it is streamed to the page?
import flask
import time, math
app = flask.Flask(__name__)
#app.route('/')
def index():
def inner():
# simulate a long process to watch
for i in range(500):
j = math.sqrt(i)
time.sleep(1)
# this value should be inserted into an HTML template
yield str(i) + '<br/>\n'
return flask.Response(inner(), mimetype='text/html')
app.run(debug=True)

You can stream data in a response, but you can't dynamically update a template the way you describe. The template is rendered once on the server side, then sent to the client.
One solution is to use JavaScript to read the streamed response and output the data on the client side. Use XMLHttpRequest to make a request to the endpoint that will stream the data. Then periodically read from the stream until it's done.
This introduces complexity, but allows updating the page directly and gives complete control over what the output looks like. The following example demonstrates that by displaying both the current value and the log of all values.
This example assumes a very simple message format: a single line of data, followed by a newline. This can be as complex as needed, as long as there's a way to identify each message. For example, each loop could return a JSON object which the client decodes.
from math import sqrt
from time import sleep
from flask import Flask, render_template
app = Flask(__name__)
#app.route("/")
def index():
return render_template("index.html")
#app.route("/stream")
def stream():
def generate():
for i in range(500):
yield "{}\n".format(sqrt(i))
sleep(1)
return app.response_class(generate(), mimetype="text/plain")
<p>This is the latest output: <span id="latest"></span></p>
<p>This is all the output:</p>
<ul id="output"></ul>
<script>
var latest = document.getElementById('latest');
var output = document.getElementById('output');
var xhr = new XMLHttpRequest();
xhr.open('GET', '{{ url_for('stream') }}');
xhr.send();
var position = 0;
function handleNewData() {
// the response text include the entire response so far
// split the messages, then take the messages that haven't been handled yet
// position tracks how many messages have been handled
// messages end with a newline, so split will always show one extra empty message at the end
var messages = xhr.responseText.split('\n');
messages.slice(position, -1).forEach(function(value) {
latest.textContent = value; // update the latest value in place
// build and append a new item to a list to log all output
var item = document.createElement('li');
item.textContent = value;
output.appendChild(item);
});
position = messages.length - 1;
}
var timer;
timer = setInterval(function() {
// check the response for new data
handleNewData();
// stop checking once the response has ended
if (xhr.readyState == XMLHttpRequest.DONE) {
clearInterval(timer);
latest.textContent = 'Done';
}
}, 1000);
</script>
An <iframe> can be used to display streamed HTML output, but it has some downsides. The frame is a separate document, which increases resource usage. Since it's only displaying the streamed data, it might not be easy to style it like the rest of the page. It can only append data, so long output will render below the visible scroll area. It can't modify other parts of the page in response to each event.
index.html renders the page with a frame pointed at the stream endpoint. The frame has fairly small default dimensions, so you may want to to style it further. Use render_template_string, which knows to escape variables, to render the HTML for each item (or use render_template with a more complex template file). An initial line can be yielded to load CSS in the frame first.
from flask import render_template_string, stream_with_context
#app.route("/stream")
def stream():
#stream_with_context
def generate():
yield render_template_string('<link rel=stylesheet href="{{ url_for("static", filename="stream.css") }}">')
for i in range(500):
yield render_template_string("<p>{{ i }}: {{ s }}</p>\n", i=i, s=sqrt(i))
sleep(1)
return app.response_class(generate())
<p>This is all the output:</p>
<iframe src="{{ url_for("stream") }}"></iframe>

5 years late, but this actually can be done the way you were initially trying to do it, javascript is totally unnecessary (Edit: the author of the accepted answer added the iframe section after I wrote this). You just have to include embed the output as an <iframe>:
from flask import Flask, render_template, Response
import time, math
app = Flask(__name__)
#app.route('/content')
def content():
"""
Render the content a url different from index
"""
def inner():
# simulate a long process to watch
for i in range(500):
j = math.sqrt(i)
time.sleep(1)
# this value should be inserted into an HTML template
yield str(i) + '<br/>\n'
return Response(inner(), mimetype='text/html')
#app.route('/')
def index():
"""
Render a template at the index. The content will be embedded in this template
"""
return render_template('index.html.jinja')
app.run(debug=True)
Then the 'index.html.jinja' file will include an <iframe> with the content url as the src, which would something like:
<!doctype html>
<head>
<title>Title</title>
</head>
<body>
<div>
<iframe frameborder="0"
onresize="noresize"
style='background: transparent; width: 100%; height:100%;'
src="{{ url_for('content')}}">
</iframe>
</div>
</body>
When rendering user-provided data render_template_string() should be used to render the content to avoid injection attacks. However, I left this out of the example because it adds additional complexity, is outside the scope of the question, isn't relevant to the OP since he isn't streaming user-provided data, and won't be relevant for the vast majority of people seeing this post since streaming user-provided data is a far edge case that few if any people will ever have to do.

Originally I had a similar problem to the one posted here where a model is being trained and the update should be stationary and formatted in Html. The following answer is for future reference or people trying to solve the same problem and need inspiration.
A good solution to achieve this is to use an EventSource in Javascript, as described here. This listener can be started using a context variable, such as from a form or other source. The listener is stopped by sending a stop command. A sleep command is used for visualization without doing any real work in this example. Lastly, Html formatting can be achieved using Javascript DOM-Manipulation.
Flask Application
import flask
import time
app = flask.Flask(__name__)
#app.route('/learn')
def learn():
def update():
yield 'data: Prepare for learning\n\n'
# Preapre model
time.sleep(1.0)
for i in range(1, 101):
# Perform update
time.sleep(0.1)
yield f'data: {i}%\n\n'
yield 'data: close\n\n'
return flask.Response(update(), mimetype='text/event-stream')
#app.route('/', methods=['GET', 'POST'])
def index():
train_model = False
if flask.request.method == 'POST':
if 'train_model' in list(flask.request.form):
train_model = True
return flask.render_template('index.html', train_model=train_model)
app.run(threaded=True)
HTML Template
<form action="/" method="post">
<input name="train_model" type="submit" value="Train Model" />
</form>
<p id="learn_output"></p>
{% if train_model %}
<script>
var target_output = document.getElementById("learn_output");
var learn_update = new EventSource("/learn");
learn_update.onmessage = function (e) {
if (e.data == "close") {
learn_update.close();
} else {
target_output.innerHTML = "Status: " + e.data;
}
};
</script>
{% endif %}

How to scrape AJAX based websites by using Scrapy and Splash?

I want to make a general scraper which can crawl and scrape all data from any type of website including AJAX websites. I have extensively searched the internet but could not find any proper link which can explain me how Scrapy and Splash together can scrape AJAX websites(which includes pagination,form data and clicking on button before page is displayed). Every link I have referred tells me that Javascript websites can be rendered using Splash but there's no good tutorial/explanation about using Splash to render JS websites. Please don't give me solutions related to using browsers(I want to do everything programmatically,headless browser suggestions are welcome..but I want to use Splash).
class FlipSpider(CrawlSpider):
name = "flip"
allowed_domains = ["www.amazon.com"]
start_urls = ['https://www.amazon.com/s/ref=nb_sb_noss?url=search-alias%3Daps&field-keywords=mobile']
rules = (Rule(LinkExtractor(), callback='lol', follow=True),
def parse_start_url(self,response):
yield scrapy.Request(response.url,
self.lol,
meta={'splash':{'endpoint':'render.html','args':{'wait': 5,'iframes':1,}}})
def lol(self, response):
"""
Some code
"""

The problem with Splash and pagination is following:
I wasn't able to product a Lua script that delivers a new webpage (after click on pagination link) that is in format of response. and not pure HTML.
So, my solution is following - to click the link and extract that new generated url and direct a crawler to this new url.
So, I on the page that has pagination link I execute
yield SplashRequest(url=response.url, callback=self.get_url, endpoint="execute", args={'lua_source': script})
with following Lua script
def parse_categories(self, response):
script = """
function main(splash)
assert(splash:go(splash.args.url))
splash:wait(1)
splash:runjs('document.querySelectorAll(".next-page")[0].click()')
splash:wait(1)
return splash:url()
end
"""
and the get_url function
def get_url(self,response):
yield SplashRequest(url=response.body_as_unicode(), callback=self.parse_categories)
This way I was able to loop my queries.
Same way if you don't expect new URL your Lua script can just produce pure html that you have to work our with regex (that is bad) - but this is the best I was able to do.

You can emulate behaviors, like a ckick, or scroll, by writting a JavaScript function and by telling Splash to execute that script when it renders your page.
A little exemple:
You define a JavaScript function that selects an element in the page and then clicks on it:
(source: splash doc)
# Get button element dimensions with javascript and perform mouse click.
_script = """
function main(splash)
assert(splash:go(splash.args.url))
local get_dimensions = splash:jsfunc([[
function () {
var rect = document.getElementById('button').getClientRects()[0];
return {"x": rect.left, "y": rect.top}
}
]])
splash:set_viewport_full()
splash:wait(0.1)
local dimensions = get_dimensions()
splash:mouse_click(dimensions.x, dimensions.y)
-- Wait split second to allow event to propagate.
splash:wait(0.1)
return splash:html()
end
"""
Then, when you request, you modify the endpoint and set it to "execute", and you add "lua_script": _script to the args.
Exemple :
def parse(self, response):
yield SplashRequest(response.url, self.parse_elem,
endpoint="execute",
args={"lua_source": _script})
You will find all the informations about splash scripting here

I just answered a similar question here: scraping ajax based pagination. My solution is to get the current and last pages and then replace the page variable in the request URL.
Also - the other thing you can do is look on the network tab in the browser dev tools and see if you can identify any API that is called. If you look at the requests under XHR you can see those that return json.
You can then call the API directly and parse the json/ html response. Here is the link from the scrapy docs:The Network-tool

Rendering HTML from a python String in web2py, generate #usename links python

Rendering HTML from a python String in web2py
I am trying to render an anchor link in an html file generated server side in web2py
#username
and the link generates correctly; however when I call it in my view {{=link}} the page does not render it as HTML. I have tried using
mystring.decode('utf-8')
and various other conversions. Passing it to javascript and back to the page displays the link fine. Is there something specific about python strings that do not communicate well with html?
In the controller the string is generated by the function call:
#code barrowed from luca de alfaro's ucsc cmps183 class examples
def regex_text(s):
def makelink(match):
# The title is the matched praase #username
title = match.group(0).strip()
# The page is the striped title 'username' lowercase
page = match.group(1).lower()
return '%s' % (A(title, _href=URL('default', 'profile', args=[page])))
return re.sub(REGEX,makelink, s)
def linkify(s):
return regex_text(s)
def represent_links(s, v):
return linkify(s)
which replaces #username with a link to their profile and args(0) = username and is sent to the view by a controller call
def profile():
link = linkify(string)
return dict(link=link)

For security, web2py templates will automatically escape any text inserted via {{=...}}. To disable the escaping, you can wrap the text in the XML() helper:
{{=XML(link)}}

Rendering Jinja template in Flask following ajax response

This is my first dive into Flask + Jinja, but I've used HandlebarsJS a lot in the past, so I know this is possible but I'm not sure how to pull this off with Flask:
I'm building an app: a user enters a string, which is processed via python script, and the result is ajax'd back to the client/Jinja template.
I can output the result using $("body").append(response) but this would mean I need to write some nasty html within the append.
Instead, I'd like to render another template once the result is processed, and append that new template in the original template.
Is this possible?
My python:
from flask import Flask, render_template, request, jsonify
from script import *
app = Flask(__name__)
#app.route('/')
def index():
return render_template('index.html')
#app.route('/getColors')
def add_colors():
user = request.args.get("handle", 0, type = str)
return jsonify(
avatar_url = process_data(data)
)
if __name__ == '__main__':
app.run()

There is no rule about your ajax routes having to return JSON, you can return HTML exactly like you do for your regular routes.
#app.route('/getColors')
def add_colors():
user = request.args.get("handle", 0, type = str)
return render_template('colors.html',
avatar_url=process_data(data))
Your colors.html file does not need to be a complete HTML page, it can be the snippet of HTML that you want the client to append. So then all the client needs to do is append the body of the ajax response to the appropriate element in the DOM.

render html page /#:id using flask

I'm using Flask and want to render html pages and directly focus on a particular dom element using /#:id.
Below is the code for default / rendering
#app.route('/')
def my_form():
return render_template("my-form.html")
Below is the function being called upon POST request.
#app.route('/', methods=['POST'])
def my_form_post():
#...code goes here
return render_template("my-form.html")
I want to render my my-form.html page as my-form.html/#output so that it should directly focus upon the desired element in the dom.
But trying return render_template("my-form.html/#output") tells that there's no such file and trying #app.route('/#output', methods=['POST']) doesn't work either.
UPDATE:
Consider this web-app JSON2HTML - http://json2html.herokuapp.com/
What is happening: Whenever a person clicks send button the textarea input and the styling setting is send over to my python backend flask-app as below.
#app.route('/', methods=['POST'])
def my_form_post():
#...code for converting json 2 html goes here
return render_template("my-form.html",data = processed_data)
What I want: Whenever the send button is clicked and the form data is POSTED and processed and the same page is redirected with a new parameter which contains the processed_data to be displayed. My problem is to render the same page appending the fragment identifier #outputTable so that after the conversion the page directly focuses on the desired output the user wants.

The fragment identifier part of the URL is never sent to the server, so Flask does not have that. This is supposed to be handled in the browser. Within Javascript you can access this element as window.location.hash.
If you need to do your highlighting on the server then you need to look for an alternative way of indicating what to highlight that the server receives so that it can give it to the template. For example:
# focus element in the query string
# http://example.com/route?id=123
#app.route('/route')
def route():
id = request.args.get('id')
# id will be the given id or None if not available
return render_template('my-form.html', id = id)
And here is another way:
# focus element in the URL
#http://example.com/route/123
#app.route('/route')
#app.route('/route/<id>')
def route(id = None):
# id is sent as an argument
return render_template('my-form.html', id = id)
Response to the Edit: as I said above, the fragment identifier is handled by the web browser, Flask never sees it. To begin, you have to add <a name="outputTable">Output</a> to your template in the part you want to jump to. You then have two options: The hacky one is to write the action attribute of your form including the hashtag. The better choice is to add an onload Javascript event handler that jumps after the page has loaded.

Develop Reference

JavaScript is the programming language of the Web.

Extract feeds from web page - javascript

Related

Flask returning a generator and handling it in JavaScript [duplicate]

How to scrape AJAX based websites by using Scrapy and Splash?

Rendering HTML from a python String in web2py, generate #usename links python

Rendering Jinja template in Flask following ajax response

render html page /#:id using flask

Categories

Resources