Cross-Domain Rss Feed Request? - javascript

Ok, so for about a week now I've been doing tons of research on making xmlhttprequests to servers and have learned a lot about CORS, ajax/jquery request, google feed api, and I am still completely lost.
The Goal:
There are 2 sites in the picture, both I have access to, the first one is a wordpress site which has the rss feed and the other is my localhost site running off of xampp (soon to be a published site when I'm done). I am trying to get the rss feed from the wordpress site and display it on my localhost site.
The Issue:
I run into the infamous Access-Control-Allow-Origin error in the console and I know that I can fix that by setting it in the .htaccess file of the website but there are online aggregators that are able to just read and display it when I give them the link. So I don't really know what those sites are doing that I'm not, and what is the best way to achieve this without posing any easy security threats to both sites.
I highly prefer not to have to use any third party plugins to do this, I would like to aggregate the feed through my own code as I have done for an rss feed on the localhost site, but if I have to I will.
UPDATE:
I've made HUGE progress with learning php and have finally got a working bit of code that will allow me to download the feed files from their various sources, as well as being able to store them in cache files on the server. What I have done is set an AJAX request behind some buttons on my site which switches between the rss feeds. The AJAX request POSTs a JSON encoded array containing some data to my php file, which then downloads the requested feed via cURL (http_get_contents copied from a Github dev as I don't know how to use cURL yet) link and stores it in a md5 encoded cache file, then it filters what I need from the data and sends it back to the front end. However, I have two more questions... (Its funny how that works, getting one answer and ending up with two more questions).
Question #1: Where should I store both the cache files and the php files on the server? I heard that you are supposed to store them below the root but I am not sure how to access them that way.
Question #2: When I look at the source of the site through the browser as I click the buttons which send an ajax request to the php file, the php file is visibly downloaded to the list of source files but also it downloads more and more copies of the php file as you click the buttons, is there a way to prevent this? I may have to implement another method to get this working.
Here is my working php:
//cURL http_get_contents declaration
<?php
function http_get_contents($url, $opts = array()) {
$ch = curl_init();
curl_setopt($ch, CURLOPT_TIMEOUT, 5);
curl_setopt($ch, CURLOPT_USERAGENT, "{$_SERVER['SERVER_NAME']}");
curl_setopt($ch, CURLOPT_URL, $url);
if (is_array($opts) && $opts) {
foreach ($opts as $key => $val) {
curl_setopt($ch, $key, $val);
}
}
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
if (false === ($retval = curl_exec($ch))) {
die(curl_error($ch));
} else {
return $retval;
}
}
//receive and decode $_POSTed array
$post = json_decode($_POST['jsonString'], true);
$url = $post[0];
$xmn = $post[1]; //starting item index number (i.e. to return 3 items from the feed, starting with the 5th one)
$xmx = $xmn + 3; //max number (so three in total to be returned)
$cache = '/tmp/' . md5($url) . '.html';
$cacheint = 0; //this is how I set if the feed will be downloaded from the site it is from, or if it will be read from the cache file, I will implement a way to check if there is a newer version of the file on the other site in the future
//if the cache file doesn't exist, download feed and write contents to cache file
if(!file_exists($cache) || ((time() - filemtime($cache)) > 3600 * $cacheint)) {
$feed_content = http_get_contents($url);
if($feed_content = http_get_contents($url)) {
$fp = fopen($cache, 'w');
fwrite($fp, $feed_content);
fclose($fp);
}
}
//parse and echo results
$content = file_get_contents($cache);
$x = new SimpleXmlElement($content);
$item = $x->channel->item;
echo '<tr>';
for($i = $xmn; $i < $xmx; $i++) {
echo '<td class="item"><p class="title clear">' .
$item[$i]->title .
'</p><p class="desc">' .
$desc=substr($item[$i]->description, 0, 250) .
'... <a href="' .
$item[$i]->link .
'" target="_blank">more</a></p><p class="date">' .
$item[$i]->pubDate .
'</p></td>';
}
echo '</tr>';
?>

Related

PHP Curl - Scraping from data from site that has data in window.open created by Javascript

I am attempting to scrape data from a new window (using Javascript's window.open()) that is generated by the site I am posting to via cUrl, but I am unsure how to go about this.
The target site only generates this needed data when certain parameters are posted to it, and no other way.
The following code simply dumps the result of the cUrl request, but the result does not contain any data that is relevant.
My code:
//build post data for request
$proofData = array("formula" => $formula,
"proof" => $proof,
"action" => $action);
$postProofData = http_build_query($proofData);
$ch = curl_init($url); //open connection
//sort curl settings for request
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_POST, 3);
curl_setopt($ch, CURLOPT_POSTFIELDS, $postProofData);
//obtain data from LigLab
$result = curl_exec($ch);
//finish connection
curl_close($ch);
echo "forumla: " . $formula;
var_dump($result);
The following code is what is generated
Target site's code:
var proof = "<?php echo str_replace("\n","|",$annoted_proof) ?>";
var lines = proof.split('|');
proof_window=window.open("","Proof and Justifications","scrollbar=yes,resizable=yes, titlebar=yes,menubar=yes,status=yes,width= 800, height=800, alwaysRaised=yes");
for(var i = 0;i < lines.length;i++){
proof_window.document.write(lines[i]);
proof_window.document.write("\n");
}
I want to scrape the lines variable but it is generated after page load and after user interaction.
You can't parse processed javascript code with curl.
You have to use a headless browser, which emulates a real browser with events (clicks, hover and javascript code)
you can start here http://www.simpletest.org/en/browser_documentation.html or here PHP Headless Browser?

PHP And AJAX Download of a few MB file freezes website

Hello ive searched everywhere to find the answer however none of the solutions ive tried helped
What i am building is a site which connects to Youtube to allow users to search and download videos as MP3 files. I have built the site with the search etc however i am having a problem with the download part (ive worked out how to get the youtube audio file). The format for the audio is originally audio/mp4 so i need to convert it to mp3 however first i need to get the file on the server
So on the download page ive made a script that sends an ajax request to the server to start downloading the file. It then sends a request to a different page every few seconds to find out the progress and update it on the page the user is viewing.
However the problem is while the video is downloading the whole website freezes (all the pages dont load until the file is fully downloaded) and so when the script tries to find out the progress it cant until its fully done.
The file which downloads:
<?php
session_start();
if (isset($_GET['yt_vid']) && isset($_GET['yrt'])) {
set_time_limit(0); // to prevent the script from stopping execution
include "assets/functions.php";
define('CHUNK', (1024 * 8 * 1024));
if ($_GET['yrt'] == "gphj") {
$vid = $_GET['yt_vid'];
$mdvid = md5($vid);
if (!file_exists("assets/videos/" . $mdvid . ".mp4")) { // check if the file already exists, if not proceed to downloading it
$url = urlScraper($vid); // urlScraper function is a function to get the audio file, it sends a simple curl request and takes less than a second to complete
if (!isset($_SESSION[$mdvid])) {
$_SESSION[$mdvid] = array(time(), 0, retrieve_remote_file_size($url));
}
$file = fopen($url, "rb");
$localfile_name = "assets/videos/" . $mdvid . ".mp4"; // The file is stored on the server so it doesnt have to be downloaded every time
$localfile = fopen($localfile_name, "w");
$time = time();
while (!feof($file)) {
$_SESSION[$mdvid][1] = (int)$_SESSION[$mdvid][1] + 1;
file_put_contents($localfile_name, fread($file, CHUNK), FILE_APPEND);
}
echo "Execution time: " . (time() - $time);
fclose($file);
fclose($localfile);
$result = curl_result($url, "body");
} else {
echo "Failed.";
}
}
}
?>
I also had that problem in the past, the reason that it does not work is because the session can only be once open for writing.
What you need to do is modify your download script and use session_write_close() each time directly after writing to the session.
like:
session_start();
if (!isset($_SESSION[$mdvid])) {
$_SESSION[$mdvid] = array(time(), 0, retrieve_remote_file_size($url));
}
session_write_close();
and also in the while
while (!feof($file)) {
session_start();
$_SESSION[$mdvid][1] = (int)$_SESSION[$mdvid][1] + 1;
session_write_close();
file_put_contents($localfile_name, fread($file, CHUNK), FILE_APPEND);
}

Facebook Graph API caching JSON response

I am using Facebook Graph API to get contents from a Facebook fan page and then display them into a website. I am doing it like this, and it is working, but somehow, it seems that my hosting provider is limiting my requests every certain time.... So I would like to cache the response and only ask for a new request every 8h for example.
$data = get_data("https://graph.facebook.com/12345678/posts?access_token=1111112222233333&limit=20&fields=full_picture,link,message,likes,comments&date_format=U");
$result = json_decode($data);
The get_data function uses CURL in the following way:
function get_data($url) {
$ch = curl_init();
$timeout = 5;
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $timeout);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
$datos = curl_exec($ch);
curl_close($ch);
return $datos;
}
This works fine, I can output the JSON data response and use it as I like into my website to display the content. But as I mention, in my hosting, this seems to fail every X time, I guess because I am getting limited. I have tried to cache the response using some code I saw here at Stackoverflow. But I cannot figure out how to integrate and use both codes. I have managed to create the cache file, but I cannot manage to read correctly from the cached file and avoid making a new request to Facebook graph API.
// cache files are created like cache/abcdef123456...
$cacheFile = 'cache' . DIRECTORY_SEPARATOR . md5($url);
if (file_exists($cacheFile)) {
$fh = fopen($cacheFile, 'r');
$cacheTime = trim(fgets($fh));
// if data was cached recently, return cached data
if ($cacheTime > strtotime('-60 minutes')) {
return fread($fh);
}
// else delete cache file
fclose($fh);
unlink($cacheFile);
}
$fh = fopen($cacheFile, 'w');
fwrite($fh, time() . "\n");
fwrite($fh, $json);
fclose($fh);
return $json;
Many thanks in advance for your help!
There are some thinks that could come in handy when trying to construct cache and to cache actual object (or even arrays).
The functions serialize and unserialize allows you to get a string representation of an object or of an array so you can cache it as plain text and then pop the object/array as it was before from the string.
filectime which allows you to get the last modification date of a file, so when it is created, you can rely on this information to see if your cache is outdated like you tried to implement it.
And for the whole working code, there you go :
function get_data($url) {
/** #var $cache_file is path/to/the/cache/file/based/on/md5/url */
$cache_file = 'cache' . DIRECTORY_SEPARATOR . md5($url);
if(file_exists($cache_file)){
/**
* Using the last modification date of the cache file to check its validity
*/
if(filectime($cache_file) < strtotime('-60 minutes')){
unlink($cache_file);
} else {
echo 'TRACE -- REMOVE ME -- out of cache';
/**
* unserializing the object on the cache file
* so it gets is original "shape" : object, array, ...
*/
return unserialize(file_get_contents($cache_file));
}
}
$ch = curl_init();
$timeout = 5;
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $timeout);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
$data = curl_exec($ch);
curl_close($ch);
/**
* We actually did the curl call so we need to (re)create the cache file
* with the string representation of our curl return we got from serialize
*/
file_put_contents($cache_file, serialize($data));
return $data;
}
PS : note that I changed the $datos variable on your actual function get_data to a more common $data.
This answer will add a few more dependencies to your project, but it may be well worth it instead of rolling your own stuff.
You could use the Guzzle HTTP client, coupled with the HTTP Cache plugin.
$client = new Client('http://www.test.com/');
$cachePlugin = new CachePlugin(array(
'storage' => new DefaultCacheStorage(
new DoctrineCacheAdapter(
new FilesystemCache('/path/to/cache/files')
)
)
));
$client->addSubscriber($cachePlugin);
$request = $client->get('https://graph.facebook.com/12345678/posts?access_token=1111112222233333&limit=20&fields=full_picture,link,message,likes,comments&date_format=U');
$request->getParams()->set('cache.override_ttl', 3600*8); // 8hrs
$data = $request->send()->getBody();
$result = json_decode($data);
Not sure is you can use memcache, if you can:
$cacheFile = 'cache' . DIRECTORY_SEPARATOR . md5($url);
$mem = new Memcached();
$mem->addServer("127.0.0.1", 11211);
$cached = $mem->get($cacheFile);
if($cached){
return $cached;
}
else{
$data = get_data($url);
$mem->set($cacheFile, json_encode($data), time() + 60*10); //10 min
}
If your hosting provider is pushing all of your outbound requests through a proxy server -- you can try to defeat it by adding an extra parameter near the beginning of the request :
https://graph.facebook.com/12345678/posts?p=(randomstring)&access_token=1111112222233333&limit=20&fields=full_picture,link,message,likes,comments&date_format=U
I have used this successfully for outbound calls to third party data providers. Of course I don't know if your particular issue is this issue. You could also be bitten by the provider if they reject requests with parameters they don't expect.

Http content loaded in https connection

My Webapp is running with a https connection / ssl certificate. I need to show pictures to the user. I get the links to the picture by an API request and link them afterwards. Sadly, the pictures address is http, so the browser shows that there are unsecure parts on the site, which mustn't be...
I could download the pictures and link to the proper picture afterwards, but I think this might be a little timeconsuming and not the best way to handle this.
Does somebody know a better solution for this? I'm using php, jquery and javascript.
You'll have to write a proxy on your server and display all the images through it. Basically your URLs should be like:
$url = 'http://ecx.images-amazon.com/images/I/51MU5VilKpL._SL75_.jpg';
$url = urlencode($url);
echo '<img src="/proxy.php?from=' . $url . '">';
and the proxy.php:
$cache = '/path/to/cache';
$url = $_GET['from'];
$hash = md5($url);
$file = $cache . DIRECTORY_SEPARATOR . $hash;
if (!file_exists($file)) {
$data = file_get_contents($url);
file_put_contents($file, $data);
}
header('Content-Type: image/jpeg');
readfile($file);
Ok, I've got a streaming example for you. You need to adapt it to your needs, of course.
Suppose you make a php file on your server named mws.php with this content:
if (isset($_GET['image']))
{
header('Content-type: image/jpeg');
header('Content-transfer-encoding: binary');
echo file_get_contents($_GET['image']);
}
Look for any image on the web, for instance:
http://freebigpictures.com/wp-content/uploads/2009/09/mountain-stream.jpg
now you can show that image, as if it was located on your own secure server with this url:
https://<your server>/mws.php?image=http://freebigpictures.com/wp-content/uploads/2009/09/mountain-stream.jpg
It would of course be better to store the image locally if you need it more than once, and you have to include the correct code to get it from Amazon MWS ListMatchingProducts, but this is the basic idea.
Please don't forget to secure your script against abuse.

javascript / php - Get src of image from the URL of the site

I noticed that at http://avengersalliance.wikia.com/wiki/File:Effect_Icon_186.png, there is an image (a small one) there. Click on it, you will be brought to another page: http://img2.wikia.nocookie.net/__cb20140312005948/avengersalliance/images/f/f1/Effect_Icon_186.png.
For http://avengersalliance.wikia.com/wiki/File:Effect_Icon_187.png, after clicking on the image there, you are brought to another page: http://img4.wikia.nocookie.net/__cb20140313020718/avengersalliance/images/0/0c/Effect_Icon_187.png
There are many similar sites, from http://avengersalliance.wikia.com/wiki/File:Effect_Icon_001.png, to http://avengersalliance.wikia.com/wiki/File:Effect_Icon_190.png (the last one).
I'm not sure if the image link is somewhat related to the link of its parent site, but may I know, is it possible to get http://img2.wikia.nocookie.net/__cb20140312005948/avengersalliance/images/f/f1/Effect_Icon_186.png string, from the string http://avengersalliance.wikia.com/wiki/File:Effect_Icon_186.png, using PHP or JavaScript? I would appreciate your help.
Here is a small PHP script that can do this. It uses CURL to get content and DOMDocument to parse HTML.
<?php
/*
* For educational purposes only
*/
function get_wiki_image($url = '') {
if(empty($url)) return;
$curl = curl_init($url);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, TRUE);
$output = curl_exec($curl);
curl_close($curl);
$DOM = new DOMDocument;
libxml_use_internal_errors(true);
$DOM->loadHTML($output);
libxml_use_internal_errors(false);
return $DOM->getElementById('file')->firstChild->firstChild->getAttribute('src');
}
echo get_wiki_image('http://avengersalliance.wikia.com/wiki/File%3aEffect_Icon_186.png');
You can access for example by class and then select the one than you want with [n], after that getAttribute and you got it
document.getElementsByClassName('icon cup')[0].getAttribute('src')
Hope it helps

Categories

Resources