I'm trying to scrape the viewers on www.twitch.tv/directory using Python. I have tried the basic BeautifulSoup script:
url= 'https://www.twitch.tv/directory'
html= urlopen(url)
soup = BeautifulSoup(url, "html5lib") #also tried using html.parser, lxml
soup.prettify()
This gives me html without the actual viewer numbers shown.
Then I tried using param ajax data. From this thread
param = {"action": "getcategory",
"br": "f21",
"category": "dress",
"pageno": "",
"pagesize": "",
"sort": "",
"fsize": "",
"fcolor": "",
"fprice": "",
"fattr": ""}
url = "https://www.twitch.tv/directory"
# Also tried with the headers parameter headers={"User-Agent":"Mozilla/5.0...
js = requests.get(url,params=param).json()
But I get a JSONDecodeError: Expecting value: line 1 column 1 (char 0) error.
From then I moved on to selenium
driver = webdriver.Edge()
url = 'https://www.twitch.tv/directory'
driver.get(url)
#Also tried driver.execute_script("return document.documentElement.outerHTML") and innerHTML
html = driver.page_source
driver.close()
soup = BeautifulSoup(html, "lxml")
These just yield the same result I get from the standard BeautifulSoup call.
Any help on scraping the view count would be appreciated.
The stats are not present in the page when its first loaded. The page makes a graphql request to https://gql.twitch.tv/gql to fetch the game data. When a user isn't logged in the graphql request asks for the query AnonFrontPage_TopChannels.
Here is a working request in python:
import requests
import json
resp = requests.post(
"https://gql.twitch.tv/gql",
json.dumps(
{
"operationName": "AnonFrontPage_TopChannels",
"variables": {"platformType": "all", "isTagsExperiment": True},
"extensions": {
"persistedQuery": {
"version": 1,
"sha256Hash": "d94b2fd8ad1d2c2ea82c187d65ebf3810144b4436fbf2a1dc3af0983d9bd69e9",
}
},
}
),
headers = {'Client-Id': 'kimne78kx3ncx6brgo4mv6wki5h1ko'},
)
print(json.loads(resp.content))
I've included the Client-Id in the request. The id doesn't seem to be unique to the session, but I imagine Twitch expires them, so this likely won't work forever. You'll have to inspect future graphql requests and grab a new Client-Id in the future or figure out how to programmatically scrape one from the page.
This request actually seems to be the Top Live Channels section. Here's how you can get the view counts and titles:
edges = json.loads(resp.content)["data"]["streams"]["edges"]
games = [(f["node"]["title"], f["node"]["viewersCount"]) for f in edges]
# games:
[
("Let us GAME", 78250),
("(REBROADCAST) Worlds Play-In Knockouts: Cloud9 vs. Gambit Esports", 36783),
("RuneFest 2018 - OSRS Reveals !schedule", 35042),
(None, 25237),
("Front Page of TWITCH + Fortnite FALL SKIRMISH Training!", 22380),
("Reckful - 3v3 with barry and a german", 20399),
]
You'll need to check the chrome network inspector and figure out the structure of the other requests to get more data.
And here's an example for the directory page:
import requests
import json
resp = requests.post(
"https://gql.twitch.tv/gql",
json.dumps(
{
"operationName": "BrowsePage_AllDirectories",
"variables": {
"limit": 30,
"directoryFilters": ["GAMES"],
"isTagsExperiment": True,
"tags": [],
},
"extensions": {
"persistedQuery": {
"version": 1,
"sha256Hash": "75fb8eaa6e61d995a4d679dcb78b0d5e485778d1384a6232cba301418923d6b7",
}
},
}
),
headers={"Client-Id": "kimne78kx3ncx6brgo4mv6wki5h1ko"},
)
edges = json.loads(resp.content)["data"]["directoriesWithTags"]["edges"]
games = [f["node"] for f in edges]
Related
I have been working on the backend of my app. At this point, it can access all data in a data base, and output it. I'm trying to implement some queries, so that the user can filter out the content that is returned. My DAL/DAO, looks like this
let mflix //Creates a variable used to store a ref to our DB
class MflixDAO {
static async injectDB(conn){
if(mflix){
return
}
try{
mflix = await conn.db(process.env.JD_NS).collection("movies")
}catch(e){
console.error('Unable to establish a collection handle in mflixDAO: ' + e)
}
}
// Creates a query to fetch data from the collection/table in the DB
static async getMovies({
mflix.controller
filters = null,
page = 0,
moviesPerPage = 20,
} = {}) {
let query
if (filters){
// Code
if("year" in filters){
query = {"year": {$eq: filters["year"]}}
}
// Code
}
// Cursor represents the returned data
let cursor
try{
cursor = await mflix.find(query)
}catch(e){
console.error('Unable to issue find command ' + e)
return {moviesList: [], totalNumMovies: 0}
}
const displayCursor = cursor.limit(moviesPerPage).skip(moviesPerPage * page)
try{
const moviesList = await displayCursor.toArray() // Puts data in an array
const totalNumMovies = await mflix.countDocuments(query) // Gets total number of documents
return { moviesList, totalNumMovies}
} catch(e){
console.error('Unable to convert cursor to array or problem counting documents ' + e)
return{moviesList: [], totalNumMovies: 0}
}
}
}
export default MflixDAO
Just so you know, I am using a sample database from MongoDB Atlas. I am using Postman to test HTTP requests. All the data follows JSON format
Anyway, when I execute a basic GET request. The program runs without any problems. All the data outputs as expected. However, if I execute something along the lines of
GET http://localhost:5000/api/v1/mflix?year=1903
Then moviesList returns an empty array [], but no error message.
After debugging, I suspect the problem lies either at cursor = await mflix.find(query) or displayCursor = cursor.limit(moviesPerPage).skip(moviesPerPage * page), but the callstacks for those methods is so complex for me, I don't know what to even look for.
Any suggestions?
Edit: Here is an example of the document I am trying to access:
{
"_id": "573a1390f29313caabcd42e8",
"plot": "A group of bandits stage a brazen train hold-up, only to find a determined posse hot on their heels.",
"genres": [
"Short",
"Western"
],
"runtime": 11,
"cast": [
"A.C. Abadie",
"Gilbert M. 'Broncho Billy' Anderson",
"George Barnes",
"Justus D. Barnes"
],
"poster": "https://m.media-amazon.com/images/M/MV5BMTU3NjE5NzYtYTYyNS00MDVmLWIwYjgtMmYwYWIxZDYyNzU2XkEyXkFqcGdeQXVyNzQzNzQxNzI#._V1_SY1000_SX677_AL_.jpg",
"title": "The Great Train Robbery",
"fullplot": "Among the earliest existing films in American cinema - notable as the first film that presented a narrative story to tell - it depicts a group of cowboy outlaws who hold up a train and rob the passengers. They are then pursued by a Sheriff's posse. Several scenes have color included - all hand tinted.",
"languages": [
"English"
],
"released": "1903-12-01T00:00:00.000Z",
"directors": [
"Edwin S. Porter"
],
"rated": "TV-G",
"awards": {
"wins": 1,
"nominations": 0,
"text": "1 win."
},
"lastupdated": "2015-08-13 00:27:59.177000000",
"year": 1903,
"imdb": {
"rating": 7.4,
"votes": 9847,
"id": 439
},
"countries": [
"USA"
],
"type": "movie",
"tomatoes": {
"viewer": {
"rating": 3.7,
"numReviews": 2559,
"meter": 75
},
"fresh": 6,
"critic": {
"rating": 7.6,
"numReviews": 6,
"meter": 100
},
"rotten": 0,
"lastUpdated": "2015-08-08T19:16:10.000Z"
},
"num_mflix_comments": 0
}
EDIT: It seems to be a datatype problem. When I request a data with a string/varchar type, the program returns values that contain that value. Example:
Input:
GET localhost:5000/api/v1/mflix?rated=TV-G
Output:
{
"_id": "XXXXXXXXXX"
// Data
"rated" = "TV-G"
// Data
}
EDIT: The problem has nothing to do with anything I've posted up to this point it seems. The problem is in this piece of code:
let filters = {}
if(req.query.year){
filters.year = req.query.year // This line needs to be changed
}
const {moviesList, totalNumMovies} = await MflixDAO.getMovies({
filters,
page,
moviesPerPage,
})
I will explain in the answer below
Ok so the problem, as it turns out, is that when I make an HTTP request, the requested value is passed as a string. So in
GET http://localhost:5000/api/v1/mflix?year=1903
the value of year is registered by the program as a string. In other words, the DAO ends up looking for "1903" instead of 1903. Naturally, year = "1903" does not exist. To fix this, the line filters.year = req.query.year must be changed to filters.year = parseInt(req.query.year).
I am new to dialogflow fulfillment and I am trying to retrieve news from news API based on user questions. I followed documentation provided by news API, but I am not able to catch any responses from the search results, when I run the function in console it is not errors. I changed the code and it looks like now it is reaching to the newsapi endpoint but it is not fetching any results. I am utilizing https://newsapi.org/docs/client-libraries/node-js to make a request to search everything about the topic. when I diagnoise the function it says " Webhook call failed. Error: UNAVAILABLE. "
'use strict';
const functions = require('firebase-functions');
const {WebhookClient} = require('dialogflow-fulfillment');
const {Card, Suggestion} = require('dialogflow-fulfillment');
const http = require('http');
const host = 'newsapi.org';
const NewsAPI = require('newsapi');
const newsapi = new NewsAPI('63756dc5caca424fb3d0343406295021');
process.env.DEBUG = 'dialogflow:debug';
exports.dialogflowFirebaseFulfillment = functions.https.onRequest((req, res) =>
{
// Get the city
let search = req.body.queryResult.parameters['search'];// search is a required param
// Call the weather API
callNewsApi(search).then((response) => {
res.json({ 'fulfillmentText': response }); // Return the results of the news API to Dialogflow
}).catch((xx) => {
console.error(xx);
res.json({ 'fulfillmentText': `I don't know the news but I hope it's good!` });
});
});
function callNewsApi(search)
{
console.log(search);
newsapi.v2.everything
(
{
q: 'search',
langauge: 'en',
sortBy: 'relevancy',
source: 'cbc-news',
domains: 'cbc.ca',
from: '2019-12-31',
to: '2020-12-12',
page: 2
}
).then (response => {console.log(response);
{
let articles = response['data']['articles'][0];
// Create response
let responce = `Current news in the $search with following title is ${articles['titile']} which says that
${articles['description']}`;
// Resolve the promise with the output text
console.log(output);
}
});
}
Also here is RAW API response
{
"responseId": "a871b8d2-16f2-4873-a5d1-b907a07adb9a-b4ef8d5f",
"queryResult": {
"queryText": "what is the latest news about toronto",
"parameters": {
"search": [
"toronto"
]
},
"allRequiredParamsPresent": true,
"fulfillmentMessages": [
{
"text": {
"text": [
""
]
}
}
],
"intent": {
"name": "projects/misty-ktsarh/agent/intents/b52c5774-e5b7-494a-8f4c-f783ebae558b",
"displayName": "misty.news"
},
"intentDetectionConfidence": 1,
"diagnosticInfo": {
"webhook_latency_ms": 543
},
"languageCode": "en"
},
"webhookStatus": {
"code": 14,
"message": "Webhook call failed. Error: UNAVAILABLE."
},
"outputAudio": "UklGRlQqAABXQVZFZm10IBAAAAABAAEAwF0AAIC7AAACABAAZGF0YTAqAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA... (The content is truncated. Click `COPY` for the original JSON.)",
"outputAudioConfig": {
"audioEncoding": "OUTPUT_AUDIO_ENCODING_LINEAR_16",
"synthesizeSpeechConfig": {
"speakingRate": 1,
"voice": {}
}
}
}
And Here is fulfillment request:
{
"responseId": "a871b8d2-16f2-4873-a5d1-b907a07adb9a-b4ef8d5f",
"queryResult": {
"queryText": "what is the latest news about toronto",
"parameters": {
"search": [
"toronto"
]
},
"allRequiredParamsPresent": true,
"fulfillmentMessages": [
{
"text": {
"text": [
""
]
}
}
],
"intent": {
"name": "projects/misty-ktsarh/agent/intents/b52c5774-e5b7-494a-8f4c-f783ebae558b",
"displayName": "misty.news"
},
"intentDetectionConfidence": 1,
"diagnosticInfo": {
"webhook_latency_ms": 543
},
"languageCode": "en"
},
"webhookStatus": {
"code": 14,
"message": "Webhook call failed. Error: UNAVAILABLE."
},
"outputAudio": "UklGRlQqAABXQVZFZm10IBAAAAABAAEAwF0AAIC7AAACABAAZGF0YTAqAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA... (The content is truncated. Click `COPY` for the original JSON.)",
"outputAudioConfig": {
"audioEncoding": "OUTPUT_AUDIO_ENCODING_LINEAR_16",
"synthesizeSpeechConfig": {
"speakingRate": 1,
"voice": {}
}
}
}
Also here is the screenshot from the firebase console.
Can anyone guide me what is that I am missing in here?
The key is the first three lines in the error message:
Function failed on loading user code. Error message: Code in file index.js can't be loaded.
Did you list all required modules in the package.json dependencies?
Detailed stack trace: Error: Cannot find module 'newsapi'
It is saying that the newsapi module couldn't be loaded and that the most likely cause of this is that you didn't list this as a dependency in your package.json file.
If you are using the Dialogflow Inline Editor, you need to select the package.json tab and add a line in the dependencies section.
Update
It isn't clear exactly when/where you're getting the "UNAVAILABLE" error, but one likely cause if you're using Dialogflow's Inline Editor is that it is using the Firebase "Spark" pricing plan, which has limitations on network calls outside Google's network.
You can upgrade to the Blaze plan, which does require a credit card on file, but does include the Spark plan's free tier, so you shouldn't incur any costs during light usage. This will allow for network calls.
Update based on TypeError: Cannot read property '0' of undefined
This indicates that either a property (or possibly an index of a property) is trying to reference against something that is undefined.
It isn't clear which line, exactly, this may be, but these lines all are suspicious:
let response = JSON.parse(body);
let source = response['data']['source'][0];
let id = response['data']['id'][0];
let name = response['data']['name'][0];
let author = response['author'][0];
let title = response['title'][0];
let description = response['description'][0];
since they are all referencing a property. I would check to see exactly what comes back and gets stored in response. For example, could it be that there is no "data" or "author" field in what is sent back?
Looking at https://newsapi.org/docs/endpoints/everything, it looks like none of these are fields, but that there is an articles property sent back which contains an array of articles. You may wish to index off that and get the attributes you want.
Update
It looks like that, although you are loading the parameter into a variable with this line
// Get the city and date from the request
let search = req.body.queryResult.parameters['search'];// city is a required param
You don't actually use the search variable anywhere. Instead, you seem to be passing a literal string "search" to your function with this line
callNewsApi('search').then((output) => {
which does a search for the word "search", I guess.
You indicated that "it goes to the catch portion", which indicates that something went wrong in the call. You don't show any logging in the catch portion, and it may be useful to log the exception that is thrown, so you know why it is going to the catch portion. Something like
}).catch((xx) => {
console.error(xx);
res.json({ 'fulfillmentText': `I don't know the news but I hope it's good!` });
});
is normal, but since it looks like you're logging it in the .on('error') portion, showing that error might be useful.
The name of the intent and the variable I was using to make the call had a difference in Casing, I guess calls are case sensitive just be aware of that
I have set up what I think should be a working JSON output to send a message in slack but Slack keeps rejecting it.
I have tried multiple different message layout formats using the guides on slack's api site, but so far the only method that has successfully sent is a fully flat JSON with no block formatting.
function submitValuesToSlack(e) {
var name = e.values[1];
var caseNumber = e.values[2];
var problemDescription = e.values[3];
var question = e.values[4];
var completedChecklist = e.values[5];
var payload = [{
"channel": postChannel,
"username": postUser,
"icon_emoji": postIcon,
"link_names": 1,
"blocks": [
{
"type": "section",
"fields": [
{
"type": "mrkdwn",
"text": "*Name:*\n " + name
}
]
}]
}];
console.log(JSON.stringify(payload, null, "\t"));
var options = {
'method': 'post',
'payload': JSON.stringify(payload)
};
console.log(options)
var response = UrlFetchApp.fetch(slackIncomingWebhookUrl, options);
}
When I run this, I get the following output:
[
{
"channel":"#tech-support",
"username":"Form Response",
"icon_emoji":":mailbox_with_mail:",
"link_names":1,
"blocks":[
{
"type":"section",
"fields":[
{
"type":"mrkdwn",
"text":"*Name:*\n test"
}
]
}
]
}
]
Which I believe is correct, however slack api just rejects it with an HTTP 400 error "no text"
am I misunderstanding something about block formatting?
EDIT:
To Clarify, formatting works if I use this for my JSON instead of the more complex format:
{
"channel":"#tech-support",
"username":"Form Response",
"icon_emoji":":mailbox_with_mail:",
"link_names":1,
"text":"*Name:*\n test"
}
The reason you are getting the error no_text is because you do not have a valid message text property in your payload. You either need to have a text property as top line parameter (classic style - your example at the bottom) or a text block within a section block.
If you want to put to use blocks only (as you are asking) the section block is called text, not fields. fields is another type of section bock that has a different meaning.
So the correct syntax is:
[
{
"channel":"#tech-support",
"username":"Form Response",
"icon_emoji":":mailbox_with_mail:",
"link_names":1,
"blocks":[
{
"type":"section",
"text":[
{
"type":"mrkdwn",
"text":"*Name:*\n test"
}
]
}
]
}
]
Also see here for the official documentation on it.
Blocks are very powerful, but can be complicated at times. I would recommend to use the message builder to try out your messages and check out the examples in the docu.
I am trying to use an API that is hosted externally and also returns JSON data. I have tried editing the headers but I'm not entirely sure it is working because I am still getting a warning about not having the CORS header
Source
var url = "http://hkconsult.in/social_search/keyword_services.php?keyword=throat&callback=test";
$.ajax({
type: 'get',
url: url,
headers: {
"Origin":"http://hkconsult.in/social_search/keyword_services.php?keyword=throat&callback=test",
"Access-Control-Allow-Origin":"http://hkconsult.in/social_search/keyword_services.php?keyword=throat&callback=test"
}
}).done(function(data) {
document.getElementById('cool').innerHTML = data;
});
Firebug Headers
When opening the response for the "OPTIONS" url in firebug, it returns the data that I want. How do I use that data in javascript?
The CORS are headers that the server should send you as a response to your request.
The OPTIONS request is performed by the browser to check those headers prior to performing your actual request.
You cannot get the result of that OPTIONS request because it happens outside the scope of your code.
If the server is not setup to send those headers, then your only option is to use a proxy page that will use a server-side script to perform the call on your behalf.
As per my understanding you are asking about how to parse the response JSON instead of CORS Issue.
var data = {
"0": {
"keyword_name": "Sore throat",
"id": "1787",
"user_id": "3350988339",
"user_name": "Nic",
"user_screen_name": "Goldendevi",
"user_profile_pic": "http://pbs.twimg.com/profile_images/840766770633945088/eRRoRZHv_normal.jpg",
"user_location": "",
"post_id": "864553159896838145",
"post_text": "I'm so mad I have a fucking sore throat 🙄",
"post_geo_location": "0",
"post_image": "",
"post_date": "2017-05-16 20:48:30"
},
"1": {
"keyword_name": "Sore throat",
"id": "1788",
"user_id": "63496454",
"user_name": "mariana",
"user_screen_name": "yugyeumie",
"user_profile_pic": "http://pbs.twimg.com/profile_images/863802594132733952/ep0DtSoT_normal.jpg",
"user_location": "jjp; ë§ ìŠ¨",
"post_id": "864552988974747649",
"post_text": "is it possible to die of a sore throat",
"post_geo_location": "0",
"post_image": "",
"post_date": "2017-05-16 20:47:49"
}
};
var res = Object.keys(data).map(item => {return data[item].keyword_name });
console.log(res);
I'm attempting to write an alternative UI for a website I commonly use. I'm writing it with Node.js using request and cheerio to scrape the web pages of data.
However, the problem occurs when I attempt to send a POST request against this site. I want to retrieve the list of classes here without going through this page first, but the normal post parameters shown in the devtools are structured like this:
sel_subj:dummy
bl_online:FALSE
sel_day:dummy
term:201630
sel_subj:ACTG
sel_inst:ANY
sel_online:
sel_crse:
begin_hh:0
begin_mi:0
end_hh:0
end_mi:0
I can modify any other value (term, sel_crse, etc), but the sel_subj doesn't have a compatible value, so the server just goes with the default value.
I've been trying different values for the form Object parameter in request, but none of these have worked:
sel_subj: ["M", "dummy"]
sel_subj: ["dummy", "M"]
sel_subj: "M"
sel_subj: "dummy,M"
sel_subj: "M,dummy"
sel_subj: "dummy M"
sel_subj: "M dummy"
sel_subj: "dummy, M"
sel_subj: "M, dummy"
I'm trying to figure out what a duplicate field in the POST request means, what the server expects, and how to reproduce that with request
If parameter names can be duplicated, the request body can be designed by yourself:
var headers = {'content-type' : 'application/x-www-form-urlencoded'};
var body = [];
var params = [
{ sel_subj:'dummy' }, // duplicates
{ bl_online:false },
{ sel_day:'dummy' },
{ term:'201630'},
{ sel_subj:'ACTG'}, // duplicates
{ sel_inst:'ANY'},
{ sel_online: null},
{ sel_crse: null},
{ begin_hh:0},
{ begin_mi:0},
{ end_hh:0},
{ end_mi:0}
];
params.forEach( function(p) {
body.push( require('querystring').stringify(p) );
});
var r = request.post({ url:'http://localhost/api/',
headers: headers,
body:body.join('&')
});