标签 python 下的文章 - GengRe.Men

标签搜索

Gengre

累计撰写 67 篇文章
累计收到 0 条评论
今日撰写 0 篇文章

搜索到 1 篇与的结果

2023-10-25
Python 爬取YouTube某个频道下的所有视频信息被生活安排了这么一个需求。需要爬取YouTube给的频道下的给定日期范围内的视频的信息，如标题，点赞数，点踩数，播放量等信息首先需要一个谷歌账号，翻墙工具来科学上网，打开YouTube，搜索指定的频道，进入频道界面然后查看网页源代码，搜索channel，得到频道的频道ID 然后还需要申请谷歌数据API的个人独有的API key，参照博客申请api key并指定YouTube api 下面是谷歌的api地址 self.app_key = '你自己的 api key'self.channel_api = 'https://developers.google.com/apis-explorer/#p/youtube/v3/youtube.channels.list?part=snippet,contentDetails&publishedAfter=2016-11-01T00:00:00Z&publishedBefore=2017-12-31T00:00:00Z&id='+ channel_id + '&key=' + self.app_keyself.info_api = 'https://www.googleapis.com/youtube/v3/videos' 本来的思路是找出所有的视频地址，然后根据视频发布日期过滤结果，而恰巧谷歌限制了API的返回结果为500个（实际为500个左右），导致视频缺失，导致我思考了很久解决办法，最终还是Google到了结果（Google google的问题 = =）相关摘录： “如果没有搜索结果的质量严重降低(重复等),我们无法通过API为任意YouTube查询提供超过500个搜索结果. v1 / v2 GData API在11月更新,以限制返回到500的搜索结果数.如果指定500或更高的起始索引,则不会获得任何结果. 因此，为了获取全部指定时间段发布的视频，需要在参数里加上发布日期界限（分时间段搜索，每次的搜索结果上限仍然是500，请特别注意！！！！） publishedAfter=2016-11-01T00:00:00Z&publishedBefore=2017-12-31T00:00:00Z 下面贴完整代码： # -*- coding: UTF-8 -*- import urllib2 import time import urllib import json import datetime import requests import sys import xlsxwriter reload(sys) sys.setdefaultencoding("utf-8") channel = "Samsung"#频道名 channel_id = 'UCWwgaK7x0_FR1goeSRazfsQ'#频道ID class YoukuCrawler: def __init__(self): self.video_ids = [] self.maxResults = 50#每次返回的结果数 self.app_key = '你自己的 api key' self.channel_api = 'https://developers.google.com/apis-explorer/#p/youtube/v3/youtube.channels.list?part=snippet,contentDetails&publishedAfter=2016-11-01T00:00:00Z&publishedBefore=2017-12-31T00:00:00Z&id='+ channel_id + '&key=' + self.app_key # self.info_api = 'https://www.googleapis.com/youtube/v3/videos?maxResults=50&part=snippet,statistics' + '&key=' + self.app_key self.info_api = 'https://www.googleapis.com/youtube/v3/videos' now = time.mktime(datetime.date.today().timetuple()) def get_all_video_in_channel(self, channel_id): base_video_url = 'https://www.youtube.com/watch?v=' base_search_url = 'https://www.googleapis.com/youtube/v3/search?' first_url = base_search_url + 'key={}&channelId={}&part=snippet,id&publishedAfter=2016-11-01T00:00:00Z&publishedBefore=2017-12-31T00:00:00Z&order=date&maxResults=25'.format(self.app_key, channel_id) url = first_url while True: print url request = urllib2.Request(url=url) response = urllib2.urlopen(request) page = response.read() result = json.loads(page, encoding="utf-8") for i in result['items']: try: self.video_ids.append(i['id']['videoId'])#获取作品ID except: pass try: next_page_token = result['nextPageToken']#获取下一页作品 url = first_url + '&pageToken={}'.format(next_page_token) except: print "no nextPageToken" break def main(self): self.get_all_video_in_channel(channel_id) return self.get_videos_info() def get_videos_info(self):#获取作品信息 url = self.info_api query = '' count = 0 f = open(channel_id + '.txt', 'w') print len(self.video_ids) for i in self.video_ids: try: count += 1 query = i results = requests.get(url, params={'id': query, 'maxResults': self.maxResults, 'part': 'snippet,statistics', 'key': self.app_key}) page = results.content videos = json.loads(page, encoding="utf-8")['items'] for video in videos: try: like_count = int(video['statistics']['likeCount']) except KeyError: like_count = 0 try: dislike_count = int(video['statistics']['dislikeCount']) except KeyError: dislike_count = 0 temp = time.mktime(time.strptime(video['snippet']['publishedAt'], "%Y-%m-%dT%H:%M:%S.000Z")) dateArray = datetime.datetime.utcfromtimestamp(int(temp)) otherStyleTime = dateArray.strftime("%Y-%m-%d") print otherStyleTime,count if (otherStyleTime>='2016-11-01' and otherStyleTime
- 2023年10月25日
- 152 阅读
- 0 评论
- 0 点赞