python - BeautifulSoup 'href' list that is giving ambiguous TypeErrors? -
i'm using beautifulsoup scrape urls webpage. going good, until of urls have non-ascii characters in them.
requests.get('http://www.reddit.com') soup = beautifulsoup(req.content) urls = [i.get('href') in soup.findall('a') if 'keyword' in str(i.get('href'))]
the list comprehension return unicodeerror
.
thought separate list comprehension 2 parts instead:
urls = [i.get('href') in soup.findall('a')] urls = [i.encode('utf-8') in urls]
this when got attributeerror
, saying items nonetype
.
i checked type:
print [type(i) in urls]
which showed unicode types. seems none
, unicode
@ same time.
you must have missed none
value. checked www.reddit.com
and, sure enough, there's:
<a name="content"></a>
its href none
. instead of printing values , search none
manually, do:
urls = [(i, i.get('href')) in soup.findall('a')] print [u u in urls if u[1] none]
Comments
Post a Comment