1. Due to issues with external spam filters, QQ is currently unable to send any mail to Microsoft E-mail addresses. This includes any account at live.com, hotmail.com or msn.com. Signing up to the forum with one of these addresses will result in your verification E-mail never arriving. For best results, please use a different E-mail provider for your QQ address.
    Dismiss Notice
  2. For prospective new members, a word of warning: don't use common names like Dennis, Simon, or Kenny if you decide to create an account. Spammers have used them all before you and gotten those names flagged in the anti-spam databases. Your account registration will be rejected because of it.
    Dismiss Notice
  3. Since it has happened MULTIPLE times now, I want to be very clear about this. You do not get to abandon an account and create a new one. You do not get to pass an account to someone else and create a new one. If you do so anyway, you will be banned for creating sockpuppets.
    Dismiss Notice
  4. If you wish to change your username, please ask via conversation to tehelgee instead of asking via my profile. I'd like to not clutter it up with such requests.
    Dismiss Notice
  5. Due to the actions of particularly persistent spammers and trolls, we will be banning disposable email addresses from today onward.
    Dismiss Notice
  6. A note about the current Ukraine situation: Discussion of it is still prohibited as per Rule 8
    Dismiss Notice
  7. The rules regarding NSFW links have been updated. See here for details.
    Dismiss Notice
  8. The testbed for the QQ XF2 transition is now publicly available. Please see more information here.
    Dismiss Notice

I made a wordcounter!

Discussion in 'General' started by Prognostic Hannya, Dec 3, 2019.

Tags:
  1. Prognostic Hannya

    Prognostic Hannya Knight of the Yuri Crusade

    Joined:
    Dec 3, 2019
    Messages:
    1,283
    Likes Received:
    13,282
    Hi everyone,

    I'm new to QQ, and an amateur python programmer.

    I was reading With this Ring by Mr. Zoat, and was shocked by how absolutely massive it is compared to most other fanfics. But for the life of me, I couldn't find a way to see the wordcount of the story only thread that didn't just count the threadmarks. The story's currently at 3,400,000 words, if you were wondering.

    So, like the amateur programmer I am, I decided to write a script to do it myself! Just input the url for a QQ thread (that's not behind an account wall), and it will spit out the wordcount. It even has a loading bar for longer threads!

    Please let me know if this post is in the wrong place, or if you have any improvements to my code!


    Code:
    ## Made by Sam Ravenwood
     
    import bs4 as bs
    import requests
    import re
    from tqdm import tqdm
     
    #creates a list of the urls of every page
    def iterate(url):
    	http = requests.get(url)
    	page = bs.BeautifulSoup(http.text, 'html.parser')
    	#finds max pagecount
    	pagecount = int(page.find(text="Next >").previous_element.previous_element.previous_element)
    	if url[len(url)-1] == "/":
    		url = url + "page-"
    	else:
    		url = url + "/page-"
    	links = []
    	for i in range(1, pagecount+1):
    		links.append(url + str(i))
    	return links
     
    def get(cat, url):
    	assert cat in ["posts","title"], "Incorrect category for get request!"
    	http = requests.get(url)
    	page = bs.BeautifulSoup(http.text, 'html.parser')
    	if cat == "posts":
    		return page.find_all(class_="message")
    	if cat == "title":
    		return page.find("title").get_text().replace(" | Questionable Questing", "")
     
    def counter(msg):
    	msg = str(msg)
    	msg = re.sub("[^a-zA-Z0-9_\s]", "",msg) #deletes all characters that aren't alphanumeric or a space
    	msg = msg.split(" ")
    	return len(msg) + 1
     
    #creates a list of every message in the thread
    def wordcount(url):
    	links = iterate(url)
    	total = 0
    	#for each page of the community, get wordcount of each post, add it to total
    	for i in tqdm(range(len(links))):  #for each page
    		link = links[i]
    		posts_loc = get("posts", link)
    		for post_loc in posts_loc: #for each post in page
    			count = counter(post_loc.get_text())
    			total += count
      
    	return total
     
    url = input("Enter url:\n")
    if "http://" not in url and "https://" not in url:
    	url = "https://" + url
     
    print("Analyzing pages...")
    print(f"\nThread '{get('title', url)}' has total wordcount of {wordcount(url):,} words.")
    input()
     
    Last edited: Dec 3, 2019
  2. Nekraa

    Nekraa Nekraa Moderator

    Joined:
    Mar 2, 2013
    Messages:
    6,168
    Likes Received:
    22,152
    Well, first things first. Your link doesn't seem to work. You could put the code in a
    Code:
    code here
    using [code]code here[/code]. Unless it's too long, I guess.

    I also moved your thread the General, as I believe that it is more topical to your thread.
     
  3. Prognostic Hannya

    Prognostic Hannya Knight of the Yuri Crusade

    Joined:
    Dec 3, 2019
    Messages:
    1,283
    Likes Received:
    13,282
    Whoops, sorry! I fixed the link. Also the code's like 60 lines, idk if that's "too long"
     
  4. Nekraa

    Nekraa Nekraa Moderator

    Joined:
    Mar 2, 2013
    Messages:
    6,168
    Likes Received:
    22,152
    I would say it's likely not too long then.