{"componentChunkName":"component---src-components-posts-page-layout-js","path":"/analyzing-reddit-comments-using-python","result":{"data":{"mdx":{"id":"e68c0391-1b3f-538e-a9fd-69eb8ebfc670","body":"function _extends() { _extends = Object.assign || function (target) { for (var i = 1; i < arguments.length; i++) { var source = arguments[i]; for (var key in source) { if (Object.prototype.hasOwnProperty.call(source, key)) { target[key] = source[key]; } } } return target; }; return _extends.apply(this, arguments); }\n\nfunction _objectWithoutProperties(source, excluded) { if (source == null) return {}; var target = _objectWithoutPropertiesLoose(source, excluded); var key, i; if (Object.getOwnPropertySymbols) { var sourceSymbolKeys = Object.getOwnPropertySymbols(source); for (i = 0; i < sourceSymbolKeys.length; i++) { key = sourceSymbolKeys[i]; if (excluded.indexOf(key) >= 0) continue; if (!Object.prototype.propertyIsEnumerable.call(source, key)) continue; target[key] = source[key]; } } return target; }\n\nfunction _objectWithoutPropertiesLoose(source, excluded) { if (source == null) return {}; var target = {}; var sourceKeys = Object.keys(source); var key, i; for (i = 0; i < sourceKeys.length; i++) { key = sourceKeys[i]; if (excluded.indexOf(key) >= 0) continue; target[key] = source[key]; } return target; }\n\n/* @jsx mdx */\nvar _frontmatter = {\n  \"title\": \"Analyzing reddit comments using Python\",\n  \"slug\": \"analyzing-reddit-comments-using-python\",\n  \"date\": \"2021-01-18\",\n  \"author\": \"Adam Goth\",\n  \"preview\": \"In this post, we'll take a look at how to build a simple Python script for word analysis. We will then apply it to the comment section of any given reddit post.\",\n  \"categories\": [\"scripts\"],\n  \"keywords\": [\"python\", \"word analysis\"]\n};\nvar layoutProps = {\n  _frontmatter: _frontmatter\n};\nvar MDXLayout = \"wrapper\";\nreturn function MDXContent(_ref) {\n  var components = _ref.components,\n      props = _objectWithoutProperties(_ref, [\"components\"]);\n\n  return mdx(MDXLayout, _extends({}, layoutProps, props, {\n    components: components,\n    mdxType: \"MDXLayout\"\n  }), mdx(\"p\", null, \"In this post, we'll take a look at how to build a simple Python script for word\\nanalysis. We will then apply it to the comment section of any given reddit post.\"), mdx(\"h3\", null, \"Overview\"), mdx(\"p\", null, \"Between my job and side projects, I typically spend most of my time building web\\napplications using React and Node. That means writing almost exclusively\\nJavaScript. To keep my perspective on programming fresh and not strictly\\nconfined to a single language, I wanted to take a little time to step out of the\\nworld of JavaScript and explore the world of another programming language. I\\ndecided to come up with a little project idea and to build it with Python.\\nPython is a powerful yet friendly programming language that is popular with\\nbeginners and experienced programmers alike. It was created in 1991 by Guido van\\nRossum, but continues to rise in popularity almost 30 years later. In 2020,\\nPython was at or near the top of the list for in-demand languages for\\nprogramming jobs. It was also deemed by Wired magazine to be\\n\", mdx(\"a\", _extends({\n    parentName: \"p\"\n  }, {\n    \"href\": \"https://www.wired.com/story/python-language-more-popular-than-ever/\"\n  }), \"more popular than ever before\"), \".\\nAfter spending just a short time writing code with Python, it's not hard to see\\nwhy it's a popular choice. Let's jump in.\"), mdx(\"h3\", null, \"Setting up\"), mdx(\"p\", null, \"This post will assume you have basic programming knowledge and that you have\\nPython 3 installed. For more detailed information on installing Python,\\n\", mdx(\"a\", _extends({\n    parentName: \"p\"\n  }, {\n    \"href\": \"https://wiki.python.org/moin/BeginnersGuide/Download\"\n  }), \"start here\"), \". The repo for\\nthis project can be found\\n\", mdx(\"a\", _extends({\n    parentName: \"p\"\n  }, {\n    \"href\": \"https://github.com/adamgoth/reddit-comment-analysis\"\n  }), \"here\"), \". You will notice a\\n\", mdx(\"inlineCode\", {\n    parentName: \"p\"\n  }, \".py\"), \" file containing the full script, as well as a \", mdx(\"inlineCode\", {\n    parentName: \"p\"\n  }, \".ipynb\"), \" file containing a\\nJupyter Notebook for the script. \", mdx(\"a\", _extends({\n    parentName: \"p\"\n  }, {\n    \"href\": \"https://jupyter.org/\"\n  }), \"The Jupyter Notebook\"), \" is\\nan open-source web application that allows you to create and share documents\\nthat contain live code, equations, visualizations, and narrative text, which can\\nmake it easier to follow along and learn how a python script works.\"), mdx(\"h3\", null, \"The script\"), mdx(\"p\", null, \"The script in its entirety can be found\\n\", mdx(\"a\", _extends({\n    parentName: \"p\"\n  }, {\n    \"href\": \"https://github.com/adamgoth/reddit-comment-analysis/blob/master/reddit-comment-analysis.py\"\n  }), \"here\"), \".\"), mdx(\"p\", null, \"The first thing we need to do is import the requests library. This is what we\\nwill use to make the HTTP request to reddit to get the comment data from the\\nreddit post. After that, we will initialize a few global variables. We will use\\nthese global variables to keep track of data as we parse through comments.\\n\", mdx(\"inlineCode\", {\n    parentName: \"p\"\n  }, \"comment_count\"), \" is an integer and will track the number of comments we parse,\\n\", mdx(\"inlineCode\", {\n    parentName: \"p\"\n  }, \"comment_array\"), \" is an array and will hold the actual comment strings, and\\n\", mdx(\"inlineCode\", {\n    parentName: \"p\"\n  }, \"more_comment_ids\"), \" is another array that will hold ID strings that we will need\\nin order to fetch additional comments that are not returned in the initial\\npayload (commonly found in posts with many comments).\"), mdx(\"pre\", null, mdx(\"code\", _extends({\n    parentName: \"pre\"\n  }, {\n    \"className\": \"language-python\"\n  }), \"# imports\\nimport requests  # The requests library for HTTP requests in Python\\n\\n# globals\\ncomment_count = 0\\ncomment_array = []\\nmore_comment_ids = []\\n\")), mdx(\"p\", null, \"Next, we need to fetch the data for the reddit post. To do that, we can append\\n\", mdx(\"inlineCode\", {\n    parentName: \"p\"\n  }, \".json\"), \" to the end of any reddit post URL.\"), mdx(\"p\", null, \"An example would be:\\n\", mdx(\"inlineCode\", {\n    parentName: \"p\"\n  }, \"https://www.reddit.com/r/redditdev/comments/krolrb/multicomments.json\"), \".\"), mdx(\"p\", null, \"What we get back is JSON that will have a basic format that looks like this:\"), mdx(\"pre\", null, mdx(\"code\", _extends({\n    parentName: \"pre\"\n  }, {\n    \"className\": \"language-javascript\"\n  }), \"{\\n    \\\"kind\\\": \\\"Listing\\\",\\n    \\\"data\\\": {\\n        \\\"children\\\": [\\n            \\\"kind\\\": \\\"t1\\\",\\n            \\\"data\\\": {\\n                \\\"body\\\": \\\"\\\",\\n                \\\"replies\\\": \\\"\\\"\\n            }\\n        ]\\n    }\\n}\\n\")), mdx(\"p\", null, \"A reddit post is referred to as a\\n\\\"\", mdx(\"a\", _extends({\n    parentName: \"p\"\n  }, {\n    \"href\": \"https://www.reddit.com/dev/api/#listings\"\n  }), \"Listing\"), \"\\\". Listings can contain many\\nkinds of children. A child with a \", mdx(\"inlineCode\", {\n    parentName: \"p\"\n  }, \"kind\"), \" of \", mdx(\"inlineCode\", {\n    parentName: \"p\"\n  }, \"t1\"), \" indicates that the child\\nrepresents a comment. Within the comments \", mdx(\"inlineCode\", {\n    parentName: \"p\"\n  }, \"data\"), \" property, among many other\\nproperties, the text of the comment can be found on the \", mdx(\"inlineCode\", {\n    parentName: \"p\"\n  }, \"body\"), \" property, along\\nwith any possible replies which are located on the \", mdx(\"inlineCode\", {\n    parentName: \"p\"\n  }, \"replies\"), \" property. Replies\\nare structured the same way as comments. They contain children and the children\\nhas \", mdx(\"inlineCode\", {\n    parentName: \"p\"\n  }, \"kind\"), \" and \", mdx(\"inlineCode\", {\n    parentName: \"p\"\n  }, \"data\"), \" properties. Within every reply to a comment, we may see\\nanother reply to that reply comment. Each of these contains their own\\nidentically formatted children. So in order to analyze all comments within a\\nthread, we'll have to recursively sift through all comments and replies.\"), mdx(\"p\", null, \"If having to follow each individual comment tree recursively to its end wasn't\\ntricky enough, there's another issue we have to worry about. Since comment\\nthreads can become quite long, not every comment is always displayed on the\\ninitial thread load. When this happens, reddit shows \\\"load more replies\\\" buttons\\nwithin threads. So how do we get these as well? To handle these instances, the\\nAPI will deliver a child with a \", mdx(\"inlineCode\", {\n    parentName: \"p\"\n  }, \"kind\"), \" property value of \", mdx(\"inlineCode\", {\n    parentName: \"p\"\n  }, \"more\"), \".\"), mdx(\"pre\", null, mdx(\"code\", _extends({\n    parentName: \"pre\"\n  }, {\n    \"className\": \"language-javascript\"\n  }), \"{\\n    \\\"kind\\\": \\\"more\\\",\\n    \\\"data\\\": {\\n        \\\"count\\\": 2,\\n        \\\"name\\\": \\\"t1_ghp1m6v\\\",\\n        \\\"id\\\": \\\"ghp1m6v\\\",\\n        \\\"parent_id\\\": \\\"t1_ghozojl\\\",\\n        \\\"depth\\\": 2,\\n        \\\"children\\\": [\\n            \\\"ghp1m6v\\\"\\n        ]\\n    }\\n}\\n\")), mdx(\"p\", null, \"The array of \", mdx(\"inlineCode\", {\n    parentName: \"p\"\n  }, \"children\"), \" within the \", mdx(\"inlineCode\", {\n    parentName: \"p\"\n  }, \"more\"), \" object will contain a list of thread\\nIDs that can be used to fetch additional comments. In the code example above,\\nthere is just one child ID, \", mdx(\"inlineCode\", {\n    parentName: \"p\"\n  }, \"ghp1m6v\"), \". So in addition to parsing all comment\\ntrees recursively, we will also have to collect any additional comment thread\\nIDs and then do the same thing for those.\"), mdx(\"p\", null, \"Hopefully, you are still with me at this point. Talking about all of this\\nwithout writing any code can be confusing, so let's try to break it down with\\nsome functions that will help us achieve this goal.\"), mdx(\"p\", null, \"The first function we'll write is \", mdx(\"inlineCode\", {\n    parentName: \"p\"\n  }, \"parse_children_for_comments\"), \".\"), mdx(\"pre\", null, mdx(\"code\", _extends({\n    parentName: \"pre\"\n  }, {\n    \"className\": \"language-python\"\n  }), \"def parse_children_for_comments(children):\\n    global comment_count\\n    global comment_array\\n    for child in children:\\n        if child['kind'] == \\\"more\\\":\\n            children = child['data']['children']\\n            for id in children:\\n                more_comment_ids.append(id)\\n        if child['kind'] == \\\"t1\\\":\\n            comment_count += 1\\n            comment_array.append(child['data']['body'])\\n            get_replies(child['data'])\\n\")), mdx(\"p\", null, \"It will take an array of \", mdx(\"inlineCode\", {\n    parentName: \"p\"\n  }, \"children\"), \" objects that are sent back in the response\\ndata and will pull out the comment text which is found in the \", mdx(\"inlineCode\", {\n    parentName: \"p\"\n  }, \"body\"), \" property.\\nFor each child in the array argument of \", mdx(\"inlineCode\", {\n    parentName: \"p\"\n  }, \"children\"), \", we will check its \", mdx(\"inlineCode\", {\n    parentName: \"p\"\n  }, \"kind\"), \". If\\nthe \", mdx(\"inlineCode\", {\n    parentName: \"p\"\n  }, \"kind\"), \" is \", mdx(\"inlineCode\", {\n    parentName: \"p\"\n  }, \"more\"), \", we will loop through and add each id to the global array\\nwe created, \", mdx(\"inlineCode\", {\n    parentName: \"p\"\n  }, \"more_comment_ids\"), \". We will eventually come back to this array of\\nids and parse through it.\"), mdx(\"p\", null, \"Next, if the \", mdx(\"inlineCode\", {\n    parentName: \"p\"\n  }, \"kind\"), \" is \", mdx(\"inlineCode\", {\n    parentName: \"p\"\n  }, \"t1\"), \", that means we have a comment and we want to read\\nits text. In order to do that, we simply get the text with\\n\", mdx(\"inlineCode\", {\n    parentName: \"p\"\n  }, \"child['data']['body']\"), \" and append it to our global \", mdx(\"inlineCode\", {\n    parentName: \"p\"\n  }, \"comment_array\"), \" variable.\"), mdx(\"p\", null, \"After appending the comment to the \", mdx(\"inlineCode\", {\n    parentName: \"p\"\n  }, \"comment_array\"), \", we need to check if there\\nare any replies to that comment. Since we will be doing this check many times,\\nit's best that we write a helper function for it. We'll call it \", mdx(\"inlineCode\", {\n    parentName: \"p\"\n  }, \"get_replies\"), \":\"), mdx(\"pre\", null, mdx(\"code\", _extends({\n    parentName: \"pre\"\n  }, {\n    \"className\": \"language-python\"\n  }), \"def get_replies(comment):\\n    global comment_count\\n    if comment['replies'] != \\\"\\\":\\n        children = comment['replies']['data']['children']\\n        parse_children_for_comments(children)\\n\")), mdx(\"p\", null, \"First, we check if there are any replies. When there are no replies, the\\n\", mdx(\"inlineCode\", {\n    parentName: \"p\"\n  }, \"replies\"), \" property will be an empty string. If the string is not empty, we know\\nwe have a reply. As I mentioned above, replies take the same format as the\\noriginal comment it is replying to. So in order to parse the reply text, we can\\nreuse the same \", mdx(\"inlineCode\", {\n    parentName: \"p\"\n  }, \"parse_children_for_comments\"), \" function we already wrote. Since\\n\", mdx(\"inlineCode\", {\n    parentName: \"p\"\n  }, \"parse_children_for_comments\"), \" will again call \", mdx(\"inlineCode\", {\n    parentName: \"p\"\n  }, \"get_replies\"), \", and \", mdx(\"inlineCode\", {\n    parentName: \"p\"\n  }, \"get_replies\"), \"\\nwill again call \", mdx(\"inlineCode\", {\n    parentName: \"p\"\n  }, \"parse_children_for_comments\"), \" until there are no comments left,\\nthis will recursively continue until we reach a child comment with an empty\\n\", mdx(\"inlineCode\", {\n    parentName: \"p\"\n  }, \"replies\"), \" property. Pretty neat.\"), mdx(\"p\", null, \"With those helper functions defined, we're ready to fetch our data. In order to\\ndo this, we will use a built-in Python function called\\n\", mdx(\"a\", _extends({\n    parentName: \"p\"\n  }, {\n    \"href\": \"https://docs.python.org/3/library/functions.html#input\"\n  }), mdx(\"inlineCode\", {\n    parentName: \"a\"\n  }, \"input\")), \" which will\\nallow the user to enter a URL to a reddit post.\"), mdx(\"pre\", null, mdx(\"code\", _extends({\n    parentName: \"pre\"\n  }, {\n    \"className\": \"language-python\"\n  }), \"# get url from user\\nprint('enter the reddit post url (e.g. https://www.reddit.com/r/redditdev/comments/krolrb/multicomments/):')\\nthread_url = input()\\n\")), mdx(\"p\", null, \"We can expect the user to paste in a URL for a reddit post. For example, it may\\nlook something like this:\\n\", mdx(\"inlineCode\", {\n    parentName: \"p\"\n  }, \"https://www.reddit.com/r/redditdev/comments/krolrb/multicomments/\")), mdx(\"p\", null, \"To get the post data, we need to turn\\n\", mdx(\"inlineCode\", {\n    parentName: \"p\"\n  }, \"https://www.reddit.com/r/redditdev/comments/krolrb/multicomments/\"), \" into\\n\", mdx(\"inlineCode\", {\n    parentName: \"p\"\n  }, \"https://www.reddit.com/r/redditdev/comments/krolrb/multicomments.json\"), \".\"), mdx(\"p\", null, \"To do that, we can write a small helper function.\"), mdx(\"pre\", null, mdx(\"code\", _extends({\n    parentName: \"pre\"\n  }, {\n    \"className\": \"language-python\"\n  }), \"def sanitize_input(url):\\n    last_char = url[-1]\\n    if last_char == '/':\\n        url = url[:-1]\\n    url = f'{url}.json'\\n    return url\\n\")), mdx(\"p\", null, \"We pass the URL as an argument into the function. The function checks if the\\nlast character of \", mdx(\"inlineCode\", {\n    parentName: \"p\"\n  }, \"url\"), \" is a \", mdx(\"inlineCode\", {\n    parentName: \"p\"\n  }, \"/\"), \" and removes it if it is. Then the function\\nappends \", mdx(\"inlineCode\", {\n    parentName: \"p\"\n  }, \".json\"), \" to the end of \", mdx(\"inlineCode\", {\n    parentName: \"p\"\n  }, \"url\"), \". After we pass the user's inputted URL to\\nthis function, we're ready to fetch the post data.\"), mdx(\"pre\", null, mdx(\"code\", _extends({\n    parentName: \"pre\"\n  }, {\n    \"className\": \"language-python\"\n  }), \"# pass user's url to sanitize helper\\nsanitized_thread_url = sanitize_input(thread_url)\\n\\n# make network call\\nreq_data = requests.get(sanitized_thread_url, headers={'User-agent': 'adamgoth.com'})\\n\\nif req_data.status_code != 200:\\n    print('request failed')\\n    print(req_data.json())\\n\\nif req_data.status_code == 200:\\n    json_data = req_data.json()\\n    for item in json_data:\\n        children = item['data']['children']\\n        parse_children_for_comments(children)\\n\")), mdx(\"p\", null, \"We call \", mdx(\"inlineCode\", {\n    parentName: \"p\"\n  }, \"requests.get()\"), \", passing our URL as the first parameter, as well as a\\nheaders value for a second parameter. The reason we need to specify a\\n\", mdx(\"inlineCode\", {\n    parentName: \"p\"\n  }, \"User-agent\"), \" property in the header is so that we have a unique identity to\\nreddit. This will ensure we appear entirely anonymous and run into\\n\", mdx(\"a\", _extends({\n    parentName: \"p\"\n  }, {\n    \"href\": \"https://en.wikipedia.org/wiki/Rate_limiting\"\n  }), \"rate-limiting\"), \" issues.\"), mdx(\"p\", null, \"Once we have our data back in our \", mdx(\"inlineCode\", {\n    parentName: \"p\"\n  }, \"req_data\"), \" variable, the first thing we'll\\ncheck is if we did not get a \", mdx(\"inlineCode\", {\n    parentName: \"p\"\n  }, \"200\"), \" response for any reason. If the response is\\nnot \", mdx(\"inlineCode\", {\n    parentName: \"p\"\n  }, \"200\"), \", we will print out the error.\"), mdx(\"p\", null, \"Assuming we get a \", mdx(\"inlineCode\", {\n    parentName: \"p\"\n  }, \"200\"), \", we can then start parsing the data. We can use the\\nrequests library built-in JSON decoder and called \", mdx(\"inlineCode\", {\n    parentName: \"p\"\n  }, \".json()\"), \" on the response. We\\nthen write a simple \", mdx(\"inlineCode\", {\n    parentName: \"p\"\n  }, \"for\"), \" statement that takes each child in the response data\\nand passes it to the \", mdx(\"inlineCode\", {\n    parentName: \"p\"\n  }, \"parse_children_for_comments\"), \" we previously discussed.\"), mdx(\"p\", null, \"After the \", mdx(\"inlineCode\", {\n    parentName: \"p\"\n  }, \"for\"), \" loop from line 13 completes, we should have a number of comments\\nstored in our global \", mdx(\"inlineCode\", {\n    parentName: \"p\"\n  }, \"comment_array\"), \". Additionally, depending on the number of\\ncomments from the post, we may have found some additional comment IDs and stored\\nthem in our global \", mdx(\"inlineCode\", {\n    parentName: \"p\"\n  }, \"more_comment_ids\"), \" array. As a reminder, these are IDs we can\\nuse to fetch more comments that did not appear in the initial load. In the\\nreddit UI, these represent the links within comment threads that appear as \\\"load\\nmore replies\\\", and in our data response, these IDs come from the children that\\nhave a \", mdx(\"inlineCode\", {\n    parentName: \"p\"\n  }, \"kind\"), \" property value of \", mdx(\"inlineCode\", {\n    parentName: \"p\"\n  }, \"more\"), \".\"), mdx(\"p\", null, \"The URL for fetching the additional comment data looks similar to the URL we\\nused for fetching the initial post data. The only difference is the comment ID\\nis appended to the end. So\\n\", mdx(\"inlineCode\", {\n    parentName: \"p\"\n  }, \"https://www.reddit.com/r/redditdev/comments/krolrb/multicomments.json\"), \" becomes\\n\", mdx(\"inlineCode\", {\n    parentName: \"p\"\n  }, \"https://www.reddit.com/r/redditdev/comments/krolrb/multicomments/{comment_id}.json\"), \".\\nWe can write a simple helper function to do this for us.\"), mdx(\"pre\", null, mdx(\"code\", _extends({\n    parentName: \"pre\"\n  }, {\n    \"className\": \"language-python\"\n  }), \"def create_thread_url(comment_id):\\n    return sanitized_thread_url.replace('.json', f'/{comment_id}.json')\\n\")), mdx(\"p\", null, \"We simply pass the \", mdx(\"inlineCode\", {\n    parentName: \"p\"\n  }, \"comment_id\"), \" as an argument and then do a string replace on\\n\", mdx(\"inlineCode\", {\n    parentName: \"p\"\n  }, \".json\"), \" with \", mdx(\"inlineCode\", {\n    parentName: \"p\"\n  }, \"/{comment_id}.json\"), \".\"), mdx(\"p\", null, \"We're then ready to make the requests for the additional comments.\"), mdx(\"pre\", null, mdx(\"code\", _extends({\n    parentName: \"pre\"\n  }, {\n    \"className\": \"language-python\"\n  }), \"# handle extra comment ids\\nfor id in more_comment_ids:\\n    req_data = requests.get(create_thread_url(\\n        id), headers={'User-agent': 'adamgoth.com'})\\n    if req_data.status_code != 200:\\n        print('request failed')\\n        print(req_data.json())\\n\\n    if req_data.status_code == 200:\\n        json_data = req_data.json()\\n        for item in json_data:\\n            children = item['data']['children']\\n            parse_children_for_comments(children)\\n\")), mdx(\"p\", null, \"To fetch the additional comments, we'll use another \", mdx(\"inlineCode\", {\n    parentName: \"p\"\n  }, \"for\"), \" loop to loop through\\neach ID in the \", mdx(\"inlineCode\", {\n    parentName: \"p\"\n  }, \"more_comment_ids\"), \" array. For each one, we again use\\n\", mdx(\"inlineCode\", {\n    parentName: \"p\"\n  }, \"requests.get()\"), \", passing the comment ID to the \", mdx(\"inlineCode\", {\n    parentName: \"p\"\n  }, \"create_thread_url\"), \" function we\\njust wrote, along with the same \", mdx(\"inlineCode\", {\n    parentName: \"p\"\n  }, \"User-agent\"), \" header as our previous request.\\nOnce we have our response, we again check the status code, and if it's\\nsuccessful, we'll parse the data the same way we did before, passing each child\\nin the data to \", mdx(\"inlineCode\", {\n    parentName: \"p\"\n  }, \"parse_children_for_comments\"), \". As a word of caution, for posts\\nwith thousands of comment replies, this can result in a large number of\\nadditional comment IDs. It's possible to have hundreds of IDs to fetch. Each one\\nof these will require a synchronous network call, so it can take quite a while\\nif this is the case.\"), mdx(\"p\", null, \"Once all the additional comment IDs have been fetched, we have all the data we\\nneed to run our word analysis. To do this, we will combine all of the comments\\nin our global \", mdx(\"inlineCode\", {\n    parentName: \"p\"\n  }, \"comment_array\"), \" variable into a single string. We will then write\\na function which will parse that string and keep track of how many times each\\nword appears. The function to do that looks like this:\"), mdx(\"pre\", null, mdx(\"code\", _extends({\n    parentName: \"pre\"\n  }, {\n    \"className\": \"language-python\"\n  }), \"def analyze_words(words):\\n    analysis_string = words.split(' ')\\n    word_dict = {}\\n    for word in analysis_string:\\n        cleaned_word = word.replace('.', '').replace(\\\"'\\\", '').replace(\\n            '\\\\n', '').replace(',', '').replace(\\\"\\u2019\\\", '').lower()\\n        if cleaned_word not in word_dict:\\n            word_dict[cleaned_word] = 1\\n        else:\\n            word_dict[cleaned_word] += 1\\n\\n    return word_dict\\n\")), mdx(\"p\", null, \"The function takes a single string as an argument called \", mdx(\"inlineCode\", {\n    parentName: \"p\"\n  }, \"words\"), \". It then breaks\\nthe string into an array of words called \", mdx(\"inlineCode\", {\n    parentName: \"p\"\n  }, \"analysis_string\"), \" by splitting the\\nstring on each space character found in the string. We create an empty\\n\", mdx(\"a\", _extends({\n    parentName: \"p\"\n  }, {\n    \"href\": \"https://docs.python.org/3/tutorial/datastructures.html#dictionaries\"\n  }), \"dictionary\"), \"\\ncalled \", mdx(\"inlineCode\", {\n    parentName: \"p\"\n  }, \"word_dict\"), \" that we will use to keep track of each word's appearance.\\nThen we loop through each word in our \", mdx(\"inlineCode\", {\n    parentName: \"p\"\n  }, \"analysis_string\"), \" array. For each word, we\\nuse string replaces to strip out various common special characters (commas,\\nperiods, etc.) and then call \", mdx(\"inlineCode\", {\n    parentName: \"p\"\n  }, \".lower()\"), \" on it to convert all uppercase\\ncharacters to lowercase characters. This ensures that \", mdx(\"inlineCode\", {\n    parentName: \"p\"\n  }, \"The\"), \" and \", mdx(\"inlineCode\", {\n    parentName: \"p\"\n  }, \"the\"), \" are not\\ntracked as two different words. As we go through each word in the array, if the\\nword does not exist in our \", mdx(\"inlineCode\", {\n    parentName: \"p\"\n  }, \"word_dict\"), \" dictionary yet, we will add it and give\\nit a count value of \", mdx(\"inlineCode\", {\n    parentName: \"p\"\n  }, \"1\"), \". If it already exists in \", mdx(\"inlineCode\", {\n    parentName: \"p\"\n  }, \"word_dict\"), \", then we will just\\nincrement the count value up by 1. When we are finished looping through each\\nword, we will return the \", mdx(\"inlineCode\", {\n    parentName: \"p\"\n  }, \"word_dict\"), \" we created.\"), mdx(\"p\", null, \"The end of the script looks as follows:\"), mdx(\"pre\", null, mdx(\"code\", _extends({\n    parentName: \"pre\"\n  }, {\n    \"className\": \"language-python\"\n  }), \"comment_string = ' '.join(comment_array)\\nresults = analyze_words(comment_string)\\n\\nsorted = sorted(results.items(), key=lambda x: x[1], reverse=True)\\n\\nprint(f'{comment_count} comments analyzed')\\n\\nfor key in sorted:\\n    print(key)\\n\")), mdx(\"p\", null, \"After combining all the comments into a single string and passing that string\\nthrough \", mdx(\"inlineCode\", {\n    parentName: \"p\"\n  }, \"analyze_words\"), \", we can sort all the results by the number of\\nappearances counted by calling\\n\", mdx(\"inlineCode\", {\n    parentName: \"p\"\n  }, \"sorted = sorted(results.items(), key=lambda x: x[1], reverse=True)\"), \". We can\\nthen print the total number of comments we parsed and then each word and the\\nnumber of times it appeared.\"), mdx(\"h3\", null, \"Wrapping up\"), mdx(\"p\", null, \"The script in its entirety can be found\\n\", mdx(\"a\", _extends({\n    parentName: \"p\"\n  }, {\n    \"href\": \"https://github.com/adamgoth/reddit-comment-analysis/blob/master/reddit-comment-analysis.py\"\n  }), \"here\"), \".\\nTo run the script, simply run \", mdx(\"inlineCode\", {\n    parentName: \"p\"\n  }, \"python reddit-comment-analysis.py\"), \" from the\\ndirectory containing the script file.\"), mdx(\"p\", null, \"If you have Jupyter Notebooks installed, a more interactive version of this post\\ncan be found\\n\", mdx(\"a\", _extends({\n    parentName: \"p\"\n  }, {\n    \"href\": \"https://github.com/adamgoth/reddit-comment-analysis/blob/master/word_analysis.ipynb\"\n  }), \"here\"), \".\"), mdx(\"p\", null, \"This script serves as a basic starting point for fetching and analyzing data\\nfrom the web. There is room for many improvements and enhancements to this\\nscript. Ideas for additional features include:\"), mdx(\"ul\", null, mdx(\"li\", {\n    parentName: \"ul\"\n  }, \"Input validation\"), mdx(\"li\", {\n    parentName: \"ul\"\n  }, \"Options for handling upper and lower casing\"), mdx(\"li\", {\n    parentName: \"ul\"\n  }, \"Options for removing special characters\"), mdx(\"li\", {\n    parentName: \"ul\"\n  }, \"Options for removing common words (the, and, I, etc.)\")), mdx(\"p\", null, \"If you enjoyed this post or found it useful, please consider\\n\", mdx(\"a\", _extends({\n    parentName: \"p\"\n  }, {\n    \"href\": \"https://twitter.com/intent/tweet?url=https%3A%2F%2Fwww.adamgoth.com%2Fanalyzing-reddit-comments-using-python\"\n  }), \"sharing it on Twitter\"), \".\"), mdx(\"p\", null, \"If you want to stay updated on new posts,\\n\", mdx(\"a\", _extends({\n    parentName: \"p\"\n  }, {\n    \"href\": \"https://twitter.com/intent/follow?original_referer=https%3A%2F%2Fpublish.twitter.com%2F%3FbuttonType%3DFollowButton%26query%3Dhttps%253A%252F%252Ftwitter.com%252Finit_adam%26widget%3DButton&ref_src=twsrc%5Etfw&region=follow_link&screen_name=init_adam&tw_p=followbutton\"\n  }), \"follow me on Twitter\"), \".\"), mdx(\"p\", null, \"If you have any questions, comments, or just want to say hello,\\n\", mdx(\"a\", _extends({\n    parentName: \"p\"\n  }, {\n    \"href\": \"https://twitter.com/messages/compose?recipient_id=33618361\"\n  }), \"send me a message\"), \".\"), mdx(\"p\", null, \"Thanks for reading!\"));\n}\n;\nMDXContent.isMDXComponent = true;","frontmatter":{"title":"Analyzing reddit comments using Python","date":"2021-01-18","author":"Adam Goth","preview":"In this post, we'll take a look at how to build a simple Python script for word analysis. We will then apply it to the comment section of any given reddit post.","keywords":["python","word analysis"]},"timeToRead":8}},"pageContext":{"id":"e68c0391-1b3f-538e-a9fd-69eb8ebfc670"}},"staticQueryHashes":["63159454"]}