1 00:00:01,380 --> 00:00:04,260 Unknown: Hello, and welcome to this workshop. This is Kristi 2 00:00:04,260 --> 00:00:07,830 Thompson. I'm a research data management library and on the 3 00:00:07,830 --> 00:00:10,020 research and scholarly communications team in the 4 00:00:10,020 --> 00:00:15,450 library. Basically I help people access, do research with manage 5 00:00:15,480 --> 00:00:18,360 and archive all kinds of data. I've been asked to provide a 6 00:00:18,360 --> 00:00:22,050 pretty brief discussion, introduction to how to access 7 00:00:22,050 --> 00:00:26,370 and do research with social media data. This isn't going to 8 00:00:26,370 --> 00:00:29,790 be an exhaustive overview, I'm hoping instead to spark some 9 00:00:29,790 --> 00:00:34,080 ideas about the possibilities of social media research. And I 10 00:00:34,080 --> 00:00:36,570 very much hope that if any of you would like to explore this 11 00:00:36,570 --> 00:00:40,140 further, you'll reach out to me and my colleagues, we can help 12 00:00:40,140 --> 00:00:40,230 you 13 00:00:40,260 --> 00:00:40,860 get going. 14 00:00:44,890 --> 00:00:48,790 Social media data is data derived by collecting material 15 00:00:48,790 --> 00:00:52,000 that people share on social networking sites and assembling 16 00:00:52,000 --> 00:00:56,680 it into a data set for analysis. Social media data consists of 17 00:00:56,680 --> 00:01:01,570 content, which is the data that words images, videos and 18 00:01:01,570 --> 00:01:06,280 reactions, such as thumbs up shared by users of social media 19 00:01:06,280 --> 00:01:11,740 platforms. And social media data also includes metadata, which is 20 00:01:12,250 --> 00:01:14,980 information about the data and about the author of the data. 21 00:01:16,480 --> 00:01:19,990 metadata is also an important part of social media related 22 00:01:19,990 --> 00:01:24,100 research. It includes platform generated metadata, which is 23 00:01:24,100 --> 00:01:26,380 linked to an embedded information that describes 24 00:01:26,380 --> 00:01:30,370 things like when the content was shared, who shared the content, 25 00:01:30,610 --> 00:01:35,680 how often it was reshared, and where the user is from. And this 26 00:01:35,680 --> 00:01:39,850 also includes the user generated metadata, such as on Twitter 27 00:01:39,850 --> 00:01:47,020 things like hashtags and mentions. So that's the what. 28 00:01:47,590 --> 00:01:52,090 Now, why do people analyze social media data? It turns out 29 00:01:52,090 --> 00:01:55,180 social media data is great for social research because social 30 00:01:55,180 --> 00:01:59,350 media allows access to the unprompted unscripted opinions 31 00:01:59,380 --> 00:02:03,880 of users in their own words. Now, traditional research 32 00:02:03,910 --> 00:02:07,900 usually elicits user opinions under artificial continued 33 00:02:08,260 --> 00:02:13,150 conditions, such as surveys or focus groups. In contrast, 34 00:02:13,150 --> 00:02:16,120 social media to discover reactions to events as they 35 00:02:16,120 --> 00:02:21,190 happen in real time without prompting, without any observer 36 00:02:21,190 --> 00:02:25,210 effect from the researcher. And social media allows you to also 37 00:02:25,210 --> 00:02:29,320 analyze connections between people, and trace how ideas 38 00:02:29,320 --> 00:02:34,540 spread. Twitter is particularly used, suited for this type of 39 00:02:34,540 --> 00:02:37,210 research because of data availability, wide use of 40 00:02:37,210 --> 00:02:41,470 hashtags and easily traced back works. But I'll be talking about 41 00:02:41,470 --> 00:02:44,050 some of the other types of social media you can do research 42 00:02:44,050 --> 00:02:48,340 with as well. They'll have their unique advantages and 43 00:02:48,340 --> 00:02:54,160 disadvantages. There are also reasons why social media data 44 00:02:54,160 --> 00:02:57,190 needs to be used carefully, and why results from Social Media 45 00:02:57,190 --> 00:03:01,630 Research need to be treated with caution. Social media is 46 00:03:01,630 --> 00:03:05,740 performative, both positive and negative, online feelings are 47 00:03:05,740 --> 00:03:09,100 often overstated. And interest in a topic may not actually 48 00:03:09,100 --> 00:03:13,630 translate into further actions. We've all heard the stereotype 49 00:03:13,630 --> 00:03:18,730 of the social media warrior who talks about things a lot, but 50 00:03:19,060 --> 00:03:24,370 doesn't do much. You In addition, users of social media 51 00:03:24,460 --> 00:03:28,570 are not representative of populations. So there's self 52 00:03:28,570 --> 00:03:31,810 selection bias. Some people produce far more output than 53 00:03:31,810 --> 00:03:36,010 others, which creates quite a bit of bias. You basically end 54 00:03:36,010 --> 00:03:40,030 up with a lot of data on people who like to talk on social media 55 00:03:40,030 --> 00:03:45,700 a lot. And then there's also just difficulty created by the 56 00:03:45,700 --> 00:03:49,900 scope of social media data. Most searches will retrieve large 57 00:03:49,900 --> 00:03:53,260 amounts of content that is difficult to come through with 58 00:03:53,260 --> 00:03:55,570 much of the data being irrelevant to a research 59 00:03:55,630 --> 00:04:02,200 question, or impossible to interpret. Basically, you always 60 00:04:02,200 --> 00:04:05,050 need to think very carefully about what exactly you're 61 00:04:05,050 --> 00:04:09,970 studying. Who are the people that use this social media? What 62 00:04:09,970 --> 00:04:12,790 are their demographic characteristics, who's being 63 00:04:12,790 --> 00:04:15,790 left out of this sample? Is there another way to get this 64 00:04:15,790 --> 00:04:16,510 information? 65 00:04:18,010 --> 00:04:19,300 Just like all research. 66 00:04:22,350 --> 00:04:25,980 There are quite a lot of social media platforms out there. Some 67 00:04:25,980 --> 00:04:29,520 of them are easier to do research on than others. Twitter 68 00:04:29,520 --> 00:04:33,000 is probably the most popular and easiest to use social media 69 00:04:33,000 --> 00:04:36,660 network research. tweets are popular PR analysis because 70 00:04:36,660 --> 00:04:41,250 they're short, consistent text, and contain user generated 71 00:04:41,250 --> 00:04:44,790 metadata, which makes them suitable for automated analysis 72 00:04:44,790 --> 00:04:48,960 and collection. Also, Twitter actually allows people to access 73 00:04:48,960 --> 00:04:52,140 Twitter data through the Twitter API, or application programming 74 00:04:52,140 --> 00:04:55,800 interface, which makes it easy to get data automatically and 75 00:04:55,800 --> 00:04:59,460 import it into a computer program. We'll be talking later 76 00:04:59,460 --> 00:05:04,080 about how Take advantage of API's to harvest data. Even if 77 00:05:04,080 --> 00:05:06,660 you aren't very technical, it's actually pretty easy to do. 78 00:05:09,630 --> 00:05:12,930 YouTube is widely considered both a video sharing platform 79 00:05:12,930 --> 00:05:16,260 and a social media network due to its quantity of user 80 00:05:16,260 --> 00:05:21,180 generated content. Both the videos and the comments and 81 00:05:21,180 --> 00:05:23,910 discussion that take place on the videos are available for 82 00:05:23,910 --> 00:05:28,650 analysis. video analysis can be pretty time consuming, but 83 00:05:28,650 --> 00:05:31,860 content analysis can be quite rich and revealing. And 84 00:05:31,860 --> 00:05:35,250 harvesting comments for a single video we're often produce a data 85 00:05:35,250 --> 00:05:39,690 set that's a good size for basic analysis. YouTube also has an 86 00:05:39,690 --> 00:05:46,800 API for harvesting data. online news and discussion forums such 87 00:05:46,800 --> 00:05:50,580 as Reddit allow the analysts to collect longer form discourse, 88 00:05:51,030 --> 00:05:55,110 you can collect the text of a single long discussion using 89 00:05:55,110 --> 00:05:59,730 copy and paste from your browser or read it also, again has an 90 00:05:59,730 --> 00:06:06,450 API for automated searching of data. Some partial popular 91 00:06:06,450 --> 00:06:09,780 social media networks such as Instagram or Facebook are less 92 00:06:09,780 --> 00:06:12,810 suitable for analysis because the terms of service do not 93 00:06:12,810 --> 00:06:16,200 allow data collection or because you can only access limited 94 00:06:16,200 --> 00:06:20,760 information due to people's privacy settings. Facebook is 95 00:06:20,760 --> 00:06:26,310 one limited social media network that does make some specific 96 00:06:26,310 --> 00:06:30,030 data sets available to researchers. But in general, 97 00:06:30,030 --> 00:06:36,240 their data is not available for scraping. You need to be 98 00:06:36,240 --> 00:06:39,900 especially cautious about using some platforms, particularly if 99 00:06:39,900 --> 00:06:43,470 you're concerned about being sued. LinkedIn is an interesting 100 00:06:43,470 --> 00:06:47,250 case, their terms of service explicitly disallows scraping, 101 00:06:47,340 --> 00:06:50,670 and they have accepted attempted to sue outside entities that 102 00:06:50,670 --> 00:06:55,710 scraped their data, scraping LinkedIn public information, but 103 00:06:55,710 --> 00:06:59,970 you can see well not logged in, was however the rules legal 104 00:07:00,000 --> 00:07:05,430 following a court case. However, Terms of Service still disallow 105 00:07:05,460 --> 00:07:10,560 automatic access, and they may attempt to block any attempt to 106 00:07:10,560 --> 00:07:15,480 access their data automatically. Instagrams another one, their 107 00:07:15,480 --> 00:07:19,050 terms of service do not allow scraping the data apart from the 108 00:07:19,050 --> 00:07:23,340 crowdtangle public data, which is available from the books data 109 00:07:23,340 --> 00:07:28,860 for public good intercept face. And their privacy settings on 110 00:07:28,860 --> 00:07:31,860 Instagram make it quite difficult to get at the data. 111 00:07:34,590 --> 00:07:38,460 Google Groups is another one. This isn't as popular social 112 00:07:38,460 --> 00:07:41,850 platform as some of the others, but I'll explain why it's worth 113 00:07:41,850 --> 00:07:46,140 considering. Now their terms of service describe scraping of 114 00:07:46,140 --> 00:07:50,790 content as spam, and it may result in deletion of your 115 00:07:50,790 --> 00:07:54,000 account. However, on the other hand, they don't explicitly make 116 00:07:54,000 --> 00:07:58,500 any reference to suing people. Some Google Groups are public 117 00:07:58,500 --> 00:08:02,640 and can be accessed without account and account. These may 118 00:08:02,640 --> 00:08:06,450 be legitimate targets for research. These include the 119 00:08:06,450 --> 00:08:12,630 Usenet archive, dating back to 1981, an archive of online 120 00:08:12,630 --> 00:08:14,940 discussion that took place before the web was even 121 00:08:14,940 --> 00:08:19,290 developed. So if you're doing structural research, looking 122 00:08:19,290 --> 00:08:23,850 into Google Groups may be worth the effort. For any other 123 00:08:23,850 --> 00:08:28,020 platforms, check the terms of service to see what's allowed, 124 00:08:29,160 --> 00:08:30,990 and how you can go about getting access to it. 125 00:08:36,450 --> 00:08:39,300 If you need to sign up for an account on a platform in order 126 00:08:39,300 --> 00:08:42,840 to see the data, check the terms of service to find out exactly 127 00:08:42,840 --> 00:08:46,950 what you're agreeing to. If data can be accessed and read without 128 00:08:46,950 --> 00:08:50,730 an account, you're probably safe legally when it comes to doing 129 00:08:50,730 --> 00:08:54,510 research using it. But the platform provider may well have 130 00:08:54,510 --> 00:08:59,160 made it technically difficult. There's been some case law on 131 00:08:59,160 --> 00:09:02,790 this. And the US as I've mentioned LinkedIn data scraping 132 00:09:02,790 --> 00:09:07,500 was ruled legal. However, in Canada, a particular federal 133 00:09:07,500 --> 00:09:11,850 court case ruled web scraping illegal, at least if done 134 00:09:11,850 --> 00:09:16,050 deceptively for profit. There hasn't been a ruling on more 135 00:09:16,050 --> 00:09:20,640 general cases such as accessing data and automatically is for 136 00:09:20,640 --> 00:09:27,750 research purposes. In general, scraping social media data is 137 00:09:27,750 --> 00:09:31,320 legally safe if you comply with the terms of service and stick 138 00:09:31,320 --> 00:09:36,720 with data that's available to the general public. So that's 139 00:09:36,720 --> 00:09:41,310 legality. How about ethics? I reached out to representatives 140 00:09:41,310 --> 00:09:43,920 of westerns research ethics board while preparing the 141 00:09:43,920 --> 00:09:46,920 slides, make sure that I was complying with local practice 142 00:09:48,240 --> 00:09:53,880 under tcps to Article 2.2 publicly available information 143 00:09:53,910 --> 00:09:58,350 is exempt from research ethics board review. This includes 144 00:09:58,350 --> 00:10:01,950 identifiable information, so Social media data where there is 145 00:10:01,950 --> 00:10:06,570 no expectation of privacy. So you don't even know need to go 146 00:10:06,570 --> 00:10:11,070 through review for this type of research. Basically, what it 147 00:10:11,070 --> 00:10:15,240 comes down to is naturalistic observation of public behavior 148 00:10:15,240 --> 00:10:19,020 does not require review. Even if the observation is occurring 149 00:10:19,020 --> 00:10:23,130 online. You're just watching people do what they normally do. 150 00:10:23,880 --> 00:10:26,880 This is distinct from information in private groups or 151 00:10:26,880 --> 00:10:31,080 chats. Every group is available to all users of a platform and 152 00:10:31,080 --> 00:10:34,800 anyone is allowed to register by the platform, the information 153 00:10:34,800 --> 00:10:38,490 may be considered public. If you need to ask permission from a 154 00:10:38,490 --> 00:10:42,390 moderator or other entity to join a group. Participants in 155 00:10:42,390 --> 00:10:45,660 that group may have a reasonable expectation of privacy, and you 156 00:10:45,660 --> 00:10:48,720 may have to gain their consent in order to do research on them. 157 00:10:49,860 --> 00:10:53,580 Also, if the researcher is doing any sort of an intervention, 158 00:10:54,510 --> 00:10:57,570 asking questions or participating in conversation in 159 00:10:57,570 --> 00:11:01,440 an online group, then review and possibly informed consent will 160 00:11:01,440 --> 00:11:08,310 be required. Like to be focusing on Twitter data over the next 161 00:11:08,310 --> 00:11:11,730 few slides. Twitter is the most popular data for Social Media 162 00:11:11,730 --> 00:11:14,850 Research because of its relative ease of access, and the wide use 163 00:11:14,850 --> 00:11:21,930 of hashtags that make it easy to track how ideas spread. Until 164 00:11:21,930 --> 00:11:24,930 recently, the only options practicing Twitter data were the 165 00:11:24,930 --> 00:11:29,430 free standard search, which severely limits the amount of 166 00:11:29,430 --> 00:11:32,970 data, you could access to only the most recent few days or 167 00:11:32,970 --> 00:11:38,160 weeks, and expensive premium or Enterprise Search. These are 168 00:11:38,160 --> 00:11:42,330 both through the Twitter API. However, Twitter recently added 169 00:11:42,330 --> 00:11:46,860 the new free academic research track, which allows access to 170 00:11:46,860 --> 00:11:52,230 the full historical corpus of data. Whoever access to this 171 00:11:52,230 --> 00:11:55,920 stream requires application and approval of your application by 172 00:11:55,920 --> 00:11:59,580 someone at twitter. I actually don't know how long this 173 00:11:59,580 --> 00:12:02,760 qualifies or what sort of factors they consider in making 174 00:12:02,760 --> 00:12:07,050 approval because it's clear from the terms that I don't qualify 175 00:12:07,050 --> 00:12:11,040 at all, so I couldn't even apply. To apply you must be 176 00:12:11,070 --> 00:12:14,640 employed as a researcher at an academic institution, or a 177 00:12:14,640 --> 00:12:18,510 graduate student working on a thesis or dissertation, not 178 00:12:18,510 --> 00:12:23,370 class project and not an undergrad. And you must have a 179 00:12:23,370 --> 00:12:29,850 specific, clearly defined research project. For everyone 180 00:12:29,850 --> 00:12:33,600 else, standard searches still available. Standard search 181 00:12:33,600 --> 00:12:37,200 through the API allows access to the most recent seven to nine 182 00:12:37,200 --> 00:12:43,470 days of data. Another option is using existing archive data 183 00:12:43,470 --> 00:12:47,880 collections of tweets, you can go out and search for a topic 184 00:12:47,880 --> 00:12:52,950 such as COVID-19, Twitter, and data or corpus to find existing 185 00:12:52,950 --> 00:12:57,540 collections, you might get lucky. For example, George 186 00:12:57,540 --> 00:13:00,420 Washington University is assembled a variety of large 187 00:13:00,420 --> 00:13:03,510 collections of current and historical Twitter data on 188 00:13:03,510 --> 00:13:07,050 various topics, and provide some instructions on how to use them. 189 00:13:07,770 --> 00:13:10,410 Many other Twitter data archives are available including a 190 00:13:10,410 --> 00:13:17,490 massive unsorted collection at the Internet Archive. The 191 00:13:17,490 --> 00:13:20,910 easiest low tech way to capture Twitter data is using a browser 192 00:13:20,910 --> 00:13:25,350 extension. browser extensions allow you to construct a Twitter 193 00:13:25,350 --> 00:13:29,280 search and then pull the data off the results page into a data 194 00:13:29,280 --> 00:13:29,760 set. 195 00:13:32,480 --> 00:13:37,280 For example, data miner and QSR’s N-capture are two that I 196 00:13:37,280 --> 00:13:40,190 have used with some success and can both be added to Google 197 00:13:40,190 --> 00:13:45,620 Chrome. These will also scrape data from Facebook, YouTube and 198 00:13:45,620 --> 00:13:51,530 general web pages. You have to go through the Twitter web 199 00:13:51,530 --> 00:13:56,870 interface web search to locate the tweets first. And the number 200 00:13:56,870 --> 00:14:00,110 of and selection of tweets returned by the Twitter search 201 00:14:00,140 --> 00:14:02,720 through the web interface is determined by the Twitter 202 00:14:02,720 --> 00:14:06,680 algorithms. And you may not get all relevant tweets in your time 203 00:14:06,680 --> 00:14:09,470 period. So you're giving up some control over your sample. 204 00:14:10,050 --> 00:14:10,620 It's 205 00:14:10,950 --> 00:14:14,490 kind of hard to know exactly what you're getting or why 206 00:14:14,490 --> 00:14:17,910 they're giving you what they're giving you. Also note that the 207 00:14:17,910 --> 00:14:22,380 extension in capture is very easy to use, but only works with 208 00:14:22,380 --> 00:14:26,820 the qualitative software package in vivo. It creates some CV x 209 00:14:26,820 --> 00:14:28,590 files that can't be used elsewhere. 210 00:14:34,470 --> 00:14:39,150 application program interfaces API's are protocols that manage 211 00:14:39,150 --> 00:14:46,050 and transmit data from website to your software. To use the 212 00:14:46,050 --> 00:14:50,460 Twitter API, you need to have a Twitter account. Sign up as a 213 00:14:50,460 --> 00:14:54,450 developer an access fee to using a programming language such as R 214 00:14:54,450 --> 00:15:01,950 or Python. You don't actually need To program in order to use 215 00:15:01,950 --> 00:15:05,550 a programming language to gain access to Twitter, start by 216 00:15:05,550 --> 00:15:09,450 choosing an application track. Now the steps I'm going to be 217 00:15:09,450 --> 00:15:13,740 showing you are relatively new. If you've done this before, 218 00:15:13,770 --> 00:15:18,450 there may be some new steps to follow. When I reviewed Twitter 219 00:15:18,450 --> 00:15:21,750 data access a year or so ago, the steps were different and 220 00:15:21,900 --> 00:15:27,300 they may change again. So start by choosing an application 221 00:15:27,300 --> 00:15:30,420 track. The three main application tracks on the 222 00:15:30,420 --> 00:15:33,900 Twitter site are professional hobbyist, an academic 223 00:15:34,830 --> 00:15:38,880 professional allows you access to everything but you need to 224 00:15:38,880 --> 00:15:45,060 pay for it and it is not cheap. academics and hobbyists not 225 00:15:45,060 --> 00:15:48,000 counting academics who follow the academic research track, 226 00:15:48,570 --> 00:15:51,270 both gets forwarded to the standard application, which 227 00:15:51,270 --> 00:15:55,770 allows limited access to recent Twitter data about the last 228 00:15:55,770 --> 00:16:01,950 seven to nine days of tweets. To follow up for the free full 229 00:16:01,950 --> 00:16:06,120 archive access academic research track, you need to have a 230 00:16:06,120 --> 00:16:09,300 specific clearly defined research project which we will 231 00:16:09,300 --> 00:16:13,800 just describe as part of the application process. You also 232 00:16:13,800 --> 00:16:16,470 need to be employed as a researcher at an academic 233 00:16:16,470 --> 00:16:19,260 institution or a grad student working on a thesis or 234 00:16:19,260 --> 00:16:26,520 dissertation. After you complete the application, you'll be 235 00:16:26,520 --> 00:16:30,300 forwarded to the developer developer portal. Right now you 236 00:16:30,300 --> 00:16:35,040 can create an app to use the old Twitter 1.2 API, or a project to 237 00:16:35,040 --> 00:16:42,240 use the Early Access version of 2.0 api 2.0 is still evolving, 238 00:16:42,240 --> 00:16:49,800 so I used 1.2 and created an app. Your app will generate a 239 00:16:49,800 --> 00:16:52,830 set of keys and tokens that you will need at the step where you 240 00:16:52,860 --> 00:16:57,330 set up your data access script. It won't save them, you can 241 00:16:57,330 --> 00:17:01,290 always regenerate a new set if you lose them. So you generate 242 00:17:01,290 --> 00:17:05,400 your token and key keys and copy them somewhere where you won't 243 00:17:05,400 --> 00:17:10,380 lose them. I've talked a bit about using API's to get your 244 00:17:10,380 --> 00:17:15,390 data. To do that you need the right software. A free widely 245 00:17:15,390 --> 00:17:18,660 used tool is r which is a free statistical programming 246 00:17:18,660 --> 00:17:21,960 language. It's used for procedures like manipulating 247 00:17:22,020 --> 00:17:26,160 editing and visualizing data, computing statistics and 248 00:17:26,160 --> 00:17:30,480 accessing data through API's. And like visual most proven 249 00:17:30,480 --> 00:17:34,500 programs you interact with are by typing commands at a prompt 250 00:17:34,500 --> 00:17:37,470 or by writing programs, which are groups of commands that can 251 00:17:37,470 --> 00:17:42,540 be run in sequence. However, you don't need to know how to 252 00:17:42,540 --> 00:17:46,710 program to accomplish tasks with our many users have written 253 00:17:46,710 --> 00:17:50,490 helpful recipes that you can simply follow step by step to 254 00:17:50,760 --> 00:17:54,330 accomplish various tasks like getting Twitter or YouTube data, 255 00:17:54,660 --> 00:17:58,860 or running a trend analysis. It's linked to a couple of these 256 00:17:58,860 --> 00:18:02,760 recipes in this presentation at the end, and there are many 257 00:18:02,760 --> 00:18:05,760 others feel free to reach out to me if you decide to take the 258 00:18:05,760 --> 00:18:12,150 plunge into using our together analyze data. This is just a 259 00:18:12,150 --> 00:18:15,180 screenshot to show you what the basic our interface looks like 260 00:18:15,210 --> 00:18:18,450 see have some idea what you're getting into. There atoms 261 00:18:18,450 --> 00:18:21,480 available such as our studio that can make are a little nicer 262 00:18:21,480 --> 00:18:25,440 to use if you expect to be doing a lot of work with it. But 263 00:18:25,500 --> 00:18:28,830 basically using are is a matter of setting a prompt and typing 264 00:18:28,830 --> 00:18:33,180 in commands, or writing scripts that can system lists of 265 00:18:33,180 --> 00:18:39,810 commands. Here's the basic our recipe or recipe or script 266 00:18:39,840 --> 00:18:43,500 showing how to use the artweek package to access data which is 267 00:18:43,500 --> 00:18:49,080 the method I recommend. RP is one of many add on modules that 268 00:18:49,080 --> 00:18:53,520 are available for use to our with our that make it easier to 269 00:18:54,210 --> 00:18:59,790 do various different tasks. You need to start by installing it 270 00:18:59,790 --> 00:19:02,940 using the install packages command and then load the our 271 00:19:02,940 --> 00:19:07,470 library. If you decide to use our tweet don't start by looking 272 00:19:07,470 --> 00:19:10,350 at the our tweet official documentation, which has a list 273 00:19:10,350 --> 00:19:14,280 of several dozen functions in alphabetical order is just not 274 00:19:14,280 --> 00:19:19,170 very user friendly. One of the developers, Michael Kyani has 275 00:19:19,170 --> 00:19:22,560 some very good getting started. instructions that I'm linking 276 00:19:22,560 --> 00:19:30,060 here. Okay, that's it for Twitter. Next I'm going to be 277 00:19:30,060 --> 00:19:31,080 talking about YouTube. 278 00:19:33,119 --> 00:19:33,749 Now, 279 00:19:35,069 --> 00:19:39,449 video data is pretty challenging to work with analysis generally 280 00:19:39,479 --> 00:19:44,609 needs to be done by hand, watching videos slowly and 281 00:19:44,609 --> 00:19:49,529 taking notes. However, like other platforms, YouTube also 282 00:19:49,529 --> 00:19:53,549 has plenty of textual data and metadata. And this textual data 283 00:19:53,549 --> 00:19:58,259 can be analyzed much like other forms of text data. comments and 284 00:19:58,259 --> 00:20:02,189 videos provide a rich corpus Texture analysis. These can be 285 00:20:02,489 --> 00:20:05,279 analyzed in conjunction with things like viewer, viewer and 286 00:20:05,279 --> 00:20:13,559 channel statistics. Google owns YouTube. So the first step is to 287 00:20:13,559 --> 00:20:18,359 set up access to the Google API. And you need to be logged in 288 00:20:18,359 --> 00:20:26,009 with a Google or Gmail account to do this. So first, to enable 289 00:20:26,099 --> 00:20:30,119 API's on the Google API console. I'll be showing you how to 290 00:20:30,119 --> 00:20:37,799 access these API's using the our tuber package. actually doing 291 00:20:37,799 --> 00:20:42,359 this is even more complex than working with Twitter because the 292 00:20:42,359 --> 00:20:46,169 Google Cloud platform provides access to a number of API's and 293 00:20:46,169 --> 00:20:51,299 tools aside from YouTube. So you basically need to search for the 294 00:20:51,299 --> 00:20:56,099 YouTube API, and it will eventually come up. I'll note 295 00:20:56,099 --> 00:20:59,129 that many of the instructions I looked at recommended also 296 00:20:59,129 --> 00:21:02,219 enabling something called the Freebase API, 297 00:21:03,569 --> 00:21:04,289 whoever 298 00:21:04,360 --> 00:21:07,810 this has been deprecated and no longer comes up in a search. 299 00:21:09,400 --> 00:21:12,940 From what I could find out, extremely basic things seem to 300 00:21:12,940 --> 00:21:16,750 work without it. But I'm not sure exactly what it did, or if 301 00:21:16,750 --> 00:21:25,090 not having it will break older example scripts. After you 302 00:21:25,120 --> 00:21:30,250 enable the YouTube API is you need to create an OAuth client 303 00:21:30,250 --> 00:21:34,780 ID to use in your scripts. Here's another place where I had 304 00:21:35,260 --> 00:21:39,010 trouble with the online instructions I found. They told 305 00:21:39,010 --> 00:21:42,280 me to select the application type of other to create the 306 00:21:42,280 --> 00:21:46,810 correct correct type of client ID. Whoever other no longer 307 00:21:46,810 --> 00:21:51,070 appears as an option, some web searching later and I found that 308 00:21:51,070 --> 00:21:54,700 the desktop app is the new other so this is the one that you need 309 00:21:54,700 --> 00:22:04,060 to select. The most widely used our pack ship package for 310 00:22:04,060 --> 00:22:08,560 accessing YouTube data is tuber. It works much like our fleet 311 00:22:08,620 --> 00:22:11,650 with the user set user setting up a connection using the 312 00:22:11,650 --> 00:22:15,550 credentials they generated. And then using simple commands like 313 00:22:15,610 --> 00:22:18,910 get all comments to get all the comments on a specific video 314 00:22:19,990 --> 00:22:24,190 using the video ID which can be found in the URL to a specific 315 00:22:24,190 --> 00:22:32,920 video. The last two social media programs, platforms we'll be 316 00:22:32,920 --> 00:22:36,160 discussing are Facebook and Reddit. While there are many 317 00:22:36,160 --> 00:22:38,920 others have tried to cover a selection of the most widely 318 00:22:38,920 --> 00:22:43,540 used platforms that actually allow some sort of data mining, 319 00:22:43,750 --> 00:22:46,750 and also cover a range of different types of social media. 320 00:22:49,630 --> 00:22:53,020 Reddit is a social media news and discussion site with some 321 00:22:53,020 --> 00:22:56,680 forums called subreddits. on everything from politics to 322 00:22:56,680 --> 00:23:01,690 programming to cat pictures. Some Reddit forums are rather 323 00:23:01,690 --> 00:23:05,980 controversial, outright offensive, or simply illegal. 324 00:23:06,790 --> 00:23:09,970 For example, the subreddit that has tips on shoplifting without 325 00:23:09,970 --> 00:23:15,130 getting caught. And Reddit has been inconsistent about banning 326 00:23:15,130 --> 00:23:19,690 these they occasionally decide to purge one or more of them and 327 00:23:19,690 --> 00:23:24,850 let others pop up. As a site that relies on users for content 328 00:23:24,880 --> 00:23:28,300 interaction and moderation, some interesting data can be pulled 329 00:23:28,300 --> 00:23:32,140 from Reddit that can show real time trends of popular topics. 330 00:23:32,740 --> 00:23:37,000 Users upvote and downvote vote each other's comments, with 331 00:23:37,000 --> 00:23:40,330 uploaded comments rising higher in the threads. so popular 332 00:23:40,330 --> 00:23:43,570 comment is much more likely to be seen by another user, and 333 00:23:43,570 --> 00:23:48,580 perhaps then further uploaded or commented on. This makes for a 334 00:23:48,580 --> 00:23:52,270 complex network of users and discussions that also provides 335 00:23:52,270 --> 00:23:59,290 insight into trending topics and into how popular particularly or 336 00:23:59,290 --> 00:24:03,640 agreed with a particular opinions on the mark. All these 337 00:24:03,640 --> 00:24:07,360 factors make Reddit a rich source of material for analysis 338 00:24:07,360 --> 00:24:13,180 of real time trends. In popular topics. I suggest accessing 339 00:24:13,180 --> 00:24:19,030 Reddit using the R package Re ditExtractoR to use this you d 340 00:24:19,030 --> 00:24:22,270 n't have to set up an authe tication or create any kind 341 00:24:22,300 --> 00:24:25,930 f account to access the Reddi API. Yay. 342 00:24:31,450 --> 00:24:34,900 Now, unlike the rest of the platforms, I've covered Facebook 343 00:24:34,900 --> 00:24:39,190 did not allow direct access to user data by researchers through 344 00:24:39,190 --> 00:24:42,640 any sort of API. But I still wanted to mention that because 345 00:24:42,640 --> 00:24:46,360 it does provide access to some data. And as a very influential 346 00:24:46,360 --> 00:24:50,470 platform. The Terms of Service state you may not access or 347 00:24:50,470 --> 00:24:54,130 collect data from our products using automated means without 348 00:24:54,130 --> 00:24:57,850 our prior permission or attempt to access data that you do not 349 00:24:57,850 --> 00:25:03,310 have permission to access However, they do have some data 350 00:25:03,310 --> 00:25:05,770 that is made available specifically for academic 351 00:25:05,770 --> 00:25:08,800 researchers, including the Facebook ad library, 352 00:25:09,340 --> 00:25:13,870 international future business survey, and some datasets shared 353 00:25:13,870 --> 00:25:18,610 on the humanitarian data, our humanitarian data exchange, 354 00:25:19,210 --> 00:25:23,260 including some COVID-19 symptom surveys. There's also 355 00:25:23,260 --> 00:25:26,950 crowdtangle, which is a Facebook developed tool for analyzing 356 00:25:26,950 --> 00:25:31,540 public pages, public groups, verified profiles and public 357 00:25:31,540 --> 00:25:37,810 Instagram accounts. This data is accessed in a variety of ways 358 00:25:37,810 --> 00:25:41,800 humanitarian data can be downloaded directly to excel. 359 00:25:42,610 --> 00:25:45,970 The business survey requires application and approved 360 00:25:45,970 --> 00:25:49,840 product, and the crown tip crowdtangle data is accessed 361 00:25:49,840 --> 00:25:57,700 through a Facebook developer interface. According to 362 00:25:57,700 --> 00:26:00,460 Facebook, the app library provides advertising 363 00:26:00,460 --> 00:26:03,460 transparency by offering a comprehensive searchable 364 00:26:03,460 --> 00:26:06,640 collection of all ads currently running from across Facebook 365 00:26:06,640 --> 00:26:11,200 products, including Instagram. Facebook provides both a 366 00:26:11,200 --> 00:26:14,260 searchable and browsable interface to the app library and 367 00:26:14,290 --> 00:26:18,640 API access. To get API access however, you need to sign up for 368 00:26:18,640 --> 00:26:22,030 a Facebook developer account. So I didn't didn't know how to get 369 00:26:22,030 --> 00:26:30,790 exactly how that one works any further. Now what you've got 370 00:26:30,790 --> 00:26:36,010 your social media data, what are you going to do with it? I 371 00:26:36,010 --> 00:26:39,280 wanted to introduce you to some of the very popular types of 372 00:26:39,280 --> 00:26:43,090 research that social media data like Twitter's used for, this 373 00:26:43,090 --> 00:26:46,300 doesn't cover everything. But these are some of the most 374 00:26:46,300 --> 00:26:49,120 popular types of research that you're likely to come across in 375 00:26:49,120 --> 00:26:55,000 the literature. network analysis is finding out who follows who 376 00:26:55,120 --> 00:27:00,100 who talks about what and tracing how ideas spread. through social 377 00:27:00,100 --> 00:27:03,100 media networks, it's widely used in marketing as well as in 378 00:27:03,100 --> 00:27:07,420 political science. trend analysis looks at the popularity 379 00:27:07,420 --> 00:27:10,780 of concepts and hashtags over time and can be used to relate 380 00:27:10,780 --> 00:27:14,830 opinions to events, or see what events resonate most with the 381 00:27:14,830 --> 00:27:19,540 general public. trend analysis is a fairly simple type of 382 00:27:19,540 --> 00:27:25,000 analysis. It can be useful to supplement a larger project, but 383 00:27:25,030 --> 00:27:27,460 there's only a certain amount of insight you can gain from it. 384 00:27:29,200 --> 00:27:32,950 sentiment analysis is another fairly simple analysis that's 385 00:27:32,950 --> 00:27:38,680 used to automatically classify tweets or other short 386 00:27:38,680 --> 00:27:44,080 expressions on social media. associated with particular topic 387 00:27:44,080 --> 00:27:48,310 as positive, negative or neutral. And lastly, content 388 00:27:48,310 --> 00:27:53,380 analysis is often used by qualitative researchers to come 389 00:27:53,410 --> 00:27:58,090 summarize complex data and find themes or common concepts that 390 00:27:58,090 --> 00:28:03,640 dominate a particular discourse. content analysis is a way of 391 00:28:03,640 --> 00:28:06,880 teasing out the commonalities and a massive disparate text so 392 00:28:06,880 --> 00:28:10,540 you can make inferences about the messages in the texts, and 393 00:28:10,540 --> 00:28:15,010 is suitable for doing very complex types of research. 394 00:28:17,620 --> 00:28:20,260 Here's an example of a trend analysis that was done using 395 00:28:20,260 --> 00:28:23,770 Twitter data. The researcher looked for posts mentioning 396 00:28:23,770 --> 00:28:27,970 various race or ethnicity related keywords to see which 397 00:28:27,970 --> 00:28:31,960 one of a set of events caused the most discourse around the 398 00:28:31,960 --> 00:28:39,910 topic. content analysis is frequently used with social 399 00:28:39,910 --> 00:28:44,290 media data. It's widely used in qualitative research anytime the 400 00:28:44,290 --> 00:28:47,530 researcher needs to make sense of a large amount of textual 401 00:28:47,800 --> 00:28:52,840 voice or video data. content analysis is basically a 402 00:28:52,840 --> 00:28:56,830 formalized way of summarizing a massive text data to make it 403 00:28:56,830 --> 00:29:00,880 easier to discover meaning chunks of text phrases or 404 00:29:00,970 --> 00:29:04,360 sentences and interview data, for example, or individual 405 00:29:04,360 --> 00:29:08,680 tweets are tagged with descriptors that classify them 406 00:29:08,680 --> 00:29:12,040 according to some schema classification. Classification 407 00:29:12,040 --> 00:29:15,070 schemes can be a priori, which means you come up with a list of 408 00:29:15,070 --> 00:29:19,360 terms according to some theory before analysis, or inductive, 409 00:29:19,870 --> 00:29:22,660 which means it's made up from terms discovered by looking 410 00:29:22,660 --> 00:29:24,160 through a sub sample of the data. 411 00:29:28,830 --> 00:29:32,100 The research focus in content analysis can be more 412 00:29:32,100 --> 00:29:35,970 quantitative, focused on counting co authored particular 413 00:29:35,970 --> 00:29:40,260 themes occur in the data, or it can be more qualitative, focused 414 00:29:40,530 --> 00:29:45,000 on identifying and describing key themes. This is also called 415 00:29:45,030 --> 00:29:50,220 thematic analysis. Another division is between relational 416 00:29:50,220 --> 00:29:53,490 call and text analysis looking at which concepts tend to occur 417 00:29:53,490 --> 00:29:57,540 together in touch well, which is just coding for simple concepts. 418 00:29:58,440 --> 00:30:02,100 Social media like tweets and Comments may be more suited for 419 00:30:02,100 --> 00:30:05,730 conceptual analysis being shortened, often focused on 420 00:30:06,420 --> 00:30:11,220 indicating a single key concept. I know I'm not really giving you 421 00:30:11,220 --> 00:30:14,550 enough detail to go out there and start conducting a content 422 00:30:14,550 --> 00:30:18,420 analysis of your own. But I did just want to introduce these 423 00:30:18,420 --> 00:30:22,200 basic concepts to give you an idea of what types of keywords 424 00:30:22,200 --> 00:30:24,810 to search for when you want to figure out exactly how point 425 00:30:24,810 --> 00:30:30,540 analyze your social media data. Here's one popular way you might 426 00:30:31,170 --> 00:30:34,740 possibly you might conduct a content analysis of some Twitter 427 00:30:34,740 --> 00:30:40,320 data. Imagine our research question is how do Twitter users 428 00:30:40,320 --> 00:30:45,240 use mood music to deal with COVID-19 anxiety? The first step 429 00:30:45,270 --> 00:30:49,860 in getting the data is to choose a hashtag phrase or set of terms 430 00:30:49,890 --> 00:30:52,050 relating to your research question that you want to 431 00:30:52,050 --> 00:30:56,430 investigate. A search trip string might include COVID-19, 432 00:30:56,460 --> 00:31:01,170 or COVID, or Corona, music or playlist, anxiety or fear or 433 00:31:01,170 --> 00:31:08,700 worry. Next to use these terms to retrieve a set of data using 434 00:31:08,730 --> 00:31:13,290 our tweet or some other package. And you would have to decide on 435 00:31:13,290 --> 00:31:16,170 a sample size for example, you might decide to look at the 436 00:31:16,170 --> 00:31:21,690 first 500 tweets retrieved by your search. You'd extract the 437 00:31:21,690 --> 00:31:26,250 data and import the sample into Excel or in vivo or some other 438 00:31:26,250 --> 00:31:30,180 software package. And then the most time consuming part, 439 00:31:30,420 --> 00:31:34,380 reading through your data to see what groupings emerge and 440 00:31:34,380 --> 00:31:37,410 tagging each tweet with a descriptive word or phrase 441 00:31:37,410 --> 00:31:41,010 summarizing its theme. Or in some cases, you might want to 442 00:31:41,010 --> 00:31:44,910 tag it as a relevant or an interpretable. Finally, you 443 00:31:44,910 --> 00:31:48,810 would group them condense your descriptors into broad themes, 444 00:31:48,840 --> 00:31:51,540 use these themes to summarize what you've learned about your 445 00:31:51,540 --> 00:31:56,340 research question. Here's an example of a fairly simple 446 00:31:56,340 --> 00:32:00,000 content analysis that was done to classify tweets on the H1N1 447 00:32:00,000 --> 00:32:04,140 Epidemic by type. This particular analysis questions 448 00:32:04,200 --> 00:32:08,340 the tweets into groups such as resource, personal experience, 449 00:32:08,370 --> 00:32:12,630 and jokes. analyses with a different focus might have 450 00:32:12,720 --> 00:32:16,080 classified them based on emotional content positivity, or 451 00:32:16,080 --> 00:32:20,220 negativity, tivity, or some other schema. There are always 452 00:32:20,220 --> 00:32:24,990 many different ways to classify any given set of data, depending 453 00:32:24,990 --> 00:32:31,920 on what your research question is. Now, that's all I've 454 00:32:31,920 --> 00:32:34,590 prepared for you today. But if you're interested in the topics 455 00:32:34,590 --> 00:32:37,350 I've discussed, there are plenty of online sources where you can 456 00:32:37,350 --> 00:32:41,280 find more information. I've compiled a selection of my 457 00:32:41,280 --> 00:32:46,890 favorite links is starting points. And, again, I'm Kristy 458 00:32:46,890 --> 00:32:50,460 Thompson, research data librarian and if you have any 459 00:32:50,460 --> 00:32:54,060 questions or want support with any of the topics I've talked 460 00:32:54,060 --> 00:32:59,160 about, you can contact me directly at kthom67 at uwo dot 461 00:32:59,160 --> 00:33:03,360 ca or reach out to the research and scholarly communication team 462 00:33:03,420 --> 00:33:07,830 at RSC lib@uwo.ca. Thank you for listening