Description
apwheele / entity_masking repository on Github
This Pub is copy of the original version (10/31/24) published on CrimRxiv https://doi.org/10.21428/cb6ab371.15d7c59e. For the current version, please visit that URL.
Qualitative researchers are expected, sometimes required, to publish their data open access. This is for the sake of science, impact, and social justice. Yet, understandably, qualitative criminologists are worried about what this means for their workload and their ability to protect subjects’ confidentiality. To be solutions-oriented, we developed an open-source Python script for anonymizing qualitative data. It uses named-entity recognition and fuzzy-rule based merging to identify and replace personally identifiable information (PII) with unique pseudonyms. This tool doesn’t eliminate the need for manual work, but it reduces the cost and associated risk. In this article, we describe and explain how our script works and how to use it. We conclude by discussing the implications for open (qualitative) criminology.
qualitative data; anonymization; confidentiality; open criminology; data science
Please direct correspondence to Scott Jacques ([email protected]).
We’re part of a team researching illicit marketplaces on encrypted communication platforms (e.g., Telegram). People use them to illegally buy and sell guns, drugs, cosmetics, electronics, gift cards, most anything. We find and scrape the platforms’ “channels,” where this activity occurs. We analyze the resultant-data to describe, explain, and mitigate it.
We’ll publish the findings, which simply means “make them public.” “Publishing” is the process of providing “public access” (PA) to “publications.” In the academic context, the goal is to make information and knowledge more available for study and application, instead of keep it private.
It’s possible to publish any output (i.e., “work product,” “deliverable,” “creation”). In addition to articles and books, examples are datasets and software code. These are disseminated via journals, of course, but also on “outlets” such as scholars’ websites, repositories, and on networking sites (e.g., social media, ResearchGate).
To maximize a publication’s positive effect, it must be more than PA. It must be “open access” (OA): “digital, online, free of charge, and free of most copyright and licensing restrictions” (Suber, n.d.). OA exists so everyone can afford scholarship, making our work more impactful, scientific, and socially just (see, e.g., Baldwin, 2023; Willinsky, 2023; in criminology, see Ashby, 2021; Buil-Gil et al., 2024; Chin et al., 2023; Jacques, 2023; Worrall and Wilds, 2024).
To be clear, all OA publications are PA but not vice versa. OA material is more than “not private.” Rather, OA publications are truly available to anyone with a computer (including smart devices) and internet. This compares to, for instance, libraries with public archives that are only viewable in-person; or if digitized, with “all rights reserved.” Another example is textbooks are PA but not affordable to all students.
The opposite of OA is “closed access” (CA). This refers to scholarship that’s “paywalled” or “file-drawered.” The former refers to published outputs that aren’t freely accessible. The latter refers to unpublished outputs or “no PA” (see Rosenthal, Kleid, and Cohen, 1979).
Paywalls block the majority of published criminology articles (Ashby, 2020). Previously, we’re written about why and how to legally and easily provide OA to them (Jacques, 2023; Wheeler, 2019).
If CA is an iceberg, paywalls are the tip. The bigger problem is unseen. A lot of work is left unpublished. It gets proverbially file-drawered, stowed away for private keeping. Because it’s not PA, this is more restrictive than a paywall.
By keeping information and knowledge from the public, file-drawers limit impact and scienticity. Sometimes an exception is made for privileged scholars (e.g., colleagues, students), but this is a social injustice because the opportunity is unequally distributed.
We figure the majority of criminology’s publishable outputs are file-drawered. We don’t know of any research on this conjecture (but see Buil-Gil et al., 2024). Yet consider all the syllabi, lecture notes, PowerPoint slides, datasets, code files, and other outputs we keep private.
Criminologists’ revealed preference, to date, is to perish over publish anything except articles and books.1 This made sense until recent history. Before the computer and internet, it wasn’t economically feasible to publish everything. Printing costs a lot more than copying. As technology improves, the cost of PA and OA move toward zero. They’re becoming more rational.
Of all the outputs produced by criminologists, it’s maybe hardest to make one’s data OA. To be clear, the actual publishing-process is easy: just upload the file to a repository. The hard part is preparing data for publication (UK Data Service, n.d.; Westbury et al., 2022). Partly, this is why there’s greater access to quantitative than qualitative data (Campbell et al., 2023).2 The remainder of this article focuses on the latter.
Bucerius and Copes (2024) wrote an impassioned article against publicly sharing qualitative data (cf. Campbell et al., 2023). Their concerns span ethics, inequalities, knowledge inhibition, and career damages. In terms of concrete processes and outcomes, their dominant concerns include (1) the risk of breaking confidentiality with participants, and (2) the prohibitive cost of mitigating this risk.
On the first concern, they wrote:
Research ethics guidelines mandate that we maintain the confidentiality of participants, which is essential to preserving the integrity of our work. … [A] requirement to share data to a publicly accessible data repository compromises this confidentiality[.] Given the profile of topics and groups that criminologists study, this poses the risk of putting at risk already marginalized groups. … [I]nterviewees share identifiable information about their lives, including trauma or being involved in extremely sensitive situations. …
Even when confidentiality isn’t broken, they argue, the mere possibility may undermine research. Quoting Bucerius and Copes (2024):
[I]f required to upload entire interview transcripts and ethnographic field notes, it would undoubtedly negatively impact the authenticity of data collection.3 … This would lead to more stilted and inauthentic conversations with participants, ultimately leading to lower quality data, which, by extension, would do a disservice to our discipline and knowledge production. (p. 8)
[This] may deter organizations from collaborating with qualitative researchers. In some cases, entering partnerships with the understanding that entire interview transcripts will be shared publicly will be impossible. … [It] will certainly hinder or even preclude research collaboration and, ultimately, inhibit knowledge production. (p. 8)
It is possible, given the formal ethical expectations for anonymity and confidentiality, that IRBs will simply not approve research that involves unconstrained disclosure. … [T]his will have a serious impact on the types of research scholars anticipate conducting in the future[.] (p. 7)
Breaking confidentiality is a serious risk. It costs resources to prevent it. This has been true for at least a century. Pseudonyms are used in The Jack-Roller (Shaw, 1930) and The Professional Thief (Sutherland, 1937). Decades ago, Wright and Decker (1994) provided de-identified access to their interviews with offenders.
Anonymization is an old practice (Saunders, Kitzinger, and Kitzinger, 2015). By definition, it’s stripping data of personally-identifying information (PII) that should be confidential. The PII may belong to a person, group, place, event, et cetera; a participant, researcher, or third-party.
To anonymize data, qualitative criminologists already use various “techne” (see, e.g., Campbell et al., 2023; UK Data Service, n.d.; Westbury et al., 2022).4 Often, they’re required to do so by ethical boards, memorandums of understanding, and other written agreements (e.g., with funders).
Yet, there is a significant cost to anonymizing data. It’s daunting and scary because it takes careful work over a long stretch of time. To do otherwise risks breaking confidentiality. This cost and this risk deter criminologists from making their data PA and OA.
According to Bucerius and Copes (2024), it isn’t feasible or reasonable to ask criminologists to sufficiently anonymize their qualitative data:
[T]he prospect of going through hundreds of interviews to try and prospectively identify information that might be sensitive, logically identifiable to a person, or incriminating would be exceptionally difficult. (p. 7)
Even if it were possible to completely de-identify interviews, the substantial effort required in this process is a tremendous, inequitable, and ultimately an unnecessary burden. (p. 7)
[D]eidentifying materials places an undue burden on qualitative researchers, further increasing the already existing workload inequities. For junior researchers, this additional burden might make qualitative methods even less attractive. (p. 7)
A surefire way to minimize the risk and cost of anonymization is to avoid it. For example, Bucerius and Copes (2024) propose to:
adopt a policy that more closely aligns with the guidelines of the National Science Foundation (NSF), which requires qualitative researchers and ethnographers [should be required] to upload their interview guide and explanations on recruitment strategies, but does not require them to share interview transcripts and fieldnotes. At a minimum, researchers should be able to withhold making transcripts public when, in their professional opinion, there is the potential for serious harm to reputation or safety of participants and researchers, and where university IRBs have determined the transcripts cannot be made public. (p. 8-9)
Even if we didn’t want to provide OA to our data, the choice is fleeting. Starting 2026 in the United States, for example, federally-funded researchers must comply with this policy:
all peer-reviewed scholarly publications authored or coauthored by individuals or institutions resulting from federally funded research [shall be] made freely available and publicly accessible by default in agency-designated repositories without any embargo or delay after publication. (White House OSTP, 2022, p. 3)
The editors of Criminology recently announced its “authors will be required to upload data and code to the [journal’s designated] repository unless they obtain editor approval to withhold it” (Sweeten et al., 2024, p. 10). This announcement is what sparked Bucerius and Copes (2024).
Making our data OA will increase our work’s impact, measured by citation and altmetrics (e.g., views, users). It’ll make our work more scientific by providing researchers with the opportunity to analyze it. It’ll increase social justice by providing disadvantaged researchers with a resource: data.
By choice or force, criminologists will provide PA and OA to their data, sooner or later. Now is the time to prepare. As a collective of experts—a scholarly community—we should be modernizing our practices. Including, how we anonymize data and share it with the world.
The problems with anonymizing data are an opportunity to advance qualitative criminology. Always has been. But now, it’s pressing because of ever-higher expectations to make data PA and OA.
In addition to an emphasis on techne, we should lean into technology: tools that help us perform techne (for more details, see Proctor and Niemeyer, 2019). First came audio-recorders, which moved us past the need to jot notes during interviews. Then came software for analysis, replacing the use of highlighters and other analog tools.
More recently, it became possible to publish audio-recordings within articles (Allen, 2020). Recent advancements in AI enable text-to-speech with avatar voices. In short, there’s exciting changes to how and what we publish.
By harnessing technology for qualitative criminology, we make the future brighter for everyone. It’ll make our craft more ethical, equitable, fruitful, and rewarding (cf. Bucerius and Copes, 2024).
Our data will become more authentic, leading to higher quality. It’ll be easier to enter partnerships with organizations. IRBs will be more approving. What was previously “exceptionally difficult” and an “unnecessary burden” (ibid., p. 7) will become “not too bad” and “worth it.” Workload inequalities will be reduced, making the craft more attractive to junior researchers.
In qualitative-criminology’s journey, a logical next-step is to invent, adapt, and use technology for anonymizing data. Tools can help us better protect confidentiality with less time, less effort, for less money, with more certain effects. Publishing these tools OA is utilitarian.
In this article, we present a tool that brings the cost and risk of sharing qualitative data closer to zero. To the extent it’s adopted and built-upon, anonymization-techne will become less costly and less risky for qualitative criminologists. This’ll aid efforts to make their data PA and, ideally, OA for the sake of scienticity, impact, and social justice. As a collective, we’ll generate better knowledge, with more usage, that’s accessible to everyone.
The technology is a Python script developed by the second author (Wheeler, 2023), henceforth referred to as “ours” for readability. What the script does is spot PII in text files and replace them with unique pseudonyms. If an instance of PII is found twice or more in the data, the program assigns it a single pseudonym.
Our goal wasn’t to create a script that’s entirely novel. Kleinberg and Mozes (2017) and Kleinberg et al. (2022) provide similar methods, for example. However, our version is open-source, freely licensed for everyone to use and adapt. In what follows, we describe and explain (1) how our script works and (2) how qualitative researchers can use it. In so doing, we contribute an illustration of text-anonymization in practice.
As a final disclaimer, we want to be clear that our tool will not remove every single instance of PII in most datasets.5 It doesn’t eliminate work, but, instead, makes it faster and more reliable. By sharing our tool with this accompanying article, we aim to help qualitative criminologists to provide more open data of higher quality.6
Our anonymization script has two main steps. First, it runs a named-entity-recognition (NER) analysis on a series-of-text or “messages.” Second, it uses a fuzzy-rule-based system to merge similar PII entities. This section shows what’s involved in these steps and why they’re useful.
NER analysis takes blocks of text and classifies each portion. Our script adopts an open-source NER model originally trained to remove PII in medical records (Chambon et al., 2023). This includes person-names, numbers, contact information, and geographic locations. Here is a hypothetical example of how NER analysis works:
If a series-of-text is “Andy Wheeler lives at 123 Meadow Road”, an NER tool will identify and classify each text-string as PII or not:
Andy [Name]
Wheeler [Name]
lives [-]7
at [-]
123 [Address]
Meadow [Address]
Road [Address]
After the tool classifies each word, it groups (i.e., collapses) adjacent text-strings of the same category, such as:
Andy Wheeler [Name]
lives at [-]
123 Meadow Road [Address]
Our script’s second step uses a fuzzy-rule-based system to merge similar PII entities. To understand how this works, let’s start with the “Edit Distance” (E)8 between a set of text-strings. E is the number of edits to make two text-strings the same. For example, at least 1 edit must be made to change “Anderw” to “Andrew” (e.g., to fix a transcription error), so here the E is 1.
Given a set of text-strings, E will always have a “minimum distance” (min(E)) and “maximum difference” (max(E)). The former is the number of character differences. Hence, the min(E) for “Andrew” and “Anderw” is equal to 0, since they have the same characters; whereas for, say, “Andre” and “Andrew”, the min(E) is 1 because of the unshared “w.”
The max(E) is the number of characters in a set’s longer text-string. There are 6 letters in Andrew, so the above-pairings have a max(E) of 6. Versus if the pairing is “Andre” and “Andy,” the max(E) is 5.
Our script’s ability to anonymize text is improved by knowing min(E) and max(E). Instead of simply count E, we normalize E from 0 to 1 (norm(E)). Formulaically, norm(E) equals E minus min(E) divided by the sum of max(E) minus min(E). To help show what this looks like, Table 1 provides a breakdown of what’s involved in calculating norm(E) for two sets:
Table 1. Illustration of calculating norm(E)
Text-string set | ||
---|---|---|
“Andrew” & “Anderw” | “Andrew” & “Scott” | |
E | 1 | 6 |
min(E) | 0 | 5 |
max(E) | 6 | 6 |
—> | ||
E − min(E) | 1 | 1 |
max(E) − min(E) | 6 | 1 |
—> | ||
norm(E) | 0.16 | 1 |
—> | ||
Single entity | Yes | No |
Note: norm(E) equals (max(E) − min(E)) ÷ (max(E) − min(E))
We use norm(E) instead of E because it allows us to avoid blanket anonymization and enable anonymous identifiers (see e.g., Chambon et al., 2023; Kleinberg et al., 2022).9 When a text-string-set has a norm(E) below 0.2, our script treats the two text-strings as a single entity. Instead of replace PII with a general label (e.g., “Name Removed”), the script gives the two text-strings the same unique pseudonym (e.g., “Person #1”); see table 2, for example.
Table 2. Illustration of how the script changes original text-strings to anonymized versions
Original text | Anonymized text |
---|---|
Andy Wheeler is a birder 190682540 where I live 100 Main St Kansas with Joe Schmo and Andy Wheeler | PersonName2 is a birder IdentNumber2 where I live Geo1 with PersonName3 and PersonName2 |
Scott Jacques is an interesting fellow, his check number 18887623597 is a good one. | PersonName5 is an interesting fellow, his check number IdentNumber1 is a good one. |
lol what a noob, Atlanta GA is on fire, email me [email protected] your stats | lol what PersonName1, Geo2 is on fire, email me [email protected] your stats |
so what, andrew wheeler @ 100 main st kansas is not so bad | so what, PersonName2 @ Geo1 is not so bad |
pics or it didnt happen 999-887-6666 | pics or it didnt happen Contact1 |
A subjective element of scripts like ours is where to set the norm(E) threshold for “same/different.” Higher limits will have more false-positives and lower limits will have the opposite problem.
Our script uses the 0.2 cutline based on Wheeler (2015), which limited false-positive matches to less than 1 in 1000. When doing large table scans (i.e., there’s a lot of text), it’s important to limit them because the number of comparisons is pairwise.
For example, if you have 100 names, you then have
Obviously there are tradeoffs in fuzzy matching. By lowering the threshold, the recall will be higher, which is good. But it would also increase false-positive matches, which is bad.
Using our tool, qualitative researchers can set the threshold to whatever best suites their needs. It’s like turning a dial up or down. The screenshot shows how to call the function, passing in your data (the text) as well as the threshold parameter. In return, the script returns the anonymized text, along with the identified entities.
To use the script, first you’ll visit the GitHub page at this Pub’s top (Wheeler, 2023).10 Then you download the functions, and install the necessary Python libraries. At that point, you have access to the python functions to run on your text. For those interested in learning how to program in Python, we suggest Wheeler (2024), a beginner reference with examples from criminology.
To further illustrate how our script works, now we’ll show how to assess its accuracy at anonymizing qualitative data. We use discourse-data from an internet-forum for gun manufacturing. The dataset has more than 53,000 messages. To be clear, the activity thereon isn’t necessarily illegal.
Each forum-message is a case in our dataset. We expect our model to work well with this data, except for utterances that far less common in the medical context (e.g., slang, profanity, gun models). By “work well,” we mean it accurately identifies and replaces PII.
The script’s accuracy will vary across datasets. This is because the original data on which the NER model was trained may not have all of the same entities, or same frequencies of them, as other datasets. Recall our model was trained on medical transcripts (Chambon et al., 2023).
Compared to the data produced by interviews, it’s particularly challenging to anonymize internet-forum data. Instead of, for example, there being a clear interviewer and interviewees, there’s dialogue between many personas with ambiguous pseudonyms in the form of user IDs.
To assess the script’s ability to analyze these data, we analyzed the rate of false positives and false negatives in our illustrative dataset. “False positives” are text-strings identified by the script as PII but that aren’t, actually, whereas “false negatives” aren’t flagged as PII but are.
First, we ran the script on the entire dataset. Second, for the records with an identified name (i.e., messages marked as PII), we created a random sample of 1500 cases. This size enables us to assess the script’s accuracy without checking every case in the dataset.11
Third, we randomly assigned 1000 cases to each of us (Jacques and Wheeler), with an overlap of 500. The overlap enables us to compare the reliability of each coder. Also, by analyzing nonoverlapping cases, we produce smaller error intervals for the overall proportion of false negatives and positives.
Fourth, presented below, we assessed the script’s accuracy by individually coding each case in our sample (overlapping and not).
For the sample coded by Jacques, he found 29% (305 of 1051) of cases have a false positive. Wheeler found a higher proportion, at 51% (535 of 1047).12 To be clear, these are cases identified by our script as PII but aren’t according to our manual coding. In other words, the script anonymized these cases but they shouldn’t have been because they’re not PII.
In our respective samples, there are 523 overlapping text-strings. Their agreement rate was 73%. Table 3 shows the cross-tabulation.
Table 3. False positive agreement in coded sample
Scott | No False Positive | False Positive |
---|---|---|
Andy | ||
No False Positive | 254 | 25 |
False Positive | 126 | 128 |
The disagreement is mostly attributable to how differences in how we coded “handles” (i.e., “user names”) as PII. Jacques but not Wheeler tagged them as PII. Examples are “Hentai PLA,” “lizardjesus,” and “Firebird.” Depending on a researcher’s goal, they may (not) feel the need to anonymize handles.
It’s less debatable how to handle other common false positives. The word “Jesus” is mostly used as a profanity, but our script classifies it as PII. Some other regular words, such as “damn,” “nope,” and “yup,” are also identified as PII. We suspect these are uncommon words in the original training data Chambon et al., 2023), so our script categorizes them as PII due to a lack of information to the contrary.
Finally, we use a bounding exercise to combine the two rating samples. If both coders determined it was a false positive, we have 36% false positives (571 out of 1,575). If only one or the other stated it was a false positive, we have 37% (584 out of 1,575). Using Clopper-Pearson binomial confidence intervals, we then have a potential range of 33% to 40% combining the two estimates. It seems reasonable to estimate that in a similar sample of an online community, such a tool will produce around a 1/3 false positive rate in identifying PII.
Our analysis of false negatives had fewer incorrect categorizations. Jacques and Wheeler found, respectively, 14% (137 of 1000) and 2% (18 of 1000) cases have a false negative. These are the rates at which our script didn’t identify a text-string as PII but it is per our manual coding. The script failed to properly anonymize these cases.
In our respective samples, there are 500 overlapping text-strings. Our agreement rate is 89%, with the cross-tabulation shown in table 4.
Table 4. False negative agreement in coded sample
Scott | No False Negative | False Negative |
---|---|---|
Andy | ||
No False Negative | 433 | 57 |
False Negative | 0 | 10 |
The only disagreement happened when Jacques coded cases as a false negative but Wheeler didn’t. These differences, too, are mostly attributable to Jacques but not Wheeler coding handles as PII. Examples are “ClapTrap,” “kinger,” “isildur,” and “dolphin.”
Using the same bounding estimates to combine the two samples, this produces a potential low estimate of false negatives of 6% (88 out of 1500) with a 99% confidence interval of 4% to 8%; and, a high estimate of false negatives of 9% (135 out of 1500) with a 99% confidence interval of 7% to 11%. Thus our estimate of false negatives in a similar online forum, using this model, is less than 11% in messages.
Criminologists will be required to share data underlying their articles and books (e.g., Sweeten et al., 2014; White House OSTP, 2022). Understandably, many criminologists are afraid of this progress; especially, qualitative criminologists (e.g., Bucerius and Copes, 2024). Fear is ok but we can’t be deterred. We shouldn’t be keepers of paywalls and file-drawers.
Expanding access to data doesn’t require lowering our standards for confidentiality and anonymization. We should maintain protection and provide greater access (Campbell et al., 2023; UK Data Service, n.d.; Westbury et al., 2002). This is utilitarian.
In the digital era, we can do more than ever to spread information and knowledge. This is better for science, impact, and social justice. Acting on this is our duty and opportunity.
As technology improves, it becomes less costly and less risky to anonymize data. It’s becoming more rational for us to provide open data. Hence, in part, why there’s widespread change in what’s expected of us.
Technology improves what’s possible with techne. Indeed, technology makes PA and, more recently, OA possible (e.g., the printing press, “arxives”). Yet qualitative criminologists are working too slow to modernize our practices.
To be clear, our goal isn’t to eliminate manual work. Rather, it’s to reduce the cost and risk. Fear is healthy, but we want to minimize it. To do so, we need to move beyond problems to solutions.
As mentioned in the Foreground section, testing the script’s accuracy on a dataset is not equivalent to ensuring it’s properly anonymized. As the script’s fit becomes better, the manual anonymization process becomes less daunting.
Making the best fit may require building on a different model or training a new one. Always, researchers need to diligently evaluate a script’s results to see if it’s accurate on its face (e.g., too many false positives or false negatives).
But no matter the script’s fit, one or more criminologists still need to read the data, line-by-line, to ensure the script hasn’t missed anything or done something it shouldn’t have. Technology’s role is to make that process as easy and safe as possible.
By making our script open-source, it’s available for free use and adaptation. If you make the tool better than its current version, that’s fantastic! We want to know about it. We hope you’ll publish it.
We need to harness, invent, implement, and iteratively improve tools to help us perform our craft. To the extent we improve anonymization, our participants will have greater trust in us. Not only in our ability to keep PII secret, but to do so in a way that maximizes their voice’s reach.
As a result, our data will be more authentic, leading to higher quality findings and more useful insights for improving the world around us. We’ll have better interactions with (potential) partnering organizations and IRBs. Workloads will be reduced, making our work more attractive to junior scholars (cf. Bucerius and Copes, 2024). Open criminology is for the greater good.
Allen, A. (2020). Publishing audio quotes in articles: making and merging clips. Journal of Qualitative Criminal Justice and Criminology. https://www.qualitativecriminology.com/pub/publishing-audio-quotes-in-articles-making-and-merging-clips
Ashby, M.P. (2021). The open-access availability of criminological research to practitioners and policy makers. Journal of Criminal Justice Education, 32(1), 1-21. (OA postprint available here.)
Baldwin, P. (2023). Athena unbound: Why and how scholarly knowledge should be free for all. MIT Press.
Bucerius, S., & Copes, H. (2024). Transparency and trade-off: the risks of Criminology’s new data sharing policy. The Criminologist, 50(2), 6-9. https://asc41.org/wp-content/uploads/ASC-Criminologist-2024-03.pdf
Buil-Gil, D., Bui, L., Trajtenberg, N., Diviak, T., Kim, E., & Solymosi, R. (2024). Diversifying crime datasets in introductory statistical courses in criminology. Journal of Criminal Justice Education, 1–27. https://doi.org/10.1080/10511253.2024.2334706
Campbell, R., Javorka, M., Engleton, J., Fishwick, K., Gregory, K., & Goodman-Williams, R. (2023). Open-science guidance for qualitative research: An empirically validated approach for de-identifying sensitive narrative data. Advances in Methods and Practices in Psychological Science, 6(4), 25152459231205832.
Chambon, P. J., Wu, C., Steinkamp, J. M., Adleberg, J., Cook, T. S., & Langlotz, C. P. (2023). Automated deidentification of radiology reports combining transformer and “hide in plain sight” rule-based methods. Journal of the American Medical Informatics Association, 30(2), 318-328.
Chin, J. M., Pickett, J. T., Vazire, S., & Holcombe, A. O. (2023). Questionable research practices and open science in quantitative criminology. Journal of Quantitative Criminology, 39(1), 21–51. (OA postprint available here.)
Jacques, S. (2014). The quantitative-qualitative divide in criminology: A theory of ideas’ importance, attractiveness, and publication. Theoretical Criminology, 18:317-334. (OA postprint available here.)
Jacques, S. (2023). Ranking the openness of criminology units: An attempt to incentivize the use of librarians, institutional repositories, and unit-dedicated collections to increase scholarly impact and justice. Journal of Contemporary Criminal Justice, 39(3), 371-386. (OA postprint available here.)
Kleinberg, B., & Mozes, M. (2017). Web-based text anonymization with Node. js: Introducing NETANOS (Named entity-based Text Anonymization for Open Science). Journal of Open Source Software, 2(14), 293.
Kleinberg, B., Davies, T., & Mozes, M. (2022). Textwash--automated open-source text anonymisation. arXiv preprint arXiv:2208.13081.
Proctor, K. R., & Niemeyer, R. E. (2019). Mechanistic criminology. New York: Routledge.
Rosenthal, R., Kleid, J. J., & Cohen, M. V. (1979). Abnormal mitral valve motion associated with ventricular septal defect following acute myocardial infarction. American Heart Journal, 98(5), 638–641.
Saunders, B., Kitzinger, J., & Kitzinger, C. (2015.) Anonymising interview data: challenges and compromise in practice. Qualitative Research, 15(5), 616-632.
Shaw, C. (1930). The jack-roller: A delinquent boy’s own story. University of Chicago Press.
Sutherland, E. H. (1937). The professional thief. University of Chicago Press.
Sweeten, G., Topalli, V., Loughran, T., Haynie, D., & Tseloni, A. (2024). Data transparency at Criminology. The Criminologist, 50(1), 9-11. https://asc41.org/wp-content/uploads/ASC-Criminologist-2024-01.pdf
Suber P. (n.d.). Open access (the book). https://cyber.harvard.edu/hoap/Open_Access_(the_book)
UK Data Service. (n.d.). Anonymising qualitative data. Retrieved October 13, 2024, from https://ukdataservice.ac.uk/learning-hub/research-data-management/anonymisation/anonymising-qualitative-data/
Westbury, M., Candea, M., Gabrys, J., Hennessy, S., Jarman, B., Mcneice, K., & Sharma, C. (2022). Voice, Representation, Relationships: Report of the Open Qualitative Research Working Group. University of Cambridge Working Group on Open Qualitative Research. https://doi.org/10.17863/CAM.91979
Wheeler, A. P. (2019). Why I publish preprints. Andrew P. Wheeler (the website). https://andrewpwheeler.com/2019/05/30/why-i-publish-preprints/
White House OSTP (Office of Science and Technology Policy). (2022). Memorandum for the Heads of Executive Departments and Agencies. https://www.whitehouse.gov/wp-content/uploads/2022/08/08-2022-OSTP-Public-Access-Memo.pdf
Worrall, J. L., & Wilds, K. M. (2024). Is open access criminology influential? Journal of Criminal Justice Education. https://doi.org/10.1080/10511253.2024.2389096
Wheeler, A.P. (2015). Some ad-hoc fuzzy name matching within Police databases. https://andrewpwheeler.com/2015/07/01/some-ad-hoc-fuzzy-name-matching-within-police-databases
Wheeler, A. (2023). Entity masking. https://github.com/apwheele/entity_masking
Wheeler, A.P. (2024). Data science for crime analysis with Python. Crime De-Coder. https://crimede-coder.com/store
Willinsky, J. (2022). Copyright’s broken promise: How to restore the law’s ability to promote the progress of science. MIT Press.
Wright, R., & Decker, S.H. (1994). Exploring the House Burglar’s Perspective: Observing and Interviewing Offenders in St. Louis, 1989-1990. Inter-university Consortium for Political and Social Research [distributor]. https://doi.org/10.3886/ICPSR06148.v1
Wright, R., Jacques, S., Stein, M. (2015). Where are we? Why are we here? Where are we going? How do we get there? The future of qualitative research in American criminology. Advances in Criminological Theory, 20, 339-350. (OA postprint available here.)
Scott Jacques is a professor of criminology at Georgia State University. He founded CrimRxiv and currently serves as its Associate Director for Sustainability. Learn more about him at scottjacques.pubpub.org.
Andrew P. Wheeler, PhD, received his doctoral degree in criminal justice from the University at Albany SUNY. His published work focuses on data applications in policing; predictive analytics, operation research, and policy analysis. He has collaborated with police departments across the United States, and currently runs a consulting firm, CRIME De-Coder, in which he helps police departments with custom software and data analytics, https://crimede-coder.com/.
This project was approved by the IRB of Georgia State University (GSU), and funded by the Criminal Investigations and Network Analysis Center: A DHS Center of Excellence, under the title, “The Next Battlefield: Illicit Markets Hosted on Encrypted Communication Platforms,” awarded to GSU’s Evidence-Based Cybersecurity Research Group (PI David Maimon, Co-PIs Scott Jacques and Yubao Wu). We thank Joshua Gerstenfeld, Valeriia Lymishchenko, and Chinweolu Okafor for their feedback on a draft-version.