Cognitive testing of a survey instrument for self-assessed menstrual cycle characteristics and androgen excess

Background In large population-based studies, there is a lack of existing survey instruments designed to ascertain menstrual cycle characteristics and androgen excess status including hirsutism, alopecia, and acne. Our objective was to cognitively test a survey instrument for self-assessed menstrual cycle characteristics androgen excess. Methods Questions to assess menstrual characteristics and health were designed using existing surveys and clinical experience. Pictorial self-assessment tools for androgen excess were also developed with an experienced medical illustrator to include the modified Ferrimen-Galway, acne and androgenic alopecia. These were combined into an online survey instrument using REDCap. Of the 219 questions, 120 were selected for cognitive testing to assess question comprehension in a population representative of the future study population. Results Cognitive testing identified questions and concepts not easily comprehended, recalled, or had problematic response choices. Comprehension examples included simplifying the definition for polycystic ovary syndrome and revising questions on historic menstrual regularity and bleeding duration. Recall and answer formation examples include issues with recalling waist size, beverage consumption, and interpretation of questions using symbols (> or <). The survey was revised based on feedback and subsequently used in the Ovulation and Menstruation (OM) Health Pilot study. Conclusion We present a cognitively tested, novel survey instrument to assess menstrual cycle characteristics and androgen excess.


Introduction
Polycystic ovary syndrome (PCOS) is the most common endocrine disorder in reproductive aged women and is characterized by menstrual irregularity and androgen excess, and in some cases, presence of polycystic ovarian morphology on ultrasonographic or other radiologic assessments [1]. Women with PCOS are at increased risk for endometrial hyperplasia [2], infertility [3], and metabolic disorders including diabetes, dyslipidemia [4], obesity [5], and hypertension [6]. Women with PCOS are also at increased risk for mood disorders [4], such as depression and anxiety, and non-alcoholic fatty liver [7].
Ascertainment of menstrual cycle characteristics and androgen excess have been limited in large populationbased epidemiologic studies. Often, use of self-reported menstrual irregularity is a proxy for PCOS status or surveys ask a single question about having been diagnosed with PCOS [8][9][10]. Existing tools are also limited in how to quantify the clinical characteristics of androgen excess by self-report, including hirsutism, acne, and alopecia. Because menstrual disorders, such as PCOS, often require a clinical diagnosis for respondents to be able to report their disease status, mild or clinically unrecognized disease are often not captured in population-based studies. Our goal was to design and cognitively test a survey instrument for 1) self-assessed menstrual cycle characteristics that would ascertain selfreported menstrual history and irregular menstrual cycle phenotypes and 2) capture self-reported androgen excess including hirsutism, acne, and alopecia that could be used in a diverse population of women. This report describes the process of survey design and evaluation via cognitive testing that was undertaken to create the survey for The Ovulation and Menstruation Health (OM) Pilot Study.

Methods
The survey instrument was designed in two phases. During the first phase, the questionnaire including the survey questions and the pictorial self-assessment tools were created. In the second phase, an expert review of the draft questionnaire and pictorial tools was conducted. Cognitive testing was conducted using a print version for comprehension. Usability testing was conducted using the online survey instrument.

Phase 1 Questionnaire creation
The design process began with a review of existing questionnaires from women's health studies such as the Nurses' Health Study 2 (NHS2), the Framingham Heart Study (FHS), and the Cape Cod Health Study (CCHS). These are large, prospective or reproductive cohorts with surveys that span a long temporal window. Table 1 identifies the cohorts used in the initial questionnaire review and whether their questionnaire is publicly available. Although none of these questionnaires were designed to specifically capture menstrual characteristics or other features of androgen excess required to diagnose PCOS, they served as a basis for writing our questions. Additionally, diagnostic criteria of PCOS by the Rotterdam Criteria [11] and classification of abnormal uterine bleeding during a woman's reproductive years [12] by the International Federation of Gynecology and Obstetrics (FIGO) were used as references in drafting questions. A total of 219 questions were developed, which does not reflect the total number of questions an individual may answer due to skip patterns.

Pictorial self-assessment tool creation
The purpose of creating pictorial tools was to make it easier for respondents to answer questions about their body shape, acne, hirsutism, and alopecia by providing visual examples of the different response categories [13]. The Ferriman-Gallwey (F-G) score was created in 1961 and has been used to measure hirsutism as a marker of androgen excess in women [14]. Originally created to assess 11 body areas, the scoring system was later modified to evaluate nine key areas associated with hormonal hair growth. The modified Ferriman-Gallwey (mFG) pictorial tool scores hair growth from no terminal growth to severe levels of growth (0 to 4, for a maximum score of 36). Because different races/ethnicities may have different manifestations of androgen excess, including acne and alopecia, we expanded androgen excess ascertainment to encompass this variation [15][16][17]. A medical illustrator was commissioned to sketch the mFG grading scale based on her previous work illustrating the F-G scale [18]. Images for varying stages of androgenic alopecia based on the Sinclair system were also created [19]. Images for acne severity were illustrated corresponding to acne categories within the question stem. Lastly, anthropometric images were created for use in body shape self-report as a supplement to survey questions on body shape.

Phase 2: expert review and cognitive testing
After drafting the survey instrument, the Center for Survey Research (CSR) at the University of Massachusetts Boston was contracted to conduct an expert review on the design, formatting, and usability of the survey instrument [20,21]. After this review, the survey instrument was built into an online platform using REDCap. The CSR then conducted cognitive interviews for understanding participant comprehension and online survey usability.

Cognitive testing
Participant selection for cognitive testing Because the survey instrument was intended to be accessible to an educationally and racial/ethnically diverse population, we conducted in-person interviews in English with six women, aged 22-46 who represented a mix of racial and ethnic identities as well as varying levels of educational attainment ( Table 2). Half of the respondents had received a diagnosis of PCOS from a health care provider.
Cognitive testing of survey instrument The cognitive testing protocol was designed by CSR and was considered exempt by the IRB at the University of Massachusetts Boston. Interviews were conducted at CSR's office by two experienced interviewers. Interviewers followed a semi-structured protocol that began with respondents filling out sections of the survey alone. Then, the interviewer reviewed the questions with the respondent by asking a series of follow-up probing questions to gain an understanding of how the survey questions were understood and how respondents formed their answers. The cognitive interviews lasted between 50 to 90 min and respondents were paid $80 for their participation. Based on the participants' feedback, changes were made to the survey instrument to improve its clarity.
Because the full survey was long, we focused the cognitive testing on a limited number of key questions. Particularly important were questions regarding menstrual cycle length, age at menarche, menstrual flow, and body hair growth, as these are the main self-reported characteristics used to ascertain PCOS. Additional questions that were evaluated included those critical in our future analyses such as demographics, anthropometrics, PCOS diagnosis, general health, lifestyle and diet, and pregnancy history. Of the 219 questions created, 120 questions were cognitively tested. We present 14 examples here.

Results
Cognitive testing allowed us to evaluate how respondents handled each step of the question-answering process. We found issues with comprehension and readability (including formatting), recall and answer formation problems, and usability concerns. This section describes select examples of some of these problems and how we attempted to solve them. The full cognitive testing report by CSR entitled Ovulation and Menstruation Health Study, Report on Cognitive Testing, July 2017, stored on Harvard Dataverse [13].
Comprehension and Readability Problems include questions that were missing a needed definition, questions that have a definition that is too complex or unclear, questions that are formatted in a way that respondents understand the question in an inconsistent manner, and questions that have embedded assumptions.

Example 1: defining polycystic ovary syndrome Survey question
Polycystic Ovary Syndrome is a health condition where a woman has fewer than 8 periods each year. It is also associated with increased hair growth on the body. Some women may also have many cysts on the ovaries. Has a doctor ever diagnosed you with Polycystic Ovary Syndrome or PCOS?

Findings
Although this question included a definition, respondents were still confused, including those who had been diagnosed with PCOS. Because the definition was very specific, respondents were unsure how to answer if they did not meet all of the requirements of the definition. The definition was modified in future versions to read: Polycystic Ovary Syndrome is a health condition involving irregular periods, excess testosterone, increased acne, body and facial hair, and many small cysts in the ovaries. Some women also experience hair loss on the scalp.

Example 2: historic menstrual regularity Survey question
In the past couple of years, has there ever been a time when your menstrual period was NOT regular or predictable for more than a 3-month window of time?

Findings
Having more than one time-frame in the question ("past couple of years" and "3-month window") was confusing to respondents who seemed either to answer about one time-frame or the other. The introductory phrase "in the past couple of years" was removed in the final version of the questionnaire.

Example 3: menstrual cycle-duration of bleeding Survey question
In the last year, when you had your period, about how many days did it usually last? Please only count the days when you were bleeding, not when you were spotting.

Finding
For most people, a question mark signifies the end of a questionat which point it should be answered. However, in this example, important information was provided in the explanatory sentence after the question. Several respondents were unsure whether or not to include days when they were spotting. When asked about it, respondents said they did not notice the second sentence and simply stopped reading at the question. The question was changed to: "Now we want to know about how long your period usually lasts. Counting only the days when you were bleeding and not when you were spotting, in the last year about how many days did your period usually last?" Example 4: hair removal frequency Survey question In the last year, how often did you use any hair removal method on any area of your body other than underarms or legs? Response options: Daily, three times a week, twice a week, weekly, twice a month, monthly.

Finding
This question assumes a consistency of behavior that may not exist. Respondents were unsure how to answer when hair removal was not on a regular schedule, when they removed hair from different body parts on a different schedule, or when it used to be one way, but now it is a different frequency, or not at all. This was ultimately not changed in the final version as the need to have a baseline estimation for hair removal/depilation outweighed variable responses.

Example 5: obstetrics-intrauterine growth retardation Survey question
Sometimes babies are smaller than we expect. During this pregnancy did your baby have intrauterine growth retardation?

Finding
Respondents commented that they felt the first part of the question didn't match the second part. They did not understand that the first sentence was supposed to be the definition of "intrauterine growth retardation." Since the goal of the question was to assess the size of the baby, not whether the respondent understood the medical phrase "intrauterine growth retardation," the question was modified to "Sometimes, babies are smaller than we expect. During this pregnancy, were any of your babies smaller than expected for their gestational age?"

Example 6: rotating shift work Survey question
A rotating shift is a work schedule where your hours change in a predictable way from day-to-day, week-toweek, or month-to-month. In the last month, did you work rotating shifts?

Findings
This is another example where a definition was provided, but respondents did not consistently understand the idea of a "rotating shift." Rather than focusing on whether their hours changed in a "predictable" way, some simply answered about whether their hours changed at all in the last month. They included changes such as working overtime and having a job where the hours were different on different days (such as 9-7 on Mondays and 10-2 on Tuesdays). In future iterations of the survey, the research team removed this question and asked: Have you ever worked a job where you worked at least some time between midnight and 4:00 AM?

Example 7: other comprehension problems
Comprehension problems can arise from not understanding the words in the question and can also include not understanding abbreviations or symbols. While "lbs." is the abbreviation for pounds, we had a respondent who did not know what it stood for. In several questions, the symbols "<" or ">" were used in the response options. While they are commonly used, several of the respondents were slowed down by seeing the symbol instead of the words and had to think about which symbol was "greater than" and which was "less than". In the final version, we did not use any abbreviations or symbols.

Recall and answer formation problems
Cognitive testing also helps researchers understand whether respondents have the information needed to answer questions. Recall and answer barriers occur when respondents: cannot remember the information needed to answer a question, never had the information needed to answer a question, or have not thought about the information in the format that the question is asking.

Example 8: ovulatory infertility Survey question
Ovulatory infertility is infertility due to not making an egg every month. Have any of your female relatives been diagnosed with ovulatory infertility?

Findings
This question had two cognitive problems -a comprehension problem and a recall problem. Most respondents did not understand the term "ovulatory infertility" and whether it differed from any other kind of infertility. Additionally, none of the respondents knew whether any of their female relatives had it. Since we realized respondents would probably not be able to provide reliable information, we decided to delete the question from the final version. Instead of asking about ovulatory infertility, questions surrounding infertility were streamlined to ascertain if a participant tried to get pregnant for more than 6 months and then asking about possible causes such as fallopian tube complications or hormonal variations.

Example 9: hormonal contraception Survey question
Now think about all the different times you've taken hormonal contraception in your life. Counting only the times you were taking it, for how many years did you take hormonal contraception? (If you took it for less than 12 months, check the box below).

Findings
This question was difficult for those who used multiple forms of contraception and had started and stopped over time. Respondents had the information needed but had to figure out how to get it into the format required. For those who had used an injection, there was an additional layer of complexity stemming from the phrase "taking it" over a specific length of time. One respondent included the amount of time they thought the shot was supposed to last. Another responded as if it was a one-time occurrence and did not report it as any time. In the final version, an extra sentence was added to the question: "If you had an injection or implant, include the time it was effective."

Example 10: waist size Survey question
What is your waist size? Please select if you are responding in inches or centimeters. If you are unsure please take your best guess.

Findings
Only one respondent answered this question, with the rest saying "don't know" and refusing to guess. When interviewed, the respondents explicitly said that they don't know their waist measurement. However, they offered to give something similar that they did know-their pants sizes. Since respondents were unable to answer this question, because they didn't have the information required in the format required, we decided to delete this question from the final version.

Example 11: fruit juice serving sizes Survey question
A serving of fruit juice is 4 oz (half a cup). In the past week, how many servings of fruit juice did you drink.

Findings
This question includes three different measurements that the respondent had to juggle in their mind when answering. While the concept of a "serving" is often used by dieticians and other health professionals, most people do not think in those terms. As such, this question ended up being very confusing to respondents. Even those who could clearly articulate what fruit juice they drank in the past week could not figure out how to answer this question accurately. Although this question is cognitively complex, it was kept in the survey. However, the time frame was modified from "in the past week" to "in an average day" to make it easier for respondents to answer.
Usability concerns Cognitive testing may also provide information about how respondents navigate through a survey and if there are user-interface problems.

Example 12: natural body hair Survey question
Please mark the image in each row that best corresponds with the amount of hair you have on that part of your body (Fig. 1). Please think about your natural body state (without the use of hair removal procedures or treatments):

Findings
This question was followed by a series of pictures depicting hair on the upper lip, chin, and chest. Cognitive testing showed respondents understood what the phrase "natural body state" meant. As in the previous example, respondents had trouble answering the questions because they tried to actually click on or mark the image. Some respondents also said they had trouble seeing the distinctions between the options. In future versions, the format of the page was modified to make it easier for the respondents to understand they were supposed to select from answer choices below the image.

Example 13: current acne Survey question
In the last month or so, how would you currently rate your acne, also called pimples? (Fig. 2).

Findings
This question was followed by a series of answer choices with parenthetical descriptions of acne severity. Cognitive testing showed several respondents only looked at the pictures and did not read the descriptions in the answer choices. One respondent said it depends on the day/time of the month, and another asked whether or not to include acne on the back. Due to the text, parentheses, and pictures, the consultant suggested that the image was adding unnecessary complexity to the question. In the revised version, the image was removed and the question was updated to say: "Thinking about your face or back, how would you rate your acne? If your acne changes during your menstrual cycle, please think about your acne at its worst. [Response options] None or rare acne (none to a couple of pimples), Mild acne (4 Fig. 1 Natural Body Hair Rating Image for Example 12, and associated questions for each body part (lip, chin, chest) or more pimples), Moderate acne (4 or more pimples that are red and irritated), Severe acne (4 or more pimples that are red, irritated, and have pus)."

Example 14: anthropometrics Survey question
Now think about when you were 18 years old. Please mark the number displayed below the image which most looks like your body shape when you were 18 (Fig. 3).

Findings
Respondents did not have any problems figuring out how to answer this as they knew which figure looked most like their body shape. The issue arose when they had to figure out how to mark their answers on the screen. Several respondents tried to click on the number below the picture, which did not work. They had to look further down the page to where the answer options were actually listed. After cognitive testing, the format of the page was modified in future versions to make it easier for the respondents to understand. The question was also changed to: "Which image above looks most like your body shape when you were 18?"

Discussion
We present a cognitively tested survey instrument to ascertain menstrual cycle characteristics and androgen excess. To the best of our knowledge, no other instrument for self-assessment of menstrual cycle characteristics and androgen excess, inclusive of hirsutism, acne, and androgenic alopecia, exists for use in large population-based epidemiologic studies. Cognitive testing allowed for informed iterative survey design based on respondents' comprehension of survey questions, including pictorial tools for image-assisted questions. Cognitive testing identified questions and concepts not easily comprehended, recalled, or with problematic response choices.
Through the process of cognitive testing, we identified questions that would likely contribute to response errors and revised them.
For instance, example 1 demonstrated the importance of generating a clear definition for conditions pertinent to assessing PCOS. The creation of a definition that was simultaneously broad enough to capture the intended group and specific enough to allow for a prevalence assessment of PCOS had to be balanced. Similarly, example 3 which asks about menstrual cycle and definition of bleeding days demonstrates the importance of the location of the definitions. Knowing that respondents consistently read only to the question, the location of the spotting definition became critical to ensure participants were reporting the appropriate menstrual cycle length.
Additionally, expanding descriptions to capture the intention of the question was important. In example 9, identifying a timeframe for hormonal birth control usage helped elucidate the intention of the question which was to identify a duration of use by contraceptive formulation as injectables may have an efficacy of a month to several months. In contrast, we decided to retain questions such as example 4, which asks about frequency of hair removal. Despite issues quantifying frequency of hair removal amongst respondents, we felt the need to have a baseline estimation for hair removal/depilation to asses average depilation frequency in women and kept this question in.

Implications of cognitive testing
When survey questions are not consistently understood and answered by respondents, there is a high likelihood of response errors and resulting biased findings. Cognitive testing is an evaluation tool that allows researchers to fix problems with survey questions before the survey is fielded. This evaluation serves two main functions: 1) it allows researchers to better understand if questions are working as intended and 2) it provides investigators insight into how respondents handle the cognitive steps involved in answering questions (which, if needed, provides insight into how a question might be improved). These cognitive steps include comprehension (were the questions consistently understood by respondents as intended by the researchers); recall (does the respondent have the information needed to answer the question); and judgement and answer formation (were respondents able to decide and provide an answer that accurately reflected what they wanted to say in the format the researchers required). Cognitive testing allows researchers to be more confident that survey respondents understand questions the way researchers intend. Its use in questions relating to critical outcomes of the study may prevent inadvertent errors in questioning which later need to be adjusted, resulting in missing data and/ or improper answering of key questions.
The use of cognitive testing as a survey evaluation tool has become standard practice by survey researchers when creating a new survey or when using a survey originally created for a different population. It is likely that many current large-scale cohort studies involving surveys undergo some form of cognitive evaluation before they go into the field. However, surveys designed before the 1980s were probably not cognitively tested [22,23]. Questions initially designed for the Nurses' Health Study 2 (which began in 1989), were designed for health professionals, and assume the respondents have a high educational level and medical knowledge, an assumption that might be incorrect for the general population [24]. For example, the rotating shift may have been widely understood when it was used in the Nurses' Health Study 2 [25]. However, it was poorly understood by several of our respondents during cognitive testing (see example 6). Even for studies that are focused on the same population, researchers need to consider the changes in population over time. For example, a study conducted using race/ethnicity questions from the 1900s will not be reflective of the demographics in the United States in 2020s [26]. Therefore, researchers must consider the implications of using survey questions that were created in a different time period. Similarly, social and temporal factors play a major role in determining the boundary conditions in which a cognitively tested tool remains functional.
The questionnaire and pictorial tools were designed from our review of the strengths and limitations of questionnaires from large cohort studies, such as the Nurses' Health Study (NHS2) and the Framingham Heart Study (FHS). While, ascertainment of PCOS was limited in comparison to the survey we developed, they served as an important foundation for identifying the current gap in ascertaining information about selfreported androgen excess in women. Examples 12 and 14 demonstrate that image assisted questions were generally well comprehended.
Limitations of this survey instrument include the need to validate self-reported menstrual cycle characteristics and androgen excess against a gold standard. This gold standard may be an in-person interview or a medical record based clinical determination to compare with the self-reported responses. Validation of self-reported features against a gold standard such as medical record validation of PCOS diagnosis, reproductive health conditions, and androgen excess will be undertaken in future studies. Although cognitive testing is often conducted with a small number of respondents, researchers have found that important information can be gleaned about potential problems using a small sample size, especially if several respondents encounter the same issues or misunderstand the same things. Since the purpose of cognitive testing is to evaluate the survey instrument, in order to be most effective, respondents should come from the population for which the survey is intended as this study has done. This survey instrument has several strengths. We ask about components critical to the diagnostic criteria of PCOS, including androgen excess and menstrual irregularity. Compared to previous large cohorts such as NHS2 and FHS, we are able to construct a more insightful assessment of menstrual irregularity and androgen excess states using a composite of symptoms based on clinical diagnosis rather than using a yes/no response to a single survey question that might be poorly understood by a respondent. This is key to ascertain those with milder forms of menstrual irregularity and androgen excess in the general population. Using this new survey instrument, we can assign menstrual irregularity and androgen excess state beyond the binary answers to menstrual irregularity questions alone. Furthermore, this survey instrument improves upon existing cohorts lack of specificity in the question stems asked.
This survey instrument was designed to be understood by those with 8th grade level English language proficiency. Medical terminology was accompanied by definitions to facilitate comprehension. Furthermore, the survey instrument may be easily shared to facilitate harmonization across cohorts. We have shared this entire instrument in a genomic study, Genes for Good, to ascertain phenotypes of menstrual irregularity and androgen excess [27]. This tool is complementary and additional to phenotyping suggested by the PhenX Tool-Kit, whose questions are limited to androgen excess in males only at this time [28,29]. Other uses include employment in areas of the world that have not had appropriate ascertainment or studies related to menstrual cycle research. The pictorial tool has already been shared with researchers leading the Boston University Pregnancy Study Online (PRESTO) to evaluate hirsutism within a cohort of North American Women attempting pregnancy [30].
In summary, this instrument includes targeted questions and images for anthropometrics and androgen excess, including acne, hirsutism, and alopecia. Its overarching goal is to accurately ascertain cases of PCOS, some of whom may be misclassified as non-diseased.

Conclusion
We designed and cognitively tested a survey instrument for self-assessed menstrual cycle characteristics and androgen excess in a diverse population. This tool can be shared across studies to facilitate ascertainment of menstrual cycle characteristics and androgen excess.
Given the long-term health risks for women with menstrual irregularity and androgen excess, it is imperative to have tools to identify affected women for possible interventions.