Heritage Language Variation and Change in Toronto
home » about » corpus

The HerLD Corpus

A primary goal of the is to construct a unique corpus of conversational speech in ten Heritage Languages spoken in the Greater Toronto Area. This corpus, the , contains recordings in the Heritage Languages of speakers representing three generations, and comparable homeland samples. Our goal is to record 40 heritage speakers, balanced for age and sex, for each of the three generations (and 20 speakers for languages where only two generations exist in Toronto, i.e., Korean and Faetar).

Test version of HLVC Corpus in Dataverse

HLVC Speaker sample

Distribution of 451 transcribed interviews

   (updated March 2023)

Language

HOM

Gen1

Gen2

Gen3

Total by Language

Cantonese

14

18

25

6

63

Faetar

34

11

3

0

48

Hungarian

0

2

6

3

11

Italian

9

10

21

11

51

Korean

10 *

19

24

1

54

Polish

12

12

14

0

38

Portuguese

12

0

3

0

15

Russian

19 *

18

17

3

57

Tagalog

17

20

18

0

55

Ukrainian

17

10

17

15

59

Total by Generation

125

120

148

39

451

We have not collected Homeland Russian data. Instead we have adapted relevant portions of the Russian National Corpus (Institute of Russian Language, Russian Academy of Sciences, 2003-2023), whose owners we graciously thank.
Dr. Hikyoung Lee and her research team at KU collected the Korean Homeland data.

For each speaker recorded, there are several types of data.

(See file-naming conventions.)

1. Primary content: a set of audio-recorded interviews with time-aligned transcriptions (IV)

  • Interviews average one-hour in length. Normally the speakers include one Heritage Language participant and one or two Heritage Language speaking interviewers. Sometimes members of the participant's family are present. All segments of the interview conducted in the Heritage Language are transcribed as fully as possible. Lengthy switches to English are not always transcribed. An English version of the interview modules is available here. It is also available in each Heritage language.

The audio interviews are in .wav and .mp3 format.

Time-aligned transcriptions have been produced (or are being produced) for each interview.

  • Transcriptions are constructed using ELAN and are available as ELAN Annotation files, or .eaf files. (These are easily converted to many other formats.) All are UNICODE-compliant.
  • Transcriptions are aligned at the phrase level.
  • Switches to English (and other languages) are marked.
  • Cantonese is transcribed phonetically, using the Jyutping system. We are adding colloquial Cantonese character transcription.
    • Jyutping is as in Matthews, Stephen and Virginia Yip. 1994. Cantonese: a comprehensive grammar. London, New York: Routledge. 13-22.
  • Faetar is transcribed in IPA (International Phonetic Alphabet) and loosely translated into English.
  • Korean is transcribed in Hangul.
  • Hungarian is transcribed orthographically.
  • Italian is transcribed orthographically.
  • Polish is transcribed orthographically.
  • Portuguese is transcribed orthographically.
  • Russian and Ukrainian are transcribed using Comrie & Corbett’s Transliteration.
    • Comrie, Bernard & Greville Corbett. 2002. The Slavonic Languages. London & New York: Routledge. 827, 832-833.
    • Exceptions to this system: symbols with hachek (use zh, sh, ch, and shh instead), the 'hard sign' (#) and 'open e' (je).
    • These transcriptions can be automatically transliterated to Cyrillic at http://www.translit.ru.
  • Taglog is transcribed orthographically.

2. Participants respond orally to an Ethnic Orientation Questionnaire (EOQ)

    • Audio recordings of these are available (as _EOQ.wav and _EOQ.mp3 files). [See file-naming conventions]
    • Some participants' responses are transcribed (in _EOQ.eaf files). Many are not fully transcribed. Rather, the question number is indicated in the transcript if the interviewer asked the question verbatim.
    • Responses to each question are numerically coded in a spreadsheet for all participants of one Heritage Language: EOQ_data_LANGUAGE_date.xls.

3. Participants name a set of pictures, the First Words task (FW)

    • Participants are asked to describe a sequence of pictures from a children's story book, naming common objects and then describing scenes containing these items.
    • The book used is:

      Amery, Heather and Stephen Cartwright. 1987. The First Hundred Words. Tulsa: Educational Developmental Corporation.

    Audio .wav and .mp3, as well as .eaf transcription files are available for this task.

File naming conventions

Files labels have three parts:

  1. Speaker code of the primary participants
  2. Abbreviation indicating the type of interaction (preceded by an underscore):
    1. _IV is the sociolinguistic interview, generally a lengthy, relaxed conversation
    2. _EOQ is the oral administration of the Ethnic Orientation Questionnaire
    3. _FW is a picture-description task, referred to as First Words
  3. An extension indicating the file type (preceded by a dot or period):
    1. .eaf is an ELAN annotation file, or transcript
    2. .wav is an uncompressed audio recording
    3. .mp3 is a compressed audio recording
    4. Additional file types include Praat textgrids (.TextGrid) for acoustic analysis, .xls and .xslx for record-keeping of various types, e.g., the2_2_linguists.php#catalog.

Example: F1F29A_IV.eaf is the transcription of the sociolinguistic interview of speaker F1F29A.

Speaker labelling conventions

Each speaker is identified by a speakercode. The speakercode consists of five parts:

  1. The first character identifies the heritage language of the speaker (C, F, H, I, K, P, L, R, T, U).
  2. The second character identifies the heritage speaker's generation (1, 2 or 3). X or 0 identifies Homeland speakers
  3. The third character identifies the speaker's sex (M or F).
  4. The fourth and fifth characters give the speaker's age.
  5. The final character (A, B, C, etc.) provide unique identifiers for otherwise identically-labeled speakers.

Example: F1F29A is a Faetar-speaking, first generation, female, 29-year old. She is the first such speaker recorded.