Building your own corpus (Learning Techniques, Methods & Strategies) Language Learning Forum

Building your own corpus
Tags: Comprehensive input
Share with: Delicious Digg reddit Facebook StumbleUpon
Language Learning Forum : Learning Techniques, Methods & Strategies

27 messages over 4 pages: 1 2 3 4 Next >>

Bakunin
Diglot
Senior Member
Switzerland
outerkhmer.blogspot.
Joined 4925 days ago
531 posts - 1126 votes

Speaks: German*, Thai
Studies: Khmer

Message 1 of 27

30 December 2012 at 4:50pm | IP Logged

About a month ago, I had an email exchange with a visitor to my Thai recordings project (thairecordings.com) who
brought up the idea of building one's own corpus. Since building a corpus is technically quite easy (a folder with
text files and a simple collocation routine is basically all you need), I thought, why not give it a try? A month later,
I'm starting to see the benefits and wanted to share the idea and ask whether anyone here has experience with
building one's own corpus.

The basic idea behind building your own corpus is that concordance searches will yield results that are meaningful
and familiar to you, reflecting where you are in your learning process as well as your interests. While Google is the
greatest and largest corpus, it tends to show a lot of results that are irrelevant to me or where I simply lack enough
context to make sense of them.

Over the past few weeks, I've fed my corpus with around 90'000 words of transcribed spoken Thai from my
recordings project as well as with news articles I've discussed with my tutor, stories and blog posts I've read on the
internet, and a schoolbook (sports and health, first grade primary school) I've been typing in*. I'm at around
150'000 words now which is already enough to have quite a few examples for most common words (relative to
what I read and am interested in, of course). I guess, 1'000'000 words is a good target for next year.

Even though I've read all the stuff that is in the corpus, and some of it very thoroughly, I'm often surprised at how
interesting it is to look at concordance examples for specific words in order to understand better how they're used,
what register they belong to, or how they differ from similar words. The best thing, however, is that I'm familiar
with the context which is a tremendous help in getting the finer shades of meaning.

*Typing in a (school-)book, isn't that crazy? I don't know. I actually quite enjoy it, and I definitely need the typing
practice. I want to be able to type Thai as fast as I type English, and the only way to get there is to do a lot of
typing. It also helps me with proper spelling, and I tend to acquire new words more rapidly when I have to type and
then spell-check them. I also do a little bit of Scriptorum every other day, and it has a similar effect (but doesn't
contribute to my corpus).

Anybody here has entertained such a project, or am I far out on the lunatic fringe?

Edited by Bakunin on 31 December 2012 at 12:59pm
11 persons have voted this message useful

sans-serif
Tetraglot
Senior Member
Finland
Joined 4354 days ago
298 posts - 470 votes

Speaks: Finnish*, English, German, Swedish
Studies: Danish

Message 2 of 27

30 December 2012 at 5:52pm | IP Logged

I'm glad you brought this up. I've had the same idea for a little over a year now, but haven't done anything about it. Most of my reading still consists of paper books, which aren't terribly practical for building a corpus, so I will probably wait until I've taken the leap and upgraded to an e-reader. Nevertheless, it's great to hear that you're liking it.

Do keep us updated on your experiences and any new ways to use the corpus that you might come up with!

3 persons have voted this message useful

tommus
Senior Member
CanadaRegistered users can see my Skype Name
Joined 5661 days ago
979 posts - 1688 votes

Speaks: English*
Studies: Dutch, French, Esperanto, German, Spanish

Message 3 of 27

30 December 2012 at 7:46pm | IP Logged

Bakunin wrote:

Anybody here has entertained such a project, or am I far out on the lunatic fringe?

I guess I am out there with you on the lunatic fringe. I have several corpora that I developed in various ways. I downloaded the entire text from the Dutch Wikipedia as a single file. That is a big file. I did some text processing to extract just sentences, and I have a small Java application that pulls out groups of sentences with the target word. I have done something similar in Dutch with the European Parliament parallel corpus. I also have about five years of Radio Netherlands News (local and International) collected via email subscription.

There is an excellent corpus search engine on the Internet that searches in about 45 of the major languages. It has very nice customising features and a nice display. It is WebCorp: The Web as Corpus.

http://www.webcorp.org.uk/live/

Edited by tommus on 31 December 2012 at 12:01am
8 persons have voted this message useful

Bakunin
Diglot
Senior Member
Switzerland
outerkhmer.blogspot.
Joined 4925 days ago
531 posts - 1126 votes

Speaks: German*, Thai
Studies: Khmer

Message 4 of 27

30 December 2012 at 10:42pm | IP Logged

tommus wrote:

Bakunin wrote:

Anybody here has entertained such a project, or am I far out on the lunatic
fringe?

I guess I am out there with you on the lunatic fringe.

Thank goodness! :)
Actually, there are some corpora for Thai on the web as well, at least one of them decently large. But I find the
collocation results of my private one much more meaningful simply because I'm already familiar with the texts in
there. This might be an intermediate stage, and I might very well find later in my learning endeavor that a public
corpus based on the entire web is more useful, but for the moment it's clearly the other way round.

Edited by Bakunin on 30 December 2012 at 10:42pm
1 person has voted this message useful

Jeffers
Senior Member
United Kingdom
Joined 4704 days ago
2151 posts - 3960 votes

Speaks: English*
Studies: Hindi, Ancient Greek, French, Sanskrit, German

Message 5 of 27

31 December 2012 at 11:02am | IP Logged

Do you have software which you use to get statistical information on your corpus, such as
word frequency, etc? Is there any free software which can do this on text or word files?
1 person has voted this message useful

Bakunin
Diglot
Senior Member
Switzerland
outerkhmer.blogspot.
Joined 4925 days ago
531 posts - 1126 votes

Speaks: German*, Thai
Studies: Khmer

Message 6 of 27

31 December 2012 at 11:54am | IP Logged

Jeffers wrote:

Do you have software which you use to get statistical information on your corpus, such as
word frequency, etc? Is there any free software which can do this on text or word files?

Thai doesn't use spaces to indicate word boundaries. While I have written a basic parser based on headwords from
a large Thai-Thai dictionary in order to use the Foreign Language Text Reader, I can't agree with myself what
exactly a word is, and therefore I don't parse the texts of my corpus. Consequently, I can't derive statistics of word
frequency etc. But since I'm not interested in statistical analysis of my private corpus anyway (can't see a useful
application for me as a learner), it's not a big deal.

Should you study a language that uses spaces to indicate word boundaries, then there are powerful corpus tools
available in the Natural Language Toolkit (Python), including statistical analysis tools; whatever language you want
to program in, Python's NLTK could be a starting point either in terms of what functions you need or in terms of a
support community.

As said before, the real use of my corpus is the ability to look at uses and collocations of a particular word or
expression from sources I'm familiar with and fully understand. This answers questions like "What does this word
mean?", "How do you use this word?", "In what context do you use this word (and not that other one)?", "What other
words do you usually use this word with?", and even "Is this word a frequent word?". A statistical analysis primarily
answers questions like "Which words are the key words of this text?", "Which words are the most frequent words?",
or derives collocations in an automated fashion.
2 persons have voted this message useful

jacobsenmd
Diglot
Newbie
United States
buildingpeace.net
Joined 4142 days ago
2 posts - 9 votes
Speaks: Arabic (Written), Arabic (Levantine)
Studies: Turkish

Message 7 of 27

31 December 2012 at 11:40pm | IP Logged

My needs are somewhat different than yours, but perhaps relevant. I began studying Turkish a few months ago.
Beginner resources are scarce, so I am always on the lookout for comprehensible input. Every comprehensible
written or audio text I can find is a gem, and I wanted a way to preserve them, so I recently began using Evernote to
store and organize them. Every single text that I study goes into Evernote; I clip webpages and snap photographs
from my textbooks. The result is a steadily growing body of texts that I supposedly know and understand. They
sync across all my devices, and I always have them in the palm of my hand if I want to review old material. Every
text is dated and even geotagged, so this also provides a record of my language learning journey.

I'm not sure if Evernote would meet your needs for a corpus, but it might be worth looking into. You can store and
tag any number of texts, and quickly search for specific words across the entire collection.
3 persons have voted this message useful

Bakunin
Diglot
Senior Member
Switzerland
outerkhmer.blogspot.
Joined 4925 days ago
531 posts - 1126 votes

Speaks: German*, Thai
Studies: Khmer

Message 8 of 27

21 February 2014 at 10:50am | IP Logged

Here's an update on my corpus project, it's a cross-post from my log, therefore a bit more general. Note that I haven't reached my lofty goal of 1 million words yet :)

At the time of writing this, my Thai corpus has grown to almost 500'000 words. A corpus is basically a collection of written texts; corpora are mainly used by linguists to analyze how language is used. I've decided to build my own corpus because I was dissatisfied with the quality of Google's search functionality in Thai: often I would get irrelevant, obscure or wrongly spelt results. Google also tended to ignore some tone marks, which has, however, improved a lot in recent times.

The great thing about my corpus (and any self-built corpus) is that it exclusively contains texts curated by me. I've read every single text in my corpus with FLTR, fixed typos, made sure certain standards are adhered to (e.g., the placement of special characters like ๆ). Each text in my corpus is meaningful to me because I've selected to read the text based on my interests, needs or study program, have read it carefully and worked on it to some extent (e.g., have copied texts from the schoolbooks I work with, have accompanying audio for texts from thairecordings.com, have fixed typos, have discussed it with a tutor). Often, though not always, the example sentences from my corpus are much more useful to me than Google search results, in particular when I'm investigating the usage of words I already know well.

I enjoy programming from time to time; I'm clearly a dilettante, but I do get easy things done. Programming a concordancer for my corpus was relatively easy, and for the output I could make use of a template I got from an expat in Chiang Mai who had commented on a blog post on a related topic I made somewhere else. My concordancer is relatively basic, but I've programmed the possibility to search for structures like 'X […] Y' which is difficult with Google (or I haven't figured it out yet).

Here's the output for the not very frequent word โคลน (mud):

Even though there are only 7 results, I can glean some useful information from it:
- โคลน often comes together with ดิน (earth): ดินโคลน (mud)
- โคลน can ถล่ม (fall down; landslide)
- one can ลุยโคลน (wade through mud)
- there's the word โคลนตม (sticky, viscous mud)

Here's part of the output of the much more frequent construction นำ … มาใช้ (use …):

I look at stuff like this usually when I want to get many examples or check whether a particular way to say things is correct.

Here's the output for ไวไฟ:

The above is a trivial example, but corpus search can sometimes help me to understand that one word has several different meanings: here, Wi-Fi [2, 3, 5, 6, 7, 8, 9, 10, 12, 13] and flammable [1, 4, 11]. I also get the useful concordances วัตถุไวไฟ (flammable goods) and ติดตั้งไวไฟ (set up Wi-Fi).

My corpus consists mainly of schoolbooks, spoken Thai and news articles:

I use my corpus regularly, but not very often. Most of the time I just feed it (which is easy, I just need to copy the text file I've worked on with FLTR to a special folder and add some meta information). When I use it, however, it often proves to be a valuable source of language usage I couldn't easily get elsewhere.

Edited by Bakunin on 21 February 2014 at 12:25pm

6 persons have voted this message useful

This discussion contains 27 messages over 4 pages: 1 2 3 4 Next >>

Printable version

You cannot post new topics in this forum - You cannot reply to topics in this forum - You cannot delete your posts in this forum
You cannot edit your posts in this forum - You cannot create polls in this forum - You cannot vote in polls in this forum

This page was generated in 0.3906 seconds.