16 messages over 2 pages: 1 2
viedums Hexaglot Senior Member Thailand Joined 4670 days ago 327 posts - 528 votes Speaks: Latvian, English*, German, Mandarin, Thai, French Studies: Vietnamese
| Message 9 of 16 09 May 2013 at 5:25am | IP Logged |
Language Log has a response to the PNAS paper by Sally Thomason, a linguist who is the co-author of a major work on language contact. These quotes reflect her overwhelmingly negative evaluation:
“I think they have a serious garbage in, garbage out problem.”
“But even if there are no statistical flaws, the Pagel et al. paper is yet another sad example of major scientific publications accepting and publishing articles on historical linguistics without bothering to ask any competent historical linguists to review the papers in advance.”
Everyone commenting seems to agree with her.
Ultraconserved words? Really??
Edited by viedums on 09 May 2013 at 6:20am
6 persons have voted this message useful
| patrickwilken Senior Member Germany radiant-flux.net Joined 4537 days ago 1546 posts - 3200 votes Studies: German
| Message 10 of 16 09 May 2013 at 12:07pm | IP Logged |
viedums wrote:
“I think they have a serious garbage in, garbage out problem.” |
|
|
Thanks for the very informative link. I hadn't realized that the quality of the cognate data used was so poor/controversial, and that they were in fact apparently cherry picking the cognates used.
It made me wonder though if anyone had tried to recreate language trees computationally without using the pre-existing databases, but by simply looking for sound correspondences across a predefined set of words, say the 100 or 1000 most common words in various language families, to ascribe the degree to which a word is phonetically similar. The idea wouldn't be to reconstruct individual words, simply to ascribe the common variance across a pre-defined set of words. Some of that variance would be chance, and some real, but presumably more closely related languages should share more common variance. At some point noise would overcome signal, but I would curious how far back you could go doing that; whether for instance you could see that Basque is more closely related to English than Japanese.
I am assuming someone did this already, and it was a failure, but I would be curious how quickly this sort of method would fail.
1 person has voted this message useful
| Lykeio Senior Member United Kingdom Joined 4248 days ago 120 posts - 357 votes
| Message 11 of 16 09 May 2013 at 12:30pm | IP Logged |
patrickwilken wrote:
viedums wrote:
“I think they have a serious garbage in, garbage out problem.” |
|
|
Thanks for the very informative link. I hadn't realized that the quality of the cognate
data used was so poor/controversial, and that they were in fact apparently cherry
picking the cognates used.
It made me wonder though if anyone had tried to recreate language trees computationally
without using the pre-existing databases, but by simply looking for sound
correspondences across a predefined set of words, say the 100 or 1000 most common words
in various language families, to ascribe the degree to which a word is phonetically
similar. The idea wouldn't be to reconstruct individual words, simply to ascribe the
common variance across a pre-defined set of words. Some of that variance would be
chance, and some real, but presumably more closely related languages should share more
common variance. At some point noise would overcome signal, but I would curious how far
back you could go doing that; whether for instance you could see that Basque is more
closely related to English than Japanese.
I am assuming someone did this already, and it was a failure, but I would be curious
how quickly this sort of method would fail.
|
|
|
The problem is this sort of stuff is predicated on the presumption that a) all (or even
several) language families are related on a supra level and that b) its in any way
recoverable. When you consider the level of difficulty we have within family groups. It
is interesting and various hypotheses have been forwarded on differing levels, e.g
Kartevelian with PIE, several language groups together as "nostrastic" and so on.
You're not going to get it by throwing in "raw" often out of date (seriously!) un-
shifted data. Also languages aren't numbers, you have to be able to account for chance
mutations, irregularities, human contact and literary borrowings. So, for example, the
reach of Latin is massively out of proportion with its genetic descendants, the
prestige of Sanskrit over Asia is another good example. Its not at all related to
Cambodian, Thai etc but these languages all borrow a massive amount of words and you
had medieval kingdoms called Ayutthia (sp, sorry, from Skst Ayodhya) and kings named
Rama and so on.
This is just general stuff, the paper itself is basically trash now that I've had a
better chance to read it. Clearly, the reason why the journal didn't use historical
linguists as referees is because said paper wouldn't have made it to print. Imagine
publishing something like this in Glossa.
You know, I hope this popularising filth doesn't discourage anybody anyway from this
sort of stuff. We need as many people as possible. All the greats are slowly dying. We
recently lost Calvert Watkins (http://en.wikipedia.org/wiki/Calvert_Watkins ) for
example.
1 person has voted this message useful
|
emk Diglot Moderator United States Joined 5536 days ago 2615 posts - 8806 votes Speaks: English*, FrenchB2 Studies: Spanish, Ancient Egyptian Personal Language Map
| Message 12 of 16 09 May 2013 at 1:18pm | IP Logged |
viedums wrote:
“I think they have a serious garbage in, garbage out problem.”
“But even if there are no statistical flaws, the Pagel et al. paper is yet another sad example of major scientific publications accepting and publishing articles on historical linguistics without bothering to ask any competent historical linguists to review the papers in advance.” |
|
|
Sigh. It looks like their underlying database was garbage after all, and I should have been even more cynical than I was.
My old rule of thumb was "historical linguistics paper in high impact factor journal with biologists participating -> probably garbage". But I think in the future, I'm going to place additional weight on whether the paper's technique was first applied to IE origins or superfamily reconstruction. There's something about these two problems which attracts a disproportionate number of cranks, and a non-specialist could safely discard all these papers unread without going too far wrong.
It's not like there aren't dozens of interesting problems within widely-accepted language families. There's still no universally-accepted etymological dictionary for Afro-Asiatic, for example, despite 5,400 years of recorded history. There are entire families which have received only a minuscule fraction of the research effort that has gone into Indo-European. Surely any promising statistical technique should cut its teeth on problems near the frontiers of existing knowledge, rather than launching a hypothesis blindly into the distant past?
3 persons have voted this message useful
| patrickwilken Senior Member Germany radiant-flux.net Joined 4537 days ago 1546 posts - 3200 votes Studies: German
| Message 13 of 16 09 May 2013 at 1:44pm | IP Logged |
Lykeio wrote:
You're not going to get it by throwing in "raw" often out of date (seriously!) un-
shifted data. Also languages aren't numbers, you have to be able to account for chance
mutations, irregularities, human contact and literary borrowings. So, for example, the
reach of Latin is massively out of proportion with its genetic descendants, the
prestige of Sanskrit over Asia is another good example. Its not at all related to
Cambodian, Thai etc but these languages all borrow a massive amount of words and you
had medieval kingdoms called Ayutthia (sp, sorry, from Skst Ayodhya) and kings named
Rama and so on.
|
|
|
Thanks for the informative response.
Putting aside chance mutations, irregularities etc which I assume just adds noise to the signal, massive borrowings might be possible to see using something like a Principle Components Analysis across language groups.
PCA is a very standard statistical technique used in psychology and elsewhere to pull out shared variances. One of it's most impressive uses I have seen was to the area of personality research. So if you get people to describe lots of things emotionally (the cat is "playful" "happy"; the criminal is "sinister", "deceitful" etc etc) you find that some words co-correlate (or anti-correlate - e.g., happy vs. sad), whereas others show no relationship (e.g., crazy could be as likely to appear with happy or sad). Doing so psychologists were able to reconstruct the Big Five Personality factors, without any a priori assumptions about the structure of personality (extroversion/introversion is perhaps the most famous personality dimension of the Five).
I wonder in a similar way if PCA when applied across European words would show you say a common dimension for Germanic as opposed to another dimension for say Latin words.
1 person has voted this message useful
|
emk Diglot Moderator United States Joined 5536 days ago 2615 posts - 8806 votes Speaks: English*, FrenchB2 Studies: Spanish, Ancient Egyptian Personal Language Map
| Message 14 of 16 09 May 2013 at 2:28pm | IP Logged |
patrickwilken wrote:
I wonder in a similar way if PCA when applied across European words would show you say a common dimension for Germanic as opposed to another dimension for say Latin words. |
|
|
The data for Indo-European has already been analyzed to death. It's been a recognized research field for 150 years now, and innumerable graduate students have analyzed and reanalyzed every available scrap of data. Simply taking PCA (or any of the standard bioinformatics algorithms) and tossing in raw data is unlikely to discover anything new. Now, I'm sure there are new discoveries waiting to be made, and that some of these will involve modern data analysis tools. But it's pretty easy to do awful work in this area.
To give you an idea of what the field looks like, one of the gems of historical linguistics is Laryngeal theory. This involved postulating some rather improbable-looking consonants that had survived in no known language. But when Hittite was later discovered, it provided substantial evidence for the theory. In other words, the techniques made testable predictions which were later confirmed (or at least supported) by new evidence.
Honestly, if you want to apply statistical methods to historical linguistics, the best role for Indo-European might be for showing that the new techniques can successfully reconstruct existing research.
3 persons have voted this message useful
| Lykeio Senior Member United Kingdom Joined 4248 days ago 120 posts - 357 votes
| Message 15 of 16 09 May 2013 at 2:30pm | IP Logged |
That may have merit to it but would require careful experimentation. Basically with
comp phil things need to be consistent, plausible and to some degree have predicative
force. Its a slow process.
When it comes to borrowings (that is, loan words) some of the most common trends we see
tend to be by register and subject. I'm not up on the modern languages side of this
which detail how (mod) Greek received Turkish and Albanian, or Persian Arabic but going
back to the ancient stuff Sanskrit spread to other language groups largely via
religious and technical registers. Likewise, if we take Greek again, it borrowed
heavily from near eastern languages for its technical and religious vocabulary. Even
some weird everyday stuff, like tunic coming from west Semitic or the colour for blue.
/kuanos/ coming ultimately from Egyptian.
So there are discernible patterns in borrowings, but these are not absolute and loans
rarely penetrate into core vocabulary unless they signify a foreign object.
I should also point out that language trees are best thought of as heuristic paradigms.
That is to say, many are tenuous, even when they seem to be pretty unassailable (like
Indo-European) this linguistic grouping doesn't have any prescriptive force on the
users of a language. Greeks borrowed from N.E stuff because they were culturally and
geographically close regardless of a lack of genetic relationship. So it gets VERY
messy.
Likewise unrelated languages can heavily affect one another. Language is, ultimately, a
human tool of communication and putative groupings must be forced to comply to
historical realities too. Sorry if this seems tangential or ranty, I'm just saying that
most studies ignore the really important, tangible, affect of areal influence which is
much more historically traceable than earlier linguistic groupings.
4 persons have voted this message useful
| patrickwilken Senior Member Germany radiant-flux.net Joined 4537 days ago 1546 posts - 3200 votes Studies: German
| Message 16 of 16 09 May 2013 at 3:12pm | IP Logged |
emk wrote:
The data for Indo-European has already been analyzed to death. It's been a recognized research field for 150 years now, and innumerable graduate students have analyzed and reanalyzed every available scrap of data. Simply taking PCA (or any of the standard bioinformatics algorithms) and tossing in raw data is unlikely to discover anything new. Now, I'm sure there are new discoveries waiting to be made, and that some of these will involve modern data analysis tools. But it's pretty easy to do awful work in this area. |
|
|
Sure. I wasn't trying to suggest anything new could be found out that way. I was just curious if the technique would work, and thought Indo-European would be a good place to test it.
Lykeio wrote:
That may have merit to it but would require careful experimentation. Basically with
comp phil things need to be consistent, plausible and to some degree have predicative
force. Its a slow process.
|
|
|
Thanks for the very patient reply.
I am just having random thoughts. I am psychologist by training and I used to collaborated with a computational neuroscientist. Psychologists to a first approximation hate maths, and my friend has a lot of trouble getting what seems to me reasonable papers published in psychology journals, because at least in part, because his techniques are not those that others can use, because their math is simply not good enough (my friend's first PhD was in string theory before he went into neuroscience - which is not a normal career path for psychologists).
That's not to say that computational neuroscientists can solve all problems. Our collaboration was powerful in so far as I was grounded his math in the practical problems of experimental psychology.
I have the suspicion that a similar collaboration between linguists and people with good statistics/mathematical abilities would be equally fruitful (it's really hard to find people with both skill sets). I thought perhaps this PNAS paper was one example of this, but I am convinced now that this is not that case.
EMK: I would assume such a collaboration would first apply to existing knowledge to test techniques, and then apply to pre-existing problems in other areas as you have suggested.
Edited by patrickwilken on 09 May 2013 at 3:24pm
1 person has voted this message useful
|
This discussion contains 16 messages over 2 pages: << Prev 1 2 If you wish to post a reply to this topic you must first login. If you are not already registered you must first register
You cannot post new topics in this forum - You cannot reply to topics in this forum - You cannot delete your posts in this forum You cannot edit your posts in this forum - You cannot create polls in this forum - You cannot vote in polls in this forum
This page was generated in 0.3750 seconds.
DHTML Menu By Milonic JavaScript
|