Polish Y-DNA Clades
20-Nov-11
Peter Gwozdz
pete2g2@comcast.net
News
20 Nov R1a news: The new SNP paragroup
Z280* is looking very similar what I have been calling K type. A type is clearly Z280-
Z93+, not part of K. Preliminary
data: E type seems to be Z280+ Z92+,
not part of K. D type seems to be K. More SNP data will soon clarify where I type
and other types fall.
Common Polish Y-DNA Haplogroups
Comment 20 Nov 2011; this table needs update, but the links get
you quickly to the appropriate sections.
Data from the Polish
Project. This is an abbreviated
version of the more detailed Results Table. In the “Haplogroup” column, click on the
link to jump directly to the corresponding section of the Results Table. In the “Description” column, click on the
link to jump to a description of that particular clade. There is a topic below with Explanation of the Results Table.
|
|
|
|
Number |
|
|
Number |
|
|
|
|
Haplogroup |
Samples |
% |
Description |
Samples |
% |
|
|
|
6 |
0.7 |
|
|
|
|
|
|
|
48 |
5.2 |
|
|
|
|
|
|
|
29 |
3.2 |
|
|
|
|
|
|
|
131 |
14.3 |
|
|
|
|
|
|
|
|
|
|
7 |
0.8 |
|
|
|
|
|
|
|
5 |
0.5 |
|
|
|
|
72 |
7.8 |
|
|
|
|
|
|
|
3 |
0.3 |
|
|
|
|
|
|
|
86 |
9.4 |
|
|
|
|
|
|
|
|
|
|
8 |
0.9 |
|
|
|
|
|
|
|
6 |
0.7 |
|
|
|
|
15 |
1.6 |
|
|
|
|
|
R |
R1a |
2 |
0.2 |
|
|
|
|
|
(56.5%) |
|
206 |
22.4 |
|
|
|
|
|
|
|
|
|
|
130 |
14.2 |
|
|
|
|
56 |
6.1 |
|
|
|
|
|
|
|
|
|
|
23 |
2.5 |
|
|
|
|
87 |
9.5 |
87 |
9.5 |
||
|
|
|
83 |
9.0 |
83 |
9.0 |
||
|
|
R1b |
2 |
0.2 |
|
|
|
|
|
|
|
83 |
9.0 |
|
|
|
|
|
|
|
|
|
|
14 |
1.5 |
|
|
|
|
|
|
|
9 |
1.0 |
|
|
|
|
|
|
|
6 |
0.7 |
|
|
|
R2 |
1 |
0.1 |
|
|
|
|
|
|
|
8 |
0.9 |
|
|
|
|
|
|
|
Total |
918 |
100.00 |
|
|
|
Edited 10 Sep 2011. Abstract rewrite 27 Jul 2011.
The Polish
Project has assignments of men to haplogroups
based on their Y-DNA data. Lawrence Mayka, administrator of the Polish Project, provides
data for this web site of mine. This
web document is for explanation, details, and update news.
The Results Table has a summary of
assignments. Some assignment categories
have a link to more detailed discussion.
If you know your assignment you can click on the link in the right
column of the Table to read more about your assignment category.
Most of the assignments are to well
established haplogroups, as determined by Mayka. In addition, I hypothetically subdivide haplogroups into types when division can be done with 80% confidence. With less than 80% confidence, assignment
categories are tentative, not called types, often called clusters.
About half of Polish men belong to
haplogroup R1a. Most of my work has
been on R1a.
This Abstract is for people
reasonably familiar with the jargon of genetic genealogy. If you are new to genetic genealogy you
might prefer to read the Introduction first.
This web document has three
purposes: 1. More detailed explanations
for the men (samples) that I assign to types
in the Polish
Project. 2. Summary of my published results.
3. Update with recent results.
The topic is common Polish Y-DNA clades - identification of male line Y-DNA clusters
that are concentrated in Poland.
Since I originally posted this in
December 2007, emphasis has been haplogroup R1a, because
about half of Polish men are R1a, with no subdivision at that time. A new division,
roughly 50-50, between R1a1a* and R1a1a7 (M458), became available in November
2009. In 2010 I expanded this page to
include clades from other haplogroups.
I use the word type
to mean an STR cluster with statistical validity as
established by my Mountain Method. I expect my types to be validated in due time by discovery of new SNPs that will qualify them as haplogroups. I chose the word “type” because it is not
generally used in genetic genealogy and I wish to distinguish my types from
haplogroups and from other clusters.
All types have associated clusters but not all clusters qualify as
types. In my publications
and web pages I make it clear which types I have discovered in web data and
which types were suggested to me by others, with references. Usually when I discover a type I later find
out someone else had mentioned it earlier on the web; let me know if you the reader have more clues and references for
me.
Most types that I discuss seem to be
1,000 to 5,000 years old, so all the men in each type seem to be descended in direct
male lines from one man (MRCA) who lived that long ago (TMRCA). A few of my
types might be younger or older than that range.
I use phrases like “seem to be” over
and over because the methods are statistical.
Click here for a summary of the
conservative automatic haplogroup
assignments in the Polish Project, for an explanation of the extended assignments, for a discussion of the minimum 80% probability for assignment, and for the Results Table.
The Polish Project is considered
representative of Historical Poland, with caveats explained in my Publication.
I am interested in Polish
origins. This web document, however, is
not for historical analysis and conclusions, except for occasional comments to
remind us of the goal. This document is
dedicated to identifying haplogroups and types and clusters concentrated in
Poland, with detailed explanations. I
am aware that some people object
to the use of Y-DNA for historical analysis, so I try to mention caveats along
with my comments.
Abstract rewrite 4 Jul 2010.
About half the men of Polish male
line ancestry belong to the R1a haplogroup.
About 99% of Polish R1a are R1a1a. This R1a Abstract is a summary discussion of
the R1a Results Table.
U category. Unassigned.
This is the largest category in R1a.
On the Polish
Project Y-DNA Results page, detailed assignments are made with
minimum 80% probability. Because of the restriction to 80% probability, many R1a men in
the Polish Project are not assigned to detailed categories at the Polish
Project web page. Those men go into
this “Unassigned” category. These still
have either R1a or R1a1 automatically
assigned by FTDNA. If you are in
this U category, you can promote yourself out by purchasing the full 67 marker
STR set, since all R1a samples with 67 markers get a detailed
assignment.
I consider the R1a Polish data as 4
major categories based on STR data.
About half the men of Polish male line ancestry belong to the R1a
haplogroup, and that group divides roughly equally into these 4
categories. Since 2007, I have been
calling them P type, N type, K type, and R category. P and N are in the new R1a1a7 (M458). P is R1a1a7b (L270). K is R1a1a*. R is mostly R1a1a*.
R, Remainder, is not a type. I use R for samples that do not belong to any
of the types I have identified in R1a1a* so far.
My overall confidence
in K type is only 85% because there seem to be unidentified types with STR
values close to K. The modal
haplotype for K is essentially the same as the modal
haplotype for all of R1a. However, I
have identified subtypes of K that have much higher confidence. In other words I have higher confidence for
many individual samples. I have high
confidence in the subtypes although I am not sure all the subtypes assigned to
K belong to exactly the same clade along with all the other samples that I have
assigned to K outside the subtypes.
Even if K is not a true clade as defined, however, it clear that the K
samples belong to branches in the R1a1a* tree with nodes very close to each
other. The only uncertainty is that
there are likely many other samples that belong in other branches just as close
to K.
Borderline categories
are not types but are samples that match types with less than 80%
probability. Each Borderline category
has discussion below.
P type is
concentrated in Poland, rare with increasing distance from Poland. N type seems to be
mostly Slavic, widespread in eastern Europe.
K type corresponds to one of the two largest R1a1
clusters. Another large R1a1a cluster,
the one I call L type, is not common in Poland.
In the table I assign each R1a1a*
(M458-) subtype into either K or R based on how distant the STR values are from
K. Some of these are borderline
however. There is no clean separation
of K from R, so the table should not be considered a high probability
separation of K subtypes from the R remainder subtypes. Read the individual type discussions to see
which subtypes fit K with high probability;
A type is an example.
Thanks go to Lawrence
Mayka, Polish Project administrator, for extensive email information and
assistance.
You can compare data to my types by
clicking this link to instructions for Ysearch.
Reminder: I am concentrating on Poland.
The statistics of STR clusters depend a lot on the data base. For example, P type stands out dramatically
in Polish data. In other countries P
type is rare. If you belong to an R1a1
cluster that is rare in Poland, I’m sorry, but I’m not covering you. K type is an example of a type that is
common both in Poland and elsewhere. M type is common in northwest Europe but so far absent in the
Polish Project.
This Introduction is for people unfamiliar
with the jargon of genetic genealogy.
There are quite a few web sites with
a general introduction to the subject of genetic genealogy, for example Wikipedia, FTDNA, and Genographic. Back issues of JOGG
are good general references. The Y Chromosome Wikipedia
article is about male line DNA, also called Y-DNA.
The following several paragraphs are
a brief introduction to genetic genealogy for Y-DNA, providing some definitions
of jargon needed to read my web pages. The
definition words are boldface. I often
use links to those definitions when I use a jargon word for the first time in a
topic. There are more boldface
definitions in the summary of my Methods.
The Y chromosome gets passed from
father to son, so it works just like a male family name. Men are divided into haplogroups based on known rare mutations (most of them
are called single nucleotide polymorphisms SNP) in the Y
chromosome. Division into haplogroups
is done in a manner that has virtually 100% confidence. I say “virtually” because your confidence in
your DNA result from your DNA testing company might be 98% or 99% or
99.9%; the confidence for haplogroups
is better than that. We can be
virtually certain that all the men in a haplogroup descend in direct male lines
from one man, called the “Most Recent Common Ancestor” (MRCA)
for that haplogroup. The MRCA
corresponds to the node, or branching point, in the
Y-DNA tree of male line ancestry. Time
of the Most Recent Common Ancestor (TMRCA) is an
estimate of how long ago he lived - the age of the
haplogroup.
Lots of people, including me, are
working to discover more SNPs on the Y chromosome so that the haplogroups can
be divided further into smaller haplogroups.
Haplogroups have alphanumeric codes,
like R1a1a. A paragroup
is a haplogroup considered without its known haplogroup branches. When a new branch is discovered within a
paragroup, it gets removed from the definition; that changes the meaning of that paragroup. An asterisk is usually used in paragroup
codes, like R1a1a*.
Many people, including me, try to
“stay ahead” of the haplogroups by analyzing other mutations that are not so
rare (called STR) on the Y chromosome. Men submit their Y-DNA data to various web
sites. There are lots of STR data
available on the web. Men are divided
into STR clusters as hypothetical subdivisions of the haplogroups. All such clusters are hypothetical. Some will be validated in the future by new
SNP discoveries. There are various
statistical methods for estimating the confidence of STR clusters. I recently published
a method that I developed. That
publication has references to other methods.
There is a brief summary of my method
below.
A few STR clusters are small family
clusters, with the same family name.
Y-DNA is biologically accurate, so some men discover that their Y-DNA
does not match the DNA of their male line cousins identified by genealogy
research, due to secret adoptions, illegitimacies, etc. This is one of the reasons some people
prefer to avoid genetic genealogy. The
male line associated with the Y-chromosome is only one ancestral line. Humans have 24 chromosomes. Anyone who tries to make a family tree going
back 300 years has more than a thousand root tips to be filled by names of
ancestors who lived back then; the one
man at the tip of the male line root is only one of those thousand. That is another reason some genealogists
avoid Y-DNA genetic genealogy - the emphasis on only one line of descent out of
many. That said, many people enjoy the
challenging hobby of figuring out to which ancient extended male line they
belong.
Most STR based clusters have an MRCA
who lived thousands of years ago, before family names were common, so most men
assigned to a typical cluster do not have the same family name.
Many SNP based haplogroups have an
MRCA who lived more than ten thousand years ago, so these span multiple ethnic
groups and nationalities. For example,
the R1a
haplogroup is of interest to me. R1a is
most common in Slavic countries but calling R1a Slavic is misleading because it
is found throughout Europe and west Asia.
The MRCA lived so long ago that he may have spoken a language that we
would not consider Slavic if we could hear it.
It is possible that he did not even live in what is now the Slavic
region of Europe; maybe his descendants
moved there in a massive migration from the Asian steppes, or from India. No one knows for sure. Even if he was proto-Slavic in language and
culture, by now some of his descendants long ago moved to other parts of Europe
and Asia. One of the appeals of genetic
genealogy is trying to figure out ethnic descent and migration from the
statistics of haplogroups. Some people
object, pointing out that ethnicity cannot be defined genetically because of
all the moving and mixing of people over the millennia, and because the Y
chromosome is only one of many. True
enough. Some individuals and some web
sites go too far with genetic claims.
That said, statistical analysis of haplogroup data provides many clues
on human origins.
Again, some people try to stay ahead
of haplogroups, using statistical analysis of STR based clusters to gain
insight into more recent human origins.
I am one of those people. My
interest is Polish origins. This web
document, however, is not for the historical analysis and conclusions, except
for occasional comments to remind us of the goal. This document is dedicated to STR data and analysis, identifying
clusters concentrated in Poland, with detailed explanations.
The bottom of my Method section has
more definitions for a number of genetic genealogy
terms.
There are a number of organizations
and commercial companies on the web where you can order a cheek swab kit to
mail in for genetic genealogy analysis, for example FTDNA. I am not associated with the company
FTDNA; I mention them because I make
extensive use of their data; check
Google for competitors. At FTDNA, click
on Products for cheek swab kits. DNA
results are confidential unless you register the data at a database; at FTDNA, click on Projects to register your
data into one of the many databases;
for example, most of my analysis is from the data in the FTDNA Polish Project.
I use the FTDNA standard set of 67 STR markers (plus a few
non-standard ones occasionally). I do
some analysis using the standard FTDNA 12, 25, 37, or 111 STR marker sets. Other companies use standard marker sets that
may not overlap with all the FTDNA markers.
Ysearch is the
largest web database for Y-DNA, run by FTDNA, open to all men, including men
who also register with projects and including men with data from other testing
services. I use Ysearch often for
analysis so of course I encourage you to register your Y-DNA data at
Ysearch. From the FTDNA site, you can
register your data with Ysearch. Or you
can type your Y-STR data into Ysearch.
Comment 14 May 2011: recent data continues to confirm the
analysis as presented in this topic a few months ago.
This topic was completely rewritten
during Dec & Jan; last update edit
17 Jan 2011.
SNP results
continue to validate P type and N type.
The SNP called L260
is almost equivalent to what I have been calling P type.
The SNP called M458
is almost equivalent to the combination of what I have been calling P type plus
N type. In other words, N type is
almost equivalent to M458+ L260- (positive result for the M458 SNP test but
negative for the L260 SNP).
The bottom of this topic has recommendations for testing regarding these
two SNPs.
All L260+ are M458+ if tested for
M458, confirming that L260 defines haplogroup R1a1a1g2 within the M458 haplogroup R1a1a1g.
Test results available to me: 204 M458 and 59 L260, from 213 samples. The
following paragraphs summarize results for the 180 samples that have all 67 standard STR
markers. By “predicted” I mean my type assignment based on STR values, ignoring the
SNP results:
All samples predicted P type are
coming out M458+ L260+. 43 of them so
far.
All samples predicted N type are
coming out M458+ L260-. 31 of them so
far.
In other words, all samples with 67
markers that fit the P type or N type definitions
based on STR values are coming out correctly with SNP tests. This is 100% accuracy so far for samples
predicted P type or N type. However, I
am using the words “almost equivalent” because there are outliers:
In the P branch there are only 2
outliers: one with STR values at the cutoff and one that is 1 step beyond the
cutoff for P type.
In the N branch there are 11
outliers; discussed
below.
The percent of outliers expected in
the male population is lower than implied by these results because my SNP data
over represents the STR cutoff regions.
Such samples have been prioritized for SNP evaluation in order to better
establish the limits of the types. In
the Polish Project, all samples at or just beyond
the cutoffs have been SNP tested.
In addition, all outliers so far are
“just beyond” P or N types. Almost all
of these could have been predicted into the correct type based on STR values
alone, because so far almost all other “just beyond” M458- samples fit well to
other known types outside the M458 haplogroup.
Those 2 P type outliers with SNP data could have been predicted based on
STR data,
with
100% probability (but only >50% statistical confidence
due to the small sample size). All but
3 of the N type outliers could have been similarly predicted.
In my discussion
topic, I mention a few caveats, including an explanation of why I use the word
“branch” not “type” for the outliers, with quantitative explanation of what I
mean by “just beyond”.
Recommendations for
R1a men not yet tested for M458 / L260: If you are
a member of the Polish Project with an N Borderline assignment you should
purchase the M458 test to determine your haplogroup. If you have a P Borderline assignment you should purchase the
L260 test. My STR rules for the Polish
Project are complicated, and those rules may not apply to R1a men outside
Poland, where exceptions to my assignment rules are more likely.
If you are not a member of the Polish
Project, with all 67 markers, you can
compare your STR values to P type and N type following the Ysearch instructions below. If you fit with lower step to one of the
known types other than N or P you are less likely to need either SNP test
because you would likely come out M458- L260-.
If you do not fit well to another type:
If your step (genetic distance) from P type is less than 6 you are very
likely P type; step greater than 9 is
very likely not P type. From steps 6 to
9 you should purchase the L260 test to determine your status. If your step from N type is less than 7 you
are very likely N type; step greater
than 12 is very likely not N type. From
steps 7 to 12 you should purchase the M458 test.
For samples known to be M458+, the
single marker DYS385a=10 provides a very high confidence prediction for P type
L260+, as explained below.
Even if your STR values provide a
“very likely” assignment, you do everyone a favor if you test SNPs anyway. In this case you are unlikely to learn
anything more about your DNA, but as more men perform these “wildcat” tests, we
all gain more confidence that there are no small clades with unusual STR values
waiting to be discovered. There is a
slight chance you might discover that you belong to such a small clade with a
“wildcat” STR test.
See L260
and M458 Test Results for more discussion about the data available. The end of that topic
has brief speculation on the age and structure of the M458 clade.
See L260 and M458 Test Results; Details for
data summaries.
See L260M458Results.xls
for all my SNP data.
See also L260
and M458 Signatures.
Polish
Project R1a Assignment News
This topic was updated 5 Jul 2010.
If you are R1a
but not a Polish Project member, the Ysearch instructions topic has my method for matching to my types. The news in this topic applies to you if you
know your assignment.
If you are P type
or N type you would likely come out positive in the SNP test for M458 (M458+). If you are P type you are likely L260+. N type is likely
L260-. If you have not already tested
you can pay the small fee to test for these SNP tests to confirm that you
belong to the corresponding haplogroup.
If you are assigned to P borderline or
to N borderline you would benefit more from the M458 and L260 tests, because
that would provide for you a definite assignment within R1a.
The assignment rules are done with
high probability, so if you are unassigned (category U) there is a low probability
that you would test positive for M458, with probability that decreases with
your step (genetic mutation distance) from P or N.
If you have less than the standard 67 STR markers it
is generally better to purchase the remaining markers. That way, you are more likely to get an
assignment, because the statistics for STRs improves with more markers. Nevertheless, if you are not many steps from
P or N you might consider doing the M458 test even with fewer than 67 markers.
There is a slight chance that you might test
positive for L260 or M458 even if you do not match P or N. The haplogroup corresponding to M458 is old enough that there may
be small clades with STR markers very different than P or
N. I have not seen one yet, but there
is no way to estimate this probability.
I hesitate to recommend the M458 SNP test for men whose samples are
distant from both P and N in STR values.
I admit you can just wait to see if anyone with STR values similar to
yours matches an SNP, then test for that SNP.
However, we all benefit when some men test for all the new SNPs within
an established haplogroup, because that way we find out the size and rough age
of the corresponding new haplogroup branches.
FTDNA offers “deep clade” test packages to test for
all possible haplogroup branches, but my understanding is that L260 and M458
are not yet included in the R1a deep clade test. You need to purchase them separately from the advanced markers
menu. No doubt FTDNA will add them soon
to the deep clade package.
The Fall 2009 issue of the Journal of Genetic Genealogy has my
publication is split into two parts:
Part I is my “mountains in
haplospace” method for evidence that certain “types” of STR clusters correspond
to clades.
Part II is the application
of that method to Common Polish Clades.
That article has a lot more detail than this web page, but that article
was last updated in September 2009, so this web page is an update.
PolishCladesUpdate is my
folder for future updates of the Excel analysis files for those two articles.
This web page will continue as an
introduction and summary, without as much jargon and detail as the articles and
update folder.
The Fall 2010 issue has my
publication announcing the L260 SNP.
R1a Worldwide
Wikipedia has a nice R1a entry with primary
contribution by Andrew Lancaster.
11 Jan 2011 update: There is a lot of activity these days in the
discovery of new SNPs for dividing R1a into branch haplogroups.
You can follow the activity at the R page of the ISOGG
Y-DNA tree, and also at the FTDNA Draft tree.
The new SNP named L365 includes what
I have been calling G type, based on preliminary
data. It is too early to say if other
samples in addition to G type are positive for this new SNP.
The new SNP named M417 excludes what
I have been calling C type, based on preliminary
data. So far very few R1a samples are
negative for this new SNP, but it is too early to estimate the rarity of M417-.
In early 2011 FTDNA
released some new SNPs for commercial testing, including the following for
R1a: L365, M417, L366, L291, and
others. To order new SNP tests, go to
your home page at FTDNA, on the left under “My Account” click on “Order Tests
& Upgrades”, then click on “Go To Advanced Orders” and check “SNP”. Use your browser search to find the SNP of
interest. If you wish to publish your
results, join one of the projects (click on “Projects”) and the administrator
with analyze your data.
L260 and M458
are discussed below.
There are other new experimental SNPs
discussed on the web. I’m not trying to
list everything here, just the ones that are of interest for discriminating new
R1a haplogroup branches.
25 Oct 2010 update: The new SNPs cause
confusion in the alphanumeric notation for the haplogroups
and paragroups.
In my fall
2009 publication I used the notation that was well known at the time, where
more than 95% of R1a was known to be paragroup R1a1. The R1a1 samples with one of four very rare SNPs that have been
known for a few years were called haplogroups R1a1a through R1a1d. Ysearch still (25 Oct) uses the notation
described in this paragraph. FTDNA Projects still use this notation for automatic
assignment of samples. Individual
samples are not actually assigned to a paragroup because most have not been
tested for all SNPs. Most R1a samples
are listed as R1a1. Many samples are
listed as just R1a but almost all of those would come out R1a1 if tested for
the appropriate SNP (the well known M17 or M198, or one of the new ones that
all seem to be equivalent). I mentioned
in my publication that all Polish Project R1a were coming out R1a1. Since then only one sample (out of 1441 R1a
total in the Polish Project) has come out M198-.
New SNPs were discovered equivalent
to SRY10831.2, the original R1a SNP.
Subsequently, rare samples were found positive for some of these new
SNPs but negative for SRY10831.2. I’ll
use L62 to represent these; there are
others that seem to be equivalent.
Those define two small paragroups, R1a(L62, SRY10831.2-) and
R1a1(SRY10831.2, M198-). That previous
R1a1 paragroup becomes R1a1a(M198).
Accordingly, when Underhill announced the M458
SNP, he called that haplogroup R1a1a7.
L260 was called R1a1a7b when first discovered. Last spring I rewrote this entire web page using the notation
described in this paragraph.
The recent new SNPs change the
notation again. I shall not attempt to
rewrite this entire web page. As I
update topics, I’ll use the current notation.
For clarity, I’ll add the defining SNP in parenthesis when I do updates.
For example, what I have been calling
P type is equivalent to the haplogroup now called
R1a1a1g2(L260). What I have been
calling N type is equivalent to the paragroup
R1a1a1g(M458, L260-).
The choice of which SNP to put in
parenthesis is arbitrary for haplogroup notation. For example, R1a1a1(M17), R1a1a1(M198), and a few others, all
seem to be equivalent. But any day now
someone might announce a few samples that test negative for one of those SNPs
and positive for all the others, which would define a new paragroup and force
the renaming of all branches beyond that new node in the tree.
There is ambiguity in assignment of
samples. For example, a sample that
tests negative for M198 might be called R1a(M198-), but it is not clear if this
sample belongs to the paragroup R1a(L62) or to the paragroup R1a1(SRY10831.2)
if it has not been tested for the latter.
My types have an uncertainty similar
to SNPs. For example, I said N type is
equivalent to R1a1a1g(M458, L260-).
Recently two samples showed up in the Polish Project that are M458,
L260- but just beyond N type as defined by STR fit. We can think of these two as a new “paratype”, although I’ll not
use that word. We classify these two in
the Polish Project as “M458+R”, the Remainder in M458 excluding N type and P
type. Actually, as I discuss in the N type topic, it is not statistically certain where to place
the cutoff for N type, so you could argue that the M458+R category has more
than two samples in the Polish Project.
24 Dec 2010 update.
L260 is a new SNP. I published it in the Fall 2010 issue of JOGG. It
has been available as an SNP test since early April 2010 at FTDNA.
L458 is a new SNP, published by Underhill. It has
been available as an SNP test since early November 2009 at FTDNA.
FTDNA
has not yet assigned haplogroup names to these, so men who test positive are
not reported on-line yet at FTDNA nor at Ysearch, nor at
the projects supported by FTDNA, which include the Polish
Project.
Both L260 and M458 are listed at
ISOGG and at the FTDNA draft tree, where M458 is called R1a1a1g and L260 is
called R1a1a1g2.
See R1a
Confusion 25 Oct 2010 update.
9 June comment: This web page need update because a new node
has been added to the tree, changing the codes slightly.
22 June 2010 update:
Almost all of R1a divides into R1a1a1* (M17, M198), R1a1a7 (M458), and R1a1a7b (L260). These correspond to my original predicted division.
R1a also has several known rare groups: R1a*, R1a1*, R1a1aN, where N = 1 to 6 and
8. There is also a very rare
R1a1a7a. That asterisk is used for paragroups; R1a1a*, means haplogroup R1a1a without any
of those 8 known branches.
The rare R1a groups are not in my R1a
Table. It’s a shame the
corresponding STRs are generally not published in SNP announcements. I don’t know if the rare groups all together
add up to 0.1% or 1% of R1a. Surely
they are less than 3%. My percentage
calculations in my R1a Table do not need adjustment because any Ysearch samples that might belong to these rare clades would
probably have unusual STR values, not falling into one of my types, but still
be counted in the totals. In my R1a Table, rare samples are included in row
R. That row R might have a few percent
from these rare groups, but I don’t know exactly how many.
Underhill mentions 7 samples
(men) from R1a*, 9 from R1a1*, 14 from R1a1a6, and 1 from R1a1a7a.
Lawrence Mayka,
the administrator of the Polish Project, had been assuring me by email that all
the Polish Project member tests within R1a had been coming out negative for all
the rare SNP subgroups. So if you are a
Polish R1a, you are almost surely R1a1a, the same haplogroup as about half the
men from Poland. About half of these -
about 1/4 of men from Poland - are R1a1a7.
These two “about” estimates are approximate; my data on these SNPs are not
random samples, so my population estimates are derived from the types in my table, which are STR based.
On 17 June Mayka informed me of the
first R1a1* (SRY10831.2) (R1a* in the older nomenclature) member in the Polish
Project. My table, does not show this single exception
because the table is for samples with 67 markers, which that one exception does
not have. On 19 June Mayka informed me
of evidence that C type might define a new rare
subdivision of R1a slightly older than R1a1a;
if this turns out correct it will be less than 1% of R1a.
An article was published online, 4 Nov
2009, essentially dividing R1a1 into two groups, based on a new SNP, M458.
Abstract STR
Data See www.gwozdz.org/R1a.html for more
discussion
I call this article “Underhill” for
short, because his is the lead name in the list of 34 authors for this major
work.
This web page about Polish Clades was
completely rewritten using this new information. Recent L260 and M458 test results
are consistent with (albeit not full proof of) my previous R1a subdivision into
“types” here on this web page about Polish Clades.
Briefly, most of R1a1a is split by
this new mutation into R1a1a7 (M458 positive, or M458+) and R1a1a*
(M458-). See R1a Subdivision for a
brief summary of other groups, and for a clarification of what R1a1a* means.
R1a1a7 is the new M458
haplogroup. R1a1a7 includes what I have
been calling P type and N type here on this web page, even before M458 was
available.
R1a1a* is a new paragroup. This is M458 negative. It includes all my other types, particularly
K type.
This Underhill article has data for
158 “Poland” samples (Table 2):
R1a1a*: 71 samples 44.9%
R1a1a7: 87 samples 55.1%
The 70% confidence interval for
R1a1a7 is about 50% to 60% in the Underhill Poland data.
Worldwide 77% of the Underhill data
is R1a1a* (222 in Table 7 vs 68 R1a1a7, 290 total).
M458
Results are coming in now for this new SNP test and the Polish Project R1a is splitting about evenly, with a
few percent more R1a1a7 than R1a1a*, although the latter is more common
worldwide.
Format
Up to here, I have tried to write
this web page as news and summary, with links to more discussion below. I hope anyone having minimal familiarity
with genetic genealogy jargon has understood.
If you read this top to bottom, it gets progressively more detailed,
with more and more jargon. I’m sorry
about that, but the audience is also readers with genetic genealogy experience
who want to know how I came to my conclusions.
If you cannot follow some of this, it is written in a manner that you
can jump around and pick out what you do understand, then come back after you
have read more about genetic genealogy.
If you open this html document with
Word, all the link targets (bookmarks) can be viewed alphabetically or by
location.
This topic was updated 29 Dec 2010.
Lawrence Mayka
is the administrator of the Polish Project.
Click on the Polish Project web link to see
how Larry assigns samples (men) to categories. The Polish Project has sections for mtDNA
and for Y-DNA. This web document of mine is restricted to Y-DNA, with emphasis on R1a. I help
Larry with assignments to types.
Haplogroups are defined by SNP mutations. STR mutations are easier to test, so many samples have STR data
without SNP data. Predicted assignments are based on STR
correlations.
I mentioned above that FTDNA automatic haplogroup
predictions (red text means STR predicted vs green text SNP measured) have
about 99% probability. We use minimum 80% estimated probability for each individual sample in
the Polish Project that gets an extended assignment - a subdivision of its
FTDNA assignment. At 80%, many more
assignments are possible. Most extended
assignments are better than 80% probability.
Many are better than 95%.
Many samples do not have extended
assignments, but they still have their FTDNA green measured haplogroup (100%
probability) or their FTDNA red predicted haplogroup (99% probability). These bring up the average for the Polish
Project as a whole.
We are confident that the average is
better than 95%, which is to say that more than 95% of the Polish Project
samples would test positive for the SNP corresponding to their assigned
haplogroup. Excluding R1a the average
is likely more than 97%.
Example: E1b1b2a2 (V13) is an example of a haplogroup category with some
extended assignments: Larry has me in
this category, which is 100% probable because I tested positive for the V13 SNP
along with 14 other men in the Polish Project (data in this example is from 25
May 2010). However, Larry’s listing
includes 48 men in this category, based on his analysis of STR correlations:
15 green E1b1b1a2. These are of course certain.
28 red E1b1b1 because FTDNA does not
predict beyond that, but these would likely be E1b1b1a2 if tested, because they
have STR values close to those samples that have tested V13+, and unlike the
samples that have tested positive for other branches of E1b1b1. Each has at least 80% probability, and many
are even more probably correct.
2 green E1b1b1 tested for that previous
SNP but not for the current V13, but matching in STR values.
3 green E1b1b1a tested for that
previous SNP but not for the current V13, but matching in STR values.
Note that other E1b1b1 men, both
green and red, fall into other categories at the Polish Project, because they
do not match V13+ samples closely in STR values.
End of E1b1b2a2 example.
R1a is unique because almost half the
Polish Project samples are placed together by FTDNA into R1a1 (M198), which is elsewhere called R1a1a (M198). Many of our R1a assignments are to types,
which are hypothetical, without known SNP definitions. The minimum 80%
estimated probability still applies to each sample and again most are much
better than 80%. For type definitions
we are confident that the average is about 90%, which is to say that about 90%
of the Polish Project R1a samples assigned to a type would test positive
someday for an SNP, unique to that type, not yet
discovered.
“Cluster” and “Borderline” and
“Unassigned” category probabilities are discussed below.
I have been active helping Larry with
R1a assignments to types since late 2007.
Explanation
of the Results Table
Note 20 Nov 2011: I just updated the Results Table with recent data. Next, I need to rewrite this explanation and
update ResultsTable.xls; coming soon.
Edit 23 Sep 2011. Complete rewrite 28 Jul 2011:
The Results Table is based on data from the Polish Project, at 67
markers. The data was downloaded on
18 Jul 2011, at which time there were 1743 Y-DNA samples
(men), including 951 with 67 or more markers.
Data was edited for family sets, 55 samples, as explained in my publication.
Total at 67 markers after this edit was 918 samples, which is the
total in the Results Table.
Polish Project Assignments at 67
Markers are taken as representative of Poland, with caveats explained in my
publication.
I did the editing and tabulation in
an Excel file, which is available:
ResultsTable.xls
Column Haplogroup has the
conventional main branch haplogroup codes.
Column Haplogroup Category has
labels determined by Mayka. Most of these are branch haplogroup (or paragroup)
codes as defined by ISOGG, with the defining SNP in
parenthesis. Some of these are types as defined by me. A few of these are clusters. A few of these categories are for borderline samples, or for unassigned samples as
explained in the corresponding sections of this web page.
Column Short Code is my own
code for use in this web page. Some of
these have links for jump to a description of that particular clade. Some have a Ysearch
link in the far right column for the modal haplotype. Many do not have links because I have not found the time to work
on them; my priority is clades that
seem to be concentrated in Poland.
The Num and % columns
are the number of samples for each category, and percent of the total. The number of samples mentioned in those
detailed descriptions (below) may not correspond to the numbers in the table
because the particular description updates may have been done at a different
times than the table update. The
description section has descriptions of some experimental subtypes that are not
listed in the Results Table.
ISOGG names change often due to new
SNP discoveries. See R1a Confusion for examples.
Those types and subtypes and clusters
are my own code letters, for brevity.
Please do not confuse these code letters with official haplogroups. I have been using such code letters for R1a
assignments in the Polish Project since 2007.
The data is based on the 18 Jul 2011
download, but the table has been edited since then with a few new clusters.
My Update
Folder has an Excel analysis file for most types, plus many more
files.
The Ysearch
links provide the full modal haplotypes, using a
selected subset of the standard FTDNA set of 67 markers. I entered these data into Ysearch for our
convenience. All my modal haplotype definitions are available in the Excel file Haplotypes.xls,
which also has experimental types not mentioned here. Below are Ysearch instructions
for quickly comparing your haplotype to many
of my types at once.
Assignment to types is with at least 80% estimated probability.
Column % provides a good
estimate of the frequencies in Historical Poland, insofar as the Polish Project
is representative of Historical Poland, as discussed in my publication.
Unassigned: The Polish Project has many unassigned samples, but most of them
have fewer than 67 STR markers. The
Results Table is based on 67 marker data.
At 67 markers, many samples in “unassigned” categories obviously belong
to the corresponding paragroup, indicated by * in the Short Name column. The exception is R1b, where I lumped the
unassigned samples into the known categories, in proportion to the number of
assignments in each category; you can
examine my method in that Excel file by checking the formulae in the
totals for each category.
The largest group in Poland is the
paragroup R1a1a1* defined by M417 with no subsequent SNP. I split these into 3 categories (way back in
2007 - still useful): K type, K
Borderline, and Reminder (K, KB, and R).
K type is further subdivided into subtypes (hypothetical haplogroup
branches of the hypothetical K branch of M417) including K* for the samples
that do not fall into subtypes. R is
the category for samples isolated in STR values, apparently due to a large
number of small old branches with very divergent STR values. The KB samples are between K and R in STR fit,
so of course those cannot be assigned with confidence to K or R; in my Excel sheet I split the KB
total - half to a larger “K cluster” and half to R - for statistical
summaries; that’s why the K cluster has
130 samples in the Summary.
Description
of the R1a Categories
This large topic has descriptions for
the Y-DNA categories at the Polish Project. Some
of these are haplogroups, some are types,
some are clusters.
Types and clusters are high confidence
hypothetical haplogroups. Borderline
categories are lower confidence. There
is also one Unassigned category for uncertain samples.
Click the Ysearch web links in the Results Table for modal haplotypes, which
are my best fits of web data to groups of men with similar STR data.
Please don’t get confused. The following capital letter names are my
codes for R1a categories.
Capital letters are also used for the large official haplogroups, but that’s different.
Some of the following types have my
Excel analysis file for my November 2009 publication; the files are stored in the Supplementary
folder. Many of the following types
have my update Excel analysis at PolishCladesUpdate.
A.
Ashkenazi. This topic needs
rewrite, because A type is now known to be a subtype of Z93 (L342.2).
This seems to be a subtype of K. This type is discussed in my publication, Part II. I have about 90% confidence in that subtype status, but I am more
than 98% certain that A is a valid clade, not just because of my work, but
because the modal haplotype closely matches the various versions of the most
common Ashkenazi haplotype, which has been widely studied and reported on the
web. It should be emphasized that not
all Ashkenazi match this type, and some men in this type may not be descended
from Ashkenazi. This type is not
restricted to Poland. Levy-Coffman wrote an article
about Ashkenazi genetic genealogy; I
noticed discussion in a recent Science
article.
B.
Another subtype of K, recently identified by Mayka. Concentrated in Poland. The B data cluster lies at the edge of the K
cluster. The node for B type in the R1a
tree might be slightly younger or slightly older than the K definition
node. I estimate the former is about
80% probability - that B is truly a subtype of K; if not then B probably lies just outside of K (node slightly
older). Individual assignments to B
type have 80% to 90% probability.
C.
Added to Polish Project in Dec 2009 by Mayka, who
points out that Didier Vernade originally pointed out the unusual DYS392=13
value in 2007. DYS392=11 is almost
universal in R1a1a. C type is very
small. There are only 2 Polish Project
samples in C type, only 1 at 67 markers, but this type is well isolated on
Ysearch, with 4 different samples with 67 markers. I calculated SBP = 7% using only 37 markers with Ysearch
data. None on Ysearch are identified as
“Poland”. C type differs very much in
STR values from the rest of R1a1. That
is evidence for an old node for C type in the R1a tree.
25 Oct 2010 update: The C type samples are coming out negative
for a new SNP called M417. Other R1a
samples are coming out positive so far, so the prediction that C type has an
old node in the R1a tree is being verified.
Of course, it is too early to say how rare M417- samples are; it is possible more will turn up that do not
belong to C type.
M417 is one of a few new SNPs that
look like they will receive the notation R1a1a1x, where x = i, j, k, etc.
I’ll update this topic when M417
becomes available for purchase.
D.
Update 12 Nov 2011: Based
on 1 Nov 2011 Polish Project data. Analysis file: DType.xls. 59 marker definition,
cutoff = 9, no samples in the gap at 9
to 11; SBP =
5.3%.
Concentrated in Poland: Ysearch K49NZ; 34% have origin Poland.
This type was added to the Polish
Project in Jan 2010. The cluster was
brought to my attention by Mayka, who points out that Nordtvedt mentioned the cluster in web discussions some
time ago, based on DYS462=12.
Signature
(460,481,462,560) = (10,<22,12,18).
Any one of these four markers by itself can distinguish D type with high
probability from other R1a1a1i (Z280) samples, but those values can be found individually
as independent mutations in other R1a clades.
D type cannot be distinguished using the 25 FTDNA standard markers. At 37 markers, only 460 is available.
At 67 markers, 481<22 is an effective signature: 16 total D type: 13 D have 481=21, and only one other R1a sample has the 21 value.
2 D have <21, with no other R1a samples.
One D has the 22 value along with several other R1a. 481=25 is modal for R1a.
DYS462 is a standard STR marker at Sorenson, and has been
available for years at Ysearch; 462 is now available at FTDNA with the 111
marker set. In Nov 2011 I noticed that
DYS560=18 is another marker for D type from the 111 set, but that is not
available at Ysearch (Nov 2011).
That DType.xls analysis file provides
SBP = 5.3%, although I did manual editing of the definition to improve SBP,
providing some selection bias. On the
other hand, isolation of D type is even better than indicated by SBP for two
reasons: Samples just beyond D type, steps 12 and 13, all have solid assignments to other
types. Most of the D samples have
462=12 and a few have 560=18, and those samples beyond step 11 with data have
other values at those 2 markers, so a future definition using all 111 markers
should provide even better (lower) SBP.
Only 3 D type have 111 markers;
most of the DYS462 data was obtained some time ago by purchasing that
marker separately.
D type seems to be Z280+ Z92-, based
on only 1 sample (10 Nov 2011 - columns BW and BX in that analysis file). Z92 is a new SNP, so not
much data is available; confirmation
should be available soon. D is a
subtype of what I had been calling K type; I’m now using K as a code for the paragroup defined by Z280*.
D type is clearly a Polish type: In the Polish Project 10 of the 16 D type at
67 markers indicate “Poland” ancestry;
the exceptions are 2 “Unknown” (one with an obvious Polish name and one
with a name that might be Polish), 2 Slovakia, 1 Germany, and 1 Czech Republic.
On Ysearch, there are 32 samples
below the D type cutoff, and 11 of them (34.4%) indicate Poland Origin, which
is quite high for Ysearch. SBP is 15%
on Ysearch, implying there are clades near the cutoff that
are rare in Poland; indeed none of the
5 samples in the gap at steps 9 and 10 indicate Poland. For details see the “Ysearch” sheet in
DType.xls.
Age (ASD sheet cell N12) comes out
1,385 years using all 67 markers. Old
human Y-DNA clades have age older than the raw ASD calculation because of
population bottlenecks and because of other statistical adjustments. However, D type is not very old, so this
correction may not be needed. On the
far right of that ASD sheet I sorted markers by age, and I added notes about
problem values, and suggested four markers that should be masked out, but the
age with these 4 masked out (ASD sheet cell N29) is not much different, 1,216
years. I see evidence of subclades, so
D type might be composed of younger subclades that might be identified with
more data.
I noted three markers (on the far
right of the ASD sheet) that I consider hints for subclades. Last year in this topic I mentioned Da, with
the signature (458,576,444)= (16,20,14) and that still looks promising, but not
convincing. One of the three D samples
with 111 markers fits Da, and provides a hint that markers 463 and 715 from the
111 extension might help to resolve Da, so it will be interesting to see what
happens as more D men order the 111 extension.
E. V.
Rudich entered a modal for this cluster into Ysearch as ID mW7DP,
named “North Eurasian”. Mayka modified
it slightly for the modal used here by me, GNYBG,
named “Belarus”. It’s an excellent
type; on 25 May it has 16 samples at 67
markers in the Polish Project, with SBP = 14%.
FH Clade. F and H types were suggested by Mayka. They have the signature
(439,511,452 = 11,11,28). They differ
from each other, so I could not make a combined FH type. I can make a reasonable FH cluster,
but it is not necessary, since the FH clade can be better defined as the
combination of the three types Fa, Fb, and H.
The original F type (introduced Jun 2010) was split into Fa and Fb in
Dec 2010. DYS452 is not one of the FTDNA standard markers, so not many Polish
Project members have this marker evaluated.
Mayka and I helped most of the Polish Project members in FH, and members
just beyond FH, to get 452 evaluated.
Samples beyond FH have 452=30.
My analysis files do not use 452 for
determination of SBP.
452 would not significantly lower SBP because most of the background near the cutoff for
each type are samples from the other two.
In other words, Fa, Fb, and H are very well isolated from the rest of
R1a, but not so well isolated from each other.
These three FH types do not seem to be specifically concentrated in
Poland (per Ysearch) although they are concentrated in Slavic countries
including Poland. All three types seem
quite young, without relatively low STR variance (see the ASD sheets in the
analysis files).
FH Borderline. The borderline samples from Fa, Fb, and H are combined into
a single FH Borderline category in the Polish Project, because these clearly
belong to the FH clade but have less than 80% probability of belonging to any
one of the 3 types.
Fa.
Ysearch YQ6D2. 66 markers, cutoff, 9 gap 2. SBP = 27%.
See FH clade, above.
Fb.
Ysearch EFQM7. 56 markers, cutoff, 5 gap 4. SBP = 23%.
These samples were the original F type, before Fa was split off. See FH clade, above.
H.
Ysearch 559EE. 58 markers, cutoff, 7 gap 3. SBP = 14.5%. See FH clade, above.
G. This
type was suggested to me by Mayka, who calls it the
Pomeranian cluster. Pomerania is the name of the
region on the south shore of the Baltic Sea including regions of both Germany
and Poland. Marcin
Wozniak found the G modal haplotype (at 12 markers) to be very common among
Kashubians. Kashubians consider
themselves an ethnic group or nationality within Poland. It will be interesting to determine if
Kashubians in Poland have a higher % concentration of G type than German
Pomeranians. Meanwhile, “Pomeranian” is
a convenient neutral name, suggests Mayka.
G type is mentioned only briefly in
my publication because not much data was
available to me at that time. My GType.xls update
analysis file with June 2010 data has excellent results: There are 12 samples in a nice type with SBP
= 11.2%. There is preliminary evidence
of a subtype, Ga, SBP = 23%, but with only 4 samples I did not enter a modal in
Ysearch; see Haplotypes.xls
for a list including hypothetical working modals.
11 Jan 2011 news: Mayka informs me that one of the new SNPs, L365,
is positive for all of 5 G type samples that were tested so far. A few samples from other types all tested
negative for L365. It seems like G type
is included in the new haplogroup defined by L365. One of those 5 is in that tentative Ga subtype.
Of course, this is very
preliminary. It is possible, if
unlikely, that some of the G type samples still might turn out negative for
L365. It is quite possible other
samples not matching G type might be found L365 positive. I’ll provide updates here.
Those 5 samples are positive for
M417, negative for M458, and negative for a few other new SNPs.
L365 is one of a few new SNPs that
look like they will receive the notation R1a1a1x, where x = i, j, k, etc.
This type should not be confused with another G type in the N haplogroup.
14 May 2011 comment:
Sorry I have not taken the time to update this G type topic. Recent data continues to verify that G type
seems the same as the haplogroup divided by L365, now called R1a1a1i.
I.
Minor edits 5 Aug 2011. Complete
rewrite 4 Aug 2011. Based on 2 Aug 2011
Polish Project data. Three analysis
files: IType.xls; IaType.xls; IbType.xls.
I type is discussed in my publication, Part II, page 178.
On Ysearch, I type is concentrated in
Poland and in other Eastern European countries.
On 28 Jun 2011 Lukasz Lapinski suggested two small clusters based on recent I
Borderline samples. These are currently
called Ia and Ib types in
the Polish Project. Ia and Ib are
probably not really subtypes of I, as discussed in the following paragraphs.
I type seems to have structure. Some of the 67 STR markers are bimodal, which hints at subtypes. The bimodal markers are not correlated with each other, so I have
not been able to identify subtypes with confidence.
My published 2009 definition for I type, I59, uses 59 of the 67 STR
markers, cutoff 8. That definition still
works quite well, with SBP 17.8% (Aug 2011).
I consider SBP <20% sufficient to use the term type. I found a better definition, I62, cutoff
9, SBP 12.3%. The two definitions
are compared in the file IType.xls.
That 2009 definition had 22.4% SBP in 2009, so it did not quite qualify
as a type back then. (Background means foreign samples with matching STRs that
do not belong to the hypothetical I type clade; SBP is a high confidence
statistical limit estimate.) Six of the
24 using that old definition are excluded by the new definition; if the latter is exactly valid that means
background was actually 25%, which is close.
The new SBP with the old definition is 17.8%, which is lower than 25%,
but I’m comfortable with this because most of my published SBP’s have been
shown to be larger than subsequent new data, as intended. The new definition also captures two samples
that were previously borderline, one of which was classified I type anyway
because that sample has close matches in I type. The new definition captures an A type
sample; that sample is a good fit to A
type; this false call is not
incompatible with the 12.3% SBP which predicts less than 3 samples background
(12.8% of 20). More about A type in a
paragraph below.
The new I type definition lacks
breadth - changing the number of markers increases SBP. This is displayed in Itype.xls as columns
for different marker sets. For such
analysis, the database needs to be restricted to the samples with step not too
far beyond the cutoff. For I type the
ranking of markers is sensitive to exactly where the database is truncated, so
the automatic definition comes out differently for different truncation of the
database. For the database in the
Calculator sheet I truncated the database by removing samples at step > 13,
except I left in two samples at steps 14 and 15 that had been classified Ib and
IB (discussed below). The definition
for I type is also sensitive to exactly which markers are assumed for the first
iteration as the type. The TypeRank
sheet in IType.xls uses the 19 I type samples, excluding only that one that A
type. I tried quite a few other database
truncations, and various assumed sets;
those yielded different definitions with higher SBP. My published SBP formula is defined in a way
that provides a larger number to compensate in part for such selection bias.
On the other hand, for the dozen or
so samples that fit I type best, step < 7, the database and the number of
markers do not matter; the same dozen
or so samples are captured as I type for any reasonable definition using a wide
breadth of markers. We can be confident
that there is a valid clade corresponding to those dozen
best I type samples that will some day be captured as a haplogroup by a new SNP. Beyond those best dozen
samples, steps 7 to 8, there are another dozen or so samples that seem to be I
type but at lower confidence; the
background might be significantly more than the best fit SBP. In my publication I explain why background
increases very rapidly with step. I
suppose the actual percent of background might vary from maybe about 1%at step
2, to maybe about 40% at step 8.
What does this mean? The simplest explanation: There was a “father” haplogroup thousands of
years ago. Due to population bottlenecks, only a small
number of the males from that father haplogroup are MRCA’s
(ancestors of clades that exist today).
The descendants of the I type MRCA participated in a significant
population expansion. I type is the
only large clade from that haplospace neighborhood
showing up today in the Polish Project.
Other smaller “brother” clades show up, and because there are many more
haplotypes at larger step values, those brothers are randomly distributed at
large steps in my I type analysis. This
is a simple explanation; more complex
explanations are possible - for example involving migration of tribes from
distant lands.
IB are
Borderline, at step just beyond the cutoff for I
type, not fitting any other known type, with only about 50% confidence that
they will someday end up in a haplogroup corresponding to I type. Samples are also assigned to I Borderline
when the nearest matches at 67 markers are I type. There are two samples at step 10 (new definition) now changed
from I type (old definition) to IB using the new definition. There are 4 more prior IB samples at steps
12 to 15 now changed to K and KB. The
next update of the Results Table will
show slightly smaller totals in I and IB.
As 67 marker data accumulates in the
near future, it is likely a slightly better definition may turn up with even
lower SBP, and I type may separate into subtypes with <20% SBP. The 111 marker data is promising (discussed
in a following paragraph).
A clade that is very well isolated
(<5% SBP) has a high chance of soon being defined by a newly discovered SNP
haplogroup. For I type with 12.3% SBP,
a new SNP might be older, including some small older clades, or a new SNP might
be younger, leaving out some marginal I type small clades. For example, I recently discovered a new SNP
in my own Y-DNA that is slightly older than my predicted type - see L540.
My maternal
Iwanowicz grandfather was I type.
This explains my extra effort analyzing I type. The two Iwanowicz samples are my maternal
first cousin and a man that I found in Poland who seems to be my 4th or 5th
cousin. Technically, one of those
should be removed for slightly higher SBP because I recruited that data, but
the bias for 20 samples is small (SBP becomes 13.0%).
One of the Iwanowicz samples was removed
for the Results Table, along with
editing of family sets in other categories.
SBP for Ia and Ib
are 11.9 % and 17.0%. The definitions
have breadth. These are good results,
providing better than 80% confidence of validity for
each. However, these all fall outside I
type with my new definition. Even with
my old definition, only 4 of these were I type at high step, the rest were IB. Using an I code was a bit arbitrary. Now is not a good time to change their code
names, because quite a few new SNPs will soon be available. With more SNP data small types such as these
can soon be renamed with more confidence.
Back in 2009, and still today, A type
overlaps with I type at the margin. So
does the newer D type.
However, A type is coming out positive for the new haplogroup based on
the L342 mutation, which seems to be rare in Poland. Mayka informs me that a WTY
for one I type sample has come up L342-, as have two D type samples. In the past, I have always speculated that A
type and I type are both subtypes of a larger K type. It now seems A type is really in a distantly
related branch (L342) of the Y-DNA tree with similar STR values by
coincidence. My prediction that I type
is a subtype of K type is still a low confidence speculation.
The best ranked marker for I type is DYS578=9. DYS578 has the second slowest mutation rate
of the 67 standard markers per the Chandler rates. The ancestral value is 8. The 9’s are colored orange in that analysis
file IType.xls. From the 450 Polish
Project samples at 67 markers, only 6 samples outside I type have the 9 value,
one sample has a 7, the remainder are all 8, consistent with very few
independent mutations. In the analysis
file, notice that all the predicted I type samples have the 9 value with one
exception, that A type (discussed above) at the last step of I has the
ancestral 578=8. There are two A type
with 578=9 at steps 11 and 12; the
former has been tested L342+ (coded SNP results are in column BX of the
file). All the other A’s have 578=8, so
the obvious interpretation is an independent mutation to 9 within the A type
clade. The only other 9 in that
analysis file is an IB sample at step 12;
that one might be another independent mutation; on the other hand, perhaps the mutation to 9
is much older than the TMRCA for I type, with that one sample representing a
very small clade with an older node. The Ia and Ib samples all have the ancestral
value 8; that’s evidence that Ia and Ib
have old nodes with I - older than the 8 to 9 mutation.
The second best marker is DYS458=14,
again orange in the file. This is a
rapid mutator, so there is more variance.
All but 2 of the I type samples with 578=9 have this 14 value. This is evidence of youth for I type. Those two, at 15 and 16, are probably
independent mutations, although we cannot rule out the speculation that the 15
is the ancestral value telling us that the 458 mutation to 24 came after the
578 mutation.
Only 8 I type samples have 111 STR
marker data and 2 of those are my Iwanowicz samples, so analysis at 111 is
premature. That said, all but 1 of the
8 have DYS532=12; that one
exception has 11. Value 11 also shows
up for the one Ia sample, and for the two IB samples at 111 markers. DYS532 seems slow, but there are quite a few
11’s and 12’s in the 71 R1a samples at 111, so 532 will not displace 578 as the
best marker for I type. Lapinski
pointed out to me that a couple other markers also show promise at 111 markers
for I type.
[Note inserted on 14 Sep 2011: There are now 9 I type samples and 7 of them
have the signature (532,,504) = (12,14).
All other R1a samples have the modal (532,,504) = (11,>14). This is evidence that the I type node with
R1a tree is not much older than the M458 mutation. DYS532 and DYS504 are two of the new 44 markers in the extension
from 67 to 111 markers. I'll call this
pair of values the signature for a hypothetical IPN clade. This is not strong evidence, because there
is a small chance those 2 mutations happened twice independently - in the M458
clade and in the I type clade. The two
exception samples were previously classified Ia and IB, so they might be from
branches older than the signature mutations.
I need to update my analysis to include these 2 markers, and update this
I type topic. I’ll be busy with other
things for a few months, so I added this note.]
I modified the Ysearch I type
definition, EKVHX
for the new I62. I type has no samples
at the step 9 cutoff in the Polish Project;
on Ysearch there is only one Russian sample at step 9 (plus a couple
modals), so I type is also well isolated on Ysearch, not just in Poland.
All 67 markers can be used for
estimating the age of I type, because there are no significant recLOH
problems with the compound markers in the I type data. Age comes out 1,208 years. See the ASD sheet in IType.xls. Raw ASD age is usually adjusted older due to
population bottlenecks, as explained in my publication, but the adjustment
should be small for I type because it is not very old and because I type
obviously went through a population expansion.
ASD age is highly uncertain due to caveats.
End
of 5 Aug 2011 rewrite of I Type.
Reminder: most of this web page
has not been updated for quite a few months.
J. This type was suggested by Mayka. Only 6 members in the Polish Project, but
this type is well isolated at SBP= 13%.
K. News
10 Nov 2011: It’s looking like what I
have been calling K type will end up equivalent to the paragroup Z280*.
In early October, ISOGG
officially recognized Z280 as R1a1a1i.
I’m waiting results from the new SNP
Z92, particularly for I type samples, so that I can construct a new definition
for the K paragroup without Z92.
In the past I considered A type a subtype of K, but A type turned out R1a1a1h (Z280-
Z93+). This change has no effect on
assignment of individual samples to A type, which has very low SBP and hence
very high confidence of being a valid clade.
The following paragraphs in this K
topic were written more than a year ago.
My next priority is to rewrite the following with more detail:
This seems to be a main R1a1a
type. K type is discussed at length in
my publication, Part II. It is larger than others in the Slavic
lands. P and N (below) are just as
close in STR values to K as they are to each other, probably
because the K modal haplotype is the same as the R1a1 modal haplotype (using
the best 34 markers for K). So far I
have discerned a few subtypes of K in my List of R1a types, but
I do not have high confidence that they are all exact subtypes of K, as
explained in my K Borderline discussion. I suppose that as data accumulates more
subtypes will become clear within K and K Borderline.
In the Results I use K* to signify those samples
that match type K but do not match one of the subtypes. Although I have high overall confidence in
the validity of K type, individual assignments to K* are not as confident. Because K is located at the modal heart of
R1a, I expect some outlier samples
from distantly related clades to match K* fairly closely just due to the
statistics of random STR mutations.
Because of the possibility of foreign outliers, I consider samples at K
step 3 to be K Borderline, even though the cutoff for the
K definition is 4.
Even K* samples with step <3 have confidence of only 80 to 90%. That’s in Poland, where K is fairly well
defined with SNP = 26%. Worldwide K*
cannot be discerned with confidence.
The Ysearch SNP for K is 71%, not significant. That means there are K borderline clades close to the K cutoff
that are rare in Poland but causing interference on Ysearch. This is evident by a glance at the K type
results on Ysearch, where “Poland” origin is concentrated at steps <3, and
“Poland” becomes progressively less common at higher steps.
The Kurgans are the ones
who domesticated the horse more than 6,000 years ago. Many scientist think that one pre-Kurgan man is the male line ancestor
of all R1a1 men who live today. The
Kurgan hypothesis is controversial, and not necessary for this web page. You may have noticed that I used the letters
of “Kurgan” for my original types and categories during 2008.
Kt, Ku, Kx. New small
clusters need documentation here.
I have been using the subscripts “z”,
“y”, “x”, etc backwards through the alphabet because I am running out of
letters for new clusters and types.
These small hypothetical clades seem to be subclades of K, although I do
not have high confidence about the subclade status.
Kw. New small haplogroup,
equivalent to the new SNP L366. Need
documentation here.
Ky. Update 7
Oct 2011: Based on 1 Oct 2011 Polish Project data.
Analysis file: KyType.xls. Ysearch BBB9T.
Ky type was suggested to me by Mayka on 21 Dec 2010. There were only 3 samples in Ky last
year; now there are 5.
That KyType.xls file demonstrates
that the same 5 samples are extracted using any number of markers from 11 to
67, although at some of those definitions one or two other samples are also
extracted. The full 67 markers work
best, SBP=23%.
Ky was more isolated last year; a few samples showed up in the gap,
reducing SBP.
I’m using a hand edited definition,
Ky63, using 63 markers, for the following reasons:
Ky is unusual in that 4 of the 5
samples have an unusual value for at least one markers. I highlighted these values in red in that
file. Notice also the high step values
for those four, 8 through 11, using all 67 markers (column BX), although SBP
came out 23%, which is an excellent low result for 67 markers. The obvious (but speculative)
interpretation: each of the 5 samples
seems to be a representative of a branch of this hypothetical clade, where each
of the 5 branches has a node not much younger than the TMRCA.
Hand editing like this does introduce
some selection bias, so the calculated SBP=13.6% for Ky63 is misleading. Countering the selection bias, some if not
all of those 4 markers that I masked out might represent small tribal sized
subclades, so future prediction of new Ky samples should work better using Ky63
with those 4 removed. T
he far right of the “ASD” sheet has
the markers sorted by apparent age, with “M” indicating the markers that I
masked out. You can see that my
selection is a bit arbitrary; I could
have masked less than 4, or more than 4.
ASD age using all 67 markers comes
out 917 years, cell N12. ASD age using
the 63 markers not masked out comes out 878 years, cell N29, not much
less. ASD age has a number of caveats, and 4 samples are not significant, so
this age is highly uncertain. Ky seems
young, as haplogroups go.
Ky doesn’t not have a prominent signature.
Kz. Update 5 Oct 2011: Based on 1 Oct 2011 Polish
Project data. Analysis file: KzType.xls. Ysearch 9QJFQ.
Kz type was suggested to me by Mayka on 6 Oct 2010. Mayka speculates this might be a clade of Kazakh
origin. There were only 3 samples in Kz
last year; now there are 6.
That KzType.xls file demonstrates
that the same 6 samples are extracted using any number of markers from 2 to 67,
so the definition is not critical for this well isolated type.
Kz is effectively more isolated than
the SBP values (row 12 in that file) indicate, because the samples just beyond
Kz are all confidently assigned to other clades and types. For this reason, those SBP values are moot.
I’m using a hand edited definition,
Kz59, using 59 markers, for the following reasons:
Kz is unusual in that 5 of the 6
samples have an unusual value for at least 2 markers. I highlighted these values in red in that file. Notice also the high step values for those
6, 8 through 11, using all 67 markers (column BY), although SBP came out 27%,
which is an excellent low result for 67 markers. The obvious (but speculative) interpretation: each of the 6 samples seems to be a
representative of a branch of this hypothetical clade, where each of the 6
branches has a node not much younger than the TMRCA.
Hand editing like this does introduce
some selection bias, so the calculated SBP=10.7% for Kz59 is misleading (but
moot). Countering the selection bias,
many if not most of those 8 markers that I masked out might represent small
tribal sized subclades, so future prediction of new Kz samples should work
better using Kz59 with those 8 removed.
Again, this is moot, because any number of markers extract the same
samples.
The far right of the “ASD” sheet has
the markers sorted by apparent age, with “M” indicating the markers that I
masked out. You can see that my
selection is a bit arbitrary; I could
have masked less than 8, or more than 8.
ASD age using all 67 markers comes
out 724 years, cell N12. ASD age using
the 59 markers not masked out comes out 704 years, cell N29, not much
less. ASD age has a number of caveats, and 6 samples are not significant, so
this age is highly uncertain. Kz is
clearly young, as haplogroups go.
Additional information supplied to me
by Mayka: Three of the Kz type samples
are from non-Polish men who suspect they have Polish male line ancestry, so it
is not certain Kz type is Polish. Kit
number 152824 in Kz is from a man who purchased WTY and
found the new SNP L399, but that SNP appears to be private, restricted to his
family. Insofar as that man recruited 3
more Kz samples into the Polish Project, Kz seems proportionally twice as
large. My next edit of the Results Table will reduce the percent size
of Kz.
Kz has the prominent signature DYS459b=18.
Mayka points out the additional signature DYS461=12, not one of the 67
marker set; most of the samples in Kz
have been verified with this 12 value.
Since the Polish Project neighbors (step at or
beyond cutoff of Kz) are all assigned to other
hypothetical clades, we do not know if the signature markers define a larger
father clade.
L. This
cluster is highly hypothetical. It is
rare in Poland, but second in size to K in European R1a1. Larry Mayka suggested
this cluster to me. It is a well known
Scandinavian cluster. I quickly checked
it briefly, and it seems to be a “type” by my definition. However, no Polish Project sample matches at
80% probability yet, so I am not yet using it for classification here. More documentation about L will be available
here when I find time to study it.
L260. See P type.
L342.2. New topic 30 Oct 2011. This SNP was recognized
as a new haplogroup by ISOGG during the summer of
2011. This was an L342 haplogroup
category at the Polish Project for a short time in the summer and fall of 2011,
but it has been replaced by Z93, because it seems all the L342.2+
samples are also Z93+ in the Polish Project.
Apparently there are very few men elsewhere in the world found to be
Z93+ L342.2-.
Z93 is a more reliable SNP than
L342.2, so it is recommended that men first test for Z93. L342.1 is the same mutation as L342.2,
discovered earlier in the E haplogroup.
L342.2 is equivalent to L319, L348, and L349, so all 4 SNP tests
together are more reliable. These 4
mutations are in the same segment, which is apparently a segment that mutates
relatively rapidly. Z93 is recommended
as the better test for R1a samples that do not fit STR definitions of other R1a
haplogroups; the Z93+ samples can do
the L342.2 test. This information about
L342.2 was supplied to me by Mayka.
The Z93 category has the samples that
do not fit the two known subdivisions: A type and L342T cluster (next topic).
L342T. New
topic 30 Oct 2011. Based on 26 Oct 2011
Polish Project data. Analysis file: L542TCluster.xls. I just noticed this cluster.
L342T is not a type,
because SBP did not come out low enough. However, I included this cluster discussion
here for the following reasons:
Seven samples at 67 markers fit my
new 48 marker definition for L342T.
There are 19 A type samples, which should all be in
the same L342.2 (Z93) haplogroup, but those A samples do not fit L342T; the closest A’s are at step
8, where the cutoff is 6. There are 5 more L342.2 (Z93) samples at 67 markers, and those 5
also do not fit L342T, falling at steps 11 through 21. In other words, L342T is well isolated from
the other L342.2 (Z93) samples, including the A type branch. The one background sample (STR values fit
the L342T definition) and the four samples beyond the cutoff, are assigned to K
type and to subtypes of K; Z280 has
recently become available for K type;
as those background samples get tested in the future for Z280, my L342T
cluster will start looking better. Let
me say that another way: a cluster
should be analyzed with data from its own haplogroup, so L342T should be
compared only to L342.2 (Z93) data. But
there is very little L342.2 (Z93) data available, so I used the full R1a
database in that xls file. That means
L342T is likely more isolated than it seems right now, so it is more likely to
correspond to a valid haplogroup.
Mayka pointed
out to me that some of the L342T samples have Tatar ancestors. That’s why I used the “T” in the code
name. Of course, Tatars may belong to
only a branch of L342T; I have no idea
what fraction of L342T in Poland are Tatar.
And of course Tatars are expected to be a mix of multiple haplogroups.
Three of the L342T samples, with the
name Muchla, are apparently a family set, so they count statistically as only
one sample, reducing the current count from 6 to 4, so SBP as calculated in
that xls file should be increased (not as good). This is evidence against L342T being valid.
N. I
have been rewriting this topic throughout the late summer of 2011. Finished 25 Sep 2011.
Based on 5 Aug 2011 Polish Project data.
Analysis file: NType.xls
N type is concentrated in Slavic
countries. N type is discussed in my publication, page 179.
According to Ysearch
and Yhrd N type seems to be spread all around the Slavic
lands and central Europe, common from East Germany to Russia. Within Poland N type seems to be about the
same size as P type, both about 9% of men. Worldwide, N is much larger than P. N type should be properly studied in a
database that is not restricted to Poland.
However, there seem to be subtypes of N that are concentrated in
Poland. See the discussions on N
subtypes below. I’ll continue to watch
the Polish Project, because it will be interesting if more data provide more
Polish subtypes within N.
During review of my publication in
2009, the SNP called M458 was
published. I added notes about this to
my publication on page 184. The
corresponding haplogroup is now called R1a1a1g. This
haplogroup seems to be equivalent to what I have been calling P type (M458+ L260+) plus N type (M458+ L260-). M458+ samples may turn up someday that do
not fit either N type or P type, but I have not noticed any yet.
My current definition
for N type, N46, is a modal haplotype using 46 of
the 67 standard markers. The cutoff is 8, which
means all samples less than step (genetic distance) 8 from
N46 are predicted N type (predicted M458+ L260-). That definition is available in the NType.xls
analysis file, in my Haplotypes.xls
file, and at Ysearch as 3SEJK.
N type age (age means TMRCA)
is about 2,000 years. That’s highly
uncertain, but I’m 80% confident that age of 2,000
years is not off by more than a factor of 2 - age 1,000 to 4,000 years. The M458 mutation is likely much older than
the age of N type.
I’m suspicious that N type includes
many younger clades that just happen to have similar STR
values, difficult to resolve into clusters or types. I offer some speculation
along these lines in the hypothetical subclade topics below.
I highly recommend that someone from
N type purchase WTY, a commercial product for discovering
SNPs. No sample from N type has been
submitted for WTY. That means there is
a good chance that the first N man to submit his sample to WTY will discover
one or more SNPs - perhaps an SNP that captures all of N type - or perhaps an
SNP that captures about half of N type - or perhaps an SNP that captures a
small subclade - or perhaps multiple such SNPs. My WTY was the first in a long time
in my haplogroup, so I found 14 new SNPs.
It’s interesting to wonder why
R1a1a1g seems to be composed of two types that differ
substantially in STR values (N and P are separated in haplospace). I speculate about this in the P type
topic. Much of my P type discussion is
also related to N type, so I avoided repeating all the details here; please read my P type
discussion if you are interested in more about N type.
N seems to be older than P. I wonder if there are subtypes of N about
the same age as P. I avoid too much
speculation in this web page - just enough to indicate my motivation. I’m wondering if there are clades in various
haplogroups, mostly P and N, associated with the origin of the Polish nation -
a few centuries more than a millennium ago.
I have only identified 4 small
subclades of N so far: I am quite
confident of Ng type, but less confident of N-Ashk type. The Nt and Ns clusters are
hypothetical; I have about 70% confidence in them.
These 4 are used for assignments at the Polish Project web page. I also identified a few clusters with
roughly 50%confidence; these are too
speculative for formal assignments. All
are discussed below. I made speculative assignments based on all
these types and clusters within N type, in column CD of that file NType.xls,
Calculator sheet. My file NClusterAssignments.xls
has lots of details. If you are N type,
you can find your row with your kit number, and see your speculative
assignment. For the “clusters”, I
estimate a 50-50 chance an assignment will need to be changed in the next year
or so, as more data becomes available
In addition, N type has many bimodal markers, hints at yet more subclades not discussed
here. This is evidence that N type
experienced population expansion when it was young (not long after the TMRCA). More
discussion below.
The paragraphs up to here are a brief
summary. The rest of this topic is a
detailed discussion about N type and hypothetical subclades:
This Sep 2011 analysis includes only
data from the Polish Project. I’ll wait
a few months before reviewing data outside the Polish Project. My last analysis including data from outside
the Polish Project for P type, N type, L260, and M458 was Jan 2011. For those last results, see the following
topics, which have not been updated for several months:
For the size of N type, please see
the table at the top of this page, where N has only 4 more samples than P (87
vs 83 - 5 Aug 2011 data). In my 2009
publication N had one less than P (28 vs 29, Table 6 page 169). The 70% confidence interval for 87 samples
is 77 to 98 (8.4% to 10.6%) so N and P are equal in the Polish Project (and by
implication in Poland) within statistical sampling accuracy, at about 9%.
My 2009 published definition for N
type, N45, still works very well. I did
not change that definition at my Jan 2011 update and analysis here in this
topic. This Sep 2011 N46 update is just
a tweak, adding and subtracting a few markers to better fit the M458+ L260- SNP
data that has accumulated over the past year.
Both definitions are compared in that analysis file NType.xls,
Calculator sheet, columns BZ to CC.
Tweaking the definition like this, to
better fit SNP data, introduces some selection bias. I discuss this issue in the P type topic,
where I did a similar tweak; please
read that topic if you are interested in the statistical justification. The justification is not as good for N type,
so I’ll return to this issue in the “old branches” paragraph below.
This new N46 definition fails to
capture only one M458+ sample, which falls at the cutoff step 8. This new N46 definition captures only one
foreigner, L540+, at step 7, the last step of the type. The other samples at step 8 have tested
either M458- or L260+, except one that fits D type well, so they are all
confirmed as not N type. Similarly, 7
of the 20 samples at step 9 have been SNP tested, 11 of the 20 are good fits to
other types, with only 2 that are Borderline fits to other types. In other words, the N46 definition captures
the M458+ L260- samples with apparent 98% accuracy. However, my confidence is about 80% for
step 7, about 90% for step 6, and 95% or better for step <6. Again, please see the P type discussion
about confidence for a general explanation.
P and N are similar in this regard.
I have related discussion about N type confidence in the “old branches”
paragraph below.
Almost all the samples near the
cutoff for the previous N45 definition have been SNP tested. This high testing rate is not a
coincidence; Mayka
and I have been encouraging men with marginal
samples to do the M458 and L260 tests.
(We paid if cost was a problem.)
The NType.xls analysis file has 10
columns (CF to CO in the Calculator sheet) using from 2 to 67 markers as
tentative N type definitions, with automatic selection of the best
markers. For each column, I colored the
step count violet for samples captured by that definition. You can see at a glance that any definition
using 2 to 67 markers captures more than 80% of the N type (M458+ L260-)
samples, and not many foreigners, so just about any definition works
surprisingly well. In other words, N
type is very well isolated in haplospace.
For the two best automatic
definitions, I used boldface to highlight the N type samples missed by that
definition, and also boldface to highlight the foreign samples captured by that
definition. I used boldface similarly
for my prior N45 definition, using 3 columns (BZ to CB) to demonstrate the
effect of 3 different cutoff choices.
You might try resorting the sheet by
column (select everything from cell A14 to the end) to better compare the
results.
The issue of SBP
is moot for N type now that the SNPs M458 and L260 are available, but an
analysis is instructive: That NType.xls
file has automatic marker selection of N type, and automatic calculation of
SBP, disregarding the SNP data. The
best automatic definition, N61, has SBP=13.2%, vs N46 with SBP=14.1%. However, N46 is a better definition because
N61 captures only 80 of the 87 N type plus that same one foreigner. But still, 8 misses out of 87 is not bad for
N61, better than the 13.2% SBP (SBP is a high estimate for statistical
confidence).
I considered calling N46 a definition for M458+ L260-, with a
different definition for N type as a slightly smaller subtype, leaving out some
samples that do not fit the N type definition with lowest SBP. I could not come up with a convincing
definition for such a smaller subtype.
So at least for now, I am considering N type as the same as M458+ L260-,
with the understanding that may change in the future.
The summary conclusion for all those
columns of trial definitions: My
preferred N46 definition (column CC) does the best job of capturing N type
(M458+ L260-). Most of the other
columns are trying to define N type as slightly smaller, leaving out a few of
the samples (not always the same samples).
Most definitions for N type have many samples at or near the
cutoff. My explanation is in the next
paragraph:
Old branches: A type is a hypothetical
unique clade. Of
course, every clade is composed of subclades - branches in the Y-DNA tree. Here is a simple explanation for the
previous few paragraphs of discussion:
N type seems to have a few small old sub-clades, where the ancestors (MRCAs) of those small clades differed from the main N type
MRCA at a few STR values from the standard 67 set. Those old branches have many younger branches (twigs) that differ
at yet more STRs. In other words: the N tree might have a few small branches
near the ground. Those small old clades
provide samples in the database with large step, but each
sample is from a different twig, so these do not correlate into obvious
clusters. Any clade has statistical outliers with large step;
a few small old branches would provide more outliers for N.
Those old branches may not be small
world wide. One possibility - a large
subclade of N concentrated outside Poland might have one small branch in
Poland, corresponding to a man or tribe that moved to Poland long ago. I am watching for evidence along these
lines, but so far this paragraph is speculative.
In addition, there might be
additional large old subclades that seem young. I consider this possibility in the discussions below. The age of a clade can be much younger than
the node. I discuss this in another
topic, where I call such clades smooth branches. The N tree might have a number of small
smooth trunks with nodes near the ground - that would not necessarily be
evident as STR correlations. On the
other hand, the N tree might have only one main trunk, almost smooth, with only
few small branches near the ground. The
actual situation might be more complicated, with multiple trunks of various
sizes, at various distances from the ground.
I can’t tell yet from the STR data.
Perhaps another year of additional STR data may help.
Why am I speculating about N type
smooth branches? I see plenty of hints
for more branches in the N type data, but little statistical confirmation. In the discussion below for subclades, I
offer evidence (not definitive proof) for many more significant sub clades
within N type.
This discussion is personal. It is my opinion, based on my statistical
analysis. Someone might send me an
email any day now pointing out a convincing cluster or type in N that I
missed. Someone else might disagree
with my analysis about particular hypothetical N subclades.
Reminder: This discussion is limited to Poland, as represented by the
Polish Project. Outside Poland there is
additional probability of M458 branches showing up someday that fit neither N
type nor P type. Outside Poland I
expect yet more N type branches.
Regarding concentration in Poland, I
use percent of samples in Ysearch with “Origin” Poland as an objective
measure. This is discussed in my
publication, where Table 1 shows P12 (the P type modal haplotype using only the
original standard 12 markers from the Polish Project) with 42%, while N12 has
only 14%. Those numbers 42% vs 14% are
not calibrated (because of the unknown concentration of men with Poland origin
in Ysearch) but those numbers are a relative indication of concentrated in
Poland vs not particularly concentrated in Poland. My file NYsearch.xls
has an update with data from 5 Aug 2011, with N12 at 17%, a reasonable drift
due to more data. That same file has
the N46 definition at 24%. This is
evidence that N type, defined using 46 of 67 markers, is only slightly more
concentrated in Poland than the 12 marker equivalent. The simplest explanation:
There are probably large M458 clades outside Poland that match N12 and
also match N46 at less than the cutoff, but the Polish samples are only a twigs
on those branches, descended from one man or family or tribe that moved to
Poland a millennium or so age. It makes
sense that clades within M458 might be regionally concentrated. That 24% concentration for N46 is of course
an average; there are subclades of N
with higher and lower concentration. I
found a few, discussed below; that file
NYsearch.xls has a sheet for each subclade analysis.
Age:
N type comes out 2,340 years old using all 67 markers. See cell N12 in the ASD sheet in NType.xls.
Because of recLOH
issues, the compound markers 464, YCA, and CDY present difficulties estimating
age in the N type data. Other compound
markers are OK. The ASD sheet allows a
mask, row 21, where I masked out the 8 markers for these recLOH
difficulties. The result, using 59
markers, cell N29, is 2,010 years.
That’s my best guess for the age.
On the far right of the ASD sheet I
sorted the markers by apparent age.
YCAb comes out 20,704 years, demonstrating the recLOH problem.
The second (apparently) oldest marker
is DYS454, at 18,744 years. This old
age is due to only 5 mutations in this slow mutator. DYS454 is clearly bimodal. In my notes, I use the Nj code for the 2nd
mode with these 5 samples, defined by 454>11. This is evidence of a subclade, but the statistics are not
convincing yet. Maybe with more data in
the near future I might call some of these samples the Nj cluster. It’s not fair to exclude this “old” marker,
DYS454, because there are 7 markers with zero age (no mutations in the 87
samples) and there are 7 more markers with less than 1,000 years apparent
age. The reason for averaging markers
is that apparently old markers should be averaged out with apparently young
markers. Anyway, you can go ahead and
mask out DYS454 by deleting the mask number at cell AE21, and the new age (58
markers) without 454 is 1,990 years, only a 20 year decrease. I offer this paragraph of discussion as one
example of preliminary evidence of an N type subclade, based on 454<11.
The third oldest marker is DYS531, at
14,319 years; at this bimodal marker I
use the code Np for the 2nd mode value.
Again, I’m waiting for more statistical evidence for a subclade.
That far right side of the ASD sheet
has more notes about markers with old apparent age.
Age estimation from STR variance is
highly uncertain. At another of my web
pages, I use M458 as an example of age caveats. I have more discussion about age estimation
methods in the P type topic; please read those two topics if you would like more
discussion; N is similar to P in this
regard.
I’m not too concerned about getting
the age of N type correct in Polish data because I suspect in less than a year
there will be enough evidence to subdivide N - new SNPs and / or more STR data
for better statistical significance. I
suspect there will be younger subclades.
Furthermore, M458+ L260- is not really a tree; it seems to be branch of the Y-DNA tree that is well isolated - a
long smooth segment near the node; but
I mentioned above my suspicion that the main branch might not be really smooth
- there might be significant old branches concentrated outside Poland; if this is true I’ll need to soon redefine N
type as younger, excluding any such significant branches. I’ll leave it for someone else to estimate
the age of M458+ L260- from worldwide data;
I’ll concentrate on N type, and hypothetical sub clades in Poland.
There are 12 samples from N type
available with the new 111 STR marker set (18 Jul 2010). Only DYS532=12 is an obvious signature
marker for N type from the 44 new markers;
10 of the 12 have this value.
Modal for R1a is 532=11. P type
also has the 532=12 value, also 10 of 12 samples, so this marker also provides
a signature for M458 with good statistical significance. I type also has the 532=12 value; see the I type
discussion below.
The following topics are my proposed
subclades for N type in the Polish Project.
Please consider reading the section P Type Bimodal
Markers, if you would like more discussion of how I use bimodal markers as
hints for subclades; that same
discussion applies here for N type. If
you are curious about my code names, like Na, Nb, etc, check out Haplotypes.xls. Near the bottom of the “Haplotypes” sheet is
a list of 70 code names for signatures that I
considered for N type subdivision. I
discuss only a few of these here. I
spent a lot of time studying tentative subclades of N because I’m anxious to find
significant subtypes that are concentrated in Poland. I uploaded a total of 17 Excel analysis files associated
with N and tentative subclades, all discussed above and below.
Ng. Rewrite finished 22 Sep 2011.
Based on 5 Aug 2011 Polish Project data. Analysis file: NgType.xls. Ng is a small subtype of N type, but it has
highest confidence.
This is a very small subtype, only 3
samples, but it is very well isolated.
The definition uses 56 markers, cutoff 4, gap 9. There are no samples in the gap, from step
4 to 12. SBP = 15.8%.
These same 3 samples are present in
Ysearch, where the gap with no samples is from 4 to 11. Two samples at step 12 are from Germany and
Unknown. There are none at step 13 and
11 samples at step 14. It seems Ng is
concentrated in Poland.
The signature is (537, 492) = (10,
14). These are the only 3 Polish
Project samples in N type that have any mutation from the 12 value at 492, and
they have a 2-step mutation. 492 is
ranked 18th of 67 in the extended Chandler mutation
rates. The 10 value at 537 is also rare
- only these 3 plus 2 other samples have it in N type in the Polish Project. The same 3 Ng samples are extracted from N
type using 1 to 67 markers. They are
well isolated using as few as 7 markers because they have little variation from
each other in the rapidly mutating markers, so those rank well for the Ng
definition. ASD age comes out 619 years
using all 67 markers but of course that is a very rough estimate.
The simplest explanation is that the MRCA of Ng type lived in Poland less than a millennium ago and
passed on those 2 unusual mutations.
The 3 Ng samples fall at steps 4, 5,
6 with the N45 definition of N type, a hint that the Ng node is near the center
of the N type branch, not one of those old branches I speculated about, but
this is just a preliminary hint.
I introduced Ng type in Oct 2010; there have been no new 67 marker data in the
STR neighborhood of Ng type, so SBP has been 15.8% since, with the same
definition.
Ng also has what I call the Na
signature, discussed below.
The “g” is only my arbitrary code
name that I have been using for the DYS492=14 signature.
N-Ashk. Rewrite finished 25 Sep 2011.
Based on 5 Aug 2011 Polish Project data. Analysis file: NashkType.xls. N-Ashk is a small subtype of N type. Only 4 samples.
These seem to be Ashkenazi
samples. Mayka
pointed out to me that the names seem Ashkenazi, per his experience. The samples beyond the cutoff are apparently
not Ashkenazi.
Signature (19,385a,594) = (15,12,11)
I introduced this type in Jan 2011,
with SBP 23%, slightly more than my stated 20% limit for
using the word type.
Two reasons: First, the
Ashkenazi names are independent evidence of a clade. Second, the N-Ashk modal haplotype
differs from the N modal at 6 markers, which is evidence of a fairly old node
in the N branch of the Y-DNA tree.
I introduced this type as Nca type,
because of what I have been calling the Nc signature, DYS19=15. The “a” meant Ashkenazi, but that was
confusing because the samples do not match what I have been calling the Na
marker. Nc is large; I doubt N-Ashk is a twig in a large Nc
branch; the Nc mutation more likely
arose independently in the N-Ashk hypothetical clade.
This Sep 2011 reanalysis makes a
cleaner cluster of data, although still small with only 4 samples. The 594=11 marker is very clean; these 4 samples are the only R1a samples in
the Polish Project with this value. SBP
increased to 47%, so it is a stretch to call this a type, but the Ashkenazi
connection is improved now and the 594=11 marker seem to be strong
evidence. Also, I avoid making changes
in classification names without significantly more data, so I’ll continue to
call this a “type” for now. There are
no longer any N-Ashk Borderline samples at 67 markers; the Borderline category is used for apparent
Ashkenazi samples that match well with only 37 markers.
The improved definition uses 58
markers, cutoff 3, no samples in the gap at steps 3 and 4. (The previous definition used 59, cutoff
5.) The improvement: I masked out CDY. The previous definition used CDYb, missing an Ashkenazi sample
that fits the type well, but has recLOH, providing a
misleading step of 5 at this one marker. With that new sample the ranking of markers
came out slightly differently, so a few other markers were added or removed
from the definition. The old and new
definitions are available in NashkType.xls.
The new definition is also available at Ysearch as 2TZKF,
and in my Haplotypes.xls
file.
The ASD age comes out only 668 years,
cell N29 in the ASD sheet in NashkType.xls.
Age calculated from only 4 samples is highly speculative, but N-Ashk
seems young because of little variation in marker values. The ASD should use (4-1) in the denominator
instead of the total 4 samples (although most genetic genealogists do not do
this for small sample sizes); with that
adjustment the age comes out 890 years, but that is still highly
speculative. That cell N29 is using 61
markers; CDY and 464 are masked
out. (The mask is row 21, which you can
easily edit.) All 67 markers yield
1,024 years, cell N12, because of CDY.
DYS464 has no mutations in the set of 4, so including those 4 reduce the
age, but I left 464 out because most people routinely exclude the 464 set from
ASD.
N-Ashk is quite young, but the node
seems old because of the 6 marker distinction from N type. The simplest
explanation: N-Ashk has a long smooth branch, having an old node with N, but no
further branching near that main node.
The samples in the Polish Project all seem to come from twigs with young
nodes. I speculate that there may
actually be some branches of N-Ashk outside Poland. Perhaps the Ashkenazi ancestor of N-Ashk moved to Poland somewhat
less than a millennium ago. More data
will eventually confirm or refute this speculation.
2TZKF
is the modal haplotype at Ysearch, where
only two of these samples are present, and where there are 2 additional samples
in the gap, from Russian and Belarus;
the simple explanation is that N-Ashk is concentrated in Poland,
although there is too little data for confidence. See NYsearch.xls
for my Ysearch analysis.
N-Ashk has what I call the Nb
signature, discussed below.
Nt. Edited 25
Sep 2011. New topic 20 Sep 2011. Based on 5 Aug 2011 Polish Project
data. Analysis file: NtCluster.xls.
With 17 samples, Nt cluster is my
largest speculative subclade of N type identified so far.
SBP = 27%; this cluster is close to the 20% maximum SBP for Polish Project
assignments as a type.
I am suspicions of this Nt cluster due to selection bias: I considered 70 signatures for N type during
the summer of 2011, and carefully analyzed more than 30 of them. With that many attempts, a false positive is
likely. One of the clusters I analyze
will necessarily have the lowest SBP, but that might be just the luck of the
data. No one knows how to calculate the
statistical confidence in such a case.
I discovered Nt at the end of this major effort. If SBP improves with more data for Nt I’ll
upgrade it to a type, but if SBP gets worse (bigger) as data accumulates I’ll
loose interest in Nt.
If Nt is valid, it is probably
concentrated in Poland. See NYsearch.xls. See my Ysearch method discussed above. I consider this additional evidence that Nt
corresponds to a clade, boosting my estimated confidence to about 70%. We don’t always use 70% confidence for assignments,
but everyone is anxious for more subdivision of N type in the Polish Project,
so we started using Nt in Sep 2011.
The Nt definition
uses 48 markers, cutoff 4, one sample in the gap at step 4. The definition is available at Ysearch as 2544E.
Nt is based on the signature
DYS442<14. However, there are 29
samples with that signature, and 5 of the 17 Nt cluster samples have the N modal
14 value at this marker. My simple
speculative explanation: the 442
mutation from 14 to 13 occurred independently in the Nt clade after the node
with the main N type branch. Other
speculative explanations are possible - those 14’s might be a back mutation
within a much larger “father” clade that carries the Nt signature on most but
not all samples.
One Nt cluster sample has the 12
value at 442, which could be another mutation or an independent double
mutation.
If we subtract the 12 Nt signature
samples with <14, that leaves 17 more samples (not included in my Nt
cluster) with this second modal value at 442.
There are only 3 samples at 15 in all of N, and we expect step up to be
more common than step down for a slow mutator (see my publication
for references), so that still leaves an excess of samples with <14,
implying yet another hypothetical clade with an independent mutation, or a
larger “father clade” but this paragraph is getting highly speculative. I have more speculation like this about
independent clades vs large clades in the Na, Nb, and Nc topics below, similar
speculation applies to Nt.
Thirteen Nt samples match what I call
the Na signature, discussed below, but two samples match the alternate mode Nb; the last two samples are one step away from
Na. This is evidence of an even larger
Na father clade, but as discussed below the Na vs Nb signatures may have arisen
multiple times independently, so I’m not confident to speculate further along
these lines.
See also NclusterAssignments.xls.
Ns. Edited 23
Sep 2011. New topic 20 Sep 2011. Based on 5 Aug 2011 Polish Project
data. Analysis file: NsCluster.xls. Ns cluster is a speculative subclade of Nt
cluster.
With 6 samples and SBP = 27%, this
cluster is close to the 20% maximum SBP for Polish Project assignments as a type. I am suspicions
of this Ns cluster for the same reasons given above for Nt: On the other hand, Ns looks like a credible
subclade of Nt, which adds credibility to both of them.
If Ns is valid, it is probably
concentrated in Poland. See NYsearch.xls. The 67% concentration is the best I have
seen so far, but this % is highly uncertain because it is based only 2 Ns
samples at Ysearch. Such as it is, I
consider this additional evidence that Ns corresponds to a clade, same as my
confidence for Nt.
The definition uses 47 markers,
cutoff 2, no samples in the gap at steps 2 and 3. The definition is available at Ysearch as A5NSG
Ns is based on two signatures. Ns is my code for DYS446=12, 9 samples, vs
446=13 modal for N type. Nt is my code
for DYS442=13, 5 samples, vs 442=14 modal for N type. The 6 Ns samples are all at steps 0 and 1 with the 47 marker
definition; the other 3 with that
signature are at steps 9 and 10, so it is reasonable to suppose the Ns mutation
happened twice independently in the N type clade. Five of the 6 Ns samples have the Nt signature, but that 6th one
has the value 12, two steps from the N modal 14, so it should be considered Nt
also.
See also NclusterAssignments.xls.
All 6 Ns have what I call the Na
signature, discussed below.
Nd. Edited 24 Sep 2011. New topic 20 Sep 2011. Based on 5 Aug 2011 Polish Project
data. Analysis file: Nd53Cluster.xls.
Based on the signature DYS389I = 14,
vs N modal 389 = (13,29). Nine samples
have the Nd signature. Only 3 of these
9 fit Nd53. My confidence is only about
50% that these 3 samples really belong to the same clade; I included this analysis as an example of an
uncertain clade, and for discussion below in the Na topic.
DYS389II has the value 30 for Nd but
this is not a mutation at 389II. See compound markers for an explanation.
I call this Nd53 because the 53
marker definition is somewhat arbitrary - there is no very likely
definition. It is likely I’ll need to
change the definition soon, when more STR data becomes available. Also, “Nd53” makes it clear that this is not
the same as the cluster formed using only the Nd signature.
Nd53 is not used for assignments in
the Polish Project; see NclusterAssignments.xls
for speculative assignments.
The 3 samples do not have Poland as
origin, although I suppose those men have suspicion of Polish ancestry, because
that is usually the case for Polish Project samples. On the other hand, Nd53 might be representative of a clade that
is concentrated outside Poland.
Ne. Edited 24 Sep 2011. New topic 23 Sep 2011. Based on 5 Aug 2011 Polish Project
data. Analysis file: Ne40Cluster.xls.
Based on the signature DYS390 = 24,
vs N modal 25. Twelve samples have the
Ne signature. Only 3 of these 12 fit
the Ne40 cluster. My confidence is only
50% that these 3 samples really belong to the same clade; I included this analysis as an example of an
uncertain clade, and for discussion below in the Na topic. Nd and Ne have similar status.
I call this Ne40 because it is likely
I’ll need to change 40 marker definition soon, when more STR data becomes
available.
Ne40 is not used for assignments in
the Polish Project; see NclusterAssignments.xls.
Only one of the 3 samples has Poland
as origin, although I suppose the other two Ne men have suspicion of Polish
ancestry, because that is usually the case for Polish Project samples. On the other hand, Ne40 might be
representative of a clade that is concentrated outside Poland.
Na and Nb. I have been rewriting this topic throughout
the late summer of 2011. Finished 24
Sep 2011. Based on 5 Aug 2011 Polish
Project data.
Clusters based on DYS464, a maker set
that is multimodal in N type. Analysis files Na45Cluster.xls
and Nb32Cluster.xls
I introduced Na and Nb in my publication,
page 179 and Table 3. I have been
updating the discussion for Na and Nb here at this web page. I consistently emphasize that these are
speculative subclades. In retrospect, I
should have avoided the word “type” for these because more data over the years
has convinced me that the explanation for what is going on is not two subtypes
of N. It will take me a few paragraphs
to explain the issue of Na and Nb:
One way to split the N type data,
obvious at a glance, is by the number of markers for 464. Some samples have 4 values, some have 6,
just a few have 5 or 7.
I understand that the 464 set is the
most prone to genetic testing evaluation errors, so this or any categorization
using 464 will have uncertainties. If
464 is taken in combination with other markers that means some statistical
uncertainty due to possible evaluation errors at 464. Specifically, a sample in a database with 4 values at 464 might
really have 5 or more values, and vice versa.
Follow my links if you wish to read
more about compound markers and recLOH
issues, which introduce confusion for the 464 marker set. Briefly, copy mutations can increase the
number of 464 markers, but recLOH mutations might reduce the number. A single copy mutation can change more than
one value in the set. Copy mutations
and recLOH mutations are rare, about the same frequency as very slowly mutating
STR markers. Net mutations in the 464
set are common, with frequency among the fastest in the standard 67 set. For the Chandler rates, each of the four
markers 464a to 464d are assigned a rate 1/4th the net rate for single
mutations for the set of 4.
I use Na as my code for the signature
464 = (12,12,15,15,15,16) - the most common value set for 464. 28 of the 87 samples. My Nb signature is the next most common, 464
= (12,15,15,16). 16 samples. I say 464 is multimodal because there are
also two sets with 4 samples each;
that’s why I’m using Na as a signature even though it is the modal value
for N type as a whole. This is for the
87 N type samples in my 5 Aug 2011 download of the Polish
Project; the proportions change every
few months as data accumulates due to the statistics of small sample sizes.
Na and Nb differ by 2 steps following
the Ysearch method, but that is misleading because Na
can turn into Nb in a single recLOH mutation, which might have happened more
than once in the past in this N type database.
Nb can turn into Na with a single copy mutation. I may not be exactly correct in this
paragraph if my assumption of the structure of 464 in N type is incorrect, but
this paragraph is certainly a brief example of the kind of confusion that
arises with 464.
It is easy to construct clusters
using 464 in N type. Too easy. Too many choices for clusters, as I discuss
in the following. I could not come up
with clusters with good statistical confidence. My Excel analysis files allow setting
maximum step, so I also tried using maximum 1 for the 464 set - 1 step for any
variation of a sample from a trial definition;
still I found no clusters with confidence.
My analysis files allow an alternate
method, treating the 464 markers as individual markers. This is the method I used in my 2009
publication, still no clusters with confidence.
My default is to follow the Ysearch method for counting step at 464, although this
method is obviously less than perfect.
My list of code names is available in
Haplotypes.xls.
When trying individual markers,
DYS464b is best. In my notes I use Na1
- 464b<14, and Nb1 - 464b>13;
these two signatures neatly split all the N type data. Na1 captures all the Na samples plus mostly samples with more than 4 markers; Nb1 captures all the Nb plus mostly samples
with 4 markers; there are
exceptions. Using Na1 vs Nb1 I come to
the same conclusions as using Na and Nb, discussed below.
DYS464e provides another way to split
the data. In my notes I use Nx - any
value for 464e, and Ny - no value for 464e.
Nx captures all the samples with more than 4 markers including the Na
samples; Ny captures all the samples
with 4 markers including the Nb samples.
Using Nx vs Ny I come to the same conclusions as using Na and Nb,
discussed below.
Consider my definitions Na45 and
Nb32, with 45 and 32 markers. See those
two Excel files for details. My
choices for 45 and 32 are arbitrary.
Those files show columns with trial definitions using a wide range of
markers, automatically chosen by rank. A wide breadth of number markers seem
roughly equivalent. It is remarkable
how many samples fit very well using up to 50 markers for trial
definitions: Na has 16 samples at step
zero using 11 markers, and 15 samples at step less than 2 using 45
markers; Nb has the same 14 samples at
step zero using from 11 to 32 markers.
When the 464 set is excluded from the definition, some Na samples fit
the Nb definition, and some Nb samples fit the Na definition. One simple explanation: Na45 and Nb32 might correspond to two very
young clades. However, there is an
alternate explanation: Na45 might
correspond to two or more young clades, and Nb32 might correspond to two or
more young clades, and they may be a “bushy” set of branches where some Na45
clades are connected by nodes to some Nb32 clades. I see no way to be confident that most of the Na samples are in a
branch distinct from a branch with the Nb samples. I suppose if your sample matches Na45 at step zero or one, there
might be better than a 50-50 chance that you and others who match at <2
belong to a unique clade that may someday have an SNP definition, but such a
clade will surely exclude some of the step <2 samples, and include some
samples from steps 2 and 3, so Na45 does not provide a definition. The same can be said if you match Nb.
Some samples that fit the Na
signature at 464 = (12,12,15,15,15,16) come out at high step using more
markers. Similarly, some samples that
fit the Nb modal at 464 = (12,15,15,6) come out at high Nb step using more markers. You can see this at a glance in those two
files. Two opposite simple explanations
come to mind: Na and Nb may have
independently arisen more than once, followed by population expansion -
multiple branches in the N tree. The
opposite explanation: Na and Nb sets
might be signatures for two old clades that each have a few old subclades - two
main N branches that have a few old branches and where both Na and Nb have a
bushy clump of branches at the ends.
More complicated explanations also come to mind. That second explanation, two main branches,
is attractive, but I see no proof that is true, or even highly likely.
In the file NclusterAssignments.xls,
I make speculative assignments. Most of
the Na45 and Nb32 samples fit other more believable types and clusters. I went ahead and assigned the few leftovers
to Na and Nb, but these are just speculative assignments, meant so show you
which of my clusters you best fit.
Summary: There is not enough evidence to consider Na and Nb to be two
unique subclades of N. Maybe Na45 and
Nb32 do correspond to the top of two main branches of the N tree, with most of
the samples that fit Na45 or Nb32 belonging to the corresponding clades. Maybe not.
I see no way of ruling out multiple independent clades (branches far
apart in the tree) for both Na45 and Na32, or for any other definitions based
on the 464 set. Perhaps in a year or so
more STR data will provide convincing sub cades along these lines. Perhaps in a few years SNPs will be
discovered to subdivide N type.
I have more discussion along these
lines below, in the Nc topic.
At all 67
standard markers, the Na and Nb modal haplotypes are essentially the same
for STR markers other than 464. I say
“essentially” because the rapid mutators, particularly the CDY pair and DYS576,
typically vary modally from month to month due to the statistics of small
samples. At CDYb, Na type signatures
with multiple markers are typically modal 40, while Nb are typically modal 39,
but this marker always ranks poorly for definitions because of the wide range
of values. In Nb less than 1/3 of the
samples typically have the modal value at CDYb.
The Russian
site independently came up with this same haplotype
distinction. Two modal haplotypes are available on Ysearch, from the
Russians. Each use 78 markers and each
match my Na and Nb types at 67 markers, including that 39 value for CDYb in
Nb. Central European-1 Modal GTAVR
corresponds to my Nb, using only 4 values, 464a-d. Central European-2 Modal 495M5
corresponds to my Na, using 6 values, 464a-f.
Nc. New topic 25 Sep 2011. Based on 5 Aug 2011 Polish Project
data. Analysis files Nc32Cluster.xls
My Nc code is for the signature DYS19
= 15, compared to the modal value of 16.
Similar to Na and Nb, my publication and previous versions of this web
page proposed Nc as a tentative subdivision cluster of Nb. The samples with the 15 value last year had
mostly Nb samples, but this year that correlation is not significant.
My opinion of Nc is very similar to
my opinion of Na vs Nb: No confident
conclusion. Nc might correspond to a single
large clade. Then again, Nc might
correspond to independent unrelated clades where the Nc mutation arose
independently.
My Nc analysis complements my Na and
Nb analysis: If you look at
Nc32Cluster.xls, you see at a glance that the best fit samples are a mix of Na
and Nb. If you look at Na45Cluster.xls,
you see at a glance that the best fit samples are a mix of Nc and modal
DYS19=16. If you look at
Nb32Cluster.xls, you see at a glance that the best fit samples are a mix of Nc
and modal DYS19=16. If Nc32 vs modal 16
is a valid division of N type, then Na vs Nb cannot be valid. If Na vs Nb is valid, Nc vs modal 16 cannot
be valid. All three files have, at the
bottom, at large step, some Na, Nb, and Nc samples.
Next, let me consider the 4 combinations
using DYS464 and DYS19:
Nbc42Cluster.xls
is my analysis file using both the Nb and Nc signatures together.
Nac32Cluster.xls
is my analysis file using both the Na and Nc signatures together. This is very different than Nc32; the latter has a mix of Na and Nc; the former is a new analysis using the
additional restriction to Na match.
They both have 32 markers by coincidence. As in Na45 and Nb32, the number of markers is my arbitrary
choice; there is no obvious best
choice; the number of markers will
likely change as data accumulates for all these definitions where I specify the
number of markers in the code name.
Nb5_37Cluster.xls
is my analysis file using my Nb5 signature, which is the 4 Nb DYS 464 markers
plus the modal value at DYS19.
Na7_26Cluster.xls
is my analysis file using my Na7 signature, which is the 6 Na DYS 464 markers
plus the modal value at DYS19.
In the file NclusterAssignments.xls,
I make speculative assignments to these 4 clusters, but samples that fit one of
the more confident types (Ng and N-Ashk) and clusters (Ns and Nt) get that more
confident assignment if they also fit these 4 combinations.
The 3 Ng samples
are all Na, but they are a mix of values at DYS19. The neighborhood (just beyond the Ng cutoff) is all Na. This is a tantalizing hint of a “father”
clade with the Na signature.
The 4 N-Ashk samples
are all Nb, but in this case the neighborhood is a mix of Na and Nb. This is a hint of an independent mutation to
Na somewhat older than N-Ashk. Three of
the 4 N-Ashk are Nc, as are most of the neighborhood. The other has the modal DYS19=16 value. This is a hint of a father clade with the Nc signature, DYS19=15,
plus recent back mutations to the modal value.
The 6 Ns samples
are all Na, with a neighborhood mostly Na but some Nb. The Ng, N-Ashk, and Ns samples are all very
far from each other. You can see this
in the file NclusterAssignments.xls, where each type and cluster has a column,
with step value for each samples. I
consider this strong evidence against a large Na clade; it seems more likely that the Na (464=12,12,15,15,15,16)
set arose independently by copy mutation 3 times in these three hypothetical
clades.
Nt, the purported
father of Ns, has 17 samples; 13 Na
signature, 2 Nb, 2 one step away from Na.
It is reasonable to speculate that those 2 Nb are due to an independent
recLOH in Nt, and that the father clade has the Na signature. Unfortunately, it is also reasonable to
speculate that there were multiple mutation to the Na signature within Nt
making the 464 set is irrelevant.
The 3 Nd samples match Nb but again
the immediate neighborhood is a mix of Na and Nb, again evidence for
independent mutations at 464.
Ne is another example of a mixed Na
Nb neighborhood. In this example, 2 of
the 3 match Na. That third one,
464=(12,13,14,14,15,16) is 3 steps away from Na but those two 14 values are a
hint at another copy mutation.
NYsearch.xls
has a sheet with Ysearch data analysis for each type or cluster. The Polish percent, in boldface, is my
important result. Although this
analysis is based on very little data for each of those 4 combination clusters
here is the tentative finding: Nbc42 is
not concentrated in Poland. The other 3
seem to be concentrated in Poland; that
is evidence that each of those 3 clusters (Nac32, Nb5_37, and Na7_26) harbors
one or more clades that are concentrated in Poland.
Ns seems related to Nac7_26, because
4 of the 6 Ns samples match at step zero, but the other 2 are at steps 2 and 3,
so this technique of 4-way combination is good for hints, but not conclusive.
Summary; Na, Nb,
and Nc clusters: 25 Sep 2011. That was a lot of analysis to justify my
opinion that Na, Nb, and Nc, although tantalizing, cannot be trusted without
correlation to more markers. N type
probably experienced population expansion not long after the TMRCA,
whereby the main N branches come out today with similar STR distributions. DYS464 is multimodal;
DYS19 is bimodal; the 4 main combination modes based on 464
and 19 provide evidence of twigs that are concentrated in
Poland. I bet there are many more small
Polish clades based on Na,, Nb, and Nc waiting to be discovered in N type. I’ll continue to watch the STR data. New SNP markers within N type someday will
be even better.
P.
Complete rewrite finished 16 Aug 2011.
Based on 5 Aug 2011 Polish Project data. Analysis file: PType.xls
P type is the main topic in my publication, Part II. P type is significantly concentrated in Poland, and in the Czech
Republic. It is found at lower frequency
in other Eastern European countries, and in eastern Germany. About 9% of Polish males carry P type Y-DNA.
After my publication, an SNP called L260 was discovered, found
to be equivalent to P type, confirming my prediction that P type corresponds to
a haplogroup, R1a1a1g2.
The “father” haplogroup R1a1a1g (M458) is composed of
what I have been calling N type (L260-) and P type
(L260+).
My current definition
for P type, P43, is a modal haplotype using 43 of
the 67 standard markers. The cutoff is 7, which
means all samples less than step (genetic distance) 7 are
predicted P type (predicted L260+).
That definition is available in the PType.xls
analysis file, in my Haplotypes.xls
file, and at Ysearch as 8U92G.
P type age (age means TMRCA)
is about 1,600 years. That’s highly
uncertain, but I’m 80% confident that age of 1,600
years is not off by more than a factor of 1.5 - age 1,100 to 2,400 years. The L260 mutation is likely quite a bit
older than the age of P type.
It’s interesting to wonder if the age
of P type is associated with the historical appearance of Poland somewhat more
than 1,000 years ago. It’s also
interesting to wonder why P type is so isolated in haplospace
- why there are so few men alive today with STR values slightly different than
P type. I added a bit of speculation
along these lines to my publication, but frankly, no one knows the
answers. I offer a little more
speculation at the end of this topic.
That was a quick summary. Next comes detailed discussion:
My published 2009 definition for P
type, P36, still works very well. My
prior update definition, Sep 2010, P46, still works very well. I updated the definition Aug 2011; P43.
All 3 definitions are compared in that analysis file PType.xls,
Calculator sheet, columns BZ to CB.
The August change is only a slight
tweak; I dropped 3 slowly mutating
markers that are mutated in two samples recently found L260+; these two were at steps 7 and 8 using the
prior P46 definition; they are now at
steps 5 and 6 with the new P43. More
discussion about this below.
There is only one L260+ sample not captured
by P43. This sample is at step 9 using
any of my 3 definitions. The problem is
DYS464, where this sample obviously had a serious recLOH
mutation, expanding the number of 464 markers from 4 to 6, yielding step 4 for
only that compound marker. The net step
9 would become step 5 without 464.
Nevertheless, I cannot drop 464 from my definition, because this marker
helps a lot to discriminate P type from N type. I have more discussion below about this
outlier sample.
P43 captures only one sample not P
type, an NB sample, which means N Borderline.
Although this sample fits N better than P, hence the NB prediction, it
has not been tested for L260 or M458, so its status is uncertain.
There are 10 samples at step 6 (5 Aug
2011), the last step of the type, where uncertainty is
highest. Seven of these have been
tested L260+, confirming membership in this haplogroup. This high testing rate is not a coincidence; Mayka and I have been encouraging men with marginal samples to do
the L260 test since it became available in Apr 2010. (We paid if cost was a problem.)
One of the step 6 samples not L260 tested is the NB sample of the
previous paragraph. Another is M458+ and not a fit for N type, so it can be confidently
predicted L260+ (although the L260 test would be nice). The 10th step 6 sample has neither SNP test, and is not a fit for N type, so it is assigned
PB, a Borderline assignment intended to encourage SNP testing. There are two other PB samples that were
step 6 using the prior definition;
these are now step 5. We will
probably expand the PB category, so the next assignment update should have a
few more PB samples, again to highlight the ones most likely to benefit from
SNP testing. I estimate the PB samples
have about 75% probability of being proven L260+.
P43 summary: The P43 definition, cutoff 7, captures 90
samples as P type. One L260+ sample is
not captured because of DYS464. One
captured sample at step 6 is probably N type.
So the predicted P type is 90 samples and the predicted (some actual)
L260+ is also 90 samples (5 Aug 2011).
The statistical accuracy of my P type
definition may seem like about 98% - 100% below step 6. However, my confidence
is more like 90% - I’m 90% confident that more than 90% of future samples that
match P43 below the cutoff step 7 will be L260+ if tested - - 95% confidence
below step 6. That confidence is not
calculated - it’s my estimate to account for two issues: First, I have removed from the definition
markers that are mutated only for L260+ samples at high step (mentioned above
and discussed further below) but more such mutated markers are bound to show up
for future samples, so future predictions are not quite as good as the adjusted
fit implies. Second, there may still be
a very small L260- clade that just happens to have STR
values close to P43 due to the luck of random STR mutations. For samples without Polish ancestry the
probability is higher for these two issues;
this confidence discussion is limited to Poland, as represented by the
Polish Project.
According to Pawlowski,
along with further evidence in my publication, P type (L260+) is concentrated
in Poland. I verified this and other
Polish types using both Yhrd and Ysearch. P has fewer mutations than N and K, so it
must be younger. In my publication I
estimated that about 8% of Polish men have P type male line ancestry of this
type; my current estimate, from the Results Table, is 9.0% (calculated from the edited
data 28 Jul 2011) -- calculated 70% confidence interval 8.0% to 10.0% -- 95%
confidence interval 7.1% to 11.0%.
Ludvik Urban pointed out to me that P
type is common in the Czech Y-DNA
Database. FTDNA also has a Czech Y-DNA
Project. There is not enough data yet
to calculate if the frequency in the Czech Republic is greater or small than
the approximate 9% frequency in Poland (as represented by the respective
projects).
Karen Melis, administrator of the
FTDNA Zamagurie
Project, pointed out to me that P type is common in her data from the Zamagurie region, which is on
the border of Slovakia with Poland. I’m
not sure of the concentration in Slovakia.
It will be interesting if more data
in the future allows resolution of subtypes of P type by region.
I added a “Ysearch” sheet to that
PType.xls analysis file, with update analysis from Ysearch. That file has a copy of the 123 matches at
step < 9 (12 Aug 2011) from my P43 definition, 8U92G. Seven of those matches are modals,
segregated to the bottom of the sheet and not used for analysis. The cutoff is 7, same as in the Polish
Project, but SBP is 19%, not very good.
The reason is 10 samples at step 7.
Only two of these at 7 indicate “Poland” for Origin, 3 Germany, 2
Scotland, 2 Unknown, and 1 USA. This
may be a sign of a clade outside Poland with STR values close to the P type
cutoff; I doubt that; more likely, these are outliers from more
distant clades, because there are a huge number of samples at step >9 so of
course some samples from those clades will fall at step 7 just due to the luck
of random mutations. In other words, P
type is a relatively small haplogroup on Ysearch, and the background
is larger on Ysearch than in the Polish Project, so of course SBP will be
larger. Still, 19% is pretty good on
Ysearch.
Those Ysearch results include 11
samples with “Unknown” or “USA” for Origin, so I removed those for Origin
analysis, 105 net samples. Below the
cutoff step 7, 54% are Poland; that is
very high; the overall percent of
samples in Ysearch from Poland is a very low percent. At steps 7 and 8, 26% are Poland, showing the expected drop off
for outliers. Germany and other Slavic
countries also have significant percent P type; there is a table with details in that Excel sheet. This updates my evidence that P type (L260+)
is concentrated in Poland.
The isolation of P type in the Polish
Project is now even more impressive than at the time of my publication. Most of the samples at steps 7 and 8 are
good fits to other newly discovered types (see PType.xls, column CB), so there
are now fewer borderline samples just beyond the edge of P type. Two of the step 7 samples are my maternal
cousins; their close match to P type is
what got me interested in this topic; if I had not noticed this someone else may
have done a similar study and those two samples would not be in the
database; statistically those two
should be edited; I edited by -1 in the
Results Table, but I do not do minor edits in the analysis files. One of those cousins is tested M458- so I
have high confidence both belong to I type, not P type.
This Aug 2011 analysis does not
include L260 data from other projects. I’ll
wait a few months before reviewing L260 data outside the Polish Project. My last analysis including data from outside
the Polish Project for P type, N type, L260, and M458 was Jan 2011. For those last results, see the following
topics, which have not been updated for several months:
P type Age - TMRCA: My publication explains the ASD method. The ASD sheet in PType.xls provides 1,778
years using all 67 makers. However,
385b should not be used because 5 samples have recLOH
mutation from 14 to 10, providing the unreasonable ASD age of 11,007 years at
this one marker. Also, 464 has obvious recLOH
issues; my ASD sheet, treating 464a to
d as independent markers, comes up with an average of 2,093 years for these
4. Most people who figure ASD age
exclude 464. It is interesting that
385a has no recLOH (10 to 14) so far; I
do not understand why not. The other
compound markers are not issues because the P type values are such that the
apparent recLOH cause only step 1 mutations, so they might as well be included.
1,637 years is the ASD age,
cell N29 of the ASD sheet, using 62 markers; excluding 385b and excluding the
four 464. Exclusion is by typing a
blank or zero into a mask, row 21, so you the reader can easily verify that
removing compound markers other than 385b has no significant effect.
The far right of the ASD sheet has
all the markers ranked by apparent age.
I added a Notes column with explanations for some of them. Other than 385b, other old markers should
not be excluded because the random luck of STR mutations is bound to produce
such anomalies, which are statistically balanced by the 9 markers with zero age
(no mutations among the 90 samples).
They should all average out. By
the way, the number of markers with apparent zero age has been declining in P
type as data accumulated during the past few years, as of course it should, but
apparent age averaging many markers has not changed more than statistically
expected due to the details of new data.
My 2009 published age was 1601 years;
my update last year on this web page was 1775 years. I have consistently written “roughly 1600 years”
in my discussions.
There are a number of reasons why
“raw” ASD age should be increased, as discussed in my publication, part I. However, those reasons are mostly due to population bottlenecks in the
past. As discussed below, P type
evidently went through a rapid population expansion soon after the TMRCA, so
the raw ASD age should be used as a best estimate. Anyway, there are significant non-statistical age caveats that produce systematic
uncertainties as large as the uncertainties due to population bottlenecks, and
much larger than the statistical sampling uncertainties from 90 samples. So any age calculated from ASD (or from any
other type of STR variance) should be taken with a grain of salt. My factor of 1.5 uncertainty quoted above is
based on my 80% confidence from experience, not from
calculation.
385a=10 is the best marker for P
type. I have a separate topic for the P type signature. 385a=10 continues to be amazing.
89 of the 90 samples predicted L260+ have the 385a=10 value. Beyond P type, 385a=10 shows up in only 2
samples at step 7 (my two cousins, mentioned above, who should not both be
counted), none at step 8, only 1 at step 9, and 3 at step 11. The PType.xls database is truncated at step
<12; the full R1a data from the
Polish Project - 457 samples - has only 1 more 385a=10 sample beyond step
11. In other words, this one marker
385a=10 is about 99% effective at capturing P type (future L260+ predictions)
plus less than 3% additional falsely predicted foreign samples from the rest of
R1a. 385a=11 is ancestral (N type and
most of R1a), but so far there are no P type with the ancestral 11 value,
strong evidence that the rare mutation from 11 to 10 happened before the
TMRCA. The 385a & b pair are ranked
together tied for 41st in the Chandler rates, not very
slow. However, shorter STRs mutate a
lot more slowly than longer ones, and step down is slower than step up with
stronger effect for shorter STRs.
(Chandler discussed this with me by email - his project did not take
these issues into consideration - treating compound markers together, with data
combined from all haplogroups). In
other haplogroups 385a values >14 are not uncommon. So it makes sense that the 385a mutation 11
to 10 should be very rare, explaining why it works so well for P type, although
that one P type exception (at step 4) is an even rarer 10 to 9 mutation.
Column CJ of my analysis file shows
that using only the best 5 signature markers, cutoff 2, 83 P type samples are
captured an none from outside P. That’s
better than 80% accuracy using only 5 markers, which is very good and unusual
in SNP prediction. Even more unusual is
that the one best marker is even better.
DYS540=11. A new signature marker.
From the 111 marker STR set recently
available commercially. 71 Polish Project
R1a samples already have the 111 data, including 12 P type and 12 N type (18
Jul 2011). 11 of the 12 P type have the
540=11 value. 11 of the 12 N type have
540=12. Since P and N are the two parts
of the R1a1a1g (M458) haplogroup, this marker nicely distinguishes the two
parts with high probability. 12 is
obviously ancestral because that value dominates the R1a data. 540 already does not look as good as 385a
for P type, but it’s always nice to have another signature marker. It is too early to switch definitions to the
full 111 set. I’ll be adding 111 modal
haplotypes to my Haplotypes.xls file over the following months; P and N are already there.
That Excel analysis file is
intended for finding types - hypothetical haplogroups with
< 20% SBP. For P
type this is moot because L260 is available.
Nevertheless, I used the file to automatically come up with the best
prediction, P54, column CF, with SBP 7.6%. That SBP means 80% confidence (if L260 were
not known) that less than 7.6% of the predicted P type would not actually
belong to the predicted haplogroup.
Indeed P54 captures 89 samples, only 3 of which are not P according to
my new P43 based on L260 - that’s 3.3% foreigners captured. Since I published the SBP method in 2009,
almost all predictions have been better than SBP. But I designed SBP to be conservative (higher percent) to account
for statistical biases. I expect
eventually to have a few failed predictions (foreign background larger than
SBP, or two or more unrelated haplogroups fitting one type definition).
The main point of that PType.xls
file: Many definitions are displayed,
with various marker selections. I tried
a lot more definitions than the ones displayed in that file. The exact definition does not matter much
for P type. Any reasonable definition
of P type captures more than 90% of P type and less than 10% foreigners. Even the full 67 modal haplotype works
OK. Although that P54 has lower SBP than
my current P43 definition (9.2%) , P43 is better because I adjusted P43 using
L260 results.
I identified P type and submitted my
analysis for publication before the M458 mutation was
announced by Underhill. The end of my Part I mentions M458 -- notes added during
publication. M458 (so far) is composed
of P type plus N type plus perhaps a few small clades just outside N. L260, the SNP that
defines the haplogroup corresponding to what I have been calling P type, was
discovered by a P type member of the Polish Project, inspired by my
publication. With him and other
coauthors, I published a brief letter announcing and describing L260 in the
Fall 2009 issue of www.jogg.info.
P type has obvious structure. Evidence of sub clades. Nodes in the P type
branch of the Y-DNA tree. The most
obvious evidence is bimodal markers. The bimodal markers are discussed below as clusters - hypothetical sub clades without high
confidence. The bimodal markers do not
correlate with each other, so none of the clusters qualify yet as types. Future data may provide better statistics
with a convincing subtype of P. If this
paragraph is not clear, please read the discussion below for the individual
clusters: Pa, Pc, etc.
Other evidence of structure: My two edits of the P type definition. In Sep 2010, I increased the number of STR
markers in the definition, then edited out the markers that have mutations only
in L260+ samples at high step, and not in L260- samples at or just beyond the
cutoff. In Aug 2011, I edited out 3
more such markers. Four samples involved, color coded in columns BZ to CA in
the analysis file; two do not fit my
original P36 but fit the other two definitions; two do not fit the 2010 P46 but fir the other two
definitions. These edited markers are
also evidence of structure. These are
all relatively slow mutating markers.
Those samples with such mutations are probably from old nodes in the P
branch. Of course, these cannot all be
old nodes because some markers will have mutations only at high P step just due
to the luck of random mutations. Some
samples from young nodes will come out at high step due to luck, and some
samples from old nodes will come out with low step. The point of this paragraph is that old nodes defined by rare
mutations are expected in any Y-DNA tree, and those samples are evidence of the
expected structure in P type. Another
point of this paragraph is justification for my method of editing markers. You the reader may be concerned by such
editing as selection bias to improve the apparent fit of the data. Indeed there must be such bias in some of
the markers that I edited. However, insofar
as some of those edited markers truly correspond to old nodes in the P branch,
it is appropriate to edit them; future
distant cousins with the same rare mutation will be better predicted as
L260+. The whole point of using
definitions shorter than the full 67 is to remove those markers that define sub
clades in order to come up with a proper definition that distinguishes the
branch as a whole, as explained in my publication.
Old node comment. It is possible the P type data includes
samples that really belong to an L260 branch with a node much older than the
next youngest node. In such a case it
would not be proper to combine them into the single P type. That one sample at step 9 (discussed above)
is an example of a candidate for such an old branch, but then again that sample
might just be an unlucky member of a young node (an outlier). Those 4 edited samples of the previous
paragraph are also examples. Because
there have been very few P type samples beyond my original cutoff, and because
all but one of them were easily incorporated with minor edit of the definition,
I am comfortable considering them all as a single type until there is evidence
of significant L260+ samples beyond P.
At any rate, all markers are included in the age calculation, so any old
branches contribute to the estimated age of the oldest node (oldest node means
MRCA). This paragraph would be a valid comment about any type analysis, but P
type is unusually well isolated in haplospace, so the justification is strong
to consider it a single clade.
The L260 mutation might be about the
same age as P type. Unlikely. We expect a defining SNP to be more likely
older than the TMRCA, perhaps much older.
The Western Slavic Modal haplotype,
Ysearch 28WGP,
matches P type perfectly at all 43 markers used in my new definition. That Western Slavic Modal uses 76 markers,
but many of those are highly variable due to high mutation rate. That modal is one of the Russian site modals.
The Polish Project makes some
assignments to P type for samples with < 67 markers if they match the P type
model very well. I have not updated
those assignment rules for a couple years, but I have been quite conservative
below 67, so those assignments are still > 80% confidence.
Let me finish this P type topic with
brief speculation about the origin of P type:
What does P type isolation mean? One simple explanation: The M458 father haplogroup for P type and N
type seems to have experienced a severe population
bottleneck. The evidence: P type and N type are very easily separated
by STR values. Both are isolated in haplospace. No
overlap. They are so far apart that the
nearest neighbors (just beyond the cutoff) for P type include outlier
samples (from other R1a haplogroups) in addition to N type samples, and nearest
neighbors for N include samples other than P.
Apparently, the father haplogroup was quite old at the time of the
bottleneck, with lots of variation in STR values. The bottleneck wiped out most of that population, so today men in
that father haplogroup descend from just two ancestors, the MRCAs
of P type and N type.
Why is P type so large and
concentrated in Poland? One obvious
explanation is a rapid population expansion not long after the TMRCA. Evidence:
Subtypes cannot be defined with confidence. Apparently, the major bimodal markers are due to mutations that
happened early in the population expansion, so the branches of P type have
similar statistical spread of STR values.
For more discussion along these lines see the discussions of the
clusters below.
There are other explanations to these
questions: P type may represent a huge
migration of a single paternal tribe during the dark ages from far away to the
region that is now Poland. Perhaps the
related haplogroups in that far away place got wiped out by subsequent famines
and wars. On maybe they did not get
wiped out. If people in that far away
place did not tend to migrate to North America in the past, and today do not
tend to get DNA tests, then perhaps there are isolated pockets of L260 clades there
waiting to be discovered - some with STRs very similar to P type - some with
STRs very different than either P or N.
Maybe in the mountains of western Asia.
Also, the standard “null” explanation
should be considered unless there is strong evidence otherwise. The null explanation is statistical: No significant bottleneck or expansion. Just the luck of random growth of clades in
a small human population over the millennia.
The MRCA of P & N perhaps were far apart in STR values just by luck
- both being outliers. No one knows how
to calculate the probability that a large P and a larger N clade can be sole
survivors of the statistics of clade growth in the Y-DNA tree in only a couple
thousand years. To me it seems highly
unlikely. But I don’t know how to rule
this null model out in a convincing way.
I can think of more complicated
models as explanations. I’m sure you
can, too.
Caveat: I said M458 consists of P and N.
It is possible some of the outliers from N type might represent small
old branches that have nodes older than the node for P & N. There is no evidence to support this, but
then again there is no evidence to rule this out with confidence. More data will answer this over the next
year, perhaps. Anyway, this is a small
detail in the larger picture.
P type Bimodal Markers. This sub topic was significantly edited 25 Aug 2011, when I
introduced a definition of bimodal.
The following analysis uses the 90 P type
samples (5 Aug 2011) predicted L260+, at 67 markers, discussed above. I also include some comments about the 12
samples available with 111 markers (on 18 Jul 2011). A bimodal marker is evidence of structure, but not proof - a
hypothetical clade.
In the past, I have sometimes called
these hypothetical types. I now prefer
to reserve the word type for < 20% SBP, which Mayka and I take as evidence for
80% confidence that more than 80% of the samples
belong to a clade that will someday be confirmed as a haplogroup
by a newly discovered SNP.
Sometimes we make exceptions slightly above 20%, for example when a type
is regionally concentrated.
None of the following bimodal markers
qualify as a definition of a type, although some of them might be good enough
to be called clusters.
This is not proof that a specific
bimodal marker or cluster does not correspond to a future haplogroup. It is still possible that 95% of the samples
from a particular bimodal marker belong to a unique future haplogroup. For example, if the son (or grandson, or
great great grandson) of the P type MRCA had that defining mutation, and if he
participated in the purported P type population expansion, that would explain
why his haplogroup (male descendants) have STR values so similar to P type
except at the one defining marker. He
had no other mutations that differed from his ancestor among the standard 67
that I’m using today for analysis.
It is possible as more STR data
accumulates some of the following will qualify as types. Cluster identification is a bit of an art so
it is possible I just failed to find a small P sub type and someone else will
find it.
Many of the following are probably
not unique clades, but instead represent two clades that have widely separated nodes in the P tree.
Or three or more.
One characteristic of a type: It shows up early in the data as a cluster
with 20% < SBP < 50%, and the SBP continuously decreases in value as more
data shows up, as the SBP penalty for sampling statistics becomes diluted. This is good - it means false clusters that
show up by luck will not last as more data accumulates. The P bimodal markers that I have been
following for a few years (Pa, Pb, Pc, Pd, Pe, Pg) all have increased in
SBP, which I take as evidence that they will probably not become types.
Excel files for Pc and Pg are
in the on line data with my 2009 publication;
I am not updating those or adding any others because none are good
enough to stand out. Nevertheless, some
merit discussion:
Pa Bimodal Marker. Defined by DYS389 delta = 18.
DYS389=13,31. 18 samples (among
90 P at 67). P modal values are
13,30. This is a compound marker; that 2nd number is the sum, so this mutation
is in the longer repeat chain; P modal
17, Pa value 18. All the 18’s are
13,31; there are no 14,32 or 12,30 in
the Polish Project P type data at this time;
my analysis files will capture any future such samples as Pa. That 31 value by itself does not capture the
Pa cluster because there are several 14,31 in P type, which I’m calling a
different cluster because they are not mutated at the longer repeat chain; the 14 refers to the shorter chain.
Pa is briefly mentioned in my
publication at page 172. Pa was the
first bimodal marker to catch my attention in 2007 because that 31 value
produces the 3rd most common haplotype in Polish data that differs by only one
step from P modal values using the old standard 12 marker set; see the table in my publication at page
162. Such a common haplotype at 12 is
evidence that Pa is an old sub clade of P.
However, the evidence is not convincing yet.
Bimodal evidence: Only 4 samples (value 16) with values other
than 17 or 18 for the longer chain.
3 Pa are available at 111 markers.
I have more discussion about Pa in
the Pg topic below.
Pb Bimodal Marker. DYS19=16. 27 samples.
P modal value 17. This one is of
interest because 16 is the ancestral R1a value, modal for both N and K types. The large size of Pb is a bit of a surprise,
because Pb is only 5th largest at 12 markers, and those should be a mix of P
and K because Pb differs from both P and K by only 1 marker out of the 12. Those 27 are not K because they have 67
makers and do not fit K type, which differs by multiple signature markers. The large size of Pb might mean there is one
large P sub clade that represents the oldest P node, before the mutation to 17,
so it is quite old with lots of STR variation.
That makes sense, because the proportion of Pb samples that match the Pb
modal at 12 markers is not much different than the proportion of P samples that
match the P modal at 12.
On the other hand, Pb might be 2 or
more clades with unrelated nodes, only one of those might be the oldest, the
others being back mutations to 16 by coincidence. On the other hand, that 16 might be a back mutation for most or
all samples, as far as we know with the data available today.
Bimodal evidence: Only 2 samples (value 15) with values other
than 16 or 17.
5 Pb are available at 111 markers.
Pab bimodal marker pair would have
both Pa and Pb defining mutations. There
are only 2 such samples (out of 90 at 67 markers).
Pc Bimodal Marker. DYS439=11. 17
samples. P modal 10. Also discussed in my publication starting on
page 171.
Bimodal evidence: Only 2 samples (value 12) with values other
than 10 or 11.
One Pc is available at 111 markers.
The combination markers produce Pac
and Pbc clusters with 3 and 6 samples.
The Pc that I discussed in my
publication is what I now call Pch, discussed below. I changed the nomenclature to avoid getting myself confused.
Pg Bimodal Marker. DYS572=11. 25
samples. P modal 12. Also discussed in my publication page 172. Like Pb, this one is of interest because the
11 value is ancestral; the discussion is similar to the discussion for Pb.
572 is the 4th from the last of the standard 67 markers.
Bimodal evidence: Only 2 samples (one each at 12 and 13) with
values other than 11 or 12.
3 Pg are available at 111 markers.
The combinations Pag and Pbg each
have 8 samples. Two Pb combinations
(above) have 3 or more samples. All
other combinations of a, b, c, g have fewer than 3 samples each.
Those two combinations with 8
samples, Pag and Pbg, are instructive. They provide a reason why Pg has not worked as a proposed type in
the past. Pg might be comprised of two
sub clusters. Pag has the P modal 17
for all 8 samples at the “b marker”.
Pbg has the P modal 17 for the long 389 chain for all 8 samples at the
“a marker”. 9 Pg samples belong to
neither Pag nor Pbg.
For most haplogroups, a cluster of 8
samples with two markers that differ from the haplogroup modal is
impressive. However, P type is large
and relatively homogeneous. In this
case I have tried many combinations;
some are bound to come up impressive just by luck; I am discussing only the impressive
ones. I suppose if your sample falls
into either Pag (or Pbg) there may be 50% confidence that you belong to a clade
including more than 5 of those samples defined by the two corresponding
mutations, but I personally do not consider the confidence anywhere near 80%.
Even if Pag and Pbg are shown in the
future to correspond to two haplogroups, it does not follow that they will be
sub clades of Pg; they may be
independent branches of the P tree that both received the DYS572=11 mutation
independently. Or one of them could be
an old node with the ancestral value.
DYS572 is ranked in the Chandler list as 40th, not very slow. In the 2010 version of this web page, I
presented evidence that 572 is indeed a slowly mutating marker, at least in
R1a. I still stand by that
prediction. That would make it
reasonable that most of the Pg samples belong to the oldest node in the P tree
(but still less than 80% confident for 80% of the samples). Also, we wonder if Pbg is the oldest node in
the Pg branch, or if Pbg is a more recent back mutation at the “b marker” DYS19
to the ancestral value? In other words,
are the apparently ancestral 572=11 and 19=16 both older than P type, or both
younger, or is one older and one younger?
We don’t know yet.
H type also has
the 572 = 12 value.
Ph Bimodal Marker.
DYS534=14. 34 samples. P modal 13.
Bimodal evidence: Only 2 samples (value 15) with values other
than 13 or 14.
One Ph is available at 111 markers.
There are several combinations; the ones with 3 or more samples: Pah, Pbh, Pch, Pgh, Pagh, Pbch, Pbgh have 4,
11, 12, 14, 3, 5, 4 samples.
My published Pc is really Pch,
defined by those two markers that differ from the P modal.
The best 3: Pbh, Pch, Pgh, have 11, 12, 14 samples. These are instructive, particularly if they are viewed along with
the previous two “instructive” combinations, Pag and Pgb above. These cannot all be valid clades because the
same markers are used in different combinations. This is an explicit demonstration how interesting clusters will
always come up if enough combinations are tried. However, if we assume one particular cluster to be valid, that
means some of the others are not valid.
Pd, Pe, Pf, Pi, …. My Haplotypes.xls
file, near the middle of the “Haplotypes” sheet, has a longer list of bimodal
markers in P type.
Plap Cluster. Includes Lapinski samples.
This cluster has 8 samples that match perfectly at 14 of the 67
markers. Two of those 8, plus two more
at step 1 out of the 14, belong to the Lapinski family set. This is an example of selection bias,
because Lapinski recruited the other 3 distant relatives, so the cluster is not
as large as it seems. The cluster does
not form a type; I mention it here as
an example of a tentative cluster.
The Plap modal differs from the P
modal at what I call the Pr marker, DYS607 = 17 for Plap vs 16 for P modal. DYS607 is highly variable in P type; there are more 17 samples than 15 samples --
a mildly bimodal distribution. However,
those 8 Plap samples, all with the 17, just about account for the excess 17’s,
so 607 is no longer bimodal after adjusting for Plap.
Pz Cluster.
DYS565=14. Only 4 samples. DYS565 is the last of the 67 set. There are 5 DYS565=14 samples -- these 4
plus another that does not fit. The Pz
modal differs from the P modal at 12 markers, so this one is promising for the
future. SBP comes out over 50% because
of the penalty for small sample statistical correction built into SBP. This one may improve as more data
accumulates in a year or so. On the other
hand, I studied about 20 P clusters to come up with this best example of a new
promising cluster, so the most obvious explanation is luck. If you study STR data randomly generated by
a computer you may find a good cluster if you examine enough candidates.
R.
Remainder. Updated 2 Jul
2010. This is not a haplogroup or a
type. This is a category for samples
that are distant in STR values from all the R1a1a types I have defined so
far. If you are in this category, I
highly recommend that you get all 67 markers plus the M458 test. More markers will help me define a new type
for you. Your M458 test is unlikely to
come out positive, but if it does that means you would be the first member of a
new type within M458.
I also recommend that you test for
all the several SNPs that FTDNA considers equivalent to R1a1 (called R1a1a by others). Your
unusual STR values make you a candidate for an unusual small clade that has a
very old node with the R1a tree. Each
SNP is unlikely to come out negative.
In fact, all such tests most likely will come out positive. But if one comes out negative that’s
excellent, because you will join a very rare group, perhaps even define a new
haplogroup. If you cannot afford all
these tests, OK, just hope for people with STR values close to yours to do the
tests and watch this web page for your sample to move into a new category.
R
is equivalent to a paragroup. Just like R1a1a* means only R1a1a samples
that are negative for all known SNP subdivisions, my R category extends that to
mean only samples that do not match any of my known types. At 67 markers, R also means that the sample
does not qualify for one of my borderline categories. I have a policy not to use the U category for samples with all 67
markers, so in some cases I need to make a close call on a sample that is on
the edge a borderline category - some R samples are right at my cutoff at 67
markers.
For a sample with 37 or fewer
markers, I require 80% probability that the sample would not match one of my
types if all 67 markers were obtained.
There used to be quite a few R at 37 markers back when I had only a few
types, but there are none right now (July 2010) because there are none that
have STR markers so unusual that they are far from all types.
The 80% rule
does not apply to R. If a sample has
30% probability of belonging to its best fit type it would be assigned to
R. That means it only has 70%
probability of being a true R. R
samples still have their FTDNA assignment
which is either 100% (green) or 99% (red).
When I started this hobby a couple
years ago, R was the 2nd biggest category after U. I now have enough types that R is small.
In June 2010 I subdivided R into two
categories. R (M458-) is those tested
negative for M458. R (needs M458) is
those not tested for that SNP. If an R
sample would test positive it would be moved to the NR category.
U.
Unassigned. Updated 3 Jul
2010. This is not a cluster, but a
holding place for samples with less than 80% probability
for assignment. I use U in the Polish
Project for R1a uncertain samples with less than 67 markers. Samples with all 67 standard markers are not assigned to U, but
instead are assigned to the R (remainder) category, or into “Borderline”
categories such as N Borderline or K Borderline. U is 0% in the Results Table, which is samples with 67 markers,
but considering all samples U is the largest category in the Polish Project,
with 200 members on 25 May 2010 - 15% of the project, 35% of R1a. If you are classified U you can become
promoted to another category by obtaining results for the remainder of the 67
markers.
The 80% rule
does not apply to U. If a sample has
70% probability of belonging to its best fit type it would be assigned to
U. That means it only has 30%
probability of being a true U. Many U
have >30% probability of belonging to two or more different types. U samples still have their FTDNA assignment which is either 100%
(green) or 99% (red).
Probabilities
include estimates, so they are not exact.
I tend to be strict for samples with fewer than 67 markers, using U for
marginal situations. At 67 markers is
do not use U - I use R, and I’m not strict at 67. Also, I concentrate my time on improving the assignment rules at
67 markers and have not yet found time for 37 marker rules for some of the
newer small types.
Z280. See K type.
Z93. New topic 31 Oct 2011. This new SNP was recognized earlier this
month by ISOGG as R1a1a1h.
So far, all Z93 samples in the Polish
Project are coming out L342.2+, and vice versa.
A type,
discussed here at this web page since origination, and mentioned in my 2009 publication, is a branch of Z93 (L342.2). A type samples are coming out positive for
both SNPs.
I just today added L342T
as a new cluster, a hypothetical branch of Z93 (L342.2).
The Z93 category at the Polish Project web page has the samples that are Z93+
or L342.2+ and are not predicted A type or L342T cluster. Z93 also includes samples not tested for Z93
but are close STR matches to a sample that tested Z93+.
I tried to come up with an STR definition
for Z93 (L342.2). I could not. Z93 does not have good signature
STR markers. Or, there is a better way
to say that: The signature markers for
Z93 are about the same as the signature markers for Z280 (previous topic),
which is a large new haplogroup in R1a.
Lots of Polish Project samples are now coming out Z280+. Z280 seems to be equivalent to what I have
been calling K type.
Z93 and K type have similar STR values at the slower mutating STRs. As a result, the modal haplotype for R1a as
a whole is similar to the modal haplotype for Z93 (L342.2) samples, and similar
to the modal haplotype for Z280 (K type) samples.
A simple explanation: Z280 and Z93 are “brother” haplogroups, and
neither is particularly young. The MRCAs of these two haplogroups apparently had very similar STR
values. Originally, both grew rapidly,
before significant sub clades could form with STR
mutations at slow mutating markers.
Over the years, both haplogroups diversified in STR values. So many subclades in Z280 and Z93 today have
STR overlapping values. Population bottlenecks eventually produced some
sub clades with good STR signatures, such as A type for example, which is very
well isolated in haplospace. This paragraphs is a simple explanation of
why it is difficult to distinguish all Z93 samples; other explanations are possible, including complicated
explanations.
Z93 is a good example of why
calculating age of haplogroups is highly uncertain. A type seems to be very young. A type dominates Z93 in the Polish
Project. Maybe A type had a particularly
vigorous population expansion; or maybe A type luckily avoided a severe
population bottleneck; or maybe the A
type ancestors moved to Central Europe from distant lands; whatever.
Age is calculated from STR variance, so the age of Z93 is dominated by
the age of A, which is misleading and too young. If A type samples are excluded, the age of Z93 still would come
out too young, because the A type samples have a unique STR signature, which
means significant STR mutations, which means the A type MRCA lived at a time
when Z93 was already quite old, so the A data needs to be considered when
estimating the age of Z93. I’ll try to
come up with an age estimate, for next time I update this topic.
End of R1a Categories.
On 20 July I added the following
three R1b Types to this web document (next three subtopics,
L23EE, L47P, L47A).
Mayka had
already added these three to the Polish Project web
page during the previous week, based on my
recommendation, based on my SBP analysis.
I independently found these three by
analyzing the Polish Project R1b data, but Mayka pointed out they were
previously known as clusters. We judge that my analysis justifies adding them to our list of
types. Since I’m using 639 samples with
67 marker data as representative of Poland,
a small type clade at 1% of the Polish population would be expected to have
roughly 6 samples in the database (70% confidence interval 4 to 10). These three small types are roughly 1% each.
I’m following the current ISOGG codes for these types, which may be confusing compared to the current FTDNA codes.
The STR definitions for these are
available at Haplotypes.xls. PolishCladesUpdate has a
link to an Excel analysis file for each of these three types.
Instructions
for Ysearch comparison are below.
Here is the “UserIDs” bar for R1b comparison:
USEID, CX94E, MKM4R, 7HB9C
Change USEID to your User ID.
Reminder: These two types are calibrated to Polish data. The definition modal haplotypes may not be
optimal for other regions. If you have
Polish ancestors, and if you have all 67
markers, and if you match one of these within a step distance of 10 there
is more than 80% probability that you belong to the corresponding clade. Up to step 15 there is lower probability
that you belong. You should test the
appropriate SNPs (explained below) for higher confidence. If your ancestors are not from Eastern
Europe and you are a marginal match (step distance 5 to 15) for one of these,
it is not very probable that you belong to the corresponding Polish clade,
because each of these types has some overlap with other clades that are rare in
Poland.
L23EE. 20 Jul
2010 documentation: This type is positive for
the L23 SNP, hence this type is a hypothetical future
haplogroup within the current haplogroup R1b1b2a. This type is negative for L51, the only current known branch -
R1b1b2a1 - of L23.
Nordtvedt
pointed out the cluster for this type some
years ago, calling it R1b-EE (Eastern Europe). Mayka suggested the L23EE code to me.
There are only 6 samples in the
Polish Project in this type (13 Jul 2010). SBP = 10.7% using all 67 markers, which is excellent for such a
small type. The cutoff is 12, but if you match at step 10 through 12 I
estimate your probability of belonging at slightly better than 80%, so you
really should test for the L51 SNP - a negative result would boost the
probability to about 95%. In the Polish
Project, there is a gap of 5 - no samples from steps 12 through 16 and all 6 of
the samples from step 17 to 20 are L51+.
So this type is very well isolated in
haplospace in Poland.
On Ysearch (code CX94E) there are also 6 samples in this
type (13 Jul 2010), but 3 are the same as in the Polish Project. There are 7 samples at step 12 (vs zero in
the Polish Project) and only 2 of those 12 are East European - one each in
Germany and Russia. That means this
type is not well isolated world wide, meaning samples near the cutoff are
highly uncertain. I interpret this as
evidence that my definition of L23EE type is really a Polish subtype within a
larger L23EE cluster.
This type has evidence of structure. A number of markers are bimodal with no
obvious correlation. To me, that means
there are probably at least 3 sub-clades that may become evident as data
accumulates.
If you match this type closely at 37 markers I highly
recommend getting the full 67, because the
statistics for assignment are not convincing at 37 markers. Even at 67 markers, I recommend the L51
test; a negative result confirms
membership in this hypothetical clade, and a positive result means you are not
a member. We do not know the probability
of outsiders matching L23EE in STR values, particularly outside Poland, so there
is still a slim chance of a surprise - a close match to the definition but with
L51+.
L47P. 20 Jul 2010
documentation: This type is positive for
the L47 SNP, hence this type is a hypothetical future haplogroup
within the current haplogroup R1b1b2a1a1d1.
This type is probably negative for L44, the only current known branch -
R1b1b2a111d1a - of L47, but that L44 negative indication is based on only one
sample so far so it is not certain.
Mayka announced the cluster
corresponding to this type on the web in
March 2009.
There are only 4 samples in the
Polish Project in this type (13 Jul 2010). SBP = 9.3% using 64 markers, which is excellent for such a small
type. The cutoff is 7 and the gap is 10.
There are no samples from step 7 to 16.
Although samples in that wide gap are expected as data accumulates, this
type is very well isolated in haplospace in
Polish data.
This type is very robust;
the same 4 samples are selected using any number of markers from 10
to 67 with SBP <25%.
Actually, this type is even better than the SBP = 9.3%
indicates, because some of the samples at step 17 and beyond have tested
negative for the SNPs in the R1b trunk leading to L47 so they clearly do not
belong to this L47P hypothetical clade.
Ysearch (code MKM4R) also has 4 samples (13 Jul 2010), but
3 of them are the same as the Polish Project.
Ysearch has 8 samples at steps 8 to 12, so the type is not as well
isolated worldwide.
The “P” in the code L47P represents
my hypothesis that this type is Polish.
Members of this type should test for
L47, because Ysearch does have one STR matching sample listed as R1b1b2a1b,
which is equivalent to P312, an “uncle” haplogroup, that is L47 negative. That means there may be some interference in
STR matching, probably less than 10% in Polish data, but I do not know what the
exact percent interference be until more data accumulates.
See the last paragraph of L47A, next
topic, for more comments.
L47A. 20 Jul 2010
documentation: This type is positive for
the L47 SNP, hence this type is another hypothetical future
haplogroup within the current haplogroup R1b1b2a1a1d1. I do not know yet if this type is negative
for L44, a known branch of L47.
Mayka suggested
the “A” code, since this type is obviously Ashkenazi, based on family names
(see also Ysearch results, a few paragraphs down). I presume this one is known to the administrators of Jewish DNA
projects, although I did not do the research to find a first web publication at
67 markers; I would appreciate an email
of a reference to add here, even if it does not exactly match my
definition. It’s OK if an international
modal haplotype differs by a few markers from a haplotype determined in Poland,
particularly if the difference is at markers that are bimodal, indicating
subtype structure.
There are only 5 samples in the
Polish Project in this type (13 Jul 2010). SBP = 7.6% using all 67 markers, which is excellent for such a
small type. The cutoff is 10 and the gap is 9. There are no samples from steps 9 to
18. Although samples in that wide gap
are expected as data accumulates, this type is very well isolated in haplospace.
This type is very robust;
the same 4 samples are selected using any number of markers from 30
to 67 with SBP <10%.
This type is better yet on Ysearch (code 7HB9C), with 18 samples
(13 Jul 2010)
for better statistics; SBP = 4.6%,
which is remarkable. It might be even
better with an optimized definition; I
used the modal haplotype that I extracted from the 4 Polish Project samples.
This
one does not seem as Polish as L47P, although those 18 Ysearch samples are
concentrated in "Greater Poland" including Lithuania.
So far, see ISOGG, L47 and L148 are the only two known branch haplogroups of L48. In the Polish Project so far (20 July), no one has tested yet for L148, and all L48 so far at 67 markers are either L47P (previous topic) or L47A. SNP data is not posted on the web, so I do not know the frequency (prediction probability) of L48 samples that do not match either L47P or L47A so belong to yet other clades. I also have not searched the web for the STR values expected for L148. (There are two samples at 37 markers listed in the Polish Project with L48+, listed as R1b1b2a1a4 by FTDNA, but this is not enough for statistical estimation.).