Polish Y-DNA R1a Clades
16-Jan-10
Peter Gwozdz
pete2g2@comcast.net
If this is your first time here,
consider jumping down to the Abstract for a summary of this
web document.
My methods and results have been
recently published.
The Polish
Project has my assignments of men to types as a
subdivision of haplogroup R1a, which is a
category of Y-DNA.
This web document is for explanation,
details, and update news.
The Results
Table has a summary of assignments.
For more explanation of R1a
subdivision based on STR correlations, see www.gwozdz.org/R1a.html.
M458 News
The big news is the M458 data. P type and N type are coming out M458+,
which means R1a1a7. For details, see
the M458 News topic at the top of my R1a
page. On the Polish Project Y-DNA page,
I’m using the categories P Borderline and N Borderline to highlight the samples
most likely to benefit from the M458 SNP test.
Polish Project R1a Assignment News
The Polish
Project has 87 results for this new SNP test (on 7 Jan). All results are consistent with the R1a
assignments I have been making for the Polish project. Check out the topic New
Information below for an explanation of the new M458 SNP.
P type and N type. These and
only these are coming out M458+. That
means if you have all 67 markers and match either P or N you are likely M458+,
which means R1a1a7. If you do not match
P or N at 67 markers you are likely M458-, which means R1a1a*.
Let’s call that a new “P&N
rule”. It has better than 90%
confidence overall for predicting M458 results.
P Borderline & N Borderline: However, there is low confidence for the 13 samples assigned on
the Polish Project Page to “P Borderline” or “N Borderline”. (A “sample” is the STR data for one man, a
member of the Polish Project.) These
samples would benefit from the M458 test, because they are most likely to be
rare exceptions to the new P&N rule.
These are only 8% of the untested samples with 67 markers, and they are
likely to be M458+ based on the 6 samples like this tested so far, but
confidence is low (unknown percent confidence) because 6 is very few results
and because some of these have unique STR profiles.
K Borderline: These are
likely to be M458- but confidence is roughly 80% because there are not many
M458 results yet. In addition, these
have an unknown probability of hiding an undiscovered clade of M458+ that does
not match P or N. The confidence is
higher for those 3 or 4 steps from the K modal haplotype, and lower for those
at 5 or 6 steps. (Samples less than 3
are assigned to K type and those more than 6 are U67.) As more samples in K Borderline test for
M458 we’ll get higher confidence for the M458- prediction. Reminder:
K Borderline and U67 are categories, not statistically valid types: K Borderline is the main R1a tree trunk,
largest category at 67 markers, samples that have not yet been classified as
types because the STR values are close to a uniform distribution; U67 are the Unassigned samples, distant from
the R1a tree trunk.
U67: Confidence is roughly 80%
M458- for the same reasons as K Borderline.
As these samples test out M458-, they get moved to the R- category,
which is a Remainder category.
P, N, K, A, B, C, I: For
samples that closely match one of the types, confidence is better than 95% in
that new P&N rule for predicting M458 results, based on M458 results so
far. I hesitate to recommend the M458
test for samples with a low chance of learning new information, but there may a
surprise waiting to be discovered: As
more of these “close matches” continue to test for M458, we may learn that most
of the types have 99% confidence except maybe one type that may turn out
something like 80% confidence due to a “hiding” clade from the opposite group
with similar STR values by luck. Maybe
not. Time will tell.
For samples with less than 67
markers, I expect buying more markers to provide better information than
buying the M458 test, because the 67 markers can predict subtypes within the
M458+ (R1a1a7) and M458- (R1a1a*). An
exception would be those few samples at 25 or 37 markers that are on the STR
border of P or N, where the M458 test can determine the status with confidence.
The Fall issue of the Journal of Genetic Genealogy came out on 21
Nov. My publication is split into two
parts there:
Part I is my “mountains in
haplospace” method for evidence that certain “types” of STR clusters correspond
to clades.
Part II is the application
of that method to Common Polish Clades.
That article has a lot more detail than this web page, but that article
was last updated in September, so this web page is an update.
PolishCladesUpdate is my
folder for future updates to those two articles.
This web page will continue as an
introduction and summary, without as much jargon and detail as the articles and
update folder.
R1a Worldwide
On 14 Nov, I moved the general discussion of R1a to a new
web page, www.gwozdz.org/R1a.html.
That new page has the recent test
results details for M458.
This page continues to emphasize
Polish data.
A new article was published online, 4
Nov, essentially dividing R1a1 into two groups, based on a new SNP, M458.
Abstract STR
Data See www.gwozdz.org/R1a.html for more
discussion
I call this article “Underhill” for
short, because his is the lead name in the list of 34 authors for this major
work.
This web page about Polish Clades has
been completely rewritten using this new information. Recent test results for M458
are consistent with (albeit not full proof of) my previous R1a subdivision into
“types” here on this web page about Polish Clades.
Briefly, most of R1a1a is split by
this new mutation into R1a1a7 (M458 positive, or M458+) and R1a1a*
(M458-). See R1a Subdivision for a brief summary of other
groups, and for a clarification of what I mean by R1a1a*.
R1a1a7 is the new M458
haplogroup. R1a1a7 includes what I have
been calling P type and N type here on the web page, even before M458 was
available.
R1a1a* is a new paragroup. This is M458 negative. It includes all my other types, particularly
K type.
This Underhill article has data for
158 “Poland” samples (Table 2):
R1a1a*: 71 samples 44.9%
R1a1a7: 87 samples 55.1%
The 70% confidence interval for
R1a1a7 is about 50% to 60%.
Worldwide 77% of the Underhill data
is R1a1a* (222 in Table 7 vs 68 R1a1a7, 290 total).
M458
Results are coming in now for the new SNP test and the Polish Project R1a is splitting about evenly, with a
few percent more R1a1a7 than R1a1a*, although the latter is more common
worldwide.
7 Jan results: 120 samples are 62 R1a1a* (52%) vs 58
R1a1a7, but the results available to me are biased toward Poland. Within the Polish Project there are 87 samples,
45 R1a1a7 (52%) vs 42 R1a1a*.
This Abstract is for people
reasonably familiar with the jargon of genetic genealogy. If you are new to genetic genealogy you
might prefer to read the Introduction first.
This web document has three
purposes: 1. More detailed explanations
for the men (samples) that I assign to types
in the Polish
Project. 2. Summary of my published results.
3. Update with recent results.
The topic is common Polish Y-DNA clades - identification of male line Y-DNA clusters
that are concentrated in Poland.
Since I originally posted this in
December 2007, emphasis has been haplogroup R1a, because
about half of Polish men are R1a, with no subdivision into smaller
haplogroups. A new
division, roughly 50-50, between R1a1a* and R1a1a7, became available in
November. I will soon expand this page
to include clades from other haplogroups that seem to be concentrated in
Poland. In December 2009 I rewrote and
moved my general R1a analysis to another
web document.
I use the word type
to mean an STR cluster with statistical validity as
established by my Mountain Method. I expect my types to be validated some day by discovery of new SNPs that will qualify them as haplogroups. I chose the word “type” because it is not
generally used in genetic genealogy and I wish to distinguish my types from
haplogroups and from other clusters.
All types have associated clusters but not all clusters qualify as types. In my publications
and web pages I make it clear which types I have discovered in web data and
which types were suggested to me by others, with references. Sometimes I discover a type and later find
out someone else had mentioned it earlier on the web; let me know if you the reader have more clues and references for
me.
Most of types that I have identified
seem to be 1,000 to 5,000 years old, so all the men in each type seem to be
descended in direct male lines from one man (MRCA) who
lived that long ago (TMRCA). A few of my types might be younger or older than that range.
I use phrases like “seem to be” over
and over because my methods are statistical.
On the Polish
Project Y-DNA Results page, I assign samples with at least 80%
statistical confidence, which means most assignment
have better than 80% probability of being correct. A few assignments are up to 99% probability. About 95% of the assignments are correct in
my opinion.
Because of the restriction to 80%
confidence, most men in the Polish Project are not assigned to types on at the
Polish Project web page. I provide an Excel File
assigning all men, with lower confidence.
That file also has more detailed notes about the assignment method.
I divided the R1a Polish data into 4 clusters
based on STR data. About half the men
of Polish male line ancestry belong to the R1a haplogroup, and that group
divides roughly equally into these 4 clusters.
I call them P type, N type, K type, and R. Only P and N are in R1a1a7.
R, Remainder, is not a type. I use R for samples where I have 80%
confidence that they do not belong to K type or to one of the rare types other
than K that I have identified in R1a1a* so far.
Borderline clusters are not types but
seem close to types. Each Borderline
cluster has discussion below. I use
borderline clusters for samples with all 67 standard
markers, where the confidence of assignment to the corresponding type is lower
than 80%, but the samples are close enough to deserve a separate category.
U, Unassigned, is my assignment for
samples without confident assignment and with less than the full 67 standard
markers. This is the largest category
in the Polish Project; my comment about
4 equal categories refers to samples with the full 67, taken as representative
of ethnic Polish men, with caveats explained in my publication. I am temporarily using a U67 category for
samples with 67 markers that may benefit from the new M458 test; these will be reclassified as R when the
bounds of M458 are established.
I have 99% overall confidence that P
and N correspond to clades, valid subdivisions of R1a1a7. Individual samples have variable confidence
greater than 80% due to the remote chance of unusual outliers from other
types. K type is R1a1a but not R1a1a7. My overall confidence in K type is only 85%
because there seem to be unidentified types with STR values close to K. The modal haplotype for K is essentially the same as the modal
haplotype for all of R1a. However, I
have identified subtypes of K that have much higher confidence. In other words, in this case I have higher
confidence for some individual samples.
I have high confidence in the subtypes although I am not sure all the
subtypes belong to exactly the same clade along with all the other samples that
I have assigned to K outside the subtypes.
Even if K is not a true clade as defined, however, it clear that all the
K samples belong to branches in the R1a1a* tree with nodes very close to each
other. The only uncertainty is that
there are likely many other samples that belong in other branches just as close
to K.
P type is concentrated in Poland,
rare elsewhere. N type seems to be
mostly Slavic, widespread in eastern Europe.
K type corresponds to one of the two largest R1a1 clusters. Another large R1a1a cluster, the one I call
L type, is not common in Poland.
The subtypes of K are A, B, and
I. Type C is a rare small type in
R1a1a* distinct from K.
Thanks go to Lawrence
Mayka, Polish Project administrator, for extensive email information and
assistance.
You can compare data to my types by
clicking this link to instructions for Ysearch.
Reminder: I am concentrating on Poland.
The statistics of STR clusters depend a lot on the data base. For example, P type stands out dramatically
in Polish data. In other countries P
type is rare. If you belong to an R1a1
cluster that is rare in Poland, I’m sorry, but I’m not covering you. K type is an example of a type that is
common both in Poland and elsewhere. M
type is common in northwest Europe but so far absent in the Polish Project.
This Introduction is for people
unfamiliar with the jargon of genetic genealogy.
There are quite a few web sites with
a general introduction to the subject of genetic genealogy, for example Wikipedia, FTDNA, and Genographic. Back issues of JoGG
are good general references. The Y Chromosome Wikipedia
article is about male line DNA, also called Y-DNA.
The following several paragraphs are
a brief introduction to genetic genealogy for Y-DNA, providing some definitions
of jargon needed to read my web pages.
The definition words are boldface.
I often use links to those definitions when I use a jargon word for the
first time in a topic. There are more
boldface definitions in the summary of my Methods.
The Y chromosome gets passed from
father to son, so it works just like a male family name. Men are divided into haplogroups based on known rare mutations (called SNP) in the Y chromosome.
Division into haplogroups is done in a manner that has virtually 100%
confidence. I say “virtually” because
your confidence in your DNA result from your DNA testing company might be 98%
or 99% or 99.9%; the confidence for
haplogroups is better than that. We can
be virtually certain that all the men in a haplogroup descend in direct male
lines from one man, called the “Most Recent Common Ancestor” (MRCA)
for that haplogroup. Time of the Most
Recent Common Ancestor (TMRCA) is an estimate of how
long ago he lived - the age of the haplogroup.
Lots of people are working hard to discover more SNPs on the Y
chromosome so that the haplogroups can be divided further into smaller
haplogroups. I’m doing some work on
this, but I’m not discussing it in this web document.
Other people, like me in this
document, try to “stay ahead” of the haplogroups by analyzing other mutations
that are not so rare (called STR) on the Y
chromosome. Men submit their Y-DNA data
to various web sites. There is lots of
STR data available on the web. Men are
divided into STR clusters as hypothetical subdivisions of the
haplogroups. All such clusters are
hypothetical. Some will be validated in
the future by new SNP discoveries.
There are various statistical methods for estimating the confidence of
STR clusters. I recently published a method that I developed. That publication has references to other
methods. There is a brief summary of my
method below.
A few STR clusters are small family
clusters, with the same family name.
Y-DNA is biologically accurate, so some men discover that their Y-DNA
does not match the DNA of their male line cousins identified by genealogy
research, due to secret adoptions, illegitimacies, etc. This is one of the reasons some people
prefer to avoid genetic genealogy. The
male line associated with the Y-chromosome is only one ancestral line. Humans have 24 chromosomes. Anyone who tries to make a family tree going
back 300 years has more than a thousand root tips to be filled by names of
ancestors who lived back then; the one
man at the tip of the male line root is only one of those thousand. That is another reason some genealogists
avoid Y-DNA genetic genealogy - the emphasis on only one line of descent out of
many. That said, many people enjoy the
challenging hobby of figuring out to which ancient extended male line they
belong.
Most STR based clusters have an MRCA
who lived thousands of years ago, before family names were common, so most men
assigned to a typical cluster do not have the same family name.
Most SNP based haplogroups have an
MRCA who lived more than ten thousand years ago, so these span multiple ethnic
groups and nationalities. For example,
the R1a
haplogroup is of interest to me. R1a is
most common in Slavic countries but calling R1a Slavic is misleading because it
is found throughout Europe and west Asia.
The MRCA lived so long ago that he may have spoken a language that we
would not consider Slavic if we could hear it.
It is possible that he did not even live in what is now the Slavic
region of Europe; maybe his descendants
moved there in a massive migration from the Asian steppes, or from India. No one knows for sure. Even if he was proto-Slavic in language and
culture, by now some of his descendants long ago moved to other parts of Europe
and Asia. One of the appeals of genetic
genealogy is trying to figure out ethnic descent and migration from the
statistics of haplogroups. Some people
object, pointing out that ethnicity cannot be defined genetically because of
all the moving and mixing of people over the millennia, and because the Y
chromosome is only one of many. True
enough. Some individuals and some web
sites go too far with genetic claims.
That said, statistical analysis of haplogroup data provides many clues
on human origins.
Again, some people try to stay ahead
of haplogroups, using statistical analysis of STR based clusters to gain
insight into more recent human origins.
I am one of those people. My
interest is Polish origins. This web
document, however, is not for the historical analysis and conclusions, except
for occasional comments to remind us of the goal. This document is dedicated to STR analysis results, identifying
clusters concentrated in Poland, with detailed explanations.
The bottom of my Method section has
more definitions for a number of genetic genealogy
terms.
There are a number of organizations
and commercial companies on the web where you can order a cheek swab kit to
mail in for genetic genealogy analysis, for example FTDNA. I am not associated with the company
FTDNA; I mention them because I make
extensive use of their data; check
Google for competitors. At FTDNA, click
on Products for cheek swab kits. DNA
results are confidential unless you register the data at a database; at FTDNA, click on Projects to register your
data into one of the many databases;
for example, most of my analysis is from the data in the FTDNA Polish Project.
I use the FTDNA standard set of 67 STR markers (plus a few
non-standard ones occasionally). I do some
analysis using the standard FTDNA 12, 25, or 37 STR marker sets. Other companies use standard marker sets
that may not overlap with all the FTDNA markers.
Ysearch is the
largest web database for Y-DNA, run by FTDNA, open to all men, including men
who also register with projects and including men with data from other testing
services. I use Ysearch often for
analysis so of course I encourage you to register your Y-DNA data at Ysearch. From the FTDNA site, you can register your
data with Ysearch. Or you can type your Y-STR data into Ysearch.
Now For The Details
Up to here, I have tried to write
this web page as news and summary, with links to more discussion below. I hope anyone having minimal familiarity
with genetic genealogy jargon has understood.
From here on it gets more detailed.
I’m sorry about that, but the audience from now on is readers with
genetic genealogy experience who want to know how I came to my conclusions. If you cannot follow the remainder due to
jargon, it is written in a manner that you can jump around and pick out what
you do understand, then come back after you have read more about genetic
genealogy.
If you open this html document with
Word, all the link targets (bookmarks) can be viewed alphabetically or by
location.
Polish Project Assignments at 67
Markers, taken as representative of ethnic Polish.
Click here for instructions for comparing your sample
to the Ysearch links.
Click on the link in the last column
to jump down to more discussion for that type.
|
Cluster |
Group |
Type |
Subtype |
Subcluster |
Samples |
Polish % |
Ysearch |
Link |
|
P |
|
|
|
|
47 |
8.9% |
|
|
|
|
R1a1a7 |
P |
|
|
41 |
7.8% |
||
|
|
R1a |
|
|
PB |
6 |
1.1% |
|
|
|
N |
|
|
|
|
46 |
8.7% |
|
|
|
|
R1a1a7 |
N |
|
|
32 |
6.1% |
||
|
|
R1a |
|
|
NB |
14 |
2.7% |
|
|
|
K |
R1a1a* |
K |
|
|
61 |
11.5% |
||
|
|
R1a1a* |
|
K* |
|
28 |
5.3% |
|
|
|
|
R1a1a* |
|
A |
|
9 |
1.7% |
||
|
|
R1a1a* |
|
B |
|
6 |
1.1% |
|
|
|
|
R1a1a* |
|
D |
|
7 |
1.3% |
|
|
|
|
R1a1a* |
|
I |
|
11 |
2.1% |
||
|
R |
|
|
|
|
69 |
13.1% |
|
|
|
|
R1a1a* |
|
|
KB |
46 |
8.7% |
|
|
|
|
R1a1a* |
G |
|
|
? |
|
||
|
|
R1a1a* |
C |
|
|
1 |
0.2% |
|
|
|
|
R1a1a* |
L |
|
|
? |
|
|
|
|
|
R1a1a* |
|
M |
|
0 |
0% |
||
|
|
R1a |
|
|
R- |
5 |
0.9% |
|
|
|
|
R1a |
|
|
U67 |
17 |
3.2% |
|
|
|
Totals |
|
|
|
|
225 |
42.2% |
|
|
Those Types and
Subtypes are my own code letters, for brevity.
Please do not confuse these code letters with official haplogroups. I
have been using such code letters for R1a assignments in the Polish Project for 2 years. The color coding is for ease of comparison on my web pages.
This table is from analysis of the Polish Project, using the data with 67 markers. The percentage results are for data as of 17
Dec 2009, with assignments updated on 9 Jan.
My R1a worldwide document is www.gwozdz.org/R1a.html,
including a list of R1a types, along with
percentages worldwide, which are not the same as percentages in Poland.
The 3 types P, N, and K are
hypothetical clades of R1a1 Y-DNA. Insofar as the Polish Project represents
Polish male line ancestry, the % column is my best estimate of the frequency of
these types in Polish male ancestry. Each
individual assignment to a types or subtype has at least 80% confidence; the
overall table is probably about 95% accurate.
R cluster is the
Remainder, samples (men) who do not belong to P, N, or K types.
P and
N types are R1a1a7 and everything else so far is R1a1a*, although rare exceptions are expected.
P Borderline
and N Borderline (PB and NB) clusters:
These are the samples that will benefit most
from the M458 test. I’m not sure,
but I suppose these might be about 70% M458+ (R1a1a7) and 30% M458- (R1a1a*),
so I labeled them as group R1a. I
counted the PB as part of the P “cluster” but I distinguish the P type samples that
are M458+ verified or at least 80% confidence of being M458+. Similarly for NB. In other words, a few of the PB and NB samples are probably not P
or N type. Many of the PB and NB
samples were KB before the M458 test came along; I moved them here in order to emphasize which samples can benefit
most from the M458 test.
K Borderline
(KB): These are samples that have low
step compared to the K type modal haplotype, but are not
close enough to individually have 80% confidence of
belonging to K type. On the other hand,
these do not individually have 80% confidence of not belonging to K
type. A few of these no doubt belong to
K type, I estimate about 90% are not, so I put KB in R. I keep them separate because the rest of R
is samples with at least 80% confidence of not belonging to P, N, or K. The K modal haplotype is essentially the modal haplotype for R1a as
a whole, so the R1a tree is very densely populated near K in haplospace. Some
of the KB cluster probably represents small clades that are close to K, with
nodes in the tree somewhat older or somewhat younger than the definition node
for K. Actually, B and D types are like
this; for each, I estimate about 70%
probability that they have nodes younger than K, qualifying them as subtypes of
K; I estimate about 20% probability
they have a node just slightly older than K;
I estimate (actually an educated guess) about 10% probability they are
very distant from K with STR values that are similar by coincidence due to the STR values of the MRCA. So the percent for K in the table above
represents a good estimate for the size of K in Poland - a few KB should be
added to K, but perhaps one of the subtypes should be subtracted. However, I emphasize that the individual
assignments to these subtypes are high confidence regardless of the exact
location of the node for the subtype in the R1a tree. There is no clear STR border between K, KB, and R; I estimated the K
step value to distinguish them. I
anticipate that many KB clades will be difficult to define with STR values, so
many of these may need to wait for SNPs to distinguish them.
R is the
remainder cluster category, including some small types, that are distant from
P, N, and K. Samples with at least 80% confidence to not belong to the any of my defined types
go into the R- and U67 subclusters. R-
are the samples that have been tested M458-.
The concept of R is very similar to the concept of R1a*; if we imagine that someone discovers an SNP
that defines the node of K type, and considering that M458+ defines P plus N types, then the R
cluster would be a type of R1a* - a hypothetical paragroup of R1a that does not
include P, N, or K types. So far no
sample in the Polish Project has been identified from the rare R1a groups; such a sample would probably have unusual STR values and thereby
fall into R- or U67, so I used “R1a” in the table.
U67 is a temporary
category that was set up when the M458 test became available. These are R samples that have not been
tested for M458. Ordinarily I do not
use U for samples with all 67 markers, but in view of the new test these were
highlighted U67 to emphasize that they could gain information by trying the new
M458 test. There is still a chance a
few of these might come out M458+, representing a new clade. We now have enough M458 data that I guess
each sample has about 70% confidence of being M458-; in few months, if none of these turn up M458+, I’ll move these
into R-.
I’ll keep the R category even as
other clades within R become confident, because it is a good overview to divide
Polish R1a into 4 clusters of about equal size.
M is a subtype
of a larger L type.
M and L are common in northwest Europe, but rare in Poland.
Types N, A, and K do not seem
concentrated in Poland but are found mostly in Slavic countries.
P, D, and I types are concentrated in
Poland.
Click on the links to jump down to
more discussion of types.
The Ysearch
links provide the full modal haplotypes, using the indicated number
of STR markers, out of the standard FTDNA
set of 67 markers. I entered this data
into Ysearch for our convenience. All
my modal haplotype definitions are
available in the Excel file Haplotypes.xls,
which also has experimental types not mentioned here. Below are Ysearch
instructions for quickly comparing your haplotype to all my types at once.
G type is experimental, with about
70% confidence at this time. Pc and Pg
are hypothetical subtypes of P type, with less than 80% confidence. No assignments yet, but confidence
of small types is dominated by sampling statistics, which improves rapidly (or
fails rapidly) as data accumulates.
Assignment to types is with 80% net confidence (confidence that the type is valid multiplied
by confidence that the sample belongs to the type). For subtypes of K, confidence that the subtype is indeed a
subtype of K is not included in the net confidence. In other words, samples likely belong to the subtype even if the
subtype is placed wrong. These subtypes
very likely have nodes close to the K node, however.
The estimated percentage for P, N, K,
and R in the Results Table add up to 42.2%, which is the percent of R1a in the
Polish Project.
Here is a summary of terms (in
boldface) that I defined for my “Mountains in Haplospace” method. For more explanation, see the fall issue of JoGG. By haplospace I mean
multidimensional sets of STR values;
each haplotype is a point in haplospace.
A cluster
qualifies as a type if the graph of step frequency (number of samples at
that step) vs step looks like an isolated mountain. The step is the
genetic distance (mutation count) from the modal haplotype of the cluster. I use the method of Ysearch to calculate step. The cutoff is the
next step just beyond the mountain. A
good type has low step frequency in a “gap” of step
values including the cutoff (only the cutoff for a gap of 1). In other words, the cluster forms a mountain
at step values less than the cutoff, separated by a gap from the rest of the
database (the parent haplogroup usually) at higher step numbers.
The Statistical Background Percent
(SBP) is an objective measure of the quality of the
type. Low SBP is taken as evidence that
a type corresponds to a clade that may be verified as a haplogroup in the future by an SNP
(yet to be discovered). Larger types
with lower gaps have lower SBP. SBP is
intended as an estimate of the background percent
of samples in a type that really do not belong to the corresponding
hypothetical clade. SBP is increased to
account for the estimated probability of outliers from other clades. An outlier is a
sample that has very unusual STR values due to the luck of mutations. SBP is also increased to account for the
estimated probability of small foreign clades
that just happen to have the same STR values but are not closely related to the
type. The SBP is also increased to
provide the rough equivalent of the maximum in a confidence
interval. Small sample counts have wide
confidence intervals. So larger types
(more samples) automatically get lower SBP.
For a valid clade, SBP should decrease with time as data accumulates in
a database. A very well isolated clade
will have a low SBP even with only a few samples. SBP < 5% is very rare - a very well isolated type, very likely
to be a clade. SBP < 25% is good
enough to be published. SBP < 50% is
a type worth watching as data accumulates with time. The SBP equation (available as an Excel worksheet in the tools) produces SBP > 100% for clusters that do not look
like mountains. The number of markers
in the definition should be chosen to provide as
small an SBP as possible; my Excel
tools provide automatic rank of
markers as an aide; human judgment can
be used to include or exclude markers with obvious problems. A signature is
a small set of markers that rank best, convenient for publication of a type,
and for simple demonstration of the correlation of STR values.
I use the word “type”
to mean 1) the hypothetical clade, and 2) the associated cluster of data, and 3) the modal haplotype, and 4) all possible haplotypes
that differ from the modal haplotype by step less than the cutoff. The definition
of a type is the modal haplotype plus cutoff.
The definition uses only those STR markers that provide the lowest SBP,
but the definition uses as many STR makers as possible. The definition of a valid type may change
slightly as data accumulates.
Here are
some common terms (in boldface) for genetic genealogy. I did not define these, although I use them
in a restricted sense: A marker (also “locus”, plural loci) is a DNA location for an
SNP or STR or other kind of mutation. A
haplotype is a set of gene values at any number
markers, here restricted to Y-DNA STR values.
I use the word sample (plural samples or data or database) for the Y-DNA STR values from one man. A sample is also commonly called a
haplotype, but I avoid calling a sample a haplotype to make it clear that a
haplotype may or may not be present in a particular database of samples. A clade is a
general term for common descent, so an SNP haplogroup is one kind of
clade. I use the word clade in general,
when meaning a Y-DNA clade that may or may not be a defined official
haplogroup. All types
have associated hypothetical clades, but most clades cannot be isolated as
types with low SBP. A cluster is a set of samples with similar STR values. All types have associated clusters but not
all clusters are associated with types.
The modal value for a marker is the most common
value in the cluster. The modal
haplotype is the set of most common values, usually the most common haplotype
in a cluster. Many people use the
adjective “modal” as a noun, meaning “modal haplotype”; so do I;
I tried to avoid that in this web document.
Not all Y-DNA STR data separates into
types because the distribution of STR values tends to be continuous. A type corresponds to a clade that
experienced a population bottleneck
- isolation or migration or very rapid population growth.
For the Polish
Project “Y-DNA Results” Page.
See the “PolishProjectRules” sheet in
the Excel File
with Assignments.
Genetic Distance assignment rules for
the Polish Project.
Genetic Distance is calculated
following the method of Ysearch.
These rules represent 80% or better
confidence.
See the Confidence
topic for statistical explanation.
Technical Discussion of the
Table of Assignment Rules:
The P type modal haplotype that I
entered into Ysearch uses 36 of the 67 standard
markers available through FTDNA, so it is called P36. Of these 36, 17 are in the standard 37 set, so the same haplotype
is called P17(37). Similarly, P15(25)
is for data with 25 markers. Similarly,
the N type modal haplotype uses 45 of the standard 67, etc. for the other
types.
If you have some but not all the
required data on Ysearch (for example 37 markers plus a panel of 12 more, or
for example ancestry.com data with all 37 plus some but not all of the
additional markers) then that table of assignment rules will not work properly,
because those other markers may increase the genetic distance.
These assignment rules must be
applied in order, top to bottom, within each of the 3 sections. The A and I samples also match the K rule,
because they are subtypes, but they are assigned to A and I because those rules
come first. If you are assigned to A or
I, please understand you are also considered a K. There are no Borderline A or I because all but one of those so
far are solid K, and that one exception is K Borderline.
The Borderline categories are not
clades. The 80% confidence rule does
not apply. For example, a K Borderline
sample is one that matches no type with 80% confidence, but matches K at more
than 20% confidence. For more
explanation, see the M458 detailed discussion. See also the “Discussion” sheet in the Excel File with
Assignments.
In my article
about Polish Clades, I use provide a cutoff
definition for each type. More definitions are available in Haplotypes.xls.
The definitions are wider. For example,
P type is defined as samples that match P36 at less than 5 genetic distance,
vs. 4 in the Assignment Rules.
Reason: samples at the edge of
the definition are more likely to be “background” samples that really do not
belong to the clade, but just happen to have haplotypes that come close to
matching, due to random mutations. I
figure samples at genetic distance 4 from P36 each have less than 80%
probability of belonging to P type.
They are included with probability adjustment in the estimated size of P
type. On the other hand, those samples
at 4 do not meet the 80% guideline for assignment on the Polish Project web
page. A similar discussion applies to
the other Assignment Rules, when compared to my article submitted for
publication.
The Excel File with
Assignments has best estimate assignments for all members of the Polish
Project, some at less than 50% confidence.
That file is updated from time to time, as new data accumulates in the
Polish Project. That file also has
tentative assignments to speculative types not used for the Polish Project.
Here is a breakdown of the Polish
Project, 20 Sep 2009: I saved this from
a previous version of this page as an example.
It
s
very tedious to update, but it’s representative.
1043 members
441 R1a1
192 have 67
markers
92
have 80% confidence assignments to P, N, or K
34
P
20
N
38
K in including 7 A and 7 I (no B type on 20 Sep)
17 have
80% confidence assignments to R (confident not PNK)
62
are K Borderline
23
are borderline P, PK, N, or NK
127 have 37
markers
39
have 80% confidence in assignments
17
P
2 N
18
K including 5 A
2 R
88
are classified U, Unassigned
29 R1a1
members have 25 markers; 2 have 80%
assignments; the rest have U
3
have 80% assignments; 1 A and 2 P
26
are U
91 have 12
markers; all are classified U
As more data accumulates on the web,
assignment rules will change, because more data means better statistics. No doubt other types and subtypes will be
proposed here. I’ll probably refine my
statistical methods, leading to minor changes.
I expect SNP mutations to be discovered in the future, providing true
subgroups of R1a1, with different code names than the P, N, K codes I am
using. My types will be either
validated or replaced. I expect STR
data will “stay ahead” of SNP data, so people will still be interested in
proposed subdivisions based on STR data.
The recent M458 SNP is one such new
mutation, announced on 4 November. My
previous P type and N type assignments all have been coming out M458+.
My % confidence for a sample includes
both accuracy of assignment and also the probability that I am wrong about that
particular type or subtype. For
example, P type is 98% confident. If a
sample is 85% statistically confident of fitting the P clade based on STR
haplotype similarity, my net confidence is 98% times 85% = 83.3%.
The % confidence is a combination of
calculations and educated estimates.
The calculations are fully explained in the article submitted for
publication, which will soon be linked here.
The estimates are knowledgeable, based upon statistics too complex to
calculate and based upon limited data.
The estimates are listed in those articles, but discussed only
briefly. As one example of an estimate: it is possible that P type is a combination
of two about equal sized clades that are distantly related - more distantly
than the nodes for P, N, and K - but by coincidence both clades just happen to
have similar haplotypes due to the luck of random mutations. There is no way to calculate such a
probability, so I just take such things into account with knowledgeable
estimating. That’s the basis of my 98%
estimate for P type validity. Bottom
line: the % confidence numbers for
assignment of samples are not fully explained in the articles submitted for
publication. Please do not misconstrue
my probability numbers as fully calculated;
they are just a method to communicate my combination of calculations
plus estimates.
Notice that the K Borderline category
is the largest category in the 67 marker data.
That is because the K modal haplotype is almost the same as the modal
haplotype for R1a1 as a whole. It makes
sense that many of the unknown subclades in R1a1 have modal haplotypes very
similar to the R1a1 modal haplotype, which is similar to K, so of course many samples
come close to fitting K without coming close enough to be judged part of K
using my “mountain” method.
Reminder: use of the assignment rules for Y-DNA known to not originate from
Poland may be misleading.
R1a
See R1a
Subdivision for the current SNP breakout.
Almost all of R1a are R1a1a7 (M458+) and R1a1a* (M458-). Lawrence Mayka, the
administrator of the FamilyTreeDNA Polish Project, assures me by email that all
the Polish Project member tests within R1a have been coming out negative for
all the rare SNP subgroups. So if you
are a Polish R1a, you are almost surely R1a1a, the same haplogroup as about
half the men from Poland. About half of
these - about 1/4 of men from Poland - are R1a1a7.
Description of the Types
Click the Ysearch web links in the Results Table for modal haplotypes, which are my best
fits of web data to groups of men with similar STR data.
A.
Ashkenazi. This seems to be a
subtype of K. This type is discussed in
my publication, Part II. I have about 90% confidence in that subtype
status, but I am more than 95% certain that A is a valid clade, not just
because of my work, but because the modal haplotype closely matches the various
versions of the most common Ashkenazi haplotype, which has been widely studied
and reported on the web. It should be
emphasized that not all Ashkenazi match this cluster, and some men in this
cluster may not be descended from Ashkenazi.
This type is not restricted to Poland.
Levy-Coffman wrote an
article about Ashkenazi genetic genealogy, discussing the hypothesis that
Ashkenazi descend from the kingdom of the Khazars, who lived in what is
now the Ukraine.
B.
Another subtype of K, recently identified, just now being analyzed and
documented. Concentrated in
Poland. The B data cluster lies at the
edge of the K cluster. The node for B
type in the R1a tree might be slightly younger or slightly older than the K
definition node. I estimate the former
is about 80% probability - that B is truly a subtype of K; if not then it probably lies just beyond
B. Individual assignments to B type
have 80% to 85% confidence.
D.
Concentrated in Poland. This
type was added here on 9 Jan, at which time 7 Polish Project members were
assigned to D type plus 9 more to the D Borderline category (STR values closest
to D type but cannot be assigned with 80% confidence). An additional 4 men who match the D
Borderline criteria were not assigned because they are close matches to A and I
types; in other words, some of the D
Borderline samples are probably D outliers and some
might not truly belong to D. Total 20
samples. The cluster was brought to my
attention by Mayka, who points out that Nordtvedt mentioned the cluster in web discussions some
time ago, based on the very rare DYS462=12 value. DYS462 is not one of the FTDNA
standard markers; it is a standard
at Sorenson;
DYS462 is available in data on Ysearch. I did an analysis using the 67 FTDNA
markers; the SBP
came out 18.4%, providing 80% or better confidence just
on that basis for the best fit samples.
However, 462 would significantly reduce SBP, so confidence in validity
of a clade corresponding to D is quite high considering 462. Only 5 of the samples that fit D type in the
Polish Project have been tested for 462 and all 5 have that rare 12 value. I notice that there seems to be a larger
cluster that includes D type plus the borderline samples, but I could not get a
reasonable SBP or a consistent definition; there seems to be too much overlap in D
borderline with the large K type and the large K Borderline cluster.
It will be interesting if D Borderline men who fall just beyond D type
start testing for 462 in order to find the limit of a hypothetical clade
corresponding to 462=12, which may be larger than D type. D type also has the unusual DYS481=21
value; only 10 samples in the Polish
Project R1a have this value, the 7 who fit D type and 3 of the D borderline. If those 3 others are shown to have the
462=12 value, that will establish them as solid D type. Unfortunately, Sorenson does not use the 481
marker, so there are only 3 R1a1 samples on Ysearch with the D type signature
pair (462,481) = (12,21); all 3 are
Polish Project members among those assigned on 9 Jan 2009 to D type. (There are 2 others on Ysearch with this
very rare signature pair in other haplogroups - coincidence.) D type is clearly a Polish type: In the Polish Project 14 of those 20 samples
indicate “Poland” ancestry; the
exceptions are 1 blank, 1 obvious Polish family name, 1 Latvia, 1 Slovakia, and
2 Prussian Poland - Germany. Within
those 481=21, 7 of the 10 indicate “Poland”.
On Ysearch, 5 of the 7 best fits (with D step <6) indicate “Poland”,
while at steps 6&7 only 1 of 9 indicates “Poland”. That is a hint of a non-polish clade close
to the edge of D type, which might be the reason the SBP for D type on Ysearch
is 22%, not as good as that 18.4% in the Polish Project. Or maybe this is a hint of a larger parent
clade that is not Polish. D type is
very young, about 1,000 years TMRCA, and seems to be composed of subtypes Da
and Db (not yet statistically significant).
I classify D as a subtype of K, but see my K
Borderline discussion in this regard.
For more details, see the “Documentation” sheet in my analysis file
“DType.xls”, at my update folder.
G. This cluster is hypothetical. It has been discussed on the web by
others. I expect more data will soon
justify a discussion here.
I.
Concentrated in Poland. This
type is discussed in my publication, Part
II. About 85% confidence of
validity. About 80% net confidence that
both A and I are subtypes of K.
K. This
seems to be a main R1a1a type. K type
is discussed at length in my publication, Part
II. It is larger than others in the
Slavic lands. P and N (below) are just
as close in STR values to K as they are to each other,
probably because the K modal haplotype is the same as the R1a1 modal haplotype
(using the best 34 markers for K). So
far I have discerned a few subtypes of K in my List
of R1a types, but I do not have high confidence that they are all exact
subtypes of K, as explained in my K Borderline
discussion. I suppose that as data
accumulates more subtypes will become clear within K and K Borderline.
In the Results
I use K* to signify those samples that match type K but do not match one of the
subtypes. Although I have high overall
confidence in the validity of K type, individual assignments to K* are not as
confident. Because K is located at the
modal heart of R1a, I expect some outlier samples from distantly related clades to match K* fairly
closely just due to the statistics of random STR mutations. Because of the possibility of foreign
outliers, I consider samples at K step 3 to be K Borderline, even though the cutoff for the K definition is
4. Even K* samples with step <3 have
confidence of only 80 to 90%. That’s in
Poland, where K is fairly well defined with SNP = 26%. Worldwide K* cannot be discerned with confidence. The Ysearch SNP for K is 71%, not
significant. That means there are K borderline
clades close to the K cutoff that are rare in Poland but causing interference
on Ysearch. This is evident by a glance
at the K type results on Ysearch, where “Poland” origin is concentrated at
steps <3, and “Poland” becomes progressively less common at higher
steps. A type is a
very high confidence subtype of K, so these caveats about K* do not apply to
the very high confidence of individual assignments to A type, and similarly to
the other subtypes.
The Kurgans are the ones
who domesticated the horse more than 6,000 years ago. Many scientist think that one pre-Kurgan man is the male line
ancestor of all R1a1 men who live today.
The Kurgan hypothesis is controversial, and not necessary for this web
page. You may have noticed that I used
the letters of “Kurgan” for my original types and categories during 2008.
L. This
cluster is highly hypothetical. It is
rare in Poland, but second in size to K in European R1a1. Larry Mayka suggested
this cluster to me. It is a well known
Scandinavian cluster. I quickly checked
it briefly, and it seems to be a “type” by my definition. However, no Polish Project sample match at
80% confidence yet, so I am not yet using it for classification here. More documentation about L will be available
here when I find time to study it.
N.
Concentrated in Slavic countries.
This type is discussed in my publication,
Part II. This is a type that according
to Yhrd seems to be spread all around the Slavic lands and
central Europe, from East Germany to Russia.
N has more mutations than P, so that means it is older. Within Poland N seems to be slightly smaller
than P, but overall N is larger than P.
Previous versions of this page had Na and Nb as speculative subtypes,
but I removed those because it seems N type should be properly studied in a
database that is not restricted to Poland.
However, I’ll continue to watch the Polish Project, because it will be
interesting if more data provides a Polish subtype within N.
There are web comments about a new
R1a1 SNP, to be announced shortly. My
guess is that this new SNP might correspond to the cluster of data associated
with what I call N type.
P.
Concentrated in Poland. This
type is discussed at length, in my publication,
Part II. It seems that about 8% of Polish
male line ancestry men belong to this type.
According to Pawlowski, this cluster is
concentrated in Poland. I verify Polish
types using both Yhrd and Ysearch. P has fewer mutations than N and K, so it
must be younger. My TMRCA
age assessment is 1600 years old, but in light of age caveats P type might be 1 to 3 thousand years
old. Regardless of age, P type seems to
have had a population expansion less than 1 thousand years ago. My publication
provides details on the size and age calculations along with evidence regarding
the validity of P type. In my R1a web
document, I used P type as an example for a discussion of the caveats associated with TMRCA calculations, and
also as an example to explain the possibility of hidden
clades, and also as an example for population
bias in databases such as Ysearch, so you can find lots more discussion
about P type by clicking on those links.
I identified P type and submitted my analysis for publication before the
M458 mutation was announced by Underhill.
Pc & Pg. These subclusters have about 70% confidence, so no assignments
yet. Previous versions of this web page
used Pa & Pb & Pe. The new
versions, Pc & Pg, are different, so they got a different subscript letter,
although I have modified the same Ysearch IDs.
I have a Pd and other subtypes that are too speculative to mention at
this time.
U.
Unassigned. This is not a
cluster, but a holding place for samples with less than 80% confidence in
assignment. U is used in the Polish
Project for uncertain samples with less than 67 markers. Samples with all 67 standard markers are not
assigned to U, but instead are assigned into “Borderline” categories, P
Borderline, PK Borderline, N Borderline, and NK Borderline.
Instructions
for Use of Ysearch
Link to the site: http://www.ysearch.org. Brief description of Ysearch.
Click on the Create A New User tab,
where you can upload your Y-DNA STR data from a number of testing
services. Or, you can type in your
data. You end up with a “User ID”.
Ysearch has a Research Tools tab to
click, where you can type in other User ID’s for comparison.
Cluster Genetic Distance
Method; for: P - Pc - Pg - N - K - A -
I - M - G:
Click here: Research
Tools
Copy the following line into the
“UserIDs” bar at the Research Tools page:
USEID, 8U92G, RQK32, 92HEK,
3SEJK, MN8R3, FCUFG, EKVHX, 24MB4, ZD29Z
Change USEID to your User ID.
You need to type the Captcha puzzle
for access.
Click on ‘Show genetic distance
report”. You get a table of results.
Result: If there is a small genetic distance result
(3 or less) for one of these types, you have a high probability of belonging to
that type. There are more detailed
rules available, assignment rules above,
followed by several paragraphs of discussion.
Reminder: this web page is for men with R1a1a type Y-DNA. If you are not R1a1a, these instructions
will not produce a matching result, except very rarely, in which case the
result would be meaningless.
The emphasis is on men of Polish male
line ancestry. Just about all R1a
Polish line men are R1a1a. Anyone from
the haplogroup R1a1a from other countries may get good results, but that may be
misleading if there are other types, rare in Poland, not noticed by me, but
with haplotypes that overlap one of these 9 types. Many men of Polish male line ancestry do not match any of these
types. For non-Polish there is a higher
probability of not matching any of these type.
37 Marker Network
Lawrence Mayka
(independently, March 2007) constructed a “median joining network” Network
for the 37 marker samples of the Polish Project. This network supports the definitions of the P & N clusters,
and of the A subcluster. The P cluster
is the left side of Mayka’s network; N
is the top branch, and A is a small branch on the lower right.
This topic explains how I figure
percent confidence for assignments of individual
samples (men) in the Polish
Project. My publication explains my statistical methods. There is a summary of my mountain method above.
Confidence interval example: By 80% confidence I mean 80% is the lower
number of the 80% confidence interval.
For example, 80% confidence might mean that the actual probability is
90% but the 80% confidence interval is 80% to 96%. As an example, consider a situation where 10 samples match a type with an STR test. Suppose there is a definitive SNP
test available, and 9 of those 10 samples test positive for the SNP, and 1
negative. That means 9 of the 10 really
belong to the haplogroup and that 1 mismatch must
come from a different haplogroup that matched the STRs by the luck of
mutations. Next, consider a new sample
that matches that same STR test. What is
the confidence that the new sample will pass the SNP test for the haplogroup? The probability is 90% because we know that
9 out of 10 previous samples like this matched the SNP. However, 1 out of 10 is a very small sample. As explained in my publication,
I use Poisson statistics for quick calculation of confidence interval. Poisson statistics is simple to calculate in
Excel. My tool
Type.xls has an “SBP” sheet with a set of cells for quick Poisson calculations.
80% confidence interval of 1 is 0.11
to 3.89, which is 11% to 38.9% out of 10, so subtracting from 100%, the 80%
confidence interval of a match comparing to 9 out of 10 is 61.1% to 89%; that lower number 61.1% means the 80%
confidence ranges to lower than 80%, so net confidence is lower than 80%.
70% confidence interval of 1 is 0.16
to 3.37, which is 16% to 33.7%, lower number 66.3%; net confidence lower than 70%.
60% confidence interval of 1 is 0.22
to 2.99, lower number 70.1%; confidence higher than 60%.
67.3% confidence interval of 1 is
0.18 to 3.26, lower number 67.4%. So
that’s my one number: 67% confidence.
In other words, if 9 out of 10
samples that match an STR also match the SNP test, we have 67% confidence a
particular future sample matching the STR test will also match the SNP test.
For 18 out of 20, the probability is
still 90%, but a similar calculation shows 75% confidence.
For 36 out of 40, the probability is
still 90%, but a similar calculation shows 80% to 96% confidence interval, net
80% confidence, which is my example that I started with above. These calculations actually take less than a
minute using my Excel cells.
Statistical Background Percent: SBP. I use SBP as a net confidence estimate for
the background (samples that match the STR values but
really do not belong to the clade of a type). My publication does not go into the details
of confidence intervals. That is the
purpose of the explanation here in this topic.
SBP is my estimate for the net statistical confidence before any SNP has
been discovered to validate a hypothetical type. 100% minus SBP is my estimated confidence that a sample in the mountain cluster belongs to the
corresponding hypothetical clade.
A mountain cluster corresponding to a
type might include outliers from other clades, or might
include foreign clades. These and other caveats associated with STR prediction are
discussed in detail in my publication, where I
point out that the confidence for all such caveats cannot be calculated. I estimate the background by using the low
frequency of samples in the gap as representative of the
background throughout the haplospace
neighborhood. My SBP formula (available
in the tools) includes an increase in SBP to account for
all such caveats.
Part I of my publication
explains: “Much of the background is
probably at the last step of the mountain, just before
the cutoff. Much of the remainder is
probably at the previous step, much of the remainder after that at the previous
step, etc.” My Part I Table 2 justifies
this by demonstrating how the number of possible haplotypes increases very
rapidly with step. In other words, SBP
is a good worst case overall estimate of background percent within a type, but
background percent is very low at step zero and increases rapidly with
step. My publication does not provide a
formula for background vs step and in fact I have not derived an formula. For assignment of samples, I estimate the
confidence vs step in a manner to provide a rapid decrease in confidence near
the last step, in a manner to produce overall confidence roughly equal to 100%
minus SBP. Step zero is my rough
estimate that the type is a valid clade, since the step zero samples belong to
the clade with very high probability if the type is valid.
Some outliers from the type
statistically fall within or even beyond the gap, so confidence is not zero at
the cutoff.
Confidence also depends upon the size
of the gap. A wide gap with zero
samples means even samples in the gap near the mountain have reasonable
confidence percent.
Estimates vs Calculations vs Adjustments: My confidence vs step is a combination of calculation and
educated estimates based on experience.
A person who assigns samples to hypothetical clades based on STR values
acts like a bookie who provides advance estimates for gambling odds, using a
combination of calculations, educated guesses, and intuition. A bookie’s estimates are usually tested by
reality very quickly. Probabilities of
an STR estimator may not be verified or falsified by a new SNP for years. You need to be skeptical of STR based
predictions. In the past, a number of
STR based assignments have been shown wrong by new SNP discoveries. This long web document is provided so you
can read as much as you wish about my methods, judging for yourself the
reliability of my estimates and net probabilities.
The first confidence interval example
above, confidence of STR predictions calibrated to SNP data, can be pure
statistical calculation without any estimates.
However, judgment is involved.
Even such SNP predictions should be split into parts based on the step value
of the samples within a type. However,
if split down to individual steps, the statistics are very poor due to small
sample size, so steps are best combined in batches. For the first data from a new SNP it is necessary to combine all
the steps, so the predictions benefit from an estimated confidence by
step. So the judgments and calculations
can get quite complicated, and often I just estimate the confidence from
experience rather than do the calculations every day as data comes in.
I avoid changing assignment rules
often, so some assignment rules remain in place even after new data has
provided better rules.
My standard is 80% confidence, but I
avoid introducing a new type until the confidence is a bit higher, because a
new 80% confidence type would provide only a few samples at step zero on the
day when enough data has accumulated.
After waiting for more data, I tend to bend the guidelines a bit below
80% confidence in order to introduce more samples with a new type. Also, if I notice an individual coming out
at 75% when I’m updating rules I’ll tweak the rule to include him.
I tend to be generous in estimates
for samples with all 67 markers, and I tend to be conservative with samples
having fewer than 67. I update the
rules more often at 67. After all,
samples with fewer than 67 markers can get much better confidence by ordering
more markers, and 67 is the most available as a standard commercial test.
I do not look forward to a man
feeling slighted when he is not assigned to a type that is a reasonable fit to
his STR data. On the other hand, I do
not wish to be dismissed by others with experience evaluating STR data, so I
try to be conservative in my probability estimates that particular clades in
fact exist. I will have achieved my
goal if the number of people complaining that I assign too liberally turn out
to be somewhat greater than the number of people complaining that I am too
conservative (people who have read and understood my documentation).
Naturally, my confidence changes from
month to month as more M458 and STR data accumulates, for better statistics.
Assignments at fewer than 67 markers: There are two ways: Some
types have low SBP and seem 80% valid using 37 or only 25 markers, at least for
samples at low step, so samples can be directly assigned.
Second way: I check for correlation using the samples with 67 markers to see
which percent of samples at given genetic distance using fewer markers end up
in the corresponding type at 67 markers.
The confidence of a sample at fewer markers is that confidence
multiplied by the corresponding confidence at 67 markers.
My P type and N type are equivalent to R1a1a7. I introduced P, N, and K types in the Fall of 2007, publishing
this web page 6 Dec of that year. I did
not predict that P and N were sister clades, in fact it looked to me like P was
closer to K, but I had very low confidence in the relation between P, N, and
K. But I assigned samples to P and N
with 80% confidence, remarking that my overall confidence that P and N were
valid (confidence at step zero) was 95% in 2008. I stated my overall confidence in the subtypes of K type as only
80%, but again my confidence at step zero was 95%.
The new M458 SNP, the definition of
R1a1a7 is consistent with (albeit not full proof of) my previous types. If a significant number of samples from my P
or N type were split between M458 positive and negative, that would have been a
disproof. I say “significant number”
because STR types are necessarily statistical, so there should be some outliers
that are improperly classified in any scheme based on STR markers. In my case, samples should fit much better
than my confidence percent numbers, because these are statistically worst case,
as explained in the previous topic.
I look forward to the discovery of SNPs
validating more than 80%, probably about 90%, of my Polish Project assignments
to types.
A new SNP marker may not fall at the
node defining a type. For example, if
an SNP for P type is much younger than my P type clade, some of the samples
that I classify as P type may not be included in the new haplogroup defined by
such a marker. I would still consider
my P type validated if the most of the samples left out have relatively high
genetic distance (obviously the older samples in P type). In such a case, my type would still be
valid, as a prediction of yet another, older, SNP marker that may be discovered
later.
My publications
have several references of general interest and relevance to my web documents.
My Tools and
data for STR analysis are Excel files.
These are available at the JoGG publication site as Supplementary
Data: www.jogg.info/52/files/cpcindex.htm.
Polish
Clades Update. This folder is for
update of Tools and for new data: www.gwozdz.org/PolishCladesUpdate
Pawlowski
(2002) Arch Med Sadowej Kryminol 52(4):261 (in Polish). This reference is listed in my publications. I specifically mention it here because this
is where I originally found the common Polish haplotype
that I now call P type.
Link to English abstract: Pawlowski
2002.
Lawrence Mayka
is the Administrator of the Polish Project. Larry helped me to get started when I was
new to genetic genealogy, providing helpful criticism & suggestions. He reviewed & approved my 80% confidence
method for assignments on the Polish Project web
page. He also reviewed the original
drafts of my publications. A number of
my types were originally suggested to me as STR clusters by
Larry.
Cyndi Rutledge
is the administrator of the R1a Project. Larry and Cyndi send me M458 test results, which are not listed on
the web.
Anatole Klyosov
published a pair of articles about STR clusters in the same Fall issue of JoGG that has my pair of
publications. Some of the STR types that
I independently discovered I later found as 25 marker modal haplotypes in
Klyosov’s web documents (before his
publication in JoGG - some in Russian).
It was encouraging to me seeing independent identification of clusters
by different methods. He emailed to me
an English version of one of his 2008 publications. His Fall JoGG articles have references to his other
publications. Here is a web link: Klyosov Home.
Russian
web sites: http://www.rodstvo.ru; http://dnatree.ru/; http://molgen.org/. These have been active analyzing R1a,
brought to my attention by others, particularly by Mayka. These sites clearly have proposed
subdivisions of R1a based on STR data, but I cannot quickly understand these
due to the language barrier; I tried
and failed to figure out a correlation to my types. Klyosov seems to be active at these
sites. The sites make use of the FTDNA projects and Ysearch. On 10 Jan Mayka emailed to me a link to some
Russian maps
of R1a, but I have not yet taken the time to try to figure these out.
Kenneth Nordtvedt
published an article about calculating TMRCA in the Fall 2008 issue of JoGG. His excel files of data and tools are available at his web site. Ken has been active in web discussions,
suggesting many STR based clusters.
FTDNA
link: www.familytreedna.com. This is a commercial DNA testing
company. I make extensive use of the
project databases maintained by FTDNA.
These are my primary sources of data.
Click on the “Projects” tab at the home page to look for projects. Also, the project name can be substituted
for /polish/ in the following URL.
Polish
Project link: www.familytreedna.com/public/polish. One of many FTDNA projects. This is my primary source for Polish
data. The Polish Project tracks both
Y-DNA and mtDNA; click on “Y-DNA
Results” on the left to see the data that I use.
R1a Project
link: www.familytreedna.com/public/R1aY-Haplogroup. Another source.
Ysearch
link: www.ysearch.org. Ysearch is the largest web database for
Y-DNA, run by FTDNA, open to all men, including men who also register with
projects and including men with data from other testing services. I use Ysearch often for analysis so of
course I encourage you to register your Y-DNA data at Ysearch. From the FTDNA site, you can register your
data with Ysearch. Or you can type your Y-STR data into Ysearch. I am not associated with the company
FTDNA. I have Instructions for comparing your STR data to
my types (modal haplotypes)
that I have entered into Ysearch.
Yhrd
link: www.yhrd.org. A forensic Y-DNA data base. Data is separate by city, with many Polish
cities. I relied on Yhrd to figure out
the geography of the various haplotypes.
I wrote a Yhrd
Reminders for myself so that I won’t forget how to navigate the Yhrd web
site; click on that link if you need
some hints.
Sorenson
link: http://www.smgf.org/. Another DNA testing company.
Peter Gwozdz
I’m a very rare type in Poland -
E1b1b1a2. My maternal 1st cousins are R1a1a.
That means my late maternal grandfather was R1a1a. I became interested in Y-DNA in 2004. My maternal family name is Iwanowicz. I discovered a family with that name in my
maternal grandfather’s home town in Poland.
They are the only Iwanowicz family within 50 miles, so I was suspicious
they might be my 3rd or 4th cousins. I
brought a cheek swab kit when I visited them the second time in 2006. Sure enough, the male son is a perfect 25
STR marker match to my 1st cousin. I
didn’t get around to checking the web for a year. I was shocked to discover that these maternal cousins matched 80
people in the FTDNA data base, for a perfect 12 out of 12
STR markers. That’s a hell of a lot of
matches in the summer of 2007. Most of
these matches are Polish. I did some
research and found an article by Pawlowski (reference
in my publication) about this most common Polish
haplotype, which I now call P type. That got me interested in doing more
research, leading to this web page for others to see my results. My experience, however, is a reminder that
statistics can be misleading. I was
confident that my grandfather’s haplotype was P type, based on a perfect match
at the first 12 markers. I now (Dec
2009) figure that the probability was really about 90%, because 9 out of the 10
current Polish Project members who have 67 markers and who also match P type
perfectly at 12 markers are in fact P type as judged by all 67 markers. My grandfather does not match P type at 67
markers. My grandfather is that 10th
one. He matches the small hypothetical
clade that I call I type, which is also concentrated in
Poland. But my confidence on that I
type assignment is only 80%, so maybe statistics is fooling me again. That’s how an outsider ended up studying P
type and R1a1a, and writing web pages and articles about common Polish Y-DNA clades.
Explanation of Polish Project Assignments
{5 Jan note. This is an old topic, which is being
rewritten and moved to other topics.
I’m saving this old version here.}
If you got here by clicking on the
link from the Polish Project, here is a brief
explanation of the categories used for the R1a
section of the Y-DNA Results page at the Polish Project at FTDNA:
My Excel file has more
detailed explanation, including assignments with less than 80% confidence.
Link: www.gwozdz.org/PolishProjectR1a1Assignments.xls
If you are assigned to the A, B, C,
I, K, N, or P Type, I figure 80% or better confidence that your assignment will some day be validated by a new SNP
defined haplogroup corresponding to that type. If you are assigned to the R (Remainder) category,
I figure 80% or better confidence that you will end up someday in another new
haplogroup, not one of those types that I defined so
far. If you are assigned to one of the
“Borderline” categories, I figure less than 80% probability you will end up in
the corresponding future haplogroup, but you are close to that type, so if you
do not end up in that one, you will likely end up in a closely related
haplogroup that is not yet defined. I
use the U (Unassigned) category for uncertain samples where a better assignment
is possible by further testing.
I expect more than 90% of my
assignments will be validated by SNPs some day for two reasons. First, many of the assignments are higher
than 80% confidence. Second, as explained
in the Confidence topic below, each of my assignment
rules are down rated for 80% sampling statistics confidence, so the average
probability is higher than 80% for a large number of 80% confidence
assignments.
Naturally, my confidence changes from
month to month as more M458 and STR data accumulates, for better
statistics.
My net confidence ratings are based
on a combination of statistical calculations and educated estimates, as
explained in the Confidence topic below.
{20
Dec note: the following is an old
version of this topic that needs rewrite to take consideration of the recent
M458 results:}
Here is a longer explanation of my
R1a1 assignments on the Polish Project Y-DNA Results page:
“P Type” is the samples that match
with at least 80% confidence to a hypothetical clade that is a subdivision of
R1a1. I call this clade “P type”, or
just P for a short name. P type is
highly concentrated in Poland. A
“sample” here means the Y-DNA STR data for a man who joined the Polish
Project. Some of the P samples match
with better than 80% confidence, so I predict that more than 80% of the P
samples will be verified some day by an official “haplogroup” corresponding to
what I call P type. Haplogroups are
also called “groups”. Official
haplogroups are divided into smaller official haplogroups. “Clade” here refers to all haplogroups,
including those not officially discovered.
I use the word “type” for my subdivisions (hypothetical clades, not
official), in order to distinguish from official words such as “haplogroup”,
“group”, and “haplotype”. Below on this
web page I explain what I mean by a “type”, and how I assess the probability
that a type corresponds to a clade. A
“cluster” is a group of samples with similar STR data. All types have corresponding clusters, but
not all clusters are types. On the
Polish Project page, I only provide types for which I have 80% or better
confidence in validity. I consider
types at lower confidence below in this page, and in an Excel File with
Assignments page.
Of course, if your sample is listed
in the P Type category there is chance your sample will not actually end up in
that future haplogroup, because such predictions based on STR data are not
certain. I judge that probability of
missing is less than 20% for each sample that I categorized into the P Type.
Similarly, “A Type”, “I Type”, “K
Type”, and “N Type” are samples assigned to corresponding types with 80% or
better confidence.
Please don’t get confused. These same capital letters are also used for
the large official haplogroups. The I,
N, and R haplogroups are particularly large in the Polish Project. I am dealing here only with haplogroup R1a1,
which is a subdivision of haplogroup R.
I am using capital letters to subdivide R1a1. I expect my subdivisions will be verified someday, with names
something like R1a1h1b2, etc.
A type and I type seem to be subtypes
of K type. So the samples assigned to K
are the ones that have 80% confidence of belonging to K but do not match A or I
with 80% confidence. In other words, my
K Type is really my K*, where the * notation is conventionally used to indicate
samples that belong to a clade, but only those samples that do not belong to a
known subclade.
“R1a - Remainder” is the samples that
do not match one of my types, with 80% confidence of not matching any. I call these R for short. R is not a “type” per my definition. The concept of R is very similar to the
concept of R1a1* for haplogroups. We
hope that someday haplogroups will be defined for what I call R samples, but
right now the STR data does not cluster (correlate) well enough for 80%
confident identification of more types.
Many samples do not match any of my 5
types with 80% confidence but match one of them with 20% or more
confidence. I do not put these into R
because they are not R with 80% confidence.
These samples are classified “R1a - Unassigned”, or U for short, if they
have less than 67 markers measured. U
is arranged as the last section in the R1a1 section of the Y-DNA chart at the
Polish Project web page. U is the
largest classification (most samples) on that page. U is for the samples that cannot be assigned with 80%
confidence. U is only used for samples
with fewer than 67 STR markers measured.
For samples with all 67 standard STR
markers, I do not use U. Instead, I
subdivided the U samples by which type they best match. So there are classifications such as “P
Borderline” and “N Borderline”, only for samples with all 67 markers. Borderline categories do not correspond to
types, because these categories are a mix of samples that probably do not
belong to the same type. Samples in R
have 80% confidence, which means each of those matches none of my defined types
at 20% or better. For example, the P
Borderline samples match P type poorly, but well enough for me to predict that
each of these has a 21% to 79% chance of ending up some day in a haplogroup
corresponding to P type. Summary for
this P type 67 STR marker example:
where I figure 80% or more a sample goes into P; where I figure 20% or less the sample goes
into R unless it matches another type;
where I figure less than 80% but more than 20% the sample goes into P
Borderline unless it matches another type better.
This long web document is arranged
with simple, important comments first.
Details and complications are discussed later in this document.
Revision History
2007
Dec 6. First web posting. Two files.
This “PolishClades.html” and “PolishProjectR1aAssignments.xls”. First drafts.
2007
Dec 22 to 2008 Oct 27. Eleven revisions
2009
Jun 4 minor wording changes, plus mention that 2 articles are scheduled for
submission for publication this month.
Also a comment about assignment rules update scheduled.
2009
Aug 14 notice that new assignment rules are available; this page is being rewritten.
2009
Aug 31 slip rewrite date
2009
Sep 16 start complete rewrite; not
finished
2009
Sep 18 rewrite complete
2009
Oct 3 mention of JoGG submission;
extensive minor wording changes
2009
Oct 22 several paragraphs added above Abstract to explain the new Polish
Project Categories
2009
Oct 25 finished editing everything, consistent with new assignment rules -
modified the Assignment Table
2009
Oct 26 another edit, with some clarification rewording, mostly in the
“Confidence Comments”
2009
Oct 30 edit to mention that the Excel file with Assignments has been updated,
using the new rules
2009
Nov 4 add a section at the top. New
M458 split of R1a
2009
Nov 6 change title from R1a1 to R1a.
Also a rewrite of the new section at the top
2009
Nov 11 expand the R1a signature prediction beyond Polish samples
2009
Nov 12 expand the R1a signature prediction to more markers and more signatures
2009
Nov 14 move the R1a Underhill analysis (beyond Poland) discussion to a new web
page, R1a.html
2009
Nov 18 move the M458 test results to the R1a page
2009
Nov 21 my method & results published in JoGG, Fall issue, today
2009
Dec 11 minor link correction
2009
Dec 17 comment about rewrite at R1a.html
2009
Dec 20 M458 test results update;
“Assignment News” new first topic
2009
Dec 24 rewrite Abstract, new Introduction and Mountain Method topics
2010
Jan 4 redo of links for References and Sources; also quite a bit of general rewrite
2010
Jan 5 more general rewrite
2010
Jan 6 update Results Table
2010
Jan 9 D type and more rewrite
2010
Jan 10 update the type discussions
2010
Jan 16 minor changes