Polish Y-DNA Clades
19 Sep 2016
Peter Gwozdz
News
19 Sep 2016 update of frequency % in the PCI Table.
19 Sep 2016 update of G Type (L365, YP389, YP269). Update of J Type (YP977).
18 Sep
2016 Major update of the Results Table.
5 Jul 2016 rewrite of D Type (Y2613).
30 Jun 2016 rewrite of B Type (Y2902).
12 Jun 2016 edit of P
type (L260).
11 Jun 2016 rewrite update of Z92, including E type.
10 Jun 2016 rewrite update of
multiple topics - no significant news.
Results
New topic 24 Apr 2016.
The Results
Table has a Y-DNA tree in a color table format, with the Y-DNA haplogroups that are common in Poland. Data is from the Polish Project, taken as representative of Poland.
The PCI
Table lists the haplogroups that are particularly concentrated in Poland.
Abstract rewrite 24 Feb 2016. Edit 24 Apr 2016.
The Polish Project has assignments of men (samples) to haplogroups and to proposed
subdivision clades based on their Y-DNA
data. The Polish Project provides data
for this web site of mine.
Lawrence
Mayka is the primary administrator of the Polish Project. Paul Stone
is also an administrator, with emphasis on the I1 haplogroup. I
am also an administrator, helping Mayka with statistical methods for assignment
of samples. This web document is for
explanation, details, and update news.
The topic is common Polish Y-DNA
clades - identification of male line Y-DNA clades that are concentrated in the
region of Historical Poland.
This Abstract is for people
reasonably familiar with the jargon of genetic genealogy. If you are new to genetic genealogy you
might prefer to read the Introduction
first.
The Results
Table and the PCI Table provide a
summary.
Many of the assignments are to
established haplogroups, based on SNP test
results. Many samples without
sufficient SNP data, if their STR data matches closely to sample(s)
with SNP data, are assigned to those corresponding SNP haplogroups.
Some assignments are to
hypothetical haplogroup branches, based on STR correlations. Such branches are proposed by many people,
including Mayka and me. In addition, I
hypothetically subdivide haplogroups into types
when division can be done with 80% confidence. With less than 80% confidence, assignment
categories are tentative, not called types, usually called clusters.
About half of Polish men belong to
haplogroup R1a.
Most of my work has been on R1a.
The R1a Project has lots of
additional information.
This web document has three
purposes: 1. More detailed explanations
for the sample assignments in the Polish
Project. 2. Summary of my published results. 3. Update with recent results.
Before 2014, it was expensive to
discover new SNPs, so emphasis was on STRs, which were much less
expensive. That was true of this web
page, the Polish Project, most other web based projects, and most published
articles about genetic genealogy. 2014
was a transition, where the cost of discovering new SNPs was greatly reduced.
As of 2016, there is a continuous flood of new SNP discoveries, and testing for
SNPs has become inexpensive. Emphasis
is now on SNPs. Still, most on-line
samples have STR data without test data for the most recent SNPs, so STR matching
of samples continues to be needed for assignments.
Many of the new SNP branches are
very small (I call them twigs), with less
than 5 known samples.
See Big
Y for one way to find new SNPs. See
SNP ordering information for testing
individual known SNPs.
I am interested in Polish
origins. This web document, however, is
not for historical analysis and conclusions, except for occasional comments to
remind us of the goal. This document is
dedicated to identifying haplogroups and types and clusters concentrated in
Poland, with detailed explanations. I
am aware that some people object to the
use of Y-DNA for historical analysis, so I try to mention caveats along with my
comments.
Update rewrite 22 Sep 2015. Edit 24 Apr 2016.
About half of Polish men belong to haplogroup R1a. The R1a
Project has lots of additional information about that haplogroup.
See the Results
Table for a breakout of R1a subdivisions that are common in the Polish Project.
In that table, R1a is represented
by R1a1 (R-M459) because all R1a samples in the database
are R1a1. R1a1 is divided into 3 main
haplogroups: M458, Z280, and Z93. Only 1.25% R1a1 samples do not belong to these 3 main
haplogroups.
Worldwide, R1a is more
complex. A graphic representation of
the known branches of the R1a tree is available at the R1a Project. ISOGG has
an R1a tree that is not up to date.
Yfull has a continuously updated R1a tree including all the recent new
SNPs for which data have been submitted to Yfull; direct link:
https://www.yfull.com/tree/R1a/
Thanks go to Mayka and Milewski for extensive email information
and assistance.
Reminder: I am concentrating on Poland.
The statistics of STR clusters depend a lot on the database. For example, P type
stands out dramatically in Polish data.
In other countries far from Poland P type is rare. If you belong to an R1a cluster that is rare
in Poland, I’m sorry, but I’m not covering you. Check out the R1a Project.
When I originally posted this web
page in December 2007, no significant haplogroup subdivision of R1a was
available, so this page started with hypothetical subdivisions of R1a. In 2010 I expanded this page to include the clades from other haplogroups.
Edit 10 Jun 2016.
This Introduction is for people
unfamiliar with the jargon of genetic genealogy.
There are quite a few web sites
with a general introduction to the subject of genetic genealogy, for example Wikipedia, FTDNA, and Genographic. The Y Chromosome Wikipedia
article is about male line DNA, also called Y-DNA.
This is a brief introduction to
genetic genealogy for Y-DNA, providing some definitions of jargon needed to
read my web pages. The definition words
are boldface. I often use links to
those definitions when I use a jargon word for the first time in a topic. If you want more detail on those boldface
words, consider a web search. There are
more boldface definitions in the summary of my Methods.
The Y chromosome gets passed from
father to son, so it works just like a male family name. Mutations (changes in the DNA coding) in the
Y chromosome are inherited by sons. Men
are divided into haplogroups based on known rare
mutations. These mutations are called
single nucleotide polymorphisms - SNP - a change at only
one specific location in the Y chromosome.
The human Y-DNA haplogroups, representing all men, can be arranged as
branches on a tree.
Diagrams of the tree usually depict the tree branching down (an upside
down tree), or sideways. Example of a
sideways image of the Y-DNA tree: wiki
tree. Usually it is more convenient
to arrange the tree upside down as a list, with tabs for the branches; examples:
Yfull tree and ISOGG tree.
These examples have links allowing you to browse through the thousands
of known branches of the human Y-DNA tree.
We don’t really know the full human Y-DNA tree; all trees are based on current data in a
particular database; new branches are continuously discovered.
All the men in a haplogroup descend
in direct male lines from one man, called the “Most Recent Common Ancestor” (MRCA) for that haplogroup.
The MRCA corresponds to a node, or branching
point, in the Y-DNA tree of male line ancestry. Time of the Most Recent Common Ancestor (TMRCA)
is an estimate of how long ago he lived - the age
of the node.
Lots of people, including me, are
working to discover more SNPs on the Y chromosome so that the haplogroups can
be divided further into smaller haplogroups.
SNPs used to be difficult to
discover and expensive to test. Costs
have been coming down. SNPs are now
discovered relatively easily; SNP
testing is inexpensive. Since about
2013 there has been an increasing flood of new SNPs and corresponding newly
discovered haplogroups.
SNPs have alphanumeric code names
(for example CTS3402), assigned by the people who discover them, and registered
at on-line databases. The corresponding
Y-DNA haplogroups have alphanumeric code names assigned by ISOGG (for example the haplogroup for
CTS3402 is R1a1a1b1a2b3). Since 2014,
with the flood of new SNPs, ISOGG is not keeping up, and the ISOGG codes have
become too long. ISOGG codes are still
used for the main branches (the
oldest branches with thousands of samples).
For smaller (younger) branches, sometimes only the SNP code is used (for
example haplogroup CTS3402) or a main branch code followed by the SNP code for
the smaller subdivision branch of interest (for example haplogroup R-CTS3402 or
R1a-CTS3402).
Clarification: A haplogroup (or a branch of the Y-DNA tree)
includes all the subdivision haplogroups (all the subdivision branches). For example, R-S18681 is a branch of R-CTS3402,
and R-CTS3402 is a branch of R1a. R1a is a branch of R1, which is a branch of
R. So if a man belongs to the
haplogroup R-S18681 he also belongs to the haplogroup R-CTS3402 and he also
belongs to the haplogroups R1a, R1, and R.
In this example, R is the oldest and R-S18681 is the youngest branch.
If you purchase a DNA test for
R-CTS3402 and you have this SNP, the result comes out “positive”, or CTS3402+,
and you belong to that haplogroup. If
you don’t have this SNP the result comes out “negative”, or CTS3402-.
Upstream
means older (SNPs, haplogroups, branches, etc) within the same branch; for example R1a is upstream of CTS3402. Downstream
means younger within the same branch;
for example CTS3402 is downstream of R1a. Upstream haplogroups generally are larger - more samples in a
database. It may be confusing because
of the mixing of metaphors, and because most streams and rivers have smaller
branches upstream, which is opposite the situation for Y-DNA trees the way the
words are used by genetic genealogists.
I suppose upstream and downstream could be visualized as a river delta,
where a large river breaks up into smaller rivers downstream, for the Y-DNA
metaphor. Upstream and downstream
should be visualized with the Y-DNA tree as a list, with older nodes up and
younger nodes down and indented; see Yfull tree.
SNPs are individually very
rare. But the Y has about 60 million
locations, and about 1/3 of them are suitable for accurately measuring SNPs
inexpensively, so there are thousands of known Y-SNPs. A man who belongs to R-S18681 inherits from
his father, on the Y chromosome, not only the SNP mutation S18681, but also
CTS3402, and also the mutations for haplogroups R, R1, and R1a, and also the
SNPs for other branches that I am not mentioning in this example. It is unlikely that a particular man in an
entirely different haplogroup might have the S18681 SNP mutation, but it’s like
a lottery; some man somewhere outside
S18681 probably has that mutation.
However, it is extremely unlikely one man outside S18681 has many of the
mutations from the upstream branches leading to S18681; it’s like winning a lottery many times. In other words, it is almost impossible for
one man to carry the sets of SNPs for two different haplogroups. A mistake in SNP testing, or a mix up of SNP
data, or someone cheating with DNA samples, are each more likely than valid
haplogroup confusion.
Many haplogroups have an MRCA who lived thousands of years ago, so
these span multiple ethnic groups and nationalities. For example, the R1a haplogroup is of interest to me. R1a is most common in Slavic countries but
calling R1a Slavic can be misleading because it is found throughout Europe and
west Asia. The MRCA lived so long ago
that he may have spoken a language that we would not consider Slavic if we
could hear it. It is possible that he
did not even live in what is now the Slavic region of Europe; maybe his descendants moved there in a
massive migration from the Asian steppes, or from India, or from somewhere
else. No one knows for sure. He may have been proto-Slavic in language
and culture, but we don’t know for sure.
If he was proto-Slavic, by now some of his descendants long ago moved to
other parts of Europe and Asia. One of
the appeals of genetic genealogy is finding such clues about ethnic descent and
migration from the statistics of haplogroups.
Some people object, pointing out that
ethnicity cannot be defined genetically because of all the moving and mixing of
people over the millennia, and because the Y chromosome is only one of our 24
chromosomes. True enough. Some individuals and some web sites go too
far with genetic genealogy claims based on DNA. That said, statistical analysis of Y haplogroup data provides
clues on human origins.
Most known haplogroups have a TMRCA
of thousands of years ago, before family names were common, so most men in a
haplogroup do not usually have the same family name.
Some relatively young haplogroups
have been discovered that correspond to families, where most men in that
haplogroup do have the same family name.
Y-DNA is biologically accurate, so
some men discover that their Y-DNA does not match the DNA of their male line
distant cousins identified by genealogy research, due to secret adoptions,
illegitimacies, cuckoldry, etc. Such a
situation is called an NPE, non paternal event. This is one of the reasons some genealogists
prefer to avoid genetic genealogy.
The male line associated with the
Y-chromosome is only one ancestral line.
Anyone who tries to make a family tree going back 300 years has more
than a thousand root tips to be filled by names of ancestors who lived back
then; the one man at the tip of the
male line root is only one of those thousand.
That is another reason some genealogists avoid Y-DNA genetic genealogy -
the emphasis on only one line of descent out of many. That said, many men enjoy purchasing a DNA test to find out which
ancient Y-DNA haplogroup is theirs.
Many people enjoy the challenging hobby of discovering new SNPs in their
male line (including women studying their father’s or maternal grandfather’s
male lines), thereby defining their younger branches of the Y-DNA tree.
A paragroup is a haplogroup considered without its known
haplogroup branches. An asterisk is
often used in paragroup codes, like R1a1a* or CTS3402*. R1a1a* usually (not always) means all the
samples that belong to R1a1a but do not belong to any of the known
branches. When a new branch is
discovered, samples positive for that new SNP get assigned to the new
branch; that changes the meaning of the
corresponding paragroup. The meaning of
a paragroup varies between databases, because different databases have
different samples with different SNP test results. As a simple example of the idea of paragroups, “apes” is a
biological clade that includes humans,
but for many discussions we talk about “apes” as all apes except not including
humans, in which case “apes” is temporarily a paraclade in that discussion.
Many SNPs were independently
discovered more than once and listed with multiple names. For example, CTS3402 has two other code
names, so it can be listed as CTS3402 / V2670 / S3361, or any of those names
can be used alone. Those three codes
are all the same mutation, at the same location on the Y chromosome.
Most haplogroups have multiple phyloequivalent
SNPs. For example, CTS3402 has two
other phyloequivalent SNPs: Y32 and
Y2194. These are not the same; they are different mutations at different
locations. So far, every sample that is
positive for CTS3402 is also positive for those other two. This may change; for example in the future a sample may show up that is positive
for CTS3402 but negative for the other two SNPs, in which case all previous
branches of CTS3402 will be assigned to a new haplogroup, branching from
CTS3402, defined by either of those other two SNPs, while that hypothetical new
sample will be assigned into a new branch of CTS3402.
R1a has more than 100
phyloequivalent SNPs.
It’s OK to say “equivalent SNPs”
instead of “phyloequivalent” if the meaning is clear, but “equivalent” has
other unrelated meanings.
For clarity let me offer a tree
analogy for “phyloequivalent”. Between
branching nodes, a real tree has branch segments, which are smooth, without any
branching. The older branches of a tree
have long smooth segments. When that
tree was younger, there were lots of branches along that segment, but those
branches died and fell off the tree as it grew over time. Similarly, the Y-DNA tree has segments between nodes; the older branches of the Y-DNA tree
correspond to haplogroups that have initial segments that are long - sometimes
thousands of years in time, so there were many generations along that segment,
with multiple independent SNPs, that now seem to be phyloequivalent, because
all the other branches became extinct (no surviving males). I say “seem to be phyloequivalent” because a
new sample may show up from a man in a newly discovered branch, thereby
splitting that segment into two segments.
So a “smooth” segment really includes the possibility of small twigs
along that apparently smooth segment.
The metaphor of a tree is appropriate, because a large branch segment
with very few twigs looks smooth from a distance. A Y-DNA branch can be
smooth in one database (like the Polish Project) and not smooth in a larger
databases (like Ysearch, if significant
branches in that segment are rare or absent in Poland).
Extinction: Over the life of a real tree, most branches
die and fall off. Similarly, due to the
statistics of male descendancy, most male lines become extinct over time. That seems surprising to many people, but it
is a well known statistical result. If
you want verification, search the web for the theorem called “gambler’s ruin”,
whereby a gambler with a fixed stake almost always looses everything when
playing at a casino, even if the odds would be neutral. The number of males in a haplogroup
fluctuates up and down due to statistics over the generations, almost always
eventually fluctuating to zero, similar to a gambler’s stake. A haplogroup that survives for thousands of
years is like a very rare lucky gambler in a casino. In a casino the odds are usually fixed in favor of the casino; in the Y-DNA tree the odds were favorable
for male lines during population expansions, and the odds were unfavorable for
male lines during population declines.
Extinction of male lines is faster during population bottlenecks. Although almost all male lines become
extinct, there are those very rare lucky male lines that grow, forming a major
branch of the Y-DNA tree, with many new male lines, whereby most of those new
male lines eventually become extinct, repeating the process.
It is common to call haplogroups
and corresponding SNPs “nodes”, particularly when discussing the Y-DNA tree
represented as an upside down tree, as a list, with haplogroups indented to
indicate branching. In such a list the
haplogroups and their corresponding SNPs do appear as nodes, not distinguished
from the true nodes which are MRCAs.
I avoid using “node” for SNPs, although such use is common in genetic
genealogy.
Actually, phyloequivalent SNPs are
almost always spread out in time, within the segment that is older than the
MRCA. I have more discussion about this
in the age topic, below.
There is another kind of mutation,
in a microsatelite, which is also called a short tandem repeat, STR. Briefly, an STR is
like a necklace. Each bead of the
necklace is the same short sequence of DNA, repeated multiple times. An STR can mutate such that the number of
repeats in the necklace changes. So an
STR mutation is expressed as the number of repeats after the mutation. STRs are not used for haplogroups because
they are not rare enough. You can read
more on line, for example wiki
STR.
Here
are some common terms (in boldface) for genetic genealogy. A marker (also
“locus”, plural loci) is a DNA location for an SNP or STR or other kind of
mutation. A haplotype
is a set of gene values at any number markers.
In Y-DNA genetic genealogy “haplotype” is usually used to mean a set of
numbers, for the values of a particular set of Y-DNA STR markers. The word sample
(plural samples or data
or database) refers to the Y-DNA STR and SNP values
from one man. A sample is also commonly
called a haplotype, but I avoid calling a sample a haplotype to make it clear
that a haplotype may or may not be present in a particular database of
samples. A clade
is a general term for common descent, so an SNP haplogroup is one kind of
clade.
Many people, including me, in the
past, worked to “stay ahead” of the SNP haplogroups by analyzing STR
mutations. That’s because SNPs used to
be difficult to discover and expensive to test, while STR data was relatively
inexpensive. That’s changed since about
2013; SNPs are now discovered
relatively easily; SNP testing is
inexpensive. So STR analysis is no
longer as popular as it was. STRs still
have value for genetic genealogy. I
have more about STRs and haplotypes in my Method topic below.
Although SNPs are more important
than STRs for Y-DNA genetic genealogy, STRs are still valuable because on-line
databases have thousands of samples with STR data but not as much SNP data
(yet). You can search for statistical
matches to STR data.
There are many organizations and
commercial companies on the web where you can order a cheek swab kit to mail in
for genetic genealogy STR and SNP testing and matching and analysis, for
example FTDNA. I am not associated with the company FTDNA; I mention them because I make extensive use
of their data; check Google for
competitors. At FTDNA, click on
Products for cheek swab kits. DNA
results are confidential unless you register the data at a public
database; at FTDNA, click on Projects
to register your data into one of the many databases; for example, most of my analysis is from the data in the FTDNA Polish Project.
For STR analysis, I prefer the
FTDNA standard set of 67 STR markers.
I do some analysis using the standard FTDNA 12, 25, 37, or 111 STR
marker sets. Other companies use
standard marker sets that may not overlap with all the FTDNA markers.
As a first Y-DNA test, I recommend
the FTDNA standard STR 37 set, because the result will automatically place you
in one of the main large haplogroup branches of the Y-DNA tree, and because
FTDNA provides you with matches to other men with similar STR haplotypes. If cost is not an issue, the 67 set is better for accurate
matches and the 111 set is best. The 12
and 25 sets are no longer available in the Product list at FTDNA.
Once you know your haplogroup you
can follow the on-line tree and purchase SNP tests to determine your younger
branches on the Y-DNA tree. I have
instructions available for SNP ordering.
If you already purchased a DNA test
your result probably already has your main Y-DNA haplogroup branch (for men),
so you can proceed with SNP testing.
Ysearch
is the largest web database for Y-STR data, run by FTDNA, open to all men,
including men who also register with projects and including men with data from
other testing services. I use Ysearch
often for analysis so of course I encourage you to register your STR data at Ysearch. From the FTDNA site, you can register your
data with Ysearch. Or you can type your Y-STR data into Ysearch.
Data
sharing: Thousands of men are
sharing their Y-DNA data by making it available on the web for analysis. Most DNA testing companies give you
choices: you can keep everything
private, in which case nothing is shared;
you can allow sharing of everything;
you can be selective, for example sharing STR and SNP data associated
with your test kit ID number but withholding your name; and other options. The Y chromosome has relatively few genes, and none of those few
have been correlated with health issues or significant human characteristics. Nevertheless, some people feel private about
their DNA. One issue is non paternal events in the past, whereby your
DNA matches may provide surprises.
Although I encourage people to submit DNA data to databases, I
understand if you are reluctant. I use
public databases, mostly FTDNA projects
and Ysearch, so my analysis is based on
data that is already public. I use
names of people if those names are already on the web. I ask permission to use names of people to
use as references for new information or analysis.
Many people are using statistical
analysis of Y-DNA data to gain insight into human origins and migrations. I am one of those people. My interest is Polish origins. This web document, however, is not for
historical analysis and conclusions, except for occasional comments. This document is dedicated to Y-DNA data and
analysis, both SNP and STR, identifying haplogroups, types, and clusters
concentrated in Poland, with detailed explanations.
My Method topic has more definitions,
but that topic is more advanced, intended for readers with some experience in
genetic genealogy.
The Fall 2009 issue of the Journal of Genetic Genealogy has my
publication split into two parts:
Part I is my “mountains in
haplospace” method for evidence that certain “types” of STR clusters correspond
to clades.
Part II is the application
of that method to Common Polish Clades.
That article has a lot more detail than this web page, but that article
was published in the Fall of 2009, so this web page serves as an update.
PolishCladesUpdate is my
folder for updates of the Excel analysis files for those two articles.
This web page continues as an
introduction, summary, and update.
The Fall 2010 issue has my publication
announcing the L260 SNP.
New topic 6 Nov 2009. Last edit 27 May 2010.
An article was published online, 4
Nov 2009, essentially dividing R1a1 into two groups, based on a new SNP, M458.
I call this article “Underhill” for
short, because his is the lead name in the list of 34 authors for this major
work.
This web page about Polish Clades
was completely rewritten using this new information. Recent L260 and M458 test results are consistent with (albeit not
full proof of) my previous R1a subdivision into “types” here on this web page
about Polish Clades.
Briefly, most of R1a1a is split by
this new mutation into R1a1a7 (M458 positive, or M458+) and R1a1a* (M458-).
R1a1a7 is the new M458
haplogroup. R1a1a7 includes what I have
been calling P type and N type here on this web page, even before M458 was
available.
R1a1a* is a new paragroup. This is M458 negative. It
includes all my other R1a types
This Underhill article has data for
158 “Poland” samples (Table 2):
R1a1a*: 71 samples 44.9%
R1a1a7: 87 samples 55.1%
The 70% confidence interval for
R1a1a7 is about 50% to 60% in the Underhill Poland data.
Worldwide 77% of the Underhill data
is R1a1a* (222 in Table 7 vs 68 R1a1a7, 290 total).
M458 Results are coming in now for
this new SNP test and the Polish
Project R1a is splitting about evenly, with a few percent more R1a1a7 than
R1a1a*, although the latter is more common worldwide.
Format
Up to here, I have tried to write
this web page as news and summary, with links to more discussion below. I hope anyone having minimal familiarity
with genetic genealogy jargon has understood.
If you read this top to bottom, it gets progressively more detailed,
with more and more jargon. I’m sorry
about that, but the audience is also readers with genetic genealogy experience
who want to know how I came to my conclusions.
If you cannot follow some of this, it is written in a manner that you
can jump around and pick out what you do understand, then come back after you
have read more about genetic genealogy.
If you open this html document with
Word, all the link targets (bookmarks) can be viewed alphabetically or by
location; this serves as an index.
Rewrite 5 Jul 2016.
Lawrence Mayka is the administrator of the Polish
Project. Click on the Polish Project web link to see how
Larry assigns samples (men) to
categories. The Polish Project has
sections for mtDNA and for Y-DNA. This
web document of mine is restricted
to Y-DNA. I help Larry with assignments
to types.
Lukasz Lapinski and Paul Stone also help with assignments.
Haplogroups are defined by SNP mutations. The goal is to assign samples to their proper terminal
haplogroups. Your terminal haplogroup corresponds to
the youngest known SNP haplogroup in your
male line of the human Y-DNA tree. When the terminal haplogroup cannot be
assigned with reasonable confidence, assignment may be to an upstream haplogroup (one of the older
branch segments) leading to the terminal
haplogroup. New terminal haplogroups
(new younger segments) are continuously discovered because of the recent rapid
rate of discovery of relevant new SNPs.
Samples are grouped by assignment
categories on the Y-DNA STR data pages,
where the category titles appear as horizontal colored rows, followed by rows
of samples assigned to that category.
When appropriate, the assignment
category names include a recommended SNP for further testing, in order to
confidently determine the terminal haplogroup for those samples.
You can always save money by
waiting. DNA testing costs are coming
down as better testing methods are developed.
More detailed SNP packs will
surely be available in the future. As
more data accumulates, a sample may show up eventually that matches your
current STR data very closely, and if that sample has recent new SNP data with
positive results, your sample will probably be assigned (or predicted) to that
same corresponding new haplogroup at no cost to you. Assignment categories in the Polish Project are provided to help
men who are doing male line genetic genealogy research (and women doing
research on the male lines of their husbands and fathers and maternal
grandfathers), and who would like results soon, as detailed as possible.
If you are planning to purchase Big Y (next topic), there is no need to
purchase SNP tests, because Big Y includes just about all the commercially
known Y SNPs.
If you are new to Y-DNA testing,
you should join an FTDNA haplogroup project
corresponding to one of your main
haplogroup branches. The haplogroup
project administrators are usually up to date on the latest SNPs for that
haplogroup, and are often eager to help beginners figure out where they fall in
their branch of the Y-DNA tree. The
Polish Project administrators can also help out, although we may not be quite
as knowledgeable about your specific haplogroup. To find your main haplogroup branches, from your FTDNA home page
(Dashboard), click on “Haplotree and SNPs”.
The tree should come up, indicating your FTDNA assigned haplogroup. The tree should have the main branches indented
in rows above your FTDNA assigned SNP haplogroup (upstream haplogroups indented
to the left). To find haplogroup
projects, from your FTDNA home page, under the “Projects” tab, click on “Join a
Project”, then scroll down to the header “Y-DNA Haplogroup Projects”, then
click on the first letter of your haplogroup assignment; check out the projects that come up.
STR mutations had been easier to
test than SNPs in the past, so many samples have STR data without recent SNP
data. Predicted assignments (for
samples without up to date SNP data) are based on STR correlations, by comparison
to samples that have both STR and SNP data.
The men with such predicted assignments can verify their assignments by
ordering the corresponding SNP test that is named in the assignment.
The Polish Project also includes FTDNA computer generated assignments in a
column labeled “Haplogroup”, which uses a color code; green text means assignment based on a positive SNP test
result; red text means assignment based
on STR prediction. I do not know the FTDNA computer
algorithm for those red STR based predictions, but it is conservative; I notice they have more than 97% probability - less than 3% of those
red predictions end up in different haplogroups when they are eventually SNP
tested. However, that means most of the
SNPs for recently found branches of the Y-DNA tree are not predicted by FTDNA,
because there is not enough data for 97% probability predictions. Most of the newer SNPs are for younger
branches, where STR prediction simply cannot be done with such high confidence,
because those younger branches do not all have unique STR signatures.
The Polish Project assignments are
more aggressive. The assignment
guideline for predicted assignments based on STRs is better than 80%
probability of future validation
by an SNP. The intention is to provide
more STR based predictions, accepting the risk that some might later be found
incorrect. In practice, over the years,
the average validation rate in the Polish
Project is much better than 80% because most predicted samples fit much
better than 80%, and relatively fewer samples turn up in the approximate 80%
range.
We avoid recommending SNP tests
when a sample has more than 95% confidence of testing positive, because testing
money is better spent testing for the branches - younger SNPs. A negative SNP test result is OK because by
eliminating a haplogroup, predicted assignment to another branch can be made
with higher confidence.
Some Polish Project SNP prediction
categories have qualifiers, such as “Predicted, Recommended, or Needed” to
indicate relatively higher confidence, or qualifiers such as “Credible,
Consider, or Borderline” to indicate relatively lower confidence. The Borderline category has the least
confidence; this category is an
exception to the 80% guideline;
estimated probabilities as low as 50% have been used, although rarely,
for samples that do not fit any other category.
Many haplogroups have multiple phyloequivalent SNPs. Polish Project Assignment categories select
one of those phyloequivalents to be used consistently in the category
name. Usually one of the first to be
discovered is used, and then changes are made only when new data causes
previous phyloequivalents to be split into different branch segments;
so a new code name is used only when necessary. I do not fully understand the FTDNA computer
assignments in this regard; samples in the
same haplogroup often use different phyloequivalent SNP code names in the FTDNA
“Haplogroup” column.
Clusters
and types are hypothetical haplogroups,
used as assignment categories based on STR analysis. There were more of these a few years ago, before the flood of new
SNPs. Many of those have been validated (equivalent to an SNP haplogroup) or shown to be
invalid (STR matches not having high probability
of belonging to one haplogroup). In
most cases the original cluster or type name is still used for assignments,
along with the corresponding equivalent
SNP. There are still some clusters and
types that have not been validated or invalidated.
To order a recommended SNP, from
your FTDNA home page, click on “Haplotree and SNPs”. The tree should come up, indicating your FTDNA assigned main haplogroup. The tree should have the SNP branches
indented in rows below (downstream -
younger than) your FTDNA assigned SNP.
FTDNA is not particularly fast to
add new SNPs. Other companies are
available; some of them honor requests
for the newest SNPs. If you are new to
SNP testing, ask for help from your haplogroup project administrator
in selecting new SNPs appropriate for you.
I generally recommend waiting for FTDNA because I like the convenience
of all SNP data for the Polish Project in one place, but I sometimes purchase
elsewhere. For SNPs that are not in the
FTDNA tree, see SNP ordering information.
SNP packs
are available and sometimes recommended in the Polish Project assignment
name. SNP packs include many SNPs at a
very low cost per SNP. SNP packs are
recommended for samples that have no SNP tests, or have not been SNP tested in
a long time, and where the assigned haplogroup has quite a few known younger
SNPs. You can usually find your
recommended SNP pack in the FTDNA tree, above your assigned position in the
tree.
Sometimes a Remainder category is used for paragroups, which means the remaining
samples from a haplogroup that have sufficient data to conclude that they do
not belong to any of the known subdivision branch categories of that
haplogroup.
Sometimes an Unassigned
category is used for samples from a main branch haplogroup without sufficient
STR or SNP data to assign those samples to one of the known branches.
STRs: Until a couple years ago I (and most
experienced genetic genealogists) recommended purchasing the maximum number of
STR markers. The FTDNA maximum standard set is 111; smaller standard sets have 67, 37, 25, and
12. More markers increased the
confidence of assignments; some types
and clusters required 67 or 111 markers for assignment; more samples with 111 markers allowed
discovery of more types and clusters.
Today, this is all still true, but SNPs are more important than STR
markers for assignments. Assignments
that need more markers have phrases like “SNPs or Markers Needed” in the
assignment name. A sample with only 12
markers can be assigned to a main haplogroup branch based on only 12, and from
there an SNP Pack can be purchased to
identify the appropriate downstream
haplogroups, and more SNPs can be individually tested in sequence to determine
the terminal haplogroup for that
sample. If cost is not an issue, Big Y
(next topic) is better than purchasing SNP tests.
Another purpose for STRs: Finding male line best matches from the
large number of samples without sufficient SNP data. The FTDNA site automatically lists close STR matches. For this purpose, more markers are
better. 111 markers is best if cost is
not an issue. However, even with 111
markers tested, the FTDNA site does check for matches at fewer markers because
there are many samples in the FTDNA database that have only sets with fewer.
New Topic 6 Oct 2015. Edit 8 Oct 2015.
“Big Y” is a commercial
project by FTDNA for reading about 12 million base
pairs of the DNA of the Y chromosome, which has about 60 million base pairs
total. New SNPs
are being discovered in the Big Y data provided by customers.
Link: https://www.familytreedna.com/learn/y-dna-testing/big-y/.
The FTDNA home page for your DNA
kit has a link for ordering Big Y, and for later viewing the results.
If you are new to Y-DNA genetic
genealogy, you might ask for help on Big Y, from an administrator of your haplogroup project, because Big
Y results are not easy to understand. Yfull does an excellent job of analysis of
Big Y results, at a modest price; I
highly recommend joining Yfull by submitting to Yfull your “BAM”, which is the
very large data file in Big Y results.
The cost is currently $575. Other companies offer similar tests; I recommend FTDNA because I like the
convenience of Polish Project data being available at one place.
If you are planning to purchase Big
Y, there is no need to purchase SNP tests, because Big Y includes just about
all the commercially known Y SNPs. In
addition, Big Y lists your private
SNPs, corresponding to recent Y chromosome mutations in your male line
ancestors. There is a good chance one
of your private SNPs will match up with a private SNP of someone who previously
purchased (or later purchases) Big Y, in which case that private SNP will
define a new relatively young terminal haplogroup branch for just
the two of you; more men can later test
for only that SNP to see if they belong to that new haplogroup.
If you purchase Big Y you probably
won’t have to purchase any more SNPs for quite a few years. Even when a test reading more of the Y
chromosome at a lower price becomes available, men in your male line using that
newer test might discover additional SNPs not in your Big Y results, but those
SNPs may be phyloequivalent to
your Big Y SNPs.
Other testing companies provide
similar tests. I don’t work for FTDNA. If I seem to be pushing their tests it’s
because I encourage men with Polish ancestry to take up to date tests and to
join the Polish Project, where FTDNA data is readily available for analysis.
Description
of the R1a Branches
Edit 25 Apr 2016.
There are separate topics below for
descriptions of selected categories in Haplogroups I, N, and R1b.
This is a long topic with many
short subsections, each for a category.
Many of these subsections are out of date and need to be rewritten. The subsections without a date on the first
line may be a few years old.
This large topic has descriptions
for many of the Y-DNA categories at the Polish Project. Some of these are haplogroups, some are types, some are clusters.
Types and clusters are high confidence
hypothetical haplogroups.
Please don’t get confused. The following capital letter names are my
codes for R1a categories. Capital
letters are also used for the large official haplogroups, but that’s different.
Some of the following categories
are discussed in my November 2009 publication,
and may have archive copies of my 2009 Excel analysis files stored in the Supplementary
folder. Many of the following types
have my update Excel analysis at PolishCladesUpdate.
A. (Y2619).
Rewrite 28 Apr 2016. Edit 26 Apr
2016.
A type seems equivalent to the SNP haplogroup Y2619.
A type seems predominantly Ashkenazi, judging from the family names.
My analysis file: AType.xls. My Definition,
using all 67 standard markers,
with cutoff 10, gap 5, is available at
Ysearch as FCUFG.
Y2619 is a branch of Z93, which is a branch of R1a. In the Results Table,
Z93 is 2.6%, mostly A type at 1.5% (early 2016). That table is restricted to samples from
the Polish Project that indicate “Poland” for male
line ancestry. Z93 is unusual in that
only 37% indicate “Poland”, perhaps because this haplogroup seems to be
dominated by Ashkenazim, who join the Polish Project looking for close Y-DNA
matches. The Polish Project welcomes
descendants from Historical Poland, and
in the case of Z93 many of the samples even give origin from countries outside
the historical Poland borders. The
frequency of Z93 in the full Polish Project is about 7%.
A type is one of the more
dramatically "isolated in haplospace" clades that I have analyzed using my type
method. With cutoff 10 and gap 5, there
are no samples in the Polish Project at steps (STR mutation count from my A type definition) 10 through 14.
Explanation of A type isolation
(see also Population Bottlenecks): At the Yfull tree,
Y2619 has 16 phyloequivalent SNPs. Y2619 is the only known (at Yfull) branch of
CTS6, which has 11 phyloequivalent SNPs.
There is only one sample on the twig CTS6* (CTS6+
Y2619-), so not counting that one exception, those 27 total SNPs are spread out
in time on a long segment of the tree. Yfull estimates the length (in time) of that
segment as from 3400 ybp to 1450 ybp.
That’s 2 millennia of isolation (early 2016 estimate). That's a good explanation for why A type is
so well isolated in STRs - a long time for unique STR mutations.
On Ysearch, that FCUFG definition
is fairly well isolated, although not as dramatically as in the Polish Project.
There are no CTS6* samples in the
Polish Project, nor in the R1a Project (early 2016).
The SNP sequence leading to A type
is Z93 > Z94 > Z2124 > Z2122 > F1345 > CTS6 > Y2619. Other Z93 samples are sparsely spread among
the Z93 paragroups excluding A type in that
sequence. There is no significant
evidence of any concentration in Poland.
Possible exception: The Polish
Project has 5 samples with Poland male line origin that are Z93+ Z94-. Two of them have Big Y. Perhaps these represent a very small
haplogroup, to be discovered in the future, that is concentrated in Poland.
A type is discussed in my publication, Part II. The definition, using 67 markers, has been
available since 2008 at Ysearch as FCUFG. This web page has consistently predicted
that an SNP would be discovered that would seem
equivalent to A type, which is the case today with Y2619.
Levy-Coffman wrote an article
about Ashkenazi genetic genealogy. I
noticed discussion in a Science
article.
I have consistently expressed more
than 98% confidence that A is a
valid clade, not just because of my work, but because the modal haplotype
closely matches the various versions of the most common Ashkenazi haplotype,
which has been widely studied and reported on the web. It should be emphasized that not all
Ashkenazi match this type, and some men in this type may not be descended from
Ashkenazi.
Between 2008 and 2011 I predicted
that A type was a subtype of K type, but I never had high confidence in that
prediction. I eventually dropped K
type, but many of the hypothetical divisions of K type are coming out as valid
clades of Z283, which is a “big brother” SNP to Z93. The match of A type to K divisions at the first standard set of 12 markers is now seen to be
a coincidence. Older publications call
that 12 marker haplotype, very common in Eastern Europe, the “Ashkenazi”
haplotype, but we now know that only a small fraction of men who match at 12
markers are Ashkenazim.
B. (Y2902).
Rewrite 30 Jun 2016. Edit 5 Jul 2016.
B type is equivalent to the haplogroup defined by the SNP Y2902.
In the Results
Table, B type is 3.3% (early 2016).
That table is restricted to samples from the Polish Project that indicate “Poland” for male line
ancestry. For the full Polish Project,
open to all of Historical Poland, 21 of
33 B type samples indicate ancestry from “Poland” implying the current borders
of Poland. However, Y2902 is common
throughout Eastern Europe. In the R1a Project Y2902 is called “Carpathian-Russian”, which
is not particularly concentrated in Poland.
Perhaps one or more branches of Y2902 are concentrated in Poland; we
might soon see evidence for that as more B type samples get tested for terminal SNPs.
At the Yfull tree (Jul 2016), Y2902 has 18
known branches.
At the Yfull
tree (Jul 2016), Y2902 has 20 phyloequivalent
SNPs, which demonstrate why B type is isolated in
haplospace; for more discussion of
this point see Population Bottlenecks. I don’t expect any of those 18 known
branches to form a significant STR cluster
or type, because none of them has more than 6
phyloequivalent SNPs.
Mayka
suggested the B cluster to me in Sept 2009, at which
time I verified that it qualified as a type, with SBP 12.4%,
although there were only 11 samples at the time. I mentioned B type in my 2009 Publication.
B type STR signature: 389=(14,31), 458=15, 442=12, 406=10,
643=11. The 111 set is needed for that DYS643 marker.
My definition
for B type, using 66 of the standard 67 markers,
is available at Ysearch as RU8Z8. The cutoff is step 13.
In the Polish Project we have
always used a “B Borderline” assignment for samples near the cutoff (a few
steps below or above). We try to
maintain 80% confidence for individual assignments, but this type is an example
where “borderline” has been used for samples with
lower confidence if no better assignment is possible. I have noticed that which samples make the cutoff is sensitive to
exactly which markers are used in the definition, and I have tweaked the
definition year to year as data accumulated.
However, samples well below the cutoff are consistently predicted B
type. You can see this in my analysis
file, BType.xls,
where the best fit 24 samples always come out B type using 30 to 67 markers
(with my automatic marker selection method), and none of these have tested
Y2902-. SBP is 20% in that analysis
file, which is good but not great. At
111 markers, there are proportionally fewer borderline samples but it is still
an issue.
I now realize the reason for this borderline
issue: The Yfull tree shows a lot of
immediate branches for Y2902 - a bushy tree - probably due to a rapid
population expansion at about the TMRCA. Apparently they all branched again during or
not long after the expansion. So STR outliers should be expected among
the various branches, because there has been a lot of time for STR mutations to
accumulate within the full Y2902 population.
My calculated Polish Concentration
Index for B type is 13%, not good enough for the PCI table. Maybe a fuuture branch of D type will do
better.
CTS11962. See N Type, equivalent to CTS11962.
D. (Y2613).
Rewrite 5 Jul 2016.
D type seems equivalent to the haplogroup defined by the SNP Y2613.
In the Results
Table, D type is 2.0% (early 2016).
That table is restricted to samples from the Polish Project that indicate “Poland” for male line
ancestry. For the full Polish Project,
open to all of Historical Poland, 20 of
35 D type samples indicate ancestry from “Poland” implying the current borders
of Poland. However, Y2613 is common
throughout Eastern Europe. In the R1a Project Y2613 is called “Carpathian-Dalmatian”,
which is not particularly concentrated in Poland. Perhaps one or more branches of Y2613 are concentrated in Poland;
we might soon see evidence for that as more B type samples get tested for terminal SNPs.
At the Yfull tree (Jul 2016), Y2613 has 8 known
branches.
At the Yfull tree, only one sample
is Y2613*; the other 12 samples are in
Y2609, the only known branch of Y1613 (Jul 2016). In the Polish Project no samples have yet tested Y2613+
Y2609-. Only one of the 9 at Yfull is
Y2609*; the other 11 are in Y2608. (Another branch of Y2609, YP4993 has no samples
listed.) In the Polish Project only one
Y2609+ sample has come out Y2608-.
Summary: D type seems also equivalent to the haplogroup branches Y2609 and
Y2608.
This type was added to the Polish
Project in Jan 2010, before the SNPs were discovered. The cluster was brought to my attention by Mayka, who pointed out that Nordtvedt mentioned the cluster in web
discussions some time before that, based on DYS462=12.
Signature (460,481,462,650) =
(10,<22,12,18). Those last two are
only available in the 111 STR marker set,
where DYS462 is the best signature marker.
At 37 markers, only DYS460 is available. D type cannot be distinguished using the 25 markers.
My definition
for D type, using 46 of the 67 markers, is available at Ysearch
as K49NZ. The cutoff is step 6.
That definition is also available
in my analysis file DType.xls,
where SBP = 9.1% (data from early 2016), which is a good
indication of confident STR prediction.
In fact there are no outliers so far (no
predicted D type with Y2613- result and no Y2916+ not predicted D type based on
STRs). I don’t recall any D type sample
over the years being reassigned to another category.
Yfull (Jul 2016) lists 11 phyloequivalent SNPs for Y2613, explaining why it
is so well isolated. I cannot construct STR based divisions of
Y2613, and I doubt it will be possible because Yfull shows only 4
phyloequivalent SNPs for the branch Y2609;
Y2608 has only 3.
My calculated Polish Concentration
Index for D type is 14%, not good enough for the PCI table. Maybe a future branch of D type will do
better.
E. (YP569).
Rewrite 11 Jun 2016.
E type seems equivalent to the haplogroup defined by the SNP YP569, a branch of Z92. YP569 is not particularly concentrated in
Poland, but there are plenty of E type samples in the Polish Project from the
region of the Historical Polish Commonwealth.
E type is well isolated in STR Haplospace, so it was confidently identified as a hypothetical
clade in early 2010, well before Z92 was discovered. V. Rudich entered a modal for this cluster
into Ysearch as ID MW7DP,
named “North East European”. Mayka
modified it slightly for the modal used here by me, GNYBG,
named “Belarus”, 67 markers, which is
still an excellent definition for E type in
2016. My June 2010 analysis EType.xls is
still available on-line. In Jan 2016 E
type had 6 YP569+ samples and no YP569- samples yet; there is only one YP569+ sample not captured by the E type
definition, but that one is marginal, right at the cutoff; when a few more like this show up I’ll edit
the E type definition to include them.
At the Yfull
tree (Jun 2016) YP569 is listed with 15 phyloequivalent
SNPs, demonstrates (see Population Bottlenecks)
why it is such a good type.
FH Clade. F and H types were suggested by Mayka.
They have the signature
(439,511,452 = 11,11,28). They differ
from each other, so I could not make a combined FH type.
I can make a reasonable FH cluster,
but it is not necessary, since the FH clade can be better defined as the
combination of the three types Fa, Fb, and H.
The original F type (introduced Jun 2010) was split into Fa and Fb in
Dec 2010. DYS452 is not one of the FTDNA standard markers, so
not many Polish Project members have this marker evaluated. Mayka and I helped most of the Polish
Project members in FH, and members just beyond FH, to get 452 evaluated. Samples beyond FH have 452=30. My analysis files do not use 452
for determination of SBP. 452 would not significantly lower SBP
because most of the background near
the cutoff for each type are samples
from the other two. In other words, Fa,
Fb, and H are very well isolated from the rest of R1a, but not so well isolated
from each other. These three FH types
do not seem to be specifically concentrated in Poland (per Ysearch) although
they are concentrated in Slavic countries including Poland. All three types seem quite young, without
relatively low STR variance (see the ASD sheets in the analysis files).
FH
Borderline. The borderline
samples from Fa, Fb, and H are combined into a single FH Borderline category in
the Polish Project, because these clearly belong to the FH clade but have less
than 80% probability of belonging to any one of the 3 types.
Fa. Ysearch YQ6D2. 66 markers, cutoff, 9 gap 2. SBP = 27%.
See FH clade, above.
Fb. Ysearch EFQM7. 56 markers, cutoff, 5 gap 4. SBP = 23%.
These samples were the original F type, before Fa was split off. See FH clade, above.
H. Ysearch 559EE. 58 markers, cutoff, 7 gap 3. SBP = 14.5%. See FH clade, above.
G. (L365,
YP389, YP269).
Update 19 Sep 2016.
This type was suggested to me by Mayka, who calls it the Pomeranian
cluster. Pomerania is the name of the
region on the south shore of the Baltic Sea including regions of both Germany
and Poland. Marcin
Wozniak found the G modal haplotype (at 12 markers) to be very common among
Kashubians. Kashubians consider themselves
an ethnic group or nationality within Poland.
It will be interesting to determine if Kashubians in Poland have a
higher % concentration of G type than German Pomeranians. Meanwhile, “Pomeranian” is a convenient
neutral name, suggests Mayka.
G type is mentioned only briefly in
my publication because not much
data was available to me at that time.
My GType.xls
update analysis file with June 2010 data had excellent results: There are 12 samples in a nice type with SBP
= 11.2%.
I introduced subtypes Ga and Gb
based on STRs. Ga did not pan out, but Gb looks promising. I
entered G type on Ysearch as ZD29Z,
and Gb as4KMZQ.
In Jan 2011 a new SNPs, L365,
seemed to include G type, which I mentioned here at this web page, based on
only 5 samples at that time.
Now, Sep 2016, L365 has known
branches YP389 and YP269; G type seems to be equivalent to YP389, and Gb type seems
equivalent to YP269.
This type should not be confused with another G type in the N haplogroup.
13 Nov 2015 link to Stanaszek facebook page: https://www.facebook.com/R1aS18681/?fref=ts
Map showing location of S18681 samples (men): https://www.google.com/maps/d/viewer?hl=pl&authuser=0&mid=zIcwIZnt7lUg.kLMb_kZH4B3c
Update 26 Feb 2015. Lots of new SNPs have been discovered in the
past few months, so my Nov rewrite is already out of date.
For the latest
status check http://www.yfull.com/tree/R-S18681/.
Rewrite 9 Nov 2014:
I type seems to be about equivalent
to the new SNP S18681.
In other words, if you test
positive for the SNP S18681, that places you in the S18681 haplogroup, which is closely
equivalent to what I have been calling I type.
Samples that match I type at 111 STR markers are coming
out positive for S18681. Samples that
do not match I type at 111 markers are
coming out negative for S18681, with only one exception.
At 67 markers, there are three
S18681+ outlier samples that do not
match I type. I type is defined by STRs, so future S18681+ outliers may not match
I type, and a few samples beyond the I type cutoff
may come out S18681+ in the future. I
have been slightly adjusting the definition
of I type as more 67 marker data accumulates, so the definition has been
improving with time.
At less than 67 markers the
probabilities of outliers are higher.
More discussion about this below.
Most but not all I type samples in
the Polish Project are also coming out positive for the new SNP YP331. There are two
newer SNPs, YP314 and YP315,
that are located between S18681 and YP331.
The most recent SNP finding is Y5973.2.
The “father” of S18681 is CTS8816,
with the two “brothers” L1280 and Y2902.
The SNP sequence is R1a > Z280
> CTS3402 > CTS8816 > S18681 > YP315 > YP314 > YP331 >
Y5973.2.
This recent work on new SNPs is
being done by Stanaszek, Milewski, Lapinski, and Mayka.
Łukasz Stanaszek has a document R1a_S18681.doc
with a listing of I type samples from both projects, along with discussion of
the possible origin of the S18681 haplogroup.
Michał Milewski has a tree chart for Z280, which includes S18681,
at the forum: http://eng.molgen.org/viewtopic.php?f=77&t=1464&start=120 Check that forum topic for the most recent
update.
My definition for I type is
published at Ysearch EKVHX,
uses 58 of the 67 STR markers, cutoff 8, SBP 16.2%.
My analysis file is available as IType.xls. My Aug 2011 definition, which used 62
markers, still works quite well, as demonstrated in that xls file.
I type shows the highest
concentration in Poland using my Poland Concentration Index,
as listed in the table at the top of this web page. For details see Ysearch.xls.
This analysis so far is for I type
results posted at the Polish Project. There are more I type results posted at the R1a Project, so let me continue with
comments for both:
My I type definition works OK in
the R1a Project and at Ysearch, but not as well as in the Polish Project. One obvious difference is that there are
only four samples in the Polish Project confirmed or predicted into the
paragroup YP314+ YP331-, while this paragroup at the R1a project is almost as
large as the haplogroup YP331+. This
paragroup does not seem to be as concentrated in Poland as YP331+, which
explains why my I type definition, tuned to the Polish Project, does not work
quite as well in the R1a project. Those
four paragroup samples in the Polish Project are highlighted in that file
IType.xls.
SNP confirmed data: So far (14 Oct 2014, both projects) there
are only 3 samples confirmed with SNP tests in the paragroup S18681+ YP315- (plus one
cousin assumed). There is only one
sample confirmed in the paragroup YP315+YP314-. There are only three samples confirmed in the paragroup
YP314+YP331-. There are 10 samples (59%
of 17 fully confirmed samples) confirmed in the haplogroup YP331+ (plus one
cousin assumed). Three of those 10
YP331+ are in the new haplogroup Y5973.2, and one of them is confirmed
Y5973.2-.
There are about 20 samples
confirmed or predicted S18681 that need testing for the branches; we guess that most of these will come out
YP314+, and most of those will come out YP331.
In other words, the largest
subdivision of I type is looking like the haplogroup YP331+. The second largest looks like the paragroup
YP314+YP331-.
The I type samples not yet SNP
tested with low step at 67 or more markers
are predicted S18681 with high confidence.
There are “Borderline” samples close to the cutoff for I type, and
samples close to I type at <67 that might be S18681, but cannot be predicted
with high confidence.
The Polish Project and the R1a
Project categorize all samples and recommend which SNPs should be purchased for
those interested in determining the paragroup or haplogroup for their
sample. See Stanaszek (link above) for
the combined recommendations.
History of I type: This name was introduced by me in my Fall 2009 publication, Part II, page
178.. I named it after my Polish
Iwanowicz grandfather, who carried this type.
Later, I was informed that Russian web sites had been
calling this STR cluster “Northern Carpathian”.
The best ranked signature marker for I type is DYS578=9. The ancestral value is 8. DYS578 has the second slowest mutation rate
of the 67 standard markers per the Chandler
rates. This marker is in the 37 set,
but not in the 25 set. So the 37 set is
a reasonable predictor for I type, while the 25 set is not. The 9’s are colored orange in that analysis
file IType.xls. Three other good
signatures are in the standard
111 set, but not in the 67 set: DYS463=24; DYS532=12; DYS504=14. Another
fairly good signature, available in the 25 set, is DYS458=14, again
orange in the file. This is a rapid
mutator, so there is more variance.
DYS511>10, available in
the 67 set, highlighted in IType.xls, seems to be a marker for paragroup
YP314+YP331-, while the ancestral value 10 dominates the rest of S18681. However, one STR marker does not provide
very confident assignments.
Previous subdivision attempts of I type: At this web page, I have reported that I
type has a particular structure: I type
has always formed a well isolated mountain
in STR haplospace, with relatively
few samples near the cutoff. However,
the mountain is broad, not tall, more like a range of foothills, in the graph
of number of samples vs step (STR mutation count). I type has a few bimodal
STR markers. All this implies
subtypes. Over the years, I have tried
various subtype definitions within I type, and I have seen attempts by others
on the web. But the subtypes have never
provided statistical confidence; they
are not well isolated in STRs.
Now we know why. Those four paragroups and that one
haplogroup all seem to be old branches.
It seems the S18681 tree is more like a bush. Indeed, the confirmed samples do not come out together when
grouped by STRs at 67 markers. (Not
many samples have 111 markers yet.) In
fact, most of the confirmed samples from small paragroups seem closer using 67
STRs to samples in other groups than to members of their own group. That’s because there are many more samples
in the database from other groups, and some are bound to match more closely
just due to the luck of random mutations.
In
other words, I type is a broad mountain in haplospace because it has
many relatively old branches.
What does this mean? It’s difficult to say with certainty, but
here is what I consider the simplest speculative explanation: S18681 I type originally grew quickly and
prospered while other related clades died out, leaving this clade well isolated
in STR haplospace. Over the years, a
number of I type clades survived, all with approximately the same age.
J. (YP977) Update 19 Sep 2016.
This type was suggested by Mayka.
I documented it here in June 2010, when there were only 6 members in the
Polish Project, but with JType.xls at that
time this type was well isolated at SBP= 13%.
I entered J type into Ysearch as 743N9.
Now, Sep 2016, J type has 17
samples and seems to be equivalent to YP977, but
only 4 of these have tested YP977+ (no other samples are YP977+)); we need to wait for more testing for
confidence in YP977 as equivalent.
K. Update 26 Apr 2016. K type dropped. This “K” topic will be removed in a future update.
Kv,
Kx. Small clusters, need documentation.
I have been using the subscripts
“z”, “y”, “x”, etc backwards through the alphabet because I am running out of
letters for new clusters and types.
These small hypothetical clades seem to be subclades of K, although I do
not have high confidence about the subclade status.
Ky. Ky cluster is now called Z92y
cluster.
Kz. Update 5 Oct 2011: Based on 1 Oct 2011 Polish Project data. Analysis file: KzType.xls. Ysearch 9QJFQ.
Kz type was suggested to me by Mayka
on 6 Oct 2010. Mayka speculates this
might be a clade of Kazakh origin.
There were only 3 samples in Kz last year; now there are 6.
That KzType.xls file demonstrates
that the same 6 samples are extracted using any number of markers from 2 to 67,
so the definition is not critical for this well isolated type.
Kz is effectively more isolated
than the SBP values (row 12 in that file) indicate, because the samples just
beyond Kz are all confidently assigned to other clades and types. For this reason, those SBP values are moot.
I’m using a hand edited definition,
Kz59, using 59 markers, for the following reasons:
Kz is unusual in that 5 of the 6
samples have an unusual value for at least 2 markers. I highlighted these values in red in that file. Notice also the high step values for those
6, 8 through 11, using all 67 markers (column BY), although SBP came out 27%,
which is an excellent low result for 67 markers. The obvious (but speculative) interpretation: each of the 6 samples seems to be a
representative of a branch of this hypothetical clade, where each of the 6
branches has a node not much younger than
the TMRCA.
Hand editing like this does
introduce some selection bias, so the calculated SBP=10.7% for Kz59 is
misleading (but moot). Countering the
selection bias, many if not most of those 8 markers that I masked out might
represent small tribal sized subclades, so future prediction of new Kz samples
should work better using Kz59 with those 8 removed. Again, this is moot, because any number of markers extract the
same samples.
The far right of the “ASD” sheet
has the markers sorted by apparent age, with “M” indicating the markers that I
masked out. You can see that my
selection is a bit arbitrary; I could
have masked less than 8, or more than 8.
ASD age using all 67 markers comes
out 724 years, cell N12. ASD age using
the 59 markers not masked out comes out 704 years, cell N29, not much
less. ASD age has a number of caveats, and 6 samples are not
significant, so this age is highly uncertain.
Kz is clearly young, as haplogroups go.
Additional information supplied to
me by Mayka: Three of the Kz type
samples are from non-Polish men who suspect they have Polish male line
ancestry, so it is not certain Kz type is Polish. Kit number 152824 in Kz is from a man who purchased WTY and found the new SNP L399, but that SNP
appears to be private, restricted to his family. Insofar as that man recruited 3 more Kz samples into the Polish
Project, Kz seems proportionally twice as large.
Kz has the prominent signature DYS459b=18. Mayka points out the additional signature
DYS461=12, not one of the 67 marker set;
most of the samples in Kz have been verified with this 12 value. Since the Polish Project neighbors (step at or beyond cutoff of Kz) are all assigned to other
hypothetical clades, we do not know if the signature markers define a larger
father clade.
L. This cluster is highly hypothetical. It is rare in Poland, but second in size to
K in European R1a1. Larry Mayka suggested this cluster to me. It is a well known Scandinavian
cluster. I quickly checked it briefly,
and it seems to be a “type” by my definition.
However, no Polish Project sample matches at 80% probability yet, so I
am not yet using it for classification here.
More documentation about L will be available here when I find time to study
it.
L1029
is the main branch of CTS11962. At 67 STR markers, L1029 samples can be predicted
as those that fit N type (CTS11962) but do not fit Np cluster (YP515). Of
course, an SNP test is preferred. L1029
has been available as an SNP test since March 2012 at FTDNA. For more discussion see the topics N type
(CTS11962) and M458.
L1080. New
SNP needs documentation here.
L260. See P Type, equivalent to L260.
L342.2. New topic 30 Oct 2011. This SNP
was recognized as a new haplogroup by ISOGG
during the summer of 2011. This was an
L342 haplogroup category at the Polish Project for a short time in the summer
and fall of 2011, but it has been replaced by Z93,
because it seems all the L342.2+ samples are also Z93+ in the Polish
Project. Apparently there are very few
men elsewhere in the world found to be Z93+ L342.2-.
Z93 is a more reliable SNP than
L342.2, so it is recommended that men first test for Z93. L342.1 is the same mutation as L342.2,
discovered earlier in the E haplogroup.
L342.2 is equivalent to L319, L348, and L349, so all 4 SNP tests
together are more reliable. These 4
mutations are in the same segment, which is apparently a segment that mutates
relatively rapidly. Z93 is recommended
as the better test for R1a samples that do not fit STR definitions of other R1a
haplogroups; the Z93+ samples can do
the L342.2 test. This information about
L342.2 was supplied to me by Mayka.
The Z93 category has the samples
that do not fit the two known subdivisions:
A type and L342T cluster (next
topic).
L342T. New topic 30 Oct 2011. Based on 26 Oct 2011 Polish Project
data. Analysis file: L542TCluster.xls. I just noticed this cluster.
L342T is not a type, because SBP
did not come out low enough. However, I
included this cluster discussion here for the following reasons:
Seven samples at 67 markers fit my
new 48 marker definition for L342T.
There are 19 A type samples, which
should all be in the same L342.2 (Z93) haplogroup, but those A samples do not
fit L342T; the closest A’s are at step 8, where the cutoff is 6. There are 5 more L342.2 (Z93) samples at 67 markers, and those 5
also do not fit L342T, falling at steps 11 through 21. In other words, L342T is well isolated from
the other L342.2 (Z93) samples, including the A type branch. The one background sample (STR values fit
the L342T definition) and the four samples beyond the cutoff, are assigned to K
type and to subtypes of K; Z280 has
recently become available for K type;
as those background samples get tested in the future for Z280, my L342T
cluster will start looking better. Let
me say that another way: a cluster
should be analyzed with data from its own haplogroup, so L342T should be
compared only to L342.2 (Z93) data. But
there is very little L342.2 (Z93) data available, so I used the full R1a
database in that xls file. That means
L342T is likely more isolated than it seems right now, so it is more likely to
correspond to a valid haplogroup.
Mayka
pointed out to me that some of the L342T samples have Tatar ancestors. That’s why I used the “T” in the code
name. Of course, Tatars may belong to
only a branch of L342T; I have no idea
what fraction of L342T in Poland are Tatar.
And of course Tatars are expected to be a mix of multiple haplogroups.
Three of the L342T samples, with
the name Muchla, are apparently a family set, so they count statistically as
only one sample, reducing the current count from 6 to 4, so SBP as calculated
in that xls file should be increased (not as good). This is evidence against L342T being valid.
M. Needs documentation. M type was brought to my attention by Larry Mayka, who informs me others have called
this haplotype the Viking haplotype because of its concentration in northwest
Europe.
This Haplogroup, defined by the SNP M458, is a major branch subdivision of R1a.
The main tree branches of M458 are P type (L260)
and N type (CTS11962).
Tree: http://www.yfull.com/tree/R-M458/.
See the Results
Table for an overview of M458 and the branching haplogroups that are common
in Poland, per the Polish Project.
M458 was published by Underhill. M458 is common in Eastern Europe and is found throughout Europe
and Western Asia. It has been available
as an SNP test since early November 2009 at FTDNA.
Actually, the structure of the M458
tree is a bit complex, with haplogroups (Yfull, Apr 2016) PF7521, PF6188,
Y2604, and others, but these haplogroups are defined by very few samples. L260 is a branch of Y2604, which with PF6188
and CTS1962 are branches of PF7521, which is the only named branch of
M458. For simplicity, I just say that P
type (L260) and N type (CTS11962) are the two main branches of M458.
In the Polish Project, there are
only 9 samples assigned to M458+ L260- CTS11962-, neither P type nor N type,
and 5 of those belong to the Ry family set, next paragraph.
Ry type: There is a family set (five samples with the
same family name, very close STR match to
each other) where one of them tested M458+ L260- CTS1192-. These five are clearly not P or N. Not even close in STRs. These 5 samples are now categorized in the
Polish Project as “Ry type”. These were
independently noticed by Lapinski, an
administrator of the R1a Project, with a separate
category for these in that project, also.
Since 3 of these have been recruited to the family group, Ry counts as
only two samples for statistical purposes in my Results Table.
N. (CTS11962). Rewrite 22 Apr 2016.
N type seems equivalent to the SNP haplogroup CTS11962.
My analysis file: NType.xls. My Definition,
using 46 of the 67 standard
markers, with cutoff 8, is available at Ysearch as 3SEJK.
N type (CTS11962) is concentrated
in Slavic countries. N type is
discussed in my publication,
page 179.
N type and P type (L260) are the
two main branches of M458.
N type has two main branches: L1029 and Np cluster (YP515).
According to Ysearch and Yhrd N type seems to be spread all around the
Slavic lands and central Europe, common from East Germany to Russia. Within Poland (Polish
Project database) N type seems a little smaller
than P type. Worldwide, N is much
larger than P, which is concentrated in Poland. N type should be properly studied in a database that is not
restricted to Poland. However, there
seem to be subtypes of N that are concentrated in Poland. See the discussions on N subtypes, topics
below. I’ll continue to watch the
Polish Project, because it will be interesting if more data provide more Polish
subtypes within N. The R1a Project has more details, including rare samples that do not fall into the main branches.
My publication in 2009 introduced the
names “N type” and “P type” before discovery of L260 (spring 2010) or CTS11962
(early 2013).
N is an STR type, equivalent but not exactly
equal to the SNP haplogroup CTS11962, because of STR outliers
(step near the cutoff).
At 67 markers in the Polish Project N type has worked quite well since
2008 because a sample close to the N cutoff usually fits another type or
cluster better than N or not as well as N, so it gets assigned respectively to
that other cluster or to N. If it fits
neither closely, that sample gets a Borderline
assignment.
Using other than 67 standard markers: using 111 markers N type can be fully isolated with STRs. Using 37 markers many samples come out N
Borderline; N type does not work
satisfactorily for 25 or 12 markers.
Yfull (Apr
2016) roughly estimates the formation of the CTS11962 tree branch at 4500 years
ago, with TMRCA at 3300 years ago.
It’s interesting to wonder why M458
seems to be composed of two main types
that differ substantially in STR values (N and P are separated in haplospace). I speculate about this in the P type topic. Much of my P type discussion is also related
to N type, so I avoided repeating all the details here; please read my P type discussion if you are interested in
more speculation about N type.
My Type.xls files have a
macro for automatically selecting the best STR markers for definitions. That file NType.xls
demonstrates that any number of markers from 3 to 67 from the 67 STR set does a
reasonable job of identifying N type samples.
I did some manual editing in 2011 for that improved 46 marker definition,
and that NType.xls file shows that definition is still very good now (Apr
2016). Of course, now that many SNP
tests for branches of N type are available, we are more interested in those SNP
branch test results. There are only 5
samples that have not been SNP tested near the N type cutoff at 67 markers (Apr
2016).
The signature
for N type is (439,Δ389,537,413a,446) = (11,16,11,21,13). Δ389 = 389-2 minus 389-1, the second
STR chain in the pair.
Over the years this N type topic
had a very long speculative discussion about the STR structure of N type, and
hints at possible subdivision based on STRs.
All that is moot now with the flood of new SNPs due to Big Y. The near future should provide yet more SNP
subdivision of N type (CTS11962). For
updates, watch:
https://www.yfull.com/tree/R-CTS11962/
Analysis file: NashkType.xls. The Definition
with cutoff 3 is available at Ysearch
as 2TZKF. Nashk is a small branch of N
type. Only 5 samples in the Polish Project.
Signature STRs
(19,385a,594) = (15,12,11)
I introduced this type in Jan 2011,
with only 3 samples tightly isolated in STRs, and
with SBP 23%, slightly more than my stated
20% limit for using the word type. Two reasons: First, the Ashkenazi names are independent evidence of a
clade. Second, the N-Ashk modal haplotype differs from the N
modal at 6 markers, which is evidence of a fairly old node in the N branch of
the Y-DNA tree.
Mayka
pointed out to me that the names seem Ashkenazi, per his experience.
I introduced this type as Nca type,
because of what I had been calling the Nc signature, DYS19=15. The “a” meant Ashkenazi, but that was
confusing because the samples do not match what I had been calling the Na marker. (Nc and Na were speculative, no longer
documented here.)
SBP has increased now to 34% (Apr
2016), because of additional samples having shown up in gap;
Nashk is not as well isolated in STRs. So my confidence is now a bit degraded that
this represents a true clade. I’ll change the name to Nashk cluster if
future data continues to increase SBP.
Only 2 of the 5 samples give Poland
as male line origin.
The definition, unchanged since Nov
2011, uses 58 markers, cutoff 3, no samples in the gap at steps 3 and 4. I masked out CDY, because of recLOH.
One of the Nashk samples has tested
L1029+. More important: the STR neighbors at step
greater than the cutoff are all L1029+,
so Nashk is assumed to be a branch of L1029, which
is a branch of N type. No Nashk samples
have been tested for Big Y, or for the known branches of L1029, so we don’t know
the branch of L1029.
Ng type is very small, only 3 samples in the Polish Project,
since 2010. SBP =
20%, marginally qualifying to be called a type, a rather
confident branch of N type. My definition is Ng56, using 56 of the
67 STR Markers, actually 58 because DYS464
e & f are used. Signature
markers DYS492 = 14 and DYS537=10.
Analysis file: NgCluster.xls.
All 3 samples have Big Y
results. All 3 samples belong to the SNP haplogroup YP1136. There is a 4th YP1136+ sample in the R1a
Project, kit N25798, but this one does not fit my Ng56 definition, not even
close, so on that sparse evidence it seems Ng type is a branch of YP1136. Yp1136 is a branch of YP593, which is a
branch of L1029.
All 3 give Poland as male line
origin, but this type is too small to use my PCI. They differ from their common 67 STR modal haplotype by steps 4, 7,
and 8, so it is possible this Ng type represents a Polish clade
with a TMRCA of only a few centuries. They name 3 different ancestors from the
15th to 19th centuries.
The Ns cluster is quite small, only
5 samples in the Polish Project. SBP = 22.7%, just short
of the 20% needed to qualify as a type, so I’m reasonably
confident it will prove to be a valid branch of N type. My definition is
Ns53, using 53 of the 67 STR Markers. Signature marker
DYS446 = 12. I loaded my definition
into Ysearch, ID
A5NSG. There are actually 55 markers because DYD464
e & f are used, but those are not considered as in the standard 67 set, so Ysearch calls it 53, as do I. Analysis file: NsCluster.xls.
I suppose Ns cluster is a branch of
the SNP haplogroup YP445, because one of the Ns samples has tested YP445+, but
other YP445 samples do not fit the Ns definition, not even close. YP445 is a branch of YP444, which is a
branch of L1029. I
tried to form an STR cluster for YP445 but could not come up with a credible
signature or definition.
Ns is too small to provide evidence
of concentration in Poland.
Np. (YP515). Rewrite finished 15 Apr 2016, using Polish
Project data download 18 Jan 2016. Edit
20 Apr 2016.
The Np Cluster seems to be roughly equivalent to the SNP haplogroup YP515.
The Np Cluster is a subdivision of N type.
N type seems equivalent to CTS11962. So far (Apr 2016), all CTS11962 are coming
out either L1029+ or YP515+. There are few YP515 tests in the Polish Project, so far only 10
YP515+, so most of the Np cluster assignments are samples
that are L1029- and also CTS11962+ or N type based on STR
prediction. There are a few more YP515
results in the R1a Project, where
also so far all CTS11962 are coming out either L1029+ or YP515+. In other words, I am not aware of any
samples that are CTS11962+ L1029- YP515-.
Using 67 STR prediction, Np (YP515+) samples can be
separated from L1029+ with better than 70% confidence
in the Polish Project. I constructed an
STR definition Np35
using 35 of the 67 markers. The cutoff
is 2 (mutation step less than 2 are
considered matches). I uploaded this
definition to Ysearch, code CHFXB. My analysis file is NpCluster.xls. This definition captures 7 of the 8 Polish
Project samples with YP515+ test result and 67 or more STRs. That one exception is due to a single recLOH mutation, mentioned again in the following
paragraphs. This definition also
captures 9 of the 11 N type L1029- samples at 67. So the net capture efficiency is 16 out of 19, or 84%. No false positive L1029+ samples are
captured. Because of selection bias,
future prediction accuracy should be a little less than 84%, and there is
statistical uncertainty, so I estimate better than 70% confidence in the first
sentence of this paragraph. I slightly
modified my 2012 definition in this 2016 analysis, based on recent data; I’ll probably slightly modify CHFXB with
future data.
I call Np a cluster because it does not qualify for a type.
There are too many L1029+ samples at the cutoff value of 2 in my
analysis at 67 markers. In other words,
Np is not isolated in STR haplospace.
That NpCluster.xls file uses only N
type CTS11962 data, 162 samples. Using
all 1919 samples from the Polish Project at 67 markers: no samples from other haplogroups are
captured by that 35 marker Np definition.
At the cutoff step 2, there are only 3 samples from haplogroups outside
N type (one each Z92 Credible, Z280 D type, and Z280+).
At Y search, using this definition
CHFXB, there are proportionally more samples at step 2 (Step 2 samples / Steps
0 to 2 samples = 61% Polish Project vs 76% Ysearch). This is enough to be statistically significant, if not fully
convincing. This can be taken as
evidence that YP515 is rarer outside Poland;
this can also be taken as evidence that there might be haplogroups
common outside Poland with STR overlap with YP515. Either way, the Np definition is not quite as good at Ysearch as
it is in the Polish Project.
In the Polish Project 9 of the 10
YP515+ samples give “Poland” as country of ancestry; that last gives Russia.
The Np cluster is highly concentrated in Poland, per my PCI
Index.
The Yfull tree
(14 Apr 2016), has 5 YP515 samples, two of them in YP1182,
thereby defining a haplogroup branch.
The other three are in YP515*, meaning they are YP1182- and have not
provided a common new STR for any pair of these three. That means future Big Y
data at Yfull should define at least 4 total branches of YP515 - a bushy node for the MRCA.
My Results
Table estimates 2.0% for Np (YP515) Polish frequency.
The R1a Project has 4 YP1182+
samples, but only one of those was available in my analysis here A second of those 4 recently joined the
Polish Project after my data download.
I tried but failed to develop a credible STR definition for YP1182 by
adding the R1a Project YP1182+ samples to a copy of the Polish Project
database. I see no obvious signature markers that would distinguish
YP1182 from the other Np samples.
Comment 18 Apr 2016 Milewski responds that there are now 5 YP1182 in the
R1a Project, and a good signature is DYS439>11, DYS481>25 and DYS710>33.
There are 3 good signature markers for Np (and for
YP515), all 3 available are in the 37 STR
set:
DYS460 = 10 is best. In the Polish Project 9 of the 10 YP515+
samples have this value. Among the samples
N type (CTS11962+ or predicted), with no YP515 test, assigned to Np cluster
based on an L1029- result, all 14 have this 460=10 value. However, 8 of the 74 N type L1029+ samples
also have this 460=10 value, and of course it is common in other haplogroups. In the Polish Project the YP515 test is
encouraged for N type samples with 460=10.
CDYa = 33 is another good
signature, present in 8 of the 10 YP515+ and in 9 of the 14 L1029-. One of those YP515+ exceptions has the
homogeneous pair values CDYa,b = 39,39, along with DYS459 and DYS464
homogeneity, an obvious recLOH mutation, mentioned above.
The third good signature is the
value 13 included in the DYS464 set, present in 7 of the 10 YP515+ and in 10 of
the 14 L1029-. Again that recLOH is one
of the exceptions. This signature
cannot be used at Ysearch, where all the DYS464 markers are used together. My Np35 definition excludes the 464 set.
Using all 3 signature markers
together, 5 of the 10 YP515+ samples are captured at step 0, 4 of them are
captured at step 1, and only that one recLOH sample is missed at step 2, which
is the cutoff that we use. For the 14
L1029- samples, 9 are captured at step 0, 1 at step 1, and 4 are missed at step
2. That’s 17 out of 22 = 77% capture
efficiency, with the remaining 5 just missed at the cutoff 2. Considering selection bias and statistics,
expected predictions might come out slightly below 70%. Among the 71 L1029+ samples, 1 is captured
at step 1 and 19 at step 2; that’s the
reason for using 2 as the cutoff - high false positives at step 2.
Conclusion: that 3 STR marker signature with cutoff 2
does a fairly good job of predicting YP515 samples among the N type (CTS11962+)
samples. However, this signature with
step 2 does not work in the full Polish Project because other haplogroups are
captured, even at step 0. Within R1a, a
few D type samples are captured at steps 0 and 1. Actually, that’s pretty good considering only the 37 STR set is
needed.
History of Np: I introduced the Np cluster at this web page
18 Jul 2012, as a hypothetical clade, equivalent to N type with L1029-. “Np cluster” was used as an assignment at
the Polish Project since Spring of 2012.
Mayka introduced the “p” because it was already
obvious that most Np Polish Project samples come from Poland. In Oct 2012, I included Npa, Npb, and other
tentative divisions of Np in my Excel analysis files, but these are not used as
assignments in the Polish Project.
Exception: the Nmsv
cluster has been used since 2012 as a division of Np to accommodate STR outliers with recLOH, mentioned above. I suggested “msv” because Lapinski had named this recLOH cluster “Masovian” in the
R1a Project. L1029 SNP test results
have been available since March 2012, but YP515 results have only been
available since Sept 2014; L1029 has
been tested much more frequently than YP515, so many samples are assumed YP515
based on N type with an L1029- result.
P. (L260). Rewrite 14 Oct 2015. Edit 12 Jun 2016.
P type is equivalent to the haplogroup defined by the SNP L260.
There are very few STR outliers - L260+ samples
that do not match P type, or P type samples testing L260-.
P type and N
type (CTS11962) are the main branches of M458.
L260 has 21 known branches listed
at Yfull (12 Jun 2016); for update see http://www.yfull.com/tree/R-L260/. The main branches of L260 in the Polish Project are YP414, Y2905,
Y4135, and YP1337. I tried (Oct 2015)
to construct STR types or clusters
for a branch of P type, using SNP data to identify samples, but I did not come
up with any significant signatures or
definitions. Perhaps as more data accumulates it may be
possible, but for now SNP testing seems needed for assignment of samples to
branches of P type (L260).
P type is a major topic in my publication, Part II. P type is significantly concentrated in
Poland, and in the Czech Republic. It
is found at lower frequency in other Eastern European countries, and in eastern
Germany. Roughly 10% of Polish males
seem to carry P type Y-DNA. L260 was
discovered shortly after my publication, found to be equivalent to P type,
confirming my prediction that P type corresponds to a haplogroup. The L260 SNP test has been available at FTDNA since April 2010.
I published an
announcement about L260 in the Fall 2010 issue of JOGG.
My current definition for P type, P50, is a modal haplotype using 50 of the 67 standard STR markers. The cutoff
is 9, which means all samples less than step
(genetic distance) 9 are predicted P type (predicted L260+). That definition is available in the PType.xls
analysis file and at Ysearch as 8U92G. That file and definition are from an
analysis done in Feb 2014 using Polish
Project data downloaded 20 Jan 2014.
For an update, please see my file PType2015Oct.xls.
In that update, there are predicted
176 P type samples, with only one outlier
at step 8 (the highest step of the type);
that outlier tested CTS11962+, so it cannot be L260+. In addition, there are seven L260+ outliers
that do not fit the type, at steps 9 through 13. So that’s 8 outliers out of 176 predicted = 4.5% rate - not bad
for STR based prediction.
As data accumulates, my statistical
definitions of types often change by a few STR markers. The P type definition has changed very
little in the past few years. I could
not significantly improve on that P50 definition in Oct 2015, so I left it
unchanged. That file Ptype.xls has
columns of various trial definitions for comparison, with indication of which
STR markers are included in each. That
file Ptype2015Oct.xls has only one column of data for P50, with blank columns
where you can try other definitions.
Those files have a sheet
“Haplotypes & Masks” with previous P type definitions going back to 2007.
Pawlowski originally noticed what I now
call P type. My publication has more about this
history. The STR isolation of P type in
the Polish Project is now even more impressive than at the time of my
publication due to the accumulation of more data.
Ludvik Urban pointed out to me that
P type is common in the Czech Y-DNA
Database. FTDNA also has a Czech Y-DNA
Project. Karen Melis, administrator
of the FTDNA Zamagurie
Project, pointed out to me that P type is common in her data from the Zamagurie region, which is on
the border of Slovakia with Poland. I’m
not sure of the concentration in Slovakia.
I added a “Ysearch” sheet to that
PType.xls analysis file, with analysis from Ysearch. The Western Slavic Modal haplotype, Ysearch 28WGP,
matches P type perfectly at all 50 markers used in my definition. That Western Slavic Modal uses 76 markers,
but many of those are highly variable due to high mutation rate. That modal is one of the Russian site modals.
Age
of P type: The Yfull tree for L260 (12 Jun 2016)
estimates 4500 ybp as the formation date, and 2500 ybp as the TMRCA.
In this case, the formation date is the node
where the main M458 tree splits forming
the two branches L260 (P type) and CTS11962
(N type). The TMRCA is the node where the L260 haplogroup splits into known
branches. Subtraction gives 2000 years
for the length (in time) of a smooth
branch segment that includes L260 and the several phyloequivalent SNPs,
spread out in time over that segment.
Estimation of such ages is uncertain due to a number of caveats, and subtraction compounds the
uncertainty, but this is a long estimated time with no known branches. This explains why P type is so well isolated; there was plenty of time for both SNP and STR mutations, shared
by all P type samples, providing a significant STR signature. For general discussion of this point, see population bottlenecks.
In my 2009 publication I put the TMRCA of P
type as 2,000 to 3,000 years ago. That
estimate still stands.
Why is P type so common in
Poland? One obvious explanation is a
rapid population expansion in the region that is now Poland. In my 2009 publication I speculated about
such an expansion perhaps less than 1,600 ybp.
A rapid population expansion should provide a bushy
tree. The current Yfull tree has 9
branches with formation 1750 ybp, plus 2 branches with formation 1550 ybp, plus
4 branches with formation 2000 ybp.
These 15 branches together are roughly equivalent to by previous guess -
a bit older as a whole. These 15
branches are roughly equivalent to historical estimates of formation of the
Slavic tribes that ultimately formed the Polish nation. Four branches of L260 (still consulting the
Yfull tree, 12 Jun 2016) have formation > 2,000 ypb; taking these at face value
is evidence that the population expansion started slow and accelerated.
However, mutation analysis does not
yet provide a definite explanation for why P type is concentrated in
Poland. I suppose the simplest
explanation is statistical luck. The ancestral
male line leading to P type and L260, that smooth
branch segment with a span of 2,000 years with no known branch nodes, is
evidence that the line came very close to extinction. The MRCA - the very
lucky sole male survivor of that male line to leave male ancestors known to us
today - may have just happened to have been living roughly 2500 ybp in the
region that is now Poland.
In my publication I speculated
about a migration, but I see no specific evidence yet. Evidence would be multiple new small SNP
branches with formation roughly 2500 ybp, and with nodes close to the formation
node for L260, and all from a unique region other than Poland.
P type Signature: DYS385a=10 is the best single STR marker for predicting P
type. In that update file PType2015Oct.xls: There are 182 samples (from all of R1a at 67 markers)
with this 385a=10 value; 175 of them
are P type, including 6 of the 7 outliers missed by that P50 (above)
definition. Only 7 samples apparently
not P type have this 385a=10 value.
Only 6 P type samples plus one of the L260+ outliers have 385a not =
10. That one marker alone does almost
as well at predicting L260 as the definition P50.
Two other good markers for P type
are available in the 67 set: 481=25 and
572=12.
R1a. See R1a Abstract.
This Haplogroup, defined by the SNP Z280, is a major branch subdivision of R1a.
In the Polish
Project, Z280 branches into CTS1211 (about 3/4 of Z280), Z92 (about 1/4 of Z280) and S24902 (less
than 3% of Z280). See the Results Table.
Worldwide, Z280 is more complex,
with lots of branches. Tree: http://www.yfull.com/tree/R-Z280/.
This Haplogroup, defined by the SNP Z92, is a branch subdivision of Z280, with roughly 1/6th of Z280 in the Polish Project.
E type is a
branch of Z92. E type was confidently
identified as a hypothetical clade in early 2010, well
before Z92 was discovered, and now (June 2016) seems equivalent to the SNP YP569.
Z92y type
was suggested to me by Mayka on 21 Dec 2010. Z92u cluster
and Z92t cluster were proposed in 2012. These have not been confirmed by unique SNPs
as of June 2016. It is now (June 2016)
clear they are not major clades, although one or more of these 3 might end up
defining a very small Polish haplogroup (less than 1/2% of the Polish Project)
in the future, when the appropriate SNPs are found. Z92y originally qualified as a well isolated type
at 67 markers, but as more data accumulated
it became marginal (not high confidence as a valid clade at 67).
At 111
STR markers, I constructed for Z92y a very good type, SBP
8.9%, with 6 samples well isolated from the rest of Z92 (37 samples from Z92
have 111 markers in the Polish Project, Jan 2016). Two of those 6 have CTS4648+ result. Only one of those two has Big Y, with YP4479+ result. The best signature marker is DYS513 = 11,
which alone segregates these 6 from the other Z92 samples, although other
samples in R1a have this 513=11 value.
The signature (DYS513, DYS452) = (11,31) does segregate those 6 samples
in R1a. (This signature also comes up
in haplogroups I and J.) This signature
is not available at 67 or fewer markers.
See the Results
Table for the SNP branches of Z92 in the Polish Project. Since Z92 is not concentrated in Poland, not
all branches of Z92 are represented;
consult the R1a Project for world wide
structure of Z92.
Z92y, Z92u, and Z92t were
originally named Ky, Ku, and Kt, because the letter K was used for clusters and
types with definitions that were not distinguishable
from the R1a modal haplotype. Since then, many samples from these 3 have
been tested Z92+. Other K clusters
(such as Kx, Kz) have come out Z92- CTS1211+.
My 2011 analysis of Z92y is still
available as KyType.xls,
Ysearch BBB9T.
Z93. See A type
End
of R1a Branches.
Description of the R1b
Branches
On 20 July 2010 I added the
following three R1b Types to this web
document (next three subtopics, L23EE, L47P, L47A).
Mayka
had already added these three to the Polish
Project web page during the previous week, based on my recommendation, based on my SBP analysis.
I independently found these three
by analyzing the Polish Project R1b data, but Mayka pointed out they were
previously known as clusters. We judge that my analysis justifies adding
them to our list of types. Since I’m
using 639 samples with 67 marker data as
representative of Poland, a small type clade at 1% of the Polish population
would be expected to have roughly 6 samples in the database (70% confidence
interval 4 to 10). These three small
types are roughly 1% each.
I’m following the current ISOGG codes for these types, which may be
confusing.
The STR definitions for these are
available at Haplotypes.xls. PolishCladesUpdate has a
link to an Excel analysis file for each of these three types.
Reminder: These two types are calibrated to Polish data. The definition modal haplotypes may not be
optimal for other regions. If you have
Polish ancestors, and if you have all 67 markers, and if you
match one of these within a step distance of 10 there is more than 80%
probability that you belong to the corresponding clade. Up to step 15 there is lower probability
that you belong. You should test the
appropriate SNPs (explained below) for higher confidence. If your ancestors are not from Eastern
Europe and you are a marginal match (step distance 5 to 15) for one of these,
it is not very probable that you belong to the corresponding Polish clade, because
each of these types has some overlap with other clades that are rare in Poland.
L23EE. 20 Jul 2010 documentation: This
type is positive for the L23 SNP, hence
this type is a hypothetical future haplogroup within the current haplogroup
R1b1b2a. This type is negative for L51,
the only current known branch - R1b1b2a1 - of L23.
Nordtvedt pointed out the cluster for
this type some
years ago, calling it R1b-EE (Eastern Europe). Mayka suggested the L23EE
code to me.
There are only 6 samples in the Polish Project in this type (13 Jul 2010). SBP = 10.7% using all 67 markers, which is excellent for such a small type. The cutoff is 12, but if you match at step 10 through 12 I estimate your probability of belonging at slightly better than 80%, so you really should test for the L51 SNP - a negative result would boost the probability to about 95%. In the Polish Project, there is a gap of 5 - no samples from steps 12 through 16 and all 6 of the samples from step 17 to 20 are L51+. So this type is very well isolated in haplospace in Poland.
On Ysearch (code CX94E) there are also 6 samples in this type (13 Jul 2010), but 3 are the same as in the Polish Project. There are 7 samples at step 12 (vs zero in the Polish Project) and only 2 of those 12 are East European - one each in Germany and Russia. That means this type is not well isolated world wide, meaning samples near the cutoff are highly uncertain. I interpret this as evidence that my definition of L23EE type is really a Polish subtype within a larger L23EE cluster.
This type has evidence of structure. A number of markers are bimodal with no obvious correlation. To me, that means there are probably at least 3 sub-clades that may become evident as data accumulates.
If you
match this type closely at 37 markers I highly recommend getting the full 67, because the
statistics for assignment are not convincing at 37 markers. Even at 67 markers, I recommend the L51 test; a negative result confirms membership in
this hypothetical clade, and a positive result means you are not a member. We do not know the probability of outsiders
matching L23EE in STR values, particularly outside Poland, so there is still a
slim chance of a surprise - a close match to the definition but with L51+.
L47P. 20 Jul 2010 documentation: This
type is positive for the L47 SNP, hence
this type is a hypothetical future haplogroup within the current haplogroup
R1b1b2a1a1d1. This type is probably
negative for L44, the only current known branch - R1b1b2a111d1a - of L47, but
that L44 negative indication is based on only one sample so far so it is not
certain.
Mayka announced the cluster
corresponding to this type on the web in
March 2009.
There are only 4 samples in the Polish Project in this type (13 Jul 2010). SBP = 9.3% using 64 markers, which is excellent for such a small type. The cutoff is 7 and the gap is 10. There are no samples from step 7 to 16. Although samples in that wide gap are expected as data accumulates, this type is very well isolated in haplospace in Polish data.
This type is very robust; the same 4 samples are selected using any number of markers from 10 to 67 with SBP <25%.
Actually, this type is even better than the SBP = 9.3% indicates, because some of the samples at step 17 and beyond have tested negative for the SNPs in the R1b trunk leading to L47 so they clearly do not belong to this L47P hypothetical clade.
Ysearch (code MKM4R) also has 4 samples (13 Jul 2010), but 3 of them are the same as the Polish Project. Ysearch has 8 samples at steps 8 to 12, so the type is not as well isolated worldwide.
The “P” in the code L47P represents
my hypothesis that this type is Polish.
Members of this type should test
for L47, because Ysearch does have one STR matching sample listed as R1b1b2a1b,
which is equivalent to P312, an “uncle” haplogroup, that is L47 negative. That means there may be some interference in
STR matching, probably less than 10% in Polish data, but I do not know what the
exact percent interference be until more data accumulates.
See the last paragraph of L47A,
next topic, for more comments.
L47A. 20 Jul 2010 documentation: This
type is positive for the L47 SNP, hence
this type is another hypothetical future haplogroup within the current
haplogroup R1b1b2a1a1d1. I do not know
yet if this type is negative for L44, a known branch of L47.
Mayka
suggested the “A” code, since this type is obviously Ashkenazi, based on family
names (see also Ysearch results, a few paragraphs down). I presume this one is known to the
administrators of Jewish DNA projects, although I did not do the research to
find a first web publication at 67 markers;
I would appreciate an email of a reference to add here, even if it does
not exactly match my definition. It’s
OK if an international modal haplotype differs by a few markers from a
haplotype determined in Poland, particularly if the difference is at markers
that are bimodal, indicating subtype structure.
There are only 5 samples in the Polish Project in this type (13 Jul 2010). SBP = 7.6% using all 67 markers, which is excellent for such a small type. The cutoff is 10 and the gap is 9. There are no samples from steps 9 to 18. Although samples in that wide gap are expected as data accumulates, this type is very well isolated in haplospace.
This type is very robust; the same 4 samples are selected using any number of markers from 30 to 67 with SBP <10%.
This type is better yet on Ysearch
(code 7HB9C), with 18 samples
(13 Jul 2010) for better
statistics; SBP = 4.6%, which is
remarkable. It might be even better
with an optimized definition; I used
the modal haplotype that I extracted from the 4 Polish Project samples.
This one does not seem as Polish as L47P, although those 18 Ysearch samples are concentrated in "Greater Poland" including Lithuania.
So far, see ISOGG, L47 and L148 are the only two known branch haplogroups of L48. In the Polish Project so far (20 July), no one has tested yet for L148, and all L48 so far at 67 markers are either L47P (previous topic) or L47A. SNP data is not posted on the web, so I do not know the frequency (prediction probability) of L48 samples that do not match either L47P or L47A so belong to yet other clades. I also have not searched the web for the STR values expected for L148. (There are two samples at 37 markers listed in the Polish Project with L48+, listed as R1b1b2a1a4 by FTDNA, but this is not enough for statistical estimation.). All this will quickly become visible when FTDNA updates their haplotree. As of 20 Jul 2010, L48 is a terminal branch at FTDNA, so only administrators have visibility of SNP test results beyond L48, including L47 and L148. Mayka provided the SNP data that I have documented here.
Description of the I
Haplogroup Branches
Update 25 Mar 2012. Edit 1 Apr 2016.
At the end of July 2010 I added two
types from the I haplogroup to this web
document. I independently found these
two by analyzing the Polish Project I data.
Mayka informed me that they were
previously known as clusters,
hypothetical clades, discussed some time
previously by Nordtvedt. Mayka added these two to the Polish Project web page in July
2010, based on my recommendation,
based on my SBP analysis. One is a branch of what has previously been
called I2-CE, and seems to represent a Polish collection of M253 branches so we
named it M223CE type, discussed in the next topic. The other seems to be a Polish branch of I1-M253, so we named it
M253P type, discussed in a topic below.
I am now also using the short code names I-CE and I1a-P for these. I am now splitting I-CE into I-C, I-D, and
I-E, topics below.
My STR
definitions for these are available
at Haplotypes.xls,
in the Excel analysis files discussed below, and at Ysearch.
These types are calibrated to
Polish Project data. The I1-P definition
WC8JD
forms a type in the Ysearch database, so it seems to be reasonably valid world
wide. The I-C definition SB6YK,
and the I-E definition QUXE3,
are probably not valid at Ysearch for a sample with origin remote from Historical Poland, because of interference
by other clades with similar STR values, particularly from Russia.
Input new topic 19 Aug 2015 by Paul Stone.
I1, defined by the SNP M253, is unique in that it has the signature STR
value DYS455= 8 which is present in approximately 99% of all I1 samples. It is also useful that DYS455=8 is nearly
non-existent outside of I1. I1 is
unusual for the very large number of equivalent
SNPs. See http://www.yfull.com/tree/I1/ for the
list of about 300 SNPs that are equivalent to M253. Surely I1 must have had many male line branches in the past along
that long smooth branch segment,
but apparently they went extinct; it is
possible a branch (a new node along this
smooth segment) will be discovered in the future. Perhaps there was a population bottleneck. This is supported by the calculations done
by Yfull, with the result listed in that tree reference: A formation date of I1 at 27,500 ybp but
with a TMRCA of only 4,700 due to the large number of equivalent SNPs.
In the Polish Project, I1 is almost
all I1a (DF29), which comprises about
6.4% (high confidence range 4% - 9%) of the Polish male population based on
the statistically adjusted Polish Project data (see Results
Table for update). The website http://www.ydna.eu/ lists the I1 population of
Poland at 8.5% based on a sample size greater than 1,000.
I1a-P Type seems
equivalent to Y6349. (Old name
M253P.)
Rewrite 1 Apr 2016.
Input rewrite 19 Aug 2015 by Paul Stone.
In the Results
Table (early 2016), this haplogroup
is estimated at 1.4% of the Polish population.
Although this is low, my PCI Table
(2014) ranks I1a-P type third in Poland, because it seems very highly
concentrated in Poland, and my PCI is based on a combination of frequency and
concentration.
Yfull
lists 17 SNPs as phyloequivalent to
Y6349: Y6340 to Y6355 and Y6373. Yfull uses Y6340 as the representative name
of this haplogroup. Only Y6349 is available
for testing at FTDNA. At Yseq,
Y6350 and Y6354 are available. See SNP ordering information. Actually, Big
Y is a better test (if cost is not an issue), because more Big Y tests will
further divide Y6349.
Sequence for locating Y6349/Y6340
at the Yfull tree (Mar 2015): IJK> I1> DF29> Z2336> Y3866>
S4767> S7642> Y6340.
Age rough estimate for Y6349 at
Yfull, based on the large number of equivalent SNPs: Formation 3,300 years before present; TMRCA 1,550 years before
present.
Marek Skarbek Kozietulski has
called this I1-Vistula, because most of the samples are from the basin of the
Vistula river.
History: In the past, I have called this M253 P type, I P type, and I1 P
type. Now that this type is clearly a
subdivision of I1a, the name I1a P type seems best, or short code I1a-P. On 26 July 2011, I added this I-P Polish type for the I haplogroup to this web
page. This type had already been known
as a cluster for a few years. Mayka
pointed out to me that Nordtvedt
listed it on the web as I1*-P1, with related clusters I1*-P2 and AS4. Marek Skarbek Kozietulski has studied this
cluster quite a bit, since he’s a member.
I mentioned this type briefly in my publication, where I was
previously calling it Y type, considering it not high confidence based on the data available
then in 2009.
My analysis file for the STR type
at 67 markers is I-PType.xls. That file was generated in 2012 with only 11
samples. My definition for I1a P type, from 2010,
uses 54 markers, cutoff 4, gap 5, no samples in the gap from steps 4
through 8 in the Polish Project
at 67 markers. SBP came out 5.0%. Marek informs me that he had identified 4 men who matched at 12
markers and actively recruited them to obtain all 67 markers and to join the
Polish Project. That means only 7 of
these 11 samples should be used for statistical purposes. (That 1.4% frequency in the Results Table
was calculated excluding any known recruited samples.) SBP calculated on the basis of 7 samples is
8.7%. This low SBP along with that
large gap of 5 was compelling evidence (in 2012 before Y6349 was discovered)
that I1a-P is a clade that is isolated in haplospace. I used all 11 samples in my analysis file in
order to best estimate the definition, which is also available at Ysearch as WC8JD.
Recent Polish Project data (early
2016) has 21 samples in Y6349 / I1a P type;
19 have Poland for male line ancestor, one Hungary, one Unknown,
demonstrating that this clade is highly concentrated in Poland.
Four recent M254 P (I1a-P type)
samples in the Polish Project have Big Y results with Y6349+, and a 5th sample
has the SNP test result Y6349+.
A good signature for I1a-P is (391, 392, 447) =
(11, 12, 24), although this signature alone is not foolproof for distinguishing
I1a-P from all other I haplogroup samples.
My definition gives better than 80% confidence of assignment to I1a-P
for samples below step 8, and better than
90% confidence for samples below step 4.
Here is some interesting
speculation for which I do not have convincing statistical evidence: Marek points out that a sample at step 4 on
Ysearch is Danish, which adds to his evidence that there might be a related
clade in Denmark, perhaps with a node in the I1 tree slightly older than the
node for the I1a-P Polish clade. I do
not know where that Danish sample falls in Nordtvedt’s tree. I do not know where that Danish sample falls
in the recent flood of new SNPs.
I-CE. (M223). Update 25 Mar 2012. ISOGG code is now I2a2a; last year’s code for M223 was I2b1, still
being used at FTDNA and the Polish Project.
All the I-CE samples in the Polish
Project fall into one of the 3 branches discussed in the following topics.
The M223 clade is very well
isolated in STR haplospace. FTDNA is able to predict I2b1(M223) with
high confidence using only the first 12 standard markers, for more
than 90% of the samples. Using 67
markers, I found that any reasonable definition does a good job of extracting
M223 samples from Y-DNA STR data. A
good definition is available on Ysearch, code 4H6C9,
using 62 of the 67 standard markers plus 8 additional markers (Mar 2012).
STR isolation in the Polish Project
is generally evidence of a single Polish clade. It is possible that two or more clades with distant nodes in the
Y-DNA tree might have similar STR values by coincidence. In the case of Polish I-CE, since the larger
I-CE world-wide clade is well isolated, my Polish I-CE type might well be a
collection of multiple clades, perhaps including some clades that are not
particularly concentrated in Poland. My
original M253CE type used 4 of the 8 I-CE samples back in 2010. There are now 12 I-CE samples, and they form
two types plus one cluster. It may seem
silly to split these into 3 branches, but there are new SNPs, discussed below, that justify the split
as valid haplogroups. These small types
are interesting because they are preliminary evidence of small Polish clades.
CE stands for Continental Europe,
including Britain.
The M223-Y-Clan project
has lots of data; I used this project
data for reference.
A good signature is (392, 437, 450) = (12, 14,
9), which distinguishes almost all M223 samples from others, allowing one
mutation step. (594) = (11) is also an excellent signature for M223, with the
value 10 dominant outside M223, but this one is strange in the Polish Project,
where 4 of the 12 samples have value >11;
this is evidence that I-C might comprise two clades.
At Ysearch, the percent Polish samples for
I-M223 is low. The following 3 STR
definitions, my proposed Polish branches, capture a small fraction of M223 at
Ysearch.
My Excel file I-CE.xls has
analysis of this type and also analysis of the following three branches. That file has ASD analysis, but ASD age is
very misleading when calculated from samples that are a collection from
multiple large old clades. The three
branches have too few samples to attempt age estimates.
I-C. (M223+ P78-). (I-C Type Branch). New
topic 25 Mar 2012. I-C type is a
hypothetical subdivision of I-CE (M223).
I-C type includes all 4 samples
assigned to I-CE last year, plus one that was missed last year, plus 3 new
ones, for 8 total at 67 markers in the Polish Project. SBP has improved from 19% to 2.6% over the
past year, so this is a clade with high confidence due to the excellent
isolation, although there is a chance it may be two or more independent clades
as discussed above.
My Excel file I-CE.xls has
analysis of this type in column CJ, SBP=2.6%.
My definition uses 67
markers, cutoff 20, gap 14. There are
no Polish Project samples in the gap from step 20 through 33, so this type is
very well isolated. This definition
also isolates I-E type, 4 samples, steps 34 to 42, but there is a better
definition for I-E, see the next topic.
There are no Polish Project samples
at step 43 or 44. There is only one
I2b2 sample (not M223) as step 45. Then
there are no further samples at steps 46 through 52. So this I-C definition also captures all of the broader I-CE (M223), although surely a
better I-CE world wide definition could be constructed.
A good signature is (406, 487) = (10, 12),
which itself distinguishes the 8 I-C samples in the Polish Project.
Two of the I-C samples are I-D
samples, discussed below. Two other I-C
samples have the same family name, very close in STR values. The remaining 4 samples in I-C are not
particularly close to each other in STR values. The SNP data for each sample is included in column BX of the
“Calculator sheet”; 4 of the samples
tested negative for all 4 known haplogroup branches of I-M223. So I-C seems to capture M223* plus P95
(below) in the Polish Project.
My definition is also available at
Ysearch, SB6YK. On Ysearch there are plenty of samples from
step 20 through 33, so this definition does not work world-wide. The closest fits are not concentrated in
Poland, so if I-C truly represents a Polish clade(s) my STR definition will not
find members with confidence far from the region of Historical Poland.
I-E. (M223+ P78+). (I-E Type Branch). New
topic 25 Mar 2012. ISOGG now I2a2a3; last year’s code for P78 was I2b1c,
still being used at FTDNA and the Polish Project.
My Excel file I-CE.xls has
analysis of this type in column CM, SBP=13%.
My definition uses 67
markers, cutoff 19, gap 7. There are no
Polish Project samples in the gap from step 19 through 25, so this type is very
well isolated. Only the I-C samples are
all at steps 26 to 44, so this definition also nicely separates I-C from I-E in
the Polish Project.
A good signature is (393, 459a, 446) = (15, 9,
10), allowing one mutation step, which distinguishes the four P78 samples in
the Polish Project.
Both the P78+ Polish Project
samples are in the M223-Y-Clan
Project, and there are 13 others, but there are many more P78- in M223-Y-Clan,
so this is not a particularly large subdivision of M223.
The other two I-E samples in the
Polish Project that have not been tested for SNPs, but both have P78+ close
matches on Ysearch, and no close matches from the other 3 branches of M223, so
those are likely also P78+.
There are two other known
haplogroup branches of M223: M379 has
no positives in M233Y-Clan, and plenty of negatives, so it is very rare. M284 has plenty of positives in
M223-Y-Clan; that branch is a large
subdivision with a couple known branches of its own, but no samples in the
Polish Project.
My I-E definition is also available
at Ysearch, QUXE3. The Ysearch closest matches are I2b1c, so my
definition is good at extracting P78 samples, but I suppose a better definition
could be constructed for the world-wide P78 data. On Ysearch there are plenty of samples from step 19 through 25,
including some I2b1c beyond step 25, so this definition does not work
world-wide. The closest fits are not
concentrated in Poland, so if I-C truly represents a Polish clade(s) my STR
definition will not find members with confidence far from the region of
Historical Poland.
I-D. (M223+ P95+). (I-D Cluster). New
topic 25 Mar 2012. ISOGG now
I2a2a4; last year’s code for P95 was
I2b1d.
There are only 3 samples P95+ in
the M223-Y-Clan
Project, and many P95-, so this is a small haplogroup. Those 3 include one but not both of the
Polish Project I-D. Two of those 3 have
Poland listed as origin, and the third has no origin listed, so this may be a
Polish clade, but it is too soon to tell.
It is possible that I-C has a larger subdivision Polish branch, of which
this I-D may be a branch, but this is just speculation until we get more data.
(640) = (13) seems to be a
signature for I-D, but one STR marker
should not be very reliable for prediction.
I did not enter a definition into
Ysearch. The two I-D samples are
highlighted bold blue in column CI of I-CE.xls. Only one sample is P95+ in the Polish Project - the one that is
also in the M223-Y-Clan Project, so I used that sample as the definition. There is a sample at step 10, and none
others out to step 22, so I tentatively assigned that step 10 sample to I-D,
forming a cluster of two samples, SBP=25%, well isolated from others but not a type.
I1a1b
(L22). Edited 29 Aug 2015.
Input new topic 19 Aug 2015 by Paul
Stone.
The haplogroup defined by the SNP L22 is a major branch of I1a (DF29) and indeed
a major branch of I1 (M253). L22 is concentrated in Scandinavian
Countries, but is also found elsewhere in Europe. In Poland L22 is about 1.5 %.
L22 is a branch of CTS6364, which
is also called I1a1 (ISOGG) and also
equivalently called Z2336 (Yfull).
The paragroup CTS6364+ L22- has the STR signature
(390,385a,385b) = (20,14,14) although the signature alone is not a foolproof
prediction. This paragroup has recently
been divided into a large number of new haplogroups, due to the flood of new
SNPs. A good place to view all these
new divisions of I1a1 is http://www.yfull.com/tree/I-Z2336/.
I1-Z63. New
topic 23 Sep 2015 by Paul Stone.
Haplogroup Z63 is also called I1a3 (ISOGG, Sep 2015)
The Z63 SNP is estimated to have
formed around 4,700 ybp. In terms of
raw numbers, the Z63 population is small compared to the large CTS6463 and Z58
subgroups and is similar in size to P109 or Z73. The geographic range of Z63 is vast and stretches from Iceland to
Spain and into the Balkans and central Russia.
The subgroup is primarily Continental in nature with Scandinavian Z63
samples being relatively few in number.
Z63 has the highest density in Central Europe based on empirical
data. Inside of and throughout Poland,
several different Z63 lineages are represented. SNPs downstream of Z63 found in Poland include BY351, FGC14480,
L1237, PR683, S2078, S10360 and S15301 with others yet to be determined.
Description of the N Haplogroup Branches
N-G. (N-L551). (N-G Type).
Update 22 Mar 2012. Introduced
on 17 Oct 2010 as “N1c1(M178)-G type”.
The latest ISOGG code is N1c1d1a (L551).
Mayka
suggested this one, based on a suggestion by Andrzej Bajor, from his Rurikid Dynasty
Project. This type is concentrated in Lithuania, and
Andrzej suggests that at least one member might be a male line descendant of
Gediminas, the medieval Lithuanian Duke.
Hence the “G” code.
This type has 9 samples at 67 markers very well isolated in the Polish Project with SBP = 8.9%. See N-GType.xls. The definition is also available at Haplotypes.xls and at Ysearch as RGE95, using 51 markers, cutoff 3 (samples < step 3). All but one of the N-G samples can be extracted from the Polish Project using only the signature (392, 607, 557) = (15, 14, 13).
This type
should not be confused with another G type
in the R1a haplogroup.
That new L551 SNP verifies our prior prediction that G type
corresponds to a clade. All 9 of the
predicted G type samples at 67 markers have tested L551+, and samples predicted
just beyond G type are coming out L551-.
Of course, there will probably be a few exceptions as more data
accumulates, but so far N-G type (STR
match) is equivalent to L551 in the Polish Project.
At Ysearch, N-G type is not as well
isolated; the SBP is 22% with cutoff 4,
due to interference by what might be a Russian clade. There are many Lithuanian samples matching my N-G definition
(RGE95), including Lithuanian samples beyond the cutoff (step 3). 46% of the Ysearch samples below step 9
indicate Lithuanian origin. L551 is too
new to be included in Ysearch, so this paragraph refers to N-G type as defined
by STRs.
I do not know if the Polish Project
N-G samples are an independent Polish sub-clade of a larger Lithuanian
clade; or if the Polish Project samples
are just a random sample of individuals from a larger clade(s). I have not taken the time to search other
projects for STR matches to my N-G definition, or to search for more L551+
samples. Someone might inform me before
I get a chance to search. Watch this
topic for updates.
The age
of N-G type seems to be less than 1,000 years, perhaps only 500 years. Check the “ASD” sheet in my analysis
file. ASD age is highly uncertain, particularly for such
a small sample, but G type has little STR variance, so surely G represents a
clade younger than 2,000 years old.
Isolation is evidence of an old node,
with TMRCA much younger than the
node. The age of the L551 mutation can
be anywhere in the time span older than the TMRCA of G type and younger than
the node. N-G type is well isolated in
Lithuania and Poland, but N-G may have a relatively young node with those other
clades world-wide with similar STR values.
Those other clades can be used to better constrain the age of the L551
mutation.
N-M. (N-L591). (N-M Cluster).
Update 22 Mar 2012. Mayka suggested this one also, introducing
it at the Polish Project in Jan 2011, as “N1c1(M178)-M Cluster”. The latest ISOGG
code is N1c1d1b
(L591). Includes Mickevius
(Mickewicz) descendants. Hence the “M”
code. Also concentrated in
Lithuania. These two, N-G and N-M, are
a small fraction of the M178 clade.
I call this a cluster because it
does not meet my criterion SBP<20% to be
called a type. Actually, the original proposed cluster is equivalent to what I
am now calling Ma cluster, discussed below.
The recent new SNP named L591 is coming out with about twice as many
samples, so we have adopted the “M” short code name for the STR data for
L591; this larger N-M cluster is so
considered equivalent to N-L591.
My analysis is available, N-MCluster.xls,
10 samples at 67 markers. My best
automatic definition for N-M, column CL,
SBP=25%, is 80% accurate, missing one sample that is obviously L591 and
predicting one sample that came out L591-, out of 10 predicted. Actually, this result is a nice confirmation
of my SBP method, because although the data has only 10% background (false positives captured by
the definition), my SBP formula has an increase to account for statistical confidence; hence 25% is a better upper confidence estimate of the background
for so little data. I bet as more data
accumulates my best N-M definition will drift below SBP=20%, qualifying as a
type. Anyway, this is moot, because
L591 is a better criterion for the clade, and there is a logical distinction
between the N-M cluster (samples with STR correlation) and the L591
haplogroup. My definition serves as a
guide for priority for L591 testing.
Testing should be concentrated near the cutoff.
Accordingly, I came up with an
improved STR definition for L591, using a mask to manually adjust marker
selection. I’ll still call it by the
short code N-M. Column CC in that
file. SBP=50%, but SBP does not matter
here, because the purpose of the definition is not to discover a hypothetical
clade, but to predict samples for a known clade. Most clades do not produce low SBP because most clades are not
well isolated. Let me elaborate with
discussion of the statistical issues for N-M:
Obvious issue: There are three N-M samples with a very rare
6 step mutation at DYS446, from 17 to 11.
Without DYS446, two of these three marginally fit the N-M cluster (based
on STRs). These three seem to represent
a subclade of L591 with modal
STRs slightly drifted since their node. I
marked them as “Mb” in that Excel file.
Only one of these has actually tested L591+. Another one of these is that “obviously L591” sample that I
mention above, the “obviously” based on this 6 step mutation, which is almost
as good a marker as an SNP. That
“obviously” sample is an STR outlier at
other markers, which need to be excluded from the L591 definition, assuming
more samples like this will show up.
This seems obvious, but it needs verification with more data over the
near future.
Speculative issue: There are two other outliers, which I
labeled Mc and Md. Tested L591+. These may represent two clades with nodes only
slightly younger than the TMRCA for L591,
with independent modal drift. Highly
uncertain. They might just be
statistical outliers, due to the luck of random mutation. Again, more data will tell. For now, I adjusted the N-M definition to
capture them, on the assumption that some future samples might come up with
similar STR values.
Another issue: That one sample, mentioned above, fitting
the M cluster very well but L591-, probably represents a clade with a node
slightly older than L591, but similar STRs by coincidence; there may be other such clades. Again, this is speculative, but I adjusted
my definition to exclude this one.
Statistical speculation
summary: L591 does not seem very well
isolated in haplospace, albeit more
isolated than most young Y-DNA clades.
It seems the L591 tree has nodes close to the SNP age, both younger and older.
My L591 definition is available in
that Excel file, in Haplotypes.xls,
and at Ysearch as 64RUG.
This L591 clade seems to be
concentrated in Lithuania. The evidence
is Ysearch - Lithuanian concentration of the N-M cluster. L591 test data is not available yet at
Ysearch. My Ysearch analysis (data in a
sheet in that Excel file) is similar to the G type analysis: SBP not as good because of apparent
interference from clades world-wide.
Using the N-M definition at Ysearch, there is Lithuanian concentration
at steps well beyond the cluster cutoff, so there seems to be a larger
Lithuanian clade.
In the Polish Project, I spotted
evidence of such a larger STR type, about double the size of N-M, including the
all the N-M samples as a sub-clade. I
colored these samples green in column BX of N-MCluster.xls, using all 67
markers. I dubbed this one N-L
type. That 67 marker evidence is not
satisfactory because it captures a couple N-G samples. In another file, not posted on-line, I came
up with a satisfactory definition for N-L;
I provide it in the “Haplotypes & Masks” sheet, row 21, of
N-MCluster.xls. Mayka advises me that
there are two new SNPs, L1025 and L1027, that are currently candidates for a
haplogroup larger than L591. We are
waiting to see how those come out before introducing N-L. That N-L definition cutoff provides a
suggestion of where to prioritize SNP testing.
The age of N-M (L591) comes out
similar to the age of N-G type, probably less than 1,000 years; see that short
paragraph in the N-G topic above. My
comments about isolation of N-G in the Polish Project do not apply to L591. For N-M, it is important to exclude DYS446,
because that one marker triples the age as calculated using ASD (STR variance),
due to that 6-step deletion mutation mentioned above. You can see this by editing cell BV21 in my mask in my “ASD”
sheet in that file. Another way to edit
this is to edit the 446 value, to make the mutation count one or two, which is
more representative of the age. This is
a good example of one of the caveats
associated with age calculation based on STR variance.
N-Ma. New
topic 20 Mar 2012. This is the original
“N1c1(M178)-M Cluster” cluster, explained in the previous topic. Only 3 samples when introduced Jan 2011,
SBP=36%. Now there are 5 Ma samples,
SBP=30%. Although still not qualified
as a type, there is better than a 30% chance this will improve over the next
couple years as data accumulates.
Lithuanian concentration, same as N-G and N-M. Again, I do not expect validity world-wide for N-Ma because of
interference from other clades world-wide, but this might grow into a nice small,
young Lithuanian clade. Analysis is in N-MCluster.xls,
where the 61 marker definition for Ma is in column CG.
Poland
Concentration Index; PCI
19 Sep 2016: Update Frequency column per Results Table;
update SNP column; PCI column
needs update
New Topic 22 Feb 2014. Update 20 Apr 2016.
Haplogroup ISOGG |
SNP |
Clade |
Short Code Name |
Frequency in Polish Project |
Poland Concentration Index |
R1a1a1b1a2b3 |
I type |
2.4% |
42 |
||
|
Np cluster |
2.2% |
41 |
||
R1a1a1b1a2a |
CTS4648 |
Z92Y type |
0.7% |
39 |
|
I1a1* |
I1a-P type |
2.1% |
38 |
||
R1a1a1b1a2b3a |
G type |
2.0% |
38 |
||
R1a1a1b1a1a |
P type |
12.9% |
34 |
||
R1a1a1b1a2b3 |
J type |
1.8% |
25 |
||
J1a2b2 |
L147.1 |
J1a Ashkenazi |
J1-A |
0.9% |
19 |
[2016 comment: This PCI discussion was written in 2014.]
I introduced a preliminary version
of this index in my 2009 Publication,
page 161, with results in the far right column of Table 1, page 162. In 2009 I defined this preliminary index as
the percent of samples, of a given haplotype, that have the word “Poland”
included in the “Origin” field for male line ancestry, for Y-DNA data at Ysearch.
In this 2014 version, I extend the
index to types and clusters.
In this topic, for brevity, let me use the general term “cluster” to
mean a cluster, or a type, or a haplotype, or a haplogroup, or a paragroup, or a clade, or any other word for a category of
Y-DNA STR data, although I make a technical
distinction at the bottom of this topic.
In this 2014 version, the PCI is
statistically adjusted for data sample size, as explained a few paragraphs
below.
There are a number of reasons why %
“Poland” from Ysearch is not an accurate measure of concentration in
Poland. I mention some of these reasons
in my 2009 Publication. I plan to add a
longer discussion to this web page, with more detail about objections to
statistics from Ysearch.
However, the objections are not a
very serious problem if we are interested in relative concentration in
Poland. Comparing clusters, we expect
the clusters with higher % Poland at Ysearch to likely be more concentrated in
Poland than those with lower % Poland. This
is one reason I do not use a % sign for PCI.
The other statistical reasons are explained below in this topic.
Caveat: You may use my PCI for another country, for example England, with
a new “ECI” related to % “England” at Ysearch.
It would not be valid to compare the PCI numbers to the ECI numbers,
because we do not expect equal joining probabilities. Men with English male line ancestors do not necessarily join
Ysearch in proportion to men with Polish male line ancestors. However, we do not expect significantly
different joining probabilities for men with Polish male line ancestors in
different clusters. There are
exceptions, which I leave for future expansion of this discussion here (for
example Polish Ashkenazi clusters).
The problem of false positives: We expect “false positives” - clusters with
high % Poland just due to statistical probabilities
(the luck of how many Polish and non-Polish men that would fit that particular
cluster just happened to join Ysearch).
The more clusters we study the more false positives we expect to
find. The more clusters we study the
more likely we will find a false positive that seems very highly concentrated
in Poland.
For small samples of data, the
statistical uncertainty is larger, so we expect more false positives. Suppose we check a large number of clusters
for % Poland at Ysearch (or at any database), and suppose many of those
clusters have fewer than 10 samples at Ysearch, and suppose some of those clusters
have fewer than 5 “Poland” samples. We
will surely find false positives. I
discuss this sample size uncertainty in my 2009 Publication, where I used the
lower bound of confidence range as a
method to compensate for this statistical problem, particularly in small
clusters.
My PCI uses the lower bound of the
95% confidence range. For example, a
PCI = 25 means 95% confidence that the “true population % Poland” is 25% or
greater. By “true population % Poland”
I mean the % value that would show up in a much larger database drawn from the
same population in the same way (in the distant future at Ysearch, for example,
if Ysearch is still popular in the distant future).
If you are knowledgeable about
statistical methods, you may quickly understand the details of my method from
my file Ysearch.xls; check the “Summary” sheet and the
“Instructions” sheet. There is a
technical statistical explanation of PCI near the center of the “Instructions”
sheet. Check the other sheets for
specific cluster results. My automatic
procedures use macros; if you are
concerned about macros your browser should allow you to open my file in “View
Only” mode.
Even if you are not knowledgeable,
you might try following the “Instructions” sheet to evaluate your own clusters.
The “Summary” sheet in the
Ysearch.xls file has 5 example rows labeled “50% Tests”: one row shows that a cluster with 100
“Poland” samples out of 200 Total has 50% Poland, and has a lower 95%
confidence limit of 44.8%, so the PCI is 45.
However another row shows that a cluster with 5 “Poland” samples out of
10 Total also has 50% Poland, but has a lower 95% confidence limit of 25.3%, so
the PCI is only 25. In other words,
smaller clusters get more downgrading to compensate for small sample
statistics, but small clusters are allowed.
This is the main reason I leave out
the % - to avoid confusion - probability
vs confidence - PCI is a mix of
both. PCI allows small clusters to be
compared to large clusters, where the smaller clusters get adjusted to a lower
index number.
There is another issue: selection bias: In defining clusters, we chose the number of markers, and the cutoff, to best define the cluster. The cutoff should be adjusted to best
capture Poland samples, as demonstrated in the example sheets in
Ysearch.xls. Selection bias effectively
reduces confidence, because we may be selecting parameters based on statistical
flukes.
Those “objections to Ysearch”,
mentioned above but not listed, also reduce confidence. Many of the objections cause variation of
results, with more variation expected for smaller clusters
So my net confidence is not 95%,
but less. This is my style here: calculate PCI using 95% sample size
confidence because sample size confidence can be easily calculated in an Excel
sheet. All those objections, including
selection bias, reduce confidence, with larger effect expected for smaller
clusters, but smaller clusters get more reduction in PCI. I suppose my net confidence is about 80%,
although this estimate is based on subjective experience - difficult to
document with words. I expect about 80%
of my PCI predictions to slowly increase over the years as more data
accumulates at Ysearch, with smaller clusters increasing faster in PCI. I expect roughly about 20% of my PCI
predictions to drop over the years, meaning roughly 20% overestimation of
Poland concentration confidence. On the
other hand, this PCI method automatically ignores small clusters with lesser
concentration in Poland, many of which will significantly increase in PCI as
data accumulates.
My Excel sheet automatically
removes “Modal” Ysearch data, and excess “family set” data, as explained in the
“Instructions” sheet, where the user is invited to manually edit the results,
because human judgment is an improvement over automatic editing.
The Polish Project is representative of
the historical Commonwealth of Poland,
a geographic area much larger than modern Poland. PCI is weighted toward men with ancestry self-described as from
Poland. One of these days I might add
here a discussion topic about this complex topic. Check the web if you wish discussion now.
My PCI index can be used for
haplogroups, but technically I do not do this, because the haplogroup
assignments are not up to date at Ysearch.
I like Ysearch because of the huge amount of STR data at 67 markers. Many of my STR based types are “almost
equivalent” to SNP based haplogroups. For example, P type is almost equivalent to L260.
“Almost equivalent” means a few haplogroup samples are STR outliers and a few samples from other
haplogroups marginally fit the type just below the cutoff. In my tables this technical distinction
between P type vs L260, and other equivalents, may not be obvious to you.
Instructions
for Use of Ysearch
Update rewrite 21 Sep 2015. Edit 5 Jul 2016.
Link to the site: http://www.ysearch.org/. Brief description of Ysearch.
Click on the Create A New User tab,
where you can upload your Y-DNA STR data
from a number of testing services. Or,
you can type in your data. You end up
with a “User ID”.
Click on the Search for Genetic Matches tab to
search for Ysearch members closest to you in STR values.
Ysearch has a Research Tools tab to
click, where you can type in other User ID’s for detailed comparison to your
data.
Ysearch does not keep up very well
with new SNPs, so this is not the best
place to find out your location in the Y-DNA tree. It is better to join an FTDNA project, where the administrators
help to figure out your terminal
branch in the tree. However,
Ysearch has a lot more data, so you might find closer matches here.
My
Definitions. I have entered a number of definitions into Ysearch, for the types of interest to me. These are modal haplotypes; they do not correspond to any real
person. These definitions use only some
of the STR markers, so it may be
misleading if you seem to match one of them closely. Issues: These are based
on selected markers from the 67 marker set, so you need to have the full 67
marker data. If you are a perfect match
(Ysearch reports Genetic Distance = zero) then it is highly likely you belong
to that type. If your match is Genetic
Distance 1 or 2, it is less likely but still a good bet you belong to that
type. If your match is 3 to 10, it
depends on the type; some types are
more restrictive than others; you need
to read the documentation about that type, so search for it in this web page.
Examples: P type is 8U92G at Ysearch;
I type is EKVHX.
29 March 2010 correspondence: I mentioned Russian sites for R1a clusters in my publication. It’s not easy for me to figure out which of
those clusters correspond to my types. Mayka worked out a correspondence on 29
March, warning me that the correspondence is not exact. Some of the Russian clusters are broader
than my types; some are narrower. Here are Mayka’s findings:
My Type code vs
Russian cluster name:
A Ashkenazi Jewish
B Western Eurasian
C Old European
D Baltic - Carpathian
E Northern Eurasian
F Central Eurasian
G Northern European
H Western Carpathian
I Northern Carpathian
N Central European
P Western Slavic
19 Sep 2010 update: A nice tree display of the Russian
subdivision of R1a is at http://www.r1a.org/. Robert Sliwinski brought this site to my
attention.
My opinion: R1a
cannot be highly subdivided with confidence based on STR data.
This web site of mine is dedicated to estimating the confidence of each type that I study. I try to indicate which types are speculative. Even for the types with high confidence, the
location of the nodes in the R1a tree will be uncertain until corresponding SNPs are discovered. These Russian clusters, apparently by Klyosov, have plus / minus values for
accuracy of TMRCA ages that are far to
small, because there are serious caveats
associated with systematic statistical uncertainties.
Edit 10 Jun 2016.
Update: I published my
“Mountain Method” for STR analysis in 2009,
back when STR analysis was important because SNP
testing was expensive and new SNPs were rare.
Now that SNPs are relatively inexpensive, STR analysis is not as
important. I still find STR analysis
useful because there are still plenty of data on the web for samples
with STR data, but with insufficient SNP data.
Here is a summary of terms (in
boldface) that I defined for my “Mountains in Haplospace” method. For more explanation, see my publication. By haplospace
I mean multidimensional sets of STR
values; each haplotype is a point in haplospace.
This topic is about STR analysis,
but restricted to Y-DNA genetic genealogy.
For a more general introduction consider wiki STR Analysis.
Men submit their Y-DNA data to
various web sites. There are lots of
STR data available on the web. A cluster is a set of samples with similar STR values. Men are divided into STR clusters as
hypothetical subdivisions of the haplogroups,
based on similarities of STR values.
All such clusters are hypothetical.
Some will be validated by new SNP
discoveries. There are various
statistical methods for estimating the confidence
of STR clusters. I published a method that I
developed. That publication has
references to other methods.
A cluster qualifies as a type if the graph of step frequency
(number of samples at that step) vs step
looks like an isolated mountain. The step is the genetic
distance (mutation count) from the modal haplotype of the cluster. I use the method of Ysearch to calculate step. The cutoff is the
next step just beyond the mountain. A
good type has low step frequency in a “gap” of step
values including the cutoff (only the cutoff for a gap of 1). A good type forms a mountain at step values
less than the cutoff, separated by a gap from the rest of the database (the upstream father haplogroup usually) at higher step
numbers.
In other words, I use the word
“type” for a cluster with high quality, where quality is estimated on the basis
of STR isolation. Of course the
ultimate measure of quality is when a cluster or type is eventually shown to be
equivalent to a newly discovered SNP. Before such an SNP is discovered, a well
isolated type provides confidence that it corresponds to a future SNP
haplogroup. After such an SNP is
discovered, an equivalent type or cluster is used for STR prediction of haplogroup, for samples
with STR data but insufficient SNP data.
Eventually, we expect most well isolated types to have multiple phyloequivalent SNPs discovered; isolation should lead to both a unique set
STR values and a unique set of SNPs.
See also the discussion about smooth branch
segments.
The Statistical Background
Percent (SBP) is an objective measure of the quality of the
type. Low SBP is taken as evidence that
a type corresponds to a clade that is equivalent to a haplogroup defined by an SNP (perhaps not yet discovered). Larger types with lower gaps have lower
SBP. SBP is intended as an estimate of
the background percent of samples in a type that
really do not belong to the corresponding hypothetical clade because they are
outliers from other clades. An outlier is a sample that has very unusual STR values due to
the luck of mutations. SBP is also
intended to account for the estimated percent of samples from small foreign clades that just happen to have the same STR
values but are not closely related to the type. (Actually, an individual STR sample represents a clade insofar as
fathers, sons, brothers, and cousins should have almost the same STRs.) SBP is approximately the probability that a sample with STRs matching the type
does not belong to the corresponding clade, but SBP is adjusted for the confidence interval. Small sample counts have wide confidence
intervals. So larger types (more
samples) automatically get lower SBP.
For a valid clade, SBP should decrease with time as data accumulates in
a database. A very well isolated clade
will have a low SBP even with only a few samples. SBP < 5% is
very rare - a very well isolated type, very likely to be a clade. SBP < 20% is good enough to be announced
as a type on the web. SBP > 20% is a
cluster worth watching as data accumulates with time. I avoid using the word type for SBP > 20%. SBP > 50% is not statistically meaningful
although such clusters might improve as data accumulates. The SBP equation (available as an Excel
worksheet in the tools) produces SBP >
100% for clusters that do not look like mountains. The number of markers in the definition should be chosen to provide
as small an SBP as possible; my Excel
tools provide automatic rank
of markers as an aide; human judgment
can be used to include or exclude markers with obvious problems. A signature is
a small set of markers that rank best, convenient for publication of a type,
and for simple demonstration of the correlation of STR values.
I use the word “type” to mean 1) the hypothetical clade, and 2) the associated cluster of data, and 3) the modal haplotype,
and 4) all possible haplotypes that
differ from the modal haplotype by step less than the cutoff.
The definition of a type is the modal
haplotype plus cutoff. The definition
uses only those STR markers that provide the lowest SBP, but the definition
uses as many STR makers as possible if there is a tie. The definition of a valid type may change
slightly as data accumulates.
I use the word clade in general, meaning a Y-DNA clade that
may or may not be a defined official haplogroup. All types have associated
hypothetical clades, but most clades cannot be isolated as types with low
SBP. The modal
value for a marker is the most common value in the cluster. The modal
haplotype is the set of most common values, usually the most common
haplotype in a cluster. Many people use
the adjective “modal” as a noun, meaning “modal haplotype”; so do I;
I tried to avoid that in this web document.
Most of my types have been validated in the past few years
by newly discovered SNPs that seem equivalent. I say “seem equivalent” instead of “equal”
because of two distinctions between types and haplogroups:
Outliers: First, types are defined by STR
correlations, and outliers are expected because of the random luck of STR
mutations. Even if a type is very well
isolated, with all type samples coming out positive for a newly discovered SNP,
and with no samples negative for that SNP fitting the type, eventually outliers
are bound to show up as new samples provide more STR and SNP data. We expect to eventually discover a few outliers
that do not fit the STR type but are positive for that SNP, and we expect to
eventually discover outliers that are negative for that SNP but fit the STR
type.
Better Equivalents: Second, as new branches are discovered, it
is possible a better equivalent SNP might be discovered, slightly younger than
the original equivalent, leaving out one or more of the oldest branches of that
original equivalent haplogroup, where those oldest branches might have many
outliers that do not fit the type.
Conversely, a better older equivalent SNP might be discovered, including
a few branches with nodes older than that original node, where those older
branches might have most of the original outliers that fit the type.
The rest of this topic provides discussions
and more definitions that not part of my Mountain Method. These are discussions and terms that I use
often, so I provide them here for easy link reference from my web pages. Some of these terms are not common in
genetic genealogy. Some of these I do
not recall seeing used in genetic genealogy documents at all, so they might be
my inventions, although I suppose other writers may have used these terms with
similar meaning. Some of these are
common but I do not use them often.
A bimodal
marker has a second STR value with many samples - more than expected
statistically - in addition to the most common modal value. A multimodal
marker is possible if there are more than two common values for the marker and
if those common values are not distributed more or less symmetrically on both
sides of the most common value. (A
Bessel distribution is statistically expected for a low fraction of random
independent mutations at an STR marker.
A Bessel distribution is close to a Gaussian distribution for a high
fraction of independent mutations. A
Bessel for a low fraction looks like a tent;
a Gaussian looks like a bell.)
Step up mutations are more common than step down for short STRs, so for
example a modal 8 plus a few more 9 values than 7’s does not necessarily mean
the 9’s are statistically significant;
experience helps to judge. RecLOH and other issues at compound markers
also cause confusion in this regard. A bimodal
marker is a hint that there may be a clade associated with that 2nd value, so
genetic genealogists study clusters defined by one or a few STRs with such
bimodal 2nd values. The main modal
value also sometimes makes a good signature at a bimodal marker. In other words, a set of values using one or
more bimodal or multimodal markers makes a good signature for a hypothetical
cluster.
In the past, I have sometimes
called such clusters hypothetical types. I now prefer to reserve the word type for < 20% SBP. Sometimes I make exceptions above 20%, for
example when a cluster is regionally concentrated, or associated with an ethnic
group.
I had sometimes used “bimodal
marker” for that second STR value, but I try to avoid that confusion. It’s the STR marker that is bimodal, with
two common values.
There is no known way to calculate
the % confidence that a cluster
corresponds to a clade, but an experienced genetic genealogist can roughly
estimate confidence based on experience.
I developed SBP so that 100% minus SBP expresses my confidence, but only
for clusters with less than 30% SBP;
SBP breaks down around 50%. I
avoid publishing clusters in which I estimate less than 50% confidence,
although I may mention some as speculative, particularly if they have been
announced by others.
Not all Y-DNA STR data separates
into types because the distribution of STR values tends to be continuous. For insights into why types form, please see
my discussion of extinction and population bottlenecks.
A main branch of the Y-DNA
tree is old, with data on the web for thousands of samples belonging, and
with many known further branching divisions.
I like to use the word twig for a small young
branch of the Y-DNA tree. A terminal branch is a smallest known division of the
tree; a terminal branch might be a terminal
haplogroup, or a subdivision of a terminal haplogroup - a type or a
hypothetical cluster. A terminal branch
at one web site might not exist at another web site; a terminal branch might be very small (one or only a few samples)
or very large (many samples).
Age (often
years before present, ybp): By definition, the TMRCA
(Time of the Most Recent Common Ancestor) corresponds to the age of a node, or branching point, in the Y-DNA tree.
(Sometimes more than two branches
split off at one node, but we expect future SNPs might usually resolve that
node into multiple nodes, each branching into two branches.)
Some genetic genealogist use TMRC
as the age of the corresponding haplogroup
(or type, or cluster, or branch or clade).
I often do; it’s usually good
enough. But there is a technicality
that causes confusion:
An SNP
is probably older than the TMRCA of the haplogroup it defines, because there are
usually many generations between old nodes, due to the statistical pruning of the Y-DNA tree (discussed above in the
definition of segments of tree branches
and in the paragraph about extinction). The probability is very low that an old SNP
mutation happened in exactly the same generation as the TMRCA. (An exception would be a recent private SNP found in an extended male line family.)
Conceptually, we might prefer to consider the age of the SNP as
the age of the corresponding haplogroup.
But there are usually multiple phyloequivalent SNPs for a
haplogroup, and of course, they differ in age.
Methods to estimate TMRCA age do not provide distinct ages for all those
SNPs.
A third stipulation of the age of a
haplogroup might be the age of the previous known node. Then a haplogroup would include the male
descendants of a MRCA plus his male ancestors in the immediately older known
segment, but this is opposed to the traditional idea of a Y haplogroup being
the male clade of descendants of the man who experienced the mutation for the
corresponding SNP.
The Yfull Tree solves this confusion by
using two ages, the TMCA and the older “formed” age,
which is the TMRCA of the previous known node.
Yfull estimates age by analysis of the number of SNPs per segment. Before about 2013 or so, most Y clade age estimates were based on STR
distributions, and assumed STR mutation rates.
Any method of age estimation has serious caveats. Most of my xls on-line STR analysis files have a sheet that
estimates age from STRs in various ways, but I’m not including that sheet in
current analyses because the Yfull site does an adequate job of estimating age.
I call the segments between nodes smooth branch segments, where there are no known
nodes in that segment of the Y-DNA tree.
A long smooth segment in the Y-DNA tree is one way to visualize
isolation in haplospace. A type, because it is isolated, probably has
a long smooth segment older than the MRCA of the type, with more than the usual
number of phyloequivalent SNPs. A
smooth segment is necessarily a statistical estimate, because the number of SNPs
is influenced by the luck of statistics.
Family Sets; Recruitment
Bias; Statistics
on Frequency
Edited 23 Jan 2015:
Sometimes one individual recruits
male line relatives to submit data to a Y-DNA database, for example to the Polish Project. I call these family sets. I count these together as one sample when
compiling statistics on frequency. By
statistics on frequency I mean the number of samples
per clade. By clade I mean a haplogroup or type or cluster. I do this adjustment for family sets because
otherwise a small clade might get reported as too large.
My Results
Table page is an example of statistics on frequency where I adjust for such
recruitment bias.
I do not discourage such recruitment; it is a great research technique. I recruited my third cousin. I don’t mind the effort of adjusting for
such recruitment bias.
My adjustment method: I sort databases by name, and automatically
flag name repetitions. Then I examine
the flagged data to see if the STR data is a very close match, which is a sign
of recruitment. Often I make contact by
email when it is not obvious if the samples have been recruited. Actually, even with email discussion, the
actual correction may not be obvious;
for example it may be difficult to say if a particular recruited distant
relative may have later joined the project anyway independently, in which case
he should be counted. So I may estimate
2 or more effective “independent” samples for some family sets.
I also sort data by close STR
matches and look for evidence of recruitment.
Recruitment can also be by close STR matches even with different family
names. For example, I recruited a man
with a last name different than mine, where his daughter noticed that his 12
marker data (at ancestry.com) matched my 12 marker data (at
familytreedna.com). I paid for his 111
marker data which I submitted to multiple projects. We match STRs very closely at 111. I determined that his male line ancestors lived in Poland only 10
miles away from the village where my ancestors lived. We are obviously distant male line relatives. I don’t count him in frequency data because
I recruited him.
I also sort by email address, again
looking for samples with very close STR matches. Most samples that have the same email address are not in the same
main haplogroup branch, even when the family name matches, which means most
recruitment by family name brings in samples from different male lines. I do not adjust for these, because I assume
the recruited samples fall randomly into haplogroups according to frequency in
Poland, so such recruited data is OK.
It is difficult to judge what to do when a pair of recruited samples are
in different terminal branches that
branch from a common larger branch.
Sequential kit numbers, or nearly
sequential, are additional evidence of
recruitment.
I’m not trying to make perfect
adjustments. I’m mainly trying to catch
all the large family sets. I don’t
bother people with emails about sample pairs that may or may not be due to
recruitment; I make my own
judgment. If I miss a few pairs, or if
I discount a few pairs that are really independent, that just adds a little
noise to the frequency data.
New Topic 29 Mar 2015:
The Yfull Tree includes ages of SNP
nodes calculated from the number of SNPs
in segments between nodes, using an
average SNP mutation rate. These serve
as reasonable estimates for the ages of the corresponding haplogroups, although there are
caveats, next topic. Yfull calibrates
SNP mutation rate to a very old haplogroup, of assumed age. Yfull does not document the details of their
method; if they mix SNPs found by
different methods (as most people do) that would introduce an
inconsistency. These Yfull ages are
consistent insofar as they are calculated from SNPs as found by a single
consistent method, although there still are caveats, next topic. Other methods may come up with different
numbers of SNPs, and different rates, so the Yfull ages may not be the same as
those calculated by others.
Rewrite 29 Mar 2015. Edit 10 Jun 2016.
Ages
can be calculated using either STRs or SNPs.
In either case, an average mutation rate is used to calculate the age
from the observed mutations.
There are several biases involved,
and I mention some of them in this topic about age caveats. I don’t emphasize ages in my web pages
because of the uncertainties, but I occasionally discuss rough age
calculations.
With STRs, people generally use ASD
(to account for back mutations) calculated for each marker, then average the markers. I provide “ASD” sheets in my type files with a
simple ASD method for calculating age, but again, I consider this a rough
approximation. I provide an
introduction to STR age calculations in my Fall 2009 Publication.
Some publications use a mutation
rate from father-son data. This method
yields too high a rate insofar as somatic mutations (mutations in the cell
lines leading to the test - cheek cells, for example) are included, so the
calculated ages are too young.
Chandler
published a method for accurately determining relative STR mutation rates, and
calibrated the first 37 standard
STR markers, to father-son data.
Extension of Chandler’s 37 to more markers are available on-line, but
without explanation.
Mutation rates can be calibrated to
an old haplogroup node TMRCA. There are still remaining caveats, including
the uncertainty in the age of that old TMRCA.
With a fixed mutation rate, older
haplogroup nodes tend to come out too young.
This is due to the structure introduced by the “pruning” of haplogroup
branches that go extinct. To
compensate, some people use adjustment factors for older haplogroup nodes; some people use calculation algorithms that
are mathematically equivalent to a mutation rate that decreases with age.
Almost all DNA damage is repaired
by various cellular mechanisms, so the “mutation rate” measures only the damage
that is not repaired. Repair mechanisms
vary from person to person due to variations in minor damage to the repair
mechanisms. An ancestral line mutation
rate depends on the probability of a few ancestors with much higher than normal
mutation rate, so mutation rate varies more between ancestral lines than due to
simple random number statistics.
It is important to calculate the
+/- confidence range for data based on very few samples, using standard Poisson
statistics for a small number of samples.
Many reports of age calculations include such confidence range. For large samples, however, this is
misleading. The confidence range comes
out small for large samples, with a small +/- on the age. The various age caveats, however, provide a
much larger uncertainty due to bias. In
other words, with a large sample, we have excellent confidence that another
large sample, taken from the same population, will provide the same age result
within a small confidence range.
However, insofar as any age calculation is more uncertain due to the
biases introduced by the various caveats, those small confidence ranges are
meaningless.
I mention only some caveats here as
examples. There are more. This web page is not the place for a review
of all age caveats. I have never seen
an article with a thorough review of the caveats associated with age
calculations of ancestral lines based on mutation rates. I suppose genetic genealogists are rarely
statistics experts, and I suppose statistics experts avoid publishing such
articles, because the emphasis would be that most genetic genealogy age
calculations are not quite right.
Summary: There is no known way to figure for all caveats with calculating
age of Y-DNA nodes or haplogroups.
Nevertheless, rough age calculations can be interesting, as long as we
realize the results are not very accurate.
Rewrite 10 Jun 2016.
A “population bottleneck” means a
significant reduction in population followed by a significant increase in
population.
Population bottlenecks generally
reduce genetic variation. There are
other reasons, covered briefly toward the end of this topic.
I intend this topic to be a
discussion of reasons for isolated STR
types, which are one instance of reduced genetic
variation. As explained in other topics
(follow the links), isolated STR types are usually the same as smooth branch
segments of the human Y-DNA tree, which are usually the same as haplogroups that have a large number of phyloequivalent
SNPs. In this topic I intend the
word “type” to be shorthand for reduced genetic variation that results in a
type, or a smooth segment, or a large number of phyloequivalent SNPs.
Most
male lines go extinct. If you are not
familiar with this statistical fact, please see the discussion about extinction.
Because male lines tend to go
extinct, the Y-DNA tree tends to prune itself with time,
so technically a bottleneck is not required for reduction of genetic
variation. However, humans have
surprisingly low genetic variation;
many geneticists who have worked out the statistics have concluded that
the human population must have suffered an extreme bottleneck, because a
surprisingly low human population seems required to understand the low genetic
variation.
I have never seen a statistical
analysis of the distinction between (a) one extreme bottleneck, vs (b) multiple
bottlenecks not so extreme, vs (c) a prolonged period of moderately low
population (not low enough to be called a bottleneck). I suppose a detailed study types can provide
a distinction. Consider the nodes corresponding to the TMRCAs of the
oldest types. If
those TMRCAs turn out to be clustered at mostly one particular age,
that would be evidence of a single extreme bottleneck early in human
history. It does not seem to be coming
out that way, so I have my doubts about the common assumption of one
bottleneck.
If those TMRCAs tend to cluster at
a few age values, that would be evidence of a few bottlenecks at different
times - perhaps with different continental localization.
Most nodes have
two downstream branches, but some nodes are bushy, with several branches. A bushy node is evidence of a significant rapid population
expansion at that node (any node - not just the node of a type).
Actually, it’s a bit more
complex: We should consider a node
bushy if there are several branches within a short time distance downstream, where “short time distance” means segments
with fewer phyloequivalent SNPs than typical.
A region of the Y-DNA tree with no
bushy nodes is evidence that there was no significant rapid population
expansion for that particular Y-DNA population. With no bushy node there is no evidence of a bottleneck. A type node that is not bushy (immediately
downstream of the TMRCA of a type) is evidence of a prolonged moderately low
population (the upstream smooth branch segment) without a bottleneck (no bush).
I say “evidence” and “significant”
because statistically, we expect rare bushy nodes just due to luck even without
population expansion; similarly we
expect rare cases of population expansion without bushy nodes. So we cannot draw conclusions about the
population structure of one particular node, for the same reasons that we
cannot decide on the basis of statistics alone if a person winning lots of
money in a short time at a poker game is cheating or just lucky. All this discussion is for the human Y-DNA
tree as a whole, not individual types.
I have not seen a statistical
analysis (or computer simulation) that figures out what fraction of human Y-DNA
nodes should be bushy just due to random fluctuations without significant rapid
population expansion. It seems it should
be possible to do such an analysis; I’m
not aware if it has been attempted. Of
course, a model of human population is needed for such an analysis. I suppose the simplest model would be a
Poisson distribution for the frequency distribution of branches per node. Such an analysis could be compared to the
actual human Y-DNA tree for an estimate of excess population bottlenecks (bushy
nodes in excess of statistical expectation), which could be interpreted as the
frequency of real population bottlenecks and not just statistical flukes, for
the tree as a whole.
The effective population might be a lot smaller
than the actual population. For example
if in the past human population structure was dominated by family clans or
tribes (small Y-DNA clades) competing with each other, and with most clans going
extinct, and with rare clans surviving to grow and split into many new clans
and thereby continue the competition, then humans would have low genetic
diversity; the Y-DNA tree would have
very long smooth branch segments between the oldest nodes, producing SNP
haplogroups that can be distinguished as isolated STR
types. You might
say this is an explanation different than population bottlenecks. Or, you might say this is the same
explanation, with effective population bottlenecks, where the full
population never gets particularly small, but the breeding population (that
produces future population) is small.
Another example of an effective
population bottleneck is the Genghis Khan Y-DNA clade, with an effective
founding population of only one man, starting a large male line - a Y-DNA clade
- in just a few generations. We do not
expect a Genghis Khan clade to be an isolated type, because he no doubt belonged
to an existing haplogroup, and there is no historical evidence of a population
bottleneck shortly after he lived. When
the Y-DNA tree was pruned in the past, for whatever reason, any “Genghis Khan
style” clade would have had a statistical advantage, due to size, of surviving
and forming a type.
Migration
can be an explanation for an apparently isolated STR type. I say “apparent” because an immigrant
haplogroup can be isolated in one region if that haplogroup was absent before
immigration. The result would be a type
that is valid in only that region. If
the home (emigrant) region is today undeveloped, so that today few men from
that home region submit DNA samples for on-line, then that would be an apparent
type, not truly valid world wide. On
the other hand, if there had been a severe population bottleneck only in the
home region after the emigration, then the descendants of the migration would
form a valid type today.
You can probably think of more
examples of effective population bottlenecks.
I am not aware of a statistical method to distinguish population
bottlenecks from effective population bottlenecks.
I have no conclusions to provide
here. My purpose for this discussion is
to have one place to refer with links when I discuss various types.
Rewrite 5 Jul 2016
This topic will be dropped in about
a year or so, after I rewrite topics that have links to here. In the past, most Polish Project assignments were based on STR predictions, where appropriate SNPs were not available. This topic was a detailed explanation of my
statistical methods. With the flood of
new SNPs, and the lower cost, it is now usually easy to say which SNPs need to
be tested for particular samples.
My publication
explains the basis of my STR prediction methods.
Update 21 Jan 2015:
I look forward to the discovery of SNPs validating more than 80%, probably more
than 90%, of my type predictions.
I introduced P, N, and K in the
Fall of 2007, publishing this web page 6 Dec of that year. I did not predict that P and N were brother clades, in fact it looked to me like P was
closer to K. I did not make predictions
about the P, N, K structure because the statistics did not justify such
predictions. I assigned samples to P
and N with 80% probability,
remarking that my overall confidence
that P and N were valid (confidence at step
zero) was 95% in 2008. I stated my
overall confidence in the subtypes of K type as only 80%, and without high
confidence that the various subtypes of K actually belonged to a single unique
K clade.
P
type has been validated by the SNP L260.
N
type has been validated by the SNP CTS11962.
K, which never qualified as a type,
represents the R1a modal haplotype.
Today it is clear that K is not a clade. Many of my predicted subtypes of K have been verified by SNPs,
and so far none of them have been shown to be invalid.
In Fall 2007 I also introduced R (Remainder) as the 4th division of
Polish R1a, for those samples that do not fit P, N, or K. R was never intended as a clade. The R category is no longer used because
there are many branches of R1a known today, so that each sample with sufficient
STRs can be confidently predicted into a branch.
This topic uses R1a as an example,
but the same discussion applies to other haplogroup assignments.
My publications have several
references of general interest and relevance to my web documents.
My Tools
and data for STR analysis are Excel files. These are available at the JoGG publication site as Supplementary
Data: www.jogg.info/52/files/cpcindex.htm.
Polish
Clades Update. This folder is for
update of Tools and for new data: www.gwozdz.org/PolishCladesUpdate
Pawlowski
(2002) Arch Med Sadowej Kryminol 52(4):261 (in Polish). This reference is listed in my
publications. I specifically mention it
here because this is where I originally found the common Polish haplotype that I now call P type.
Link to English abstract: Pawlowski
2002.
Lawrence Mayka
is the Administrator of the Polish
Project. Larry helped me to get
started when I was new to genetic genealogy, providing helpful criticism &
suggestions. He reviews and makes
suggestions regarding this Polish Project web page of mine. He also reviewed the original drafts of my
publications. A number of my types were originally suggested to me as STR
clusters by Larry. Larry continues to
provide data for this web page. Many of
my references to other websites in this document were suggested to me by Larry.
Cyndi Rutledge
is the administrator of the R1aY-Haplogroup
Project. Larry and Cyndi had been
sending me M458 test results when that SNP was
new. SNP results are now listed at
project web pages.
Anatole Klyosov
published a pair of articles about STR clusters in the same Fall issue of JoGG that has my
pair of publications. Some of the STR
types that I independently discovered I later found as 25 marker modal
haplotypes in Klyosov’s web documents (before his publication in JoGG - some in Russian). It was encouraging to me seeing independent identification of
clusters by different methods. He
emailed to me an English version of one of his 2008 publications. His Fall JoGG articles have references to
his other publications. Here is a web
link: Klyosov Home.
Russian
web sites: Semargl,
http://www.r1a.org/; http://www.rodstvo.ru/; http://dnatree.ru/; http://molgen.org/. These have been active analyzing R1a,
brought to my attention by others, particularly by Mayka, who worked out a correlation with my types. These sites clearly have proposed
subdivisions of R1a based on STR data, but I cannot quickly understand these
due to the language barrier. Klyosov seems to be active at these
sites. The sites make use of the FTDNA projects and Ysearch.
Kenneth Nordtvedt
published an article about calculating TMRCA
in the Fall 2008 issue of JoGG. His excel files of data and tools are
available at his web site. Ken has been active in web discussions,
suggesting many STR based clusters.
FTDNA
link: http://www.familytreedna.com/. This is a commercial DNA testing
company. I make extensive use of the
project databases maintained by FTDNA.
These are my primary sources of data.
Click on the “Projects” tab at the home page to look for projects. Also, the project name can be substituted
for /polish/ in the Polish Project
link, below. I do not work for
FTDNA; many other companies offer DNA
tests; I recommend FTDNA because I like
the convenience of most DDNA data being available at the projects, particularly
the Polish Project.
WTY. “Walk Through the Y”. This is an obsolete commercial product by FTDNA, for reading more than 200,000 base
pairs of your Y chromosome, in a search for new SNPs in your branch of the
Y-DNA tree. You can read about my WTY at another of my web pages. WTY has been replaced by Big Y.
Big
Y: Replacement for WTY. Discussion of Big Y.
Polish
Project link: https://www.familytreedna.com/groups/polish/about/background. One of many FTDNA projects. This is my primary source for Polish
data. The Polish Project tracks both
Y-DNA and mtDNA. The Y-DNA STR data
that I use is at https://www.familytreedna.com/public/polish?iframe=yresults. The Y-DNA SNP data is at https://www.familytreedna.com/public/polish?iframe=ysnp.
Paul Stone
is an administrator of the Polish Project,
with emphasis on the I1 haplogroup.
R1a
Project link: http://www.familytreedna.com/public/R1a. Largest R1a project, with multiple
co-administrators, active in subdividing R1a data into hypothetical
haplogroups. The project home page has
a summary chart of R1a SNP subdivision, and other reference links. The administrators have been very helpful to
me, particularly Michal Milewski, Lukasz Lapinski, Artur Martyka, and Lawrence Mayka.
R1aY-Haplogroup
Project link: www.familytreedna.com/public/R1aY-Haplogroup. Original R1a project.
Ysearch
link: http://www.ysearch.org/. Ysearch is the largest web database for
Y-DNA, run by FTDNA, open to all men, including men who also register with
projects and including men with data from other testing services. From the FTDNA site, you can register your data
with Ysearch. Or you can type your
Y-STR data into Ysearch. I have Instructions for use of
Ysearch. I use Ysearch often for
analysis so of course I encourage you to register your Y-DNA data at
Ysearch. I am not associated with the
company FTDNA.
Yseq: http://www.yseq.net/. A company that provides Y- SNP tests at
competitive price and fast turnaround.
Yfull: http://www.yfull.com/. Yfull SNP tree: http://www.yfull.com/tree/
Yhrd
link: http://www.yhrd.org/. A forensic Y-DNA data base. Data is separate by city, with many Polish
cities. I relied on Yhrd to figure out
the geography of the various haplotypes.
Semargl. R1a site by Vladimir Tangankin. R1a tree in pie chart format using 111
marker data, Oct 2021: http://www.semargl.me/blog/wp-content/uploads/2012/10/R1aTree20121009tmb700.png
Blowup: http://www.semargl.me/blog/wp-content/uploads/2012/10/R1aTree20121009.png
Sorenson
link: http://www.smgf.org/. Another DNA testing company.
ISOGG
link: http://isogg.org/tree/ Y-DNA tree SNPs and corresponding alphanumeric
codes for the haplogroups.
FTDNA
Draft Tree link: http://ytree.ftdna.com/index.php?name=Draft
another Y-DNA tree with SNPs, but not updated in more than a year.
recLOH: A technical detail discussed in many
publications, for example http://en.wikipedia.org/wiki/RecLOH. I discuss this and other compound marker issues, and how step is calculated,
in the “Documentation” sheet for my Calculator.xls
tool.
DYS389:
Another technical detail, also discussed on the web and in my
Calculator.xls. Briefly, 389II is the
sum of 389I plus another STR, so 389II should be figured in terms of the delta
value.
Chandler
mutation rates. Mentioned in my
publication. From Chandler, Fall 2006 http://www.jogg.info/, 37 markers. 67 marker extension on line at mutation
rates.
Peter Gwozdz
I’m a very rare type in Poland - E-L540. My maternal
1st cousins are R1a1a. That means my
late maternal grandfather was R1a1a. I
became interested in Y-DNA in 2004. My
maternal family name is Iwanowicz. I
discovered a family with that name in my maternal grandfather’s home town in
Poland. They are the only Iwanowicz
family within 50 miles, so I was suspicious they might be my 3rd or 4th
cousins. I brought a cheek swab kit
when I visited them the second time in 2006.
Sure enough, the male son was a perfect 25 STR marker match to my 1st
cousin. I didn’t get around to checking
the web for a year. I was shocked to
discover that these maternal cousins matched 80 people in the FTDNA data base, for a perfect 12 out of 12
STR markers. That’s a hell of a lot of
matches in the summer of 2007. Most of
these matches are Polish. I did some
research and found an article by Pawlowski
(reference in my publication)
about this most common Polish Y-DNA haplotype, which I now call P type.
That got me interested in doing more research, leading to this web page
for others to see my results. My
experience, however, is a reminder that statistics can be misleading. I was confident that my grandfather’s
haplotype was P type, based on a perfect match at the first 12 markers. In June 2010 I realized that the probability
was really about 93%, because 13 out of the 14 then current Polish Project
members who had 67 markers and who also matched P type perfectly at 12 markers
were in fact P type as judged by all 67 markers. My grandfather does not match P type at 67 markers. My grandfather is that 14th one. He matches the small clade that I named I type, after Iwanowicz. I type has since been verified as haplogroup
S18681, which is also concentrated in
Poland. That’s how an outsider ended up
studying P type and R1a1a, and writing web pages and articles about common
Polish Y-DNA clades. This web page was
originally called “R1a”; it got so many
hits from Poland that I eventually renamed it to include all common Y-DNA
clades.
Revision History
2007 Dec 6 First web posting of this file
2007 Dec Two revisions
2008 8 revisions
2009 33 revisions
2010 36 revisions
2011 26 revisions
2012 18 revisions
2013 & 2014 10 revisions
2015 Jan to Mar 8
revisions
2015 Aug to Nov 15
revisious
2016 Feb 24 Full
update of the Results Table and Abstract
2016 Apr 1 Edit of
I1a-P; also misc minor edits
2016 Apr 15 Rewrite
of Np Cluster (YP515)
2016 Apr 16 Nt topic
deleted (Nt cluster, speculative, used only briefly for assignments early
2012; not a valid clade)
2016 Apr 16
Na,b,c,d,e: 5 topics deleted (Speculative from 2012, never used for
assignments; not valid clades)
2016 Apr 16 Excel
files - associated with these speculative clusters - also deleted
2016 Apr 17 Ns
Cluster rewrite
2016 Apr 18 Ng Type
rewrite
2016 Apr 20 Update
PCI table; Nashk Type rewrite
2016 Apr 22 N Type
rewrite
2016 Apr 23 L1029
rewrite; P type edit; M458 rewrite
2016 Apr 24 new
Results topic, rewrite of Abstract and R1a Abstract
2016 Apr 25 edit
lead paragraphs of “Description of the R1a Branches”
2016 Apr 25 delete
topic “L260 and M458 Signatures”
2016 Apr 25 delete
topic “37 Marker Network”
2016 Apr 25 quite a
bit of minor editing, including bookmarks
2016 Apr 25 Z92
rewrite
2016 Apr 26 delete stubs
for speculative Kt, Ku, Kw, Ky
2016 Apr 28 Z93 and
A type rewrite
2016 Jun 10
Rewrite: Age Caveats, Introduction, My
Mountain Method, Population Bottlenecks
2016 Jun 11 Rewrite
E type and Z92
2016 Jun 12 Edit P
type
2016 Jun 30 Rewrite
of B type. Edit of a couple topics
2016 Jul 5 Rewrite
of D type. Edit of B type. C type R1a1a (M198+,M417-) deleted (2012
last C type update explained this is obsolete). Rewrite of topic Probability and Confidence. Rewrite of topic Polish Project
Assignments. Minor edits of other
topics
2016 Sep 18 Update of the Results Table with July 6 database
2016 Sep 19 Update of G Type. Update of J Type (YP977)
2016 Sep 19 Update of frequency % in the PCI Table.