Initial commit.

This commit is contained in:
2025-10-11 18:08:04 +00:00
commit 8947da88eb
43 changed files with 7850 additions and 0 deletions

View File

@@ -0,0 +1,11 @@
# Custom Dictionary Words
dnsmasq
dpkg
ftpd
GOARCH
oklch
postinst
postrm
prerm
shadcn
wildcloud

13
.envrc Normal file
View File

@@ -0,0 +1,13 @@
# API dev
export WILD_CENTRAL_ENV=development
export WILD_CENTRAL_DATA=$PWD/data
export WILD_DIRECTORY=$PWD/directory
# CLI/App dev
export WILD_DAEMON_URL=http://localhost:5055
export WILD_CLI_DATA=$HOME/.wildcloud
# Source activate.sh in interactive shells
if [[ $- == *i* ]]; then
source ./activate.sh
fi

38
.gitignore vendored Normal file
View File

@@ -0,0 +1,38 @@
# Amplifier
amplifier/
tools/
.data
pyproject.toml
# Python artifacts
__pycache__/
*.py[cod]
*$py.class
*.egg-info/
.venv/
uv.lock
# VSCode
.vscode/
*.code-workspace
# Claude AI configuration
.claude/
# Wild Cloud
data/
# Development working dir
.working/
__debug*
compact__
.lock
# Compiled binaries
wild-api/daemon
wild-api/wildd
wild-cli/wild
bin
dist

3
.gitmodules vendored Normal file
View File

@@ -0,0 +1,3 @@
[submodule "wild-directory"]
path = wild-directory
url = https://git.civilsociety.dev/wild-cloud/wild-directory.git

21
CLAUDE.md Normal file
View File

@@ -0,0 +1,21 @@
# CLAUDE.md
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
## Project Overview
This project is called "Wild Cloud Central". It consists of the following components:
- **Wild Daemon**:
- @daemon/README.md
- A web server that provides an API for managing Wild Cloud instances.
- Wild CLI:
- @cli/README.md
- A command-line interface for interacting with the Wild Daemon and managing Wild Cloud clusters.
- Wild App:
- @app/README.md
- A web-based interface for managing Wild Cloud instances, hosted on Wild Central.
Read all of the following for context:
- @ai/BUILDING_WILD_CENTRAL.md

661
LICENSE Normal file
View File

@@ -0,0 +1,661 @@
GNU AFFERO GENERAL PUBLIC LICENSE
Version 3, 19 November 2007
Copyright (C) 2007 Free Software Foundation, Inc. <https://fsf.org/>
Everyone is permitted to copy and distribute verbatim copies
of this license document, but changing it is not allowed.
Preamble
The GNU Affero General Public License is a free, copyleft license for
software and other kinds of works, specifically designed to ensure
cooperation with the community in the case of network server software.
The licenses for most software and other practical works are designed
to take away your freedom to share and change the works. By contrast,
our General Public Licenses are intended to guarantee your freedom to
share and change all versions of a program--to make sure it remains free
software for all its users.
When we speak of free software, we are referring to freedom, not
price. Our General Public Licenses are designed to make sure that you
have the freedom to distribute copies of free software (and charge for
them if you wish), that you receive source code or can get it if you
want it, that you can change the software or use pieces of it in new
free programs, and that you know you can do these things.
Developers that use our General Public Licenses protect your rights
with two steps: (1) assert copyright on the software, and (2) offer
you this License which gives you legal permission to copy, distribute
and/or modify the software.
A secondary benefit of defending all users' freedom is that
improvements made in alternate versions of the program, if they
receive widespread use, become available for other developers to
incorporate. Many developers of free software are heartened and
encouraged by the resulting cooperation. However, in the case of
software used on network servers, this result may fail to come about.
The GNU General Public License permits making a modified version and
letting the public access it on a server without ever releasing its
source code to the public.
The GNU Affero General Public License is designed specifically to
ensure that, in such cases, the modified source code becomes available
to the community. It requires the operator of a network server to
provide the source code of the modified version running there to the
users of that server. Therefore, public use of a modified version, on
a publicly accessible server, gives the public access to the source
code of the modified version.
An older license, called the Affero General Public License and
published by Affero, was designed to accomplish similar goals. This is
a different license, not a version of the Affero GPL, but Affero has
released a new version of the Affero GPL which permits relicensing under
this license.
The precise terms and conditions for copying, distribution and
modification follow.
TERMS AND CONDITIONS
0. Definitions.
"This License" refers to version 3 of the GNU Affero General Public License.
"Copyright" also means copyright-like laws that apply to other kinds of
works, such as semiconductor masks.
"The Program" refers to any copyrightable work licensed under this
License. Each licensee is addressed as "you". "Licensees" and
"recipients" may be individuals or organizations.
To "modify" a work means to copy from or adapt all or part of the work
in a fashion requiring copyright permission, other than the making of an
exact copy. The resulting work is called a "modified version" of the
earlier work or a work "based on" the earlier work.
A "covered work" means either the unmodified Program or a work based
on the Program.
To "propagate" a work means to do anything with it that, without
permission, would make you directly or secondarily liable for
infringement under applicable copyright law, except executing it on a
computer or modifying a private copy. Propagation includes copying,
distribution (with or without modification), making available to the
public, and in some countries other activities as well.
To "convey" a work means any kind of propagation that enables other
parties to make or receive copies. Mere interaction with a user through
a computer network, with no transfer of a copy, is not conveying.
An interactive user interface displays "Appropriate Legal Notices"
to the extent that it includes a convenient and prominently visible
feature that (1) displays an appropriate copyright notice, and (2)
tells the user that there is no warranty for the work (except to the
extent that warranties are provided), that licensees may convey the
work under this License, and how to view a copy of this License. If
the interface presents a list of user commands or options, such as a
menu, a prominent item in the list meets this criterion.
1. Source Code.
The "source code" for a work means the preferred form of the work
for making modifications to it. "Object code" means any non-source
form of a work.
A "Standard Interface" means an interface that either is an official
standard defined by a recognized standards body, or, in the case of
interfaces specified for a particular programming language, one that
is widely used among developers working in that language.
The "System Libraries" of an executable work include anything, other
than the work as a whole, that (a) is included in the normal form of
packaging a Major Component, but which is not part of that Major
Component, and (b) serves only to enable use of the work with that
Major Component, or to implement a Standard Interface for which an
implementation is available to the public in source code form. A
"Major Component", in this context, means a major essential component
(kernel, window system, and so on) of the specific operating system
(if any) on which the executable work runs, or a compiler used to
produce the work, or an object code interpreter used to run it.
The "Corresponding Source" for a work in object code form means all
the source code needed to generate, install, and (for an executable
work) run the object code and to modify the work, including scripts to
control those activities. However, it does not include the work's
System Libraries, or general-purpose tools or generally available free
programs which are used unmodified in performing those activities but
which are not part of the work. For example, Corresponding Source
includes interface definition files associated with source files for
the work, and the source code for shared libraries and dynamically
linked subprograms that the work is specifically designed to require,
such as by intimate data communication or control flow between those
subprograms and other parts of the work.
The Corresponding Source need not include anything that users
can regenerate automatically from other parts of the Corresponding
Source.
The Corresponding Source for a work in source code form is that
same work.
2. Basic Permissions.
All rights granted under this License are granted for the term of
copyright on the Program, and are irrevocable provided the stated
conditions are met. This License explicitly affirms your unlimited
permission to run the unmodified Program. The output from running a
covered work is covered by this License only if the output, given its
content, constitutes a covered work. This License acknowledges your
rights of fair use or other equivalent, as provided by copyright law.
You may make, run and propagate covered works that you do not
convey, without conditions so long as your license otherwise remains
in force. You may convey covered works to others for the sole purpose
of having them make modifications exclusively for you, or provide you
with facilities for running those works, provided that you comply with
the terms of this License in conveying all material for which you do
not control copyright. Those thus making or running the covered works
for you must do so exclusively on your behalf, under your direction
and control, on terms that prohibit them from making any copies of
your copyrighted material outside their relationship with you.
Conveying under any other circumstances is permitted solely under
the conditions stated below. Sublicensing is not allowed; section 10
makes it unnecessary.
3. Protecting Users' Legal Rights From Anti-Circumvention Law.
No covered work shall be deemed part of an effective technological
measure under any applicable law fulfilling obligations under article
11 of the WIPO copyright treaty adopted on 20 December 1996, or
similar laws prohibiting or restricting circumvention of such
measures.
When you convey a covered work, you waive any legal power to forbid
circumvention of technological measures to the extent such circumvention
is effected by exercising rights under this License with respect to
the covered work, and you disclaim any intention to limit operation or
modification of the work as a means of enforcing, against the work's
users, your or third parties' legal rights to forbid circumvention of
technological measures.
4. Conveying Verbatim Copies.
You may convey verbatim copies of the Program's source code as you
receive it, in any medium, provided that you conspicuously and
appropriately publish on each copy an appropriate copyright notice;
keep intact all notices stating that this License and any
non-permissive terms added in accord with section 7 apply to the code;
keep intact all notices of the absence of any warranty; and give all
recipients a copy of this License along with the Program.
You may charge any price or no price for each copy that you convey,
and you may offer support or warranty protection for a fee.
5. Conveying Modified Source Versions.
You may convey a work based on the Program, or the modifications to
produce it from the Program, in the form of source code under the
terms of section 4, provided that you also meet all of these conditions:
a) The work must carry prominent notices stating that you modified
it, and giving a relevant date.
b) The work must carry prominent notices stating that it is
released under this License and any conditions added under section
7. This requirement modifies the requirement in section 4 to
"keep intact all notices".
c) You must license the entire work, as a whole, under this
License to anyone who comes into possession of a copy. This
License will therefore apply, along with any applicable section 7
additional terms, to the whole of the work, and all its parts,
regardless of how they are packaged. This License gives no
permission to license the work in any other way, but it does not
invalidate such permission if you have separately received it.
d) If the work has interactive user interfaces, each must display
Appropriate Legal Notices; however, if the Program has interactive
interfaces that do not display Appropriate Legal Notices, your
work need not make them do so.
A compilation of a covered work with other separate and independent
works, which are not by their nature extensions of the covered work,
and which are not combined with it such as to form a larger program,
in or on a volume of a storage or distribution medium, is called an
"aggregate" if the compilation and its resulting copyright are not
used to limit the access or legal rights of the compilation's users
beyond what the individual works permit. Inclusion of a covered work
in an aggregate does not cause this License to apply to the other
parts of the aggregate.
6. Conveying Non-Source Forms.
You may convey a covered work in object code form under the terms
of sections 4 and 5, provided that you also convey the
machine-readable Corresponding Source under the terms of this License,
in one of these ways:
a) Convey the object code in, or embodied in, a physical product
(including a physical distribution medium), accompanied by the
Corresponding Source fixed on a durable physical medium
customarily used for software interchange.
b) Convey the object code in, or embodied in, a physical product
(including a physical distribution medium), accompanied by a
written offer, valid for at least three years and valid for as
long as you offer spare parts or customer support for that product
model, to give anyone who possesses the object code either (1) a
copy of the Corresponding Source for all the software in the
product that is covered by this License, on a durable physical
medium customarily used for software interchange, for a price no
more than your reasonable cost of physically performing this
conveying of source, or (2) access to copy the
Corresponding Source from a network server at no charge.
c) Convey individual copies of the object code with a copy of the
written offer to provide the Corresponding Source. This
alternative is allowed only occasionally and noncommercially, and
only if you received the object code with such an offer, in accord
with subsection 6b.
d) Convey the object code by offering access from a designated
place (gratis or for a charge), and offer equivalent access to the
Corresponding Source in the same way through the same place at no
further charge. You need not require recipients to copy the
Corresponding Source along with the object code. If the place to
copy the object code is a network server, the Corresponding Source
may be on a different server (operated by you or a third party)
that supports equivalent copying facilities, provided you maintain
clear directions next to the object code saying where to find the
Corresponding Source. Regardless of what server hosts the
Corresponding Source, you remain obligated to ensure that it is
available for as long as needed to satisfy these requirements.
e) Convey the object code using peer-to-peer transmission, provided
you inform other peers where the object code and Corresponding
Source of the work are being offered to the general public at no
charge under subsection 6d.
A separable portion of the object code, whose source code is excluded
from the Corresponding Source as a System Library, need not be
included in conveying the object code work.
A "User Product" is either (1) a "consumer product", which means any
tangible personal property which is normally used for personal, family,
or household purposes, or (2) anything designed or sold for incorporation
into a dwelling. In determining whether a product is a consumer product,
doubtful cases shall be resolved in favor of coverage. For a particular
product received by a particular user, "normally used" refers to a
typical or common use of that class of product, regardless of the status
of the particular user or of the way in which the particular user
actually uses, or expects or is expected to use, the product. A product
is a consumer product regardless of whether the product has substantial
commercial, industrial or non-consumer uses, unless such uses represent
the only significant mode of use of the product.
"Installation Information" for a User Product means any methods,
procedures, authorization keys, or other information required to install
and execute modified versions of a covered work in that User Product from
a modified version of its Corresponding Source. The information must
suffice to ensure that the continued functioning of the modified object
code is in no case prevented or interfered with solely because
modification has been made.
If you convey an object code work under this section in, or with, or
specifically for use in, a User Product, and the conveying occurs as
part of a transaction in which the right of possession and use of the
User Product is transferred to the recipient in perpetuity or for a
fixed term (regardless of how the transaction is characterized), the
Corresponding Source conveyed under this section must be accompanied
by the Installation Information. But this requirement does not apply
if neither you nor any third party retains the ability to install
modified object code on the User Product (for example, the work has
been installed in ROM).
The requirement to provide Installation Information does not include a
requirement to continue to provide support service, warranty, or updates
for a work that has been modified or installed by the recipient, or for
the User Product in which it has been modified or installed. Access to a
network may be denied when the modification itself materially and
adversely affects the operation of the network or violates the rules and
protocols for communication across the network.
Corresponding Source conveyed, and Installation Information provided,
in accord with this section must be in a format that is publicly
documented (and with an implementation available to the public in
source code form), and must require no special password or key for
unpacking, reading or copying.
7. Additional Terms.
"Additional permissions" are terms that supplement the terms of this
License by making exceptions from one or more of its conditions.
Additional permissions that are applicable to the entire Program shall
be treated as though they were included in this License, to the extent
that they are valid under applicable law. If additional permissions
apply only to part of the Program, that part may be used separately
under those permissions, but the entire Program remains governed by
this License without regard to the additional permissions.
When you convey a copy of a covered work, you may at your option
remove any additional permissions from that copy, or from any part of
it. (Additional permissions may be written to require their own
removal in certain cases when you modify the work.) You may place
additional permissions on material, added by you to a covered work,
for which you have or can give appropriate copyright permission.
Notwithstanding any other provision of this License, for material you
add to a covered work, you may (if authorized by the copyright holders of
that material) supplement the terms of this License with terms:
a) Disclaiming warranty or limiting liability differently from the
terms of sections 15 and 16 of this License; or
b) Requiring preservation of specified reasonable legal notices or
author attributions in that material or in the Appropriate Legal
Notices displayed by works containing it; or
c) Prohibiting misrepresentation of the origin of that material, or
requiring that modified versions of such material be marked in
reasonable ways as different from the original version; or
d) Limiting the use for publicity purposes of names of licensors or
authors of the material; or
e) Declining to grant rights under trademark law for use of some
trade names, trademarks, or service marks; or
f) Requiring indemnification of licensors and authors of that
material by anyone who conveys the material (or modified versions of
it) with contractual assumptions of liability to the recipient, for
any liability that these contractual assumptions directly impose on
those licensors and authors.
All other non-permissive additional terms are considered "further
restrictions" within the meaning of section 10. If the Program as you
received it, or any part of it, contains a notice stating that it is
governed by this License along with a term that is a further
restriction, you may remove that term. If a license document contains
a further restriction but permits relicensing or conveying under this
License, you may add to a covered work material governed by the terms
of that license document, provided that the further restriction does
not survive such relicensing or conveying.
If you add terms to a covered work in accord with this section, you
must place, in the relevant source files, a statement of the
additional terms that apply to those files, or a notice indicating
where to find the applicable terms.
Additional terms, permissive or non-permissive, may be stated in the
form of a separately written license, or stated as exceptions;
the above requirements apply either way.
8. Termination.
You may not propagate or modify a covered work except as expressly
provided under this License. Any attempt otherwise to propagate or
modify it is void, and will automatically terminate your rights under
this License (including any patent licenses granted under the third
paragraph of section 11).
However, if you cease all violation of this License, then your
license from a particular copyright holder is reinstated (a)
provisionally, unless and until the copyright holder explicitly and
finally terminates your license, and (b) permanently, if the copyright
holder fails to notify you of the violation by some reasonable means
prior to 60 days after the cessation.
Moreover, your license from a particular copyright holder is
reinstated permanently if the copyright holder notifies you of the
violation by some reasonable means, this is the first time you have
received notice of violation of this License (for any work) from that
copyright holder, and you cure the violation prior to 30 days after
your receipt of the notice.
Termination of your rights under this section does not terminate the
licenses of parties who have received copies or rights from you under
this License. If your rights have been terminated and not permanently
reinstated, you do not qualify to receive new licenses for the same
material under section 10.
9. Acceptance Not Required for Having Copies.
You are not required to accept this License in order to receive or
run a copy of the Program. Ancillary propagation of a covered work
occurring solely as a consequence of using peer-to-peer transmission
to receive a copy likewise does not require acceptance. However,
nothing other than this License grants you permission to propagate or
modify any covered work. These actions infringe copyright if you do
not accept this License. Therefore, by modifying or propagating a
covered work, you indicate your acceptance of this License to do so.
10. Automatic Licensing of Downstream Recipients.
Each time you convey a covered work, the recipient automatically
receives a license from the original licensors, to run, modify and
propagate that work, subject to this License. You are not responsible
for enforcing compliance by third parties with this License.
An "entity transaction" is a transaction transferring control of an
organization, or substantially all assets of one, or subdividing an
organization, or merging organizations. If propagation of a covered
work results from an entity transaction, each party to that
transaction who receives a copy of the work also receives whatever
licenses to the work the party's predecessor in interest had or could
give under the previous paragraph, plus a right to possession of the
Corresponding Source of the work from the predecessor in interest, if
the predecessor has it or can get it with reasonable efforts.
You may not impose any further restrictions on the exercise of the
rights granted or affirmed under this License. For example, you may
not impose a license fee, royalty, or other charge for exercise of
rights granted under this License, and you may not initiate litigation
(including a cross-claim or counterclaim in a lawsuit) alleging that
any patent claim is infringed by making, using, selling, offering for
sale, or importing the Program or any portion of it.
11. Patents.
A "contributor" is a copyright holder who authorizes use under this
License of the Program or a work on which the Program is based. The
work thus licensed is called the contributor's "contributor version".
A contributor's "essential patent claims" are all patent claims
owned or controlled by the contributor, whether already acquired or
hereafter acquired, that would be infringed by some manner, permitted
by this License, of making, using, or selling its contributor version,
but do not include claims that would be infringed only as a
consequence of further modification of the contributor version. For
purposes of this definition, "control" includes the right to grant
patent sublicenses in a manner consistent with the requirements of
this License.
Each contributor grants you a non-exclusive, worldwide, royalty-free
patent license under the contributor's essential patent claims, to
make, use, sell, offer for sale, import and otherwise run, modify and
propagate the contents of its contributor version.
In the following three paragraphs, a "patent license" is any express
agreement or commitment, however denominated, not to enforce a patent
(such as an express permission to practice a patent or covenant not to
sue for patent infringement). To "grant" such a patent license to a
party means to make such an agreement or commitment not to enforce a
patent against the party.
If you convey a covered work, knowingly relying on a patent license,
and the Corresponding Source of the work is not available for anyone
to copy, free of charge and under the terms of this License, through a
publicly available network server or other readily accessible means,
then you must either (1) cause the Corresponding Source to be so
available, or (2) arrange to deprive yourself of the benefit of the
patent license for this particular work, or (3) arrange, in a manner
consistent with the requirements of this License, to extend the patent
license to downstream recipients. "Knowingly relying" means you have
actual knowledge that, but for the patent license, your conveying the
covered work in a country, or your recipient's use of the covered work
in a country, would infringe one or more identifiable patents in that
country that you have reason to believe are valid.
If, pursuant to or in connection with a single transaction or
arrangement, you convey, or propagate by procuring conveyance of, a
covered work, and grant a patent license to some of the parties
receiving the covered work authorizing them to use, propagate, modify
or convey a specific copy of the covered work, then the patent license
you grant is automatically extended to all recipients of the covered
work and works based on it.
A patent license is "discriminatory" if it does not include within
the scope of its coverage, prohibits the exercise of, or is
conditioned on the non-exercise of one or more of the rights that are
specifically granted under this License. You may not convey a covered
work if you are a party to an arrangement with a third party that is
in the business of distributing software, under which you make payment
to the third party based on the extent of your activity of conveying
the work, and under which the third party grants, to any of the
parties who would receive the covered work from you, a discriminatory
patent license (a) in connection with copies of the covered work
conveyed by you (or copies made from those copies), or (b) primarily
for and in connection with specific products or compilations that
contain the covered work, unless you entered into that arrangement,
or that patent license was granted, prior to 28 March 2007.
Nothing in this License shall be construed as excluding or limiting
any implied license or other defenses to infringement that may
otherwise be available to you under applicable patent law.
12. No Surrender of Others' Freedom.
If conditions are imposed on you (whether by court order, agreement or
otherwise) that contradict the conditions of this License, they do not
excuse you from the conditions of this License. If you cannot convey a
covered work so as to satisfy simultaneously your obligations under this
License and any other pertinent obligations, then as a consequence you may
not convey it at all. For example, if you agree to terms that obligate you
to collect a royalty for further conveying from those to whom you convey
the Program, the only way you could satisfy both those terms and this
License would be to refrain entirely from conveying the Program.
13. Remote Network Interaction; Use with the GNU General Public License.
Notwithstanding any other provision of this License, if you modify the
Program, your modified version must prominently offer all users
interacting with it remotely through a computer network (if your version
supports such interaction) an opportunity to receive the Corresponding
Source of your version by providing access to the Corresponding Source
from a network server at no charge, through some standard or customary
means of facilitating copying of software. This Corresponding Source
shall include the Corresponding Source for any work covered by version 3
of the GNU General Public License that is incorporated pursuant to the
following paragraph.
Notwithstanding any other provision of this License, you have
permission to link or combine any covered work with a work licensed
under version 3 of the GNU General Public License into a single
combined work, and to convey the resulting work. The terms of this
License will continue to apply to the part which is the covered work,
but the work with which it is combined will remain governed by version
3 of the GNU General Public License.
14. Revised Versions of this License.
The Free Software Foundation may publish revised and/or new versions of
the GNU Affero General Public License from time to time. Such new versions
will be similar in spirit to the present version, but may differ in detail to
address new problems or concerns.
Each version is given a distinguishing version number. If the
Program specifies that a certain numbered version of the GNU Affero General
Public License "or any later version" applies to it, you have the
option of following the terms and conditions either of that numbered
version or of any later version published by the Free Software
Foundation. If the Program does not specify a version number of the
GNU Affero General Public License, you may choose any version ever published
by the Free Software Foundation.
If the Program specifies that a proxy can decide which future
versions of the GNU Affero General Public License can be used, that proxy's
public statement of acceptance of a version permanently authorizes you
to choose that version for the Program.
Later license versions may give you additional or different
permissions. However, no additional obligations are imposed on any
author or copyright holder as a result of your choosing to follow a
later version.
15. Disclaimer of Warranty.
THERE IS NO WARRANTY FOR THE PROGRAM, TO THE EXTENT PERMITTED BY
APPLICABLE LAW. EXCEPT WHEN OTHERWISE STATED IN WRITING THE COPYRIGHT
HOLDERS AND/OR OTHER PARTIES PROVIDE THE PROGRAM "AS IS" WITHOUT WARRANTY
OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING, BUT NOT LIMITED TO,
THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
PURPOSE. THE ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE OF THE PROGRAM
IS WITH YOU. SHOULD THE PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF
ALL NECESSARY SERVICING, REPAIR OR CORRECTION.
16. Limitation of Liability.
IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING
WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MODIFIES AND/OR CONVEYS
THE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES, INCLUDING ANY
GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING OUT OF THE
USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED TO LOSS OF
DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY YOU OR THIRD
PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER PROGRAMS),
EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE POSSIBILITY OF
SUCH DAMAGES.
17. Interpretation of Sections 15 and 16.
If the disclaimer of warranty and limitation of liability provided
above cannot be given local legal effect according to their terms,
reviewing courts shall apply local law that most closely approximates
an absolute waiver of all civil liability in connection with the
Program, unless a warranty or assumption of liability accompanies a
copy of the Program in return for a fee.
END OF TERMS AND CONDITIONS
How to Apply These Terms to Your New Programs
If you develop a new program, and you want it to be of the greatest
possible use to the public, the best way to achieve this is to make it
free software which everyone can redistribute and change under these terms.
To do so, attach the following notices to the program. It is safest
to attach them to the start of each source file to most effectively
state the exclusion of warranty; and each file should have at least
the "copyright" line and a pointer to where the full notice is found.
<one line to give the program's name and a brief idea of what it does.>
Copyright (C) <year> <name of author>
This program is free software: you can redistribute it and/or modify
it under the terms of the GNU Affero General Public License as published
by the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU Affero General Public License for more details.
You should have received a copy of the GNU Affero General Public License
along with this program. If not, see <https://www.gnu.org/licenses/>.
Also add information on how to contact you by electronic and paper mail.
If your software can interact with users remotely through a computer
network, you should also make sure that it provides a way for users to
get its source. For example, if your program is a web application, its
interface could display a "Source" link that leads users to an archive
of the code. There are many ways you could offer source, and different
solutions will be better for different programs; see section 13 for the
specific requirements.
You should also get your employer (if you work as a programmer) or school,
if any, to sign a "copyright disclaimer" for the program, if necessary.
For more information on this, and how to apply and follow the GNU AGPL, see
<https://www.gnu.org/licenses/>.

12
README.md Normal file
View File

@@ -0,0 +1,12 @@
# Wild Cloud Development Environment
## Support
- **Documentation**: See `docs/` directory for detailed guides
- **Issues**: Report problems on the project issue tracker
- **API Reference**: Available at `/api/v1/` endpoints when service is running
## Documentation
- [Developer Guide](docs/DEVELOPER.md) - Development setup, testing, and API reference
- [Maintainer Guide](docs/MAINTAINER.md) - Package management and repository deployment

20
TODO.md Normal file
View File

@@ -0,0 +1,20 @@
# Development TO DO
## Wild Central
- dnsmasq setup and config for DNS resolution
- (future) PXE boot setup and config
- (future) DHCP setup and config
## Wild Daemon (Central Service, wildd)
- Add methods for config get and set and use them consistently instead of yq.
- Put all directory/setup files inside the daemon itself. These can be versioned with the daemon.
## Wild CLI
- Use common.sh in install.sh scripts.
## Wild App
- Need to build the whole thing.

21
activate.sh Executable file
View File

@@ -0,0 +1,21 @@
#!/usr/bin/env bash
# Bash completion
if [ -n "$BASH_VERSION" ]; then
# kubectl completion
if command -v kubectl &> /dev/null; then
eval "$(kubectl completion bash)"
fi
# talosctl completion
if command -v talosctl &> /dev/null; then
eval "$(talosctl completion bash)"
fi
# wild completion
if command -v wild &> /dev/null; then
eval "$(wild completion bash)"
fi
fi
source <(wild instance env)

260
ai/BUILDING_WILD_CENTRAL.md Normal file
View File

@@ -0,0 +1,260 @@
# Building Wild Cloud Central
The first version of Wild Cloud, the Proof of Concept version (v.PoC), was built as a collection of shell scripts that users would run from their local machines. This works well for early adopters who are comfortable with the command line, Talos, and Kubernetes.
To make Wild Cloud more accessible to a broader audience, we are developing Wild Central. Central is a single-purpose machine run on a LAN that will deliver:
- Wild Daemon: A lightweight service that runs on a local machine (e.g., a Raspberry Pi) to manage Wild Cloud instances on the local network.
- Wild App: A web-based interface (to Wild Daemon) for managing Wild Cloud instances.
- Wild CLI: A command-line interface (to Wild Daemon) for advanced users who prefer to manage Wild Cloud from the terminal.
## Background info
### Info about Wild Cloud v.PoC
- @docs/agent-context/wildcloud-v.PoC/README.md
- @docs/agent-context/wildcloud-v.PoC/overview.md
- @docs/agent-context/wildcloud-v.PoC/project-architecture.md
- @docs/agent-context/wildcloud-v.PoC/bin-scripts.md
- @docs/agent-context/wildcloud-v.PoC/configuration-system.md
- @docs/agent-context/wildcloud-v.PoC/setup-process.md
- @docs/agent-context/wildcloud-v.PoC/apps-system.md
### Info about Talos
- @docs/agent-context/talos-v1.11/README.md
- @docs/agent-context/talos-v1.11/architecture-and-components.md
- @docs/agent-context/talos-v1.11/cli-essentials.md
- @docs/agent-context/talos-v1.11/cluster-operations.md
- @docs/agent-context/talos-v1.11/discovery-and-networking.md
- @docs/agent-context/talos-v1.11/etcd-management.md
- @docs/agent-context/talos-v1.11/bare-metal-administration.md
- @docs/agent-context/talos-v1.11/troubleshooting-guide.md
## Architecture
### Old v.PoC Architecture
- WC_ROOT: The scripts used to set up and manage the Wild Cloud cluster. Currently, this is a set of shell scripts in $WC_ROOT/bin.
- WC_HOME: During setup, the user creates a Wild Cloud project directory (WC_HOME) on their local machine. This directory holds all configuration, secrets, and k8s manifests for their specific Wild Cloud deployment.
- Wild Cloud Apps Directory: The Wild Cloud apps are stored in the `apps/` directory within the WC_ROOT repository. Users can deploy these apps to their cluster using the scripts in WC_ROOT/bin.
- dnsmasq server: Scripts help the operator set up a dnsmasq server on a separate machine to provide LAN DNS services during node bootstrapping.
### New Wild Central Architecture
#### wildd: The Wild Cloud Daemon
wildd is a long-running service that provides an API and web interface for managing one or more Wild Cloud clusters. It runs on a dedicated device within the user's network.
wildd replaces functionality from the v.PoC scripts and the dnsmasq server. It is one API for managing multiple wild cloud instances on the LAN.
Both wild-app and wild-cli communicate with wildd to perform actions.
See: @daemon/BUILDING_WILD_DAEMON.md
#### wild-app
The web application that provides the user interface for Wild Cloud on Wild Central. It communicates with wildd to perform actions and display information.
See: @/app/BUILDING_WILD_APP.md
#### wild-cli
A command-line interface for advanced users who prefer to manage Wild Cloud from the terminal. It communicates with wildd to perform actions.
Mirrors all of the wild-* scripts from v.PoC, but adapted for the new architecture:
- One golang client (wild-cli) replaces many bash scripts (wild-*).
- Wrapper around wildd API instead of direct file manipulation.
- Multi-cloud: v.PoC scripts set the instance context with WC_HOME environment variable. In Central, wild-cli follows the "context" pattern like kubectl and talosctl, using `--context` or `WILD_CONTEXT` to select which wild cloud instance to manage, or defaulting to the "current" context.
See: @cli/BUILDING_WILD_CLI.md
#### Wild Central Data
Configured with $WILD_CENTRAL_DATA environment variable (default: /var/lib/wild-central).
Replaces multiple WC_HOMEs. All wild clouds managed on the LAN are configured here. These are still in easy to read YAML format and can be edited directly or through the webapp.
Wild Central data also holds the local app directory, logs, and artifacts, and overall state data.
#### Wild Cloud Apps Directory
The Wild Cloud apps are stored in the `apps/` directory within the WC_ROOT repository. Users can deploy these apps to their cluster using the webapp or wild-cli.
#### dnsmasq server
The Wild Daemon (wildd) includes functionality to manage a dnsmasq server on the same device, providing LAN DNS services during node bootstrapping.
## Packaging and Installation
Ultimately, the daemon, app, and cli will be packaged together for easy installation on a Raspberry Pi or similar device.
See @ai/WILD_CENTRAL_PACKAGING.md
## Implementation Philosophy
## Core Philosophy
Embodies a Zen-like minimalism that values simplicity and clarity above all. This approach reflects:
- **Wabi-sabi philosophy**: Embracing simplicity and the essential. Each line serves a clear purpose without unnecessary embellishment.
- **KISS**: The solution should be as simple as possible, but no simpler.
- **YAGNI**: Avoid building features or abstractions that aren't immediately needed. The code handles what's needed now rather than anticipating every possible future scenario.
- **Trust in emergence**: Complex systems work best when built from simple, well-defined components that do one thing well.
- **Pragmatic trust**: The developer trusts external systems enough to interact with them directly, handling failures as they occur rather than assuming they'll happen.
- **Consistency is key**: Uniform patterns and conventions make the codebase easier to understand and maintain. If you introduce a new pattern, make sure it's consistently applied. There should be one obvious way to do things.
This development philosophy values clear, concise documentation, readable code, and belief that good architecture emerges from simplicity rather than being imposed through complexity.
## Core Design Principles
### 1. Ruthless Simplicity
- **KISS principle taken to heart**: Keep everything as simple as possible, but no simpler
- **Minimize abstractions**: Every layer of abstraction must justify its existence
- **Start minimal, grow as needed**: Begin with the simplest implementation that meets current needs
- **Avoid future-proofing**: Don't build for hypothetical future requirements
- **Question everything**: Regularly challenge complexity in the codebase
### 2. Architectural Integrity with Minimal Implementation
- **Preserve key architectural patterns**: Maintain clear boundaries and responsibilities
- **Simplify implementations**: Maintain pattern benefits with dramatically simpler code
- **Scrappy but structured**: Lightweight implementations of solid architectural foundations
- **End-to-end thinking**: Focus on complete flows rather than perfect components
### 3. Library vs Custom Code
Choosing between custom code and external libraries is a judgment call that evolves with your requirements. There's no rigid rule - it's about understanding trade-offs and being willing to revisit decisions as needs change.
#### The Evolution Pattern
Your approach might naturally evolve:
- **Start simple**: Custom code for basic needs (20 lines handles it)
- **Growing complexity**: Switch to a library when requirements expand
- **Hitting limits**: Back to custom when you outgrow the library's capabilities
This isn't failure - it's natural evolution. Each stage was the right choice at that time.
#### When Custom Code Makes Sense
Custom code often wins when:
- The need is simple and well-understood
- You want code perfectly tuned to your exact requirements
- Libraries would require significant "hacking" or workarounds
- The problem is unique to your domain
- You need full control over the implementation
#### When Libraries Make Sense
Libraries shine when:
- They solve complex problems you'd rather not tackle (auth, crypto, video encoding)
- They align well with your needs without major modifications
- The problem is well-solved with mature, battle-tested solutions
- Configuration alone can adapt them to your requirements
- The complexity they handle far exceeds the integration cost
#### Making the Judgment Call
Ask yourself:
- How well does this library align with our actual needs?
- Are we fighting the library or working with it?
- Is the integration clean or does it require workarounds?
- Will our future requirements likely stay within this library's capabilities?
- Is the problem complex enough to justify the dependency?
#### Recognizing Misalignment
Watch for signs you're fighting your current approach:
- Spending more time working around the library than using it
- Your simple custom solution has grown complex and fragile
- You're monkey-patching or heavily wrapping a library
- The library's assumptions fundamentally conflict with your needs
#### Stay Flexible
Remember that complexity isn't destroyed, only moved. Libraries shift complexity from your code to someone else's - that's often a great trade, but recognize what you're doing.
The key is avoiding lock-in. Keep library integration points minimal and isolated so you can switch approaches when needed. There's no shame in moving from custom to library or library to custom. Requirements change, understanding deepens, and the right answer today might not be the right answer tomorrow. Make the best decision with current information, and be ready to evolve.
## Technical Implementation Guidelines
### API Layer
- Implement only essential endpoints
- Minimal middleware with focused validation
- Clear error responses with useful messages
- Consistent patterns across endpoints
### Storage
- Prefer simple file storage
- Simple schema focused on current needs
## Development Approach
### Vertical Slices
- Implement complete end-to-end functionality slices
- Start with core user journeys
- Get data flowing through all layers early
- Add features horizontally only after core flows work
### Iterative Implementation
- 80/20 principle: Focus on high-value, low-effort features first
- One working feature > multiple partial features
- Validate with real usage before enhancing
- Be willing to refactor early work as patterns emerge
### Testing Strategy
- Focus on critical path testing initially
- Add unit tests for complex logic and edge cases
- Testing pyramid: 60% unit, 30% integration, 10% end-to-end
### Error Handling
- Handle common errors robustly
- Log detailed information for debugging
- Provide clear error messages to users
- Fail fast and visibly during development
## Decision-Making Framework
When faced with implementation decisions, ask these questions:
1. **Necessity**: "Do we actually need this right now?"
2. **Simplicity**: "What's the simplest way to solve this problem?"
3. **Directness**: "Can we solve this more directly?"
4. **Value**: "Does the complexity add proportional value?"
5. **Maintenance**: "How easy will this be to understand and change later?"
## Areas to Embrace Complexity
Some areas justify additional complexity:
1. **Security**: Never compromise on security fundamentals
2. **Data integrity**: Ensure data consistency and reliability
3. **Core user experience**: Make the primary user flows smooth and reliable
4. **Error visibility**: Make problems obvious and diagnosable
## Areas to Aggressively Simplify
Push for extreme simplicity in these areas:
1. **Internal abstractions**: Minimize layers between components
2. **Generic "future-proof" code**: Resist solving non-existent problems
3. **Edge case handling**: Handle the common cases well first
4. **Framework usage**: Use only what you need from frameworks
5. **State management**: Keep state simple and explicit
## Remember
- It's easier to add complexity later than to remove it
- Code you don't write has no bugs
- Favor clarity over cleverness
- The best code is often the simplest
This philosophy document serves as the foundational guide for all implementation decisions in the project.

View File

@@ -0,0 +1,102 @@
# Packaging Wild Central
## Desired Experience
This is the desired experience for installing Wild Cloud Central on a fresh Debian/Ubuntu system:
### APT Repository (Recommended)
```bash
# Download and install GPG key
curl -fsSL https://mywildcloud.org/apt/wild-cloud-central.gpg | sudo tee /usr/share/keyrings/wild-cloud-central-archive-keyring.gpg > /dev/null
# Add repository (modern .sources format)
sudo tee /etc/apt/sources.list.d/wild-cloud-central.sources << 'EOF'
Types: deb
URIs: https://mywildcloud.org/apt
Suites: stable
Components: main
Signed-By: /usr/share/keyrings/wild-cloud-central-archive-keyring.gpg
EOF
# Update and install
sudo apt update
sudo apt install wild-cloud-central
```
### Manual Installation
Download the latest `.deb` package from the [releases page](https://github.com/wildcloud/wild-central/releases) and install:
```bash
sudo dpkg -i wild-cloud-central_*.deb
sudo apt-get install -f # Fix any dependency issues
```
## Quick Start
1. **Configure the service** (optional):
```bash
sudo cp /etc/wild-cloud-central/config.yaml.example /etc/wild-cloud-central/config.yaml
sudo nano /etc/wild-cloud-central/config.yaml
```
2. **Start the service**:
```bash
sudo systemctl enable wild-cloud-central
sudo systemctl start wild-cloud-central
```
3. **Access the web interface**:
Open http://your-server-ip in your browser
## Developer tooling
Makefile commands for packaging:
Build targets (compile binaries):
make build - Build for current architecture
make build-arm64 - Build arm64 binary
make build-amd64 - Build amd64 binary
make build-all - Build all architectures
Package targets (create .deb packages):
make package - Create .deb package for current arch
make package-arm64 - Create arm64 .deb package
make package-amd64 - Create amd64 .deb package
make package-all - Create all .deb packages
Repository targets:
make repo - Build APT repository from packages
make deploy-repo - Deploy repository to server
Quality assurance:
make check - Run all checks (fmt + vet + test)
make fmt - Format Go code
make vet - Run go vet
make test - Run tests
Development:
make run - Run application locally
make clean - Remove all build artifacts
make deps-check - Verify and tidy dependencies
make version - Show build information
make install - Install to system
Directory structure:
build/ - Intermediate build artifacts
dist/bin/ - Final binaries for distribution
dist/packages/ - OS packages (.deb files)
dist/repositories/ - APT repository for deployment
Example workflows:
make check && make build - Safe development build
make clean && make repo - Full release build

135
ai/talos-v1.11/README.md Normal file
View File

@@ -0,0 +1,135 @@
# Talos v1.11 Agent Context Documentation
This directory contains comprehensive documentation extracted from the official Talos v1.11 documentation, organized specifically to help AI agents become expert Talos cluster administrators.
## Documentation Structure
### Core Operations
- **[cluster-operations.md](cluster-operations.md)** - Essential cluster operations including upgrades, node management, and configuration
- **[cli-essentials.md](cli-essentials.md)** - Key talosctl commands and usage patterns for daily administration
### System Understanding
- **[architecture-and-components.md](architecture-and-components.md)** - Deep dive into Talos architecture, components, and design principles
- **[discovery-and-networking.md](discovery-and-networking.md)** - Cluster discovery mechanisms and network configuration
### Specialized Operations
- **[etcd-management.md](etcd-management.md)** - etcd operations, maintenance, backup, and disaster recovery
- **[bare-metal-administration.md](bare-metal-administration.md)** - Bare metal specific configurations, security, and hardware management
- **[troubleshooting-guide.md](troubleshooting-guide.md)** - Systematic approaches to diagnosing and resolving common issues
## Quick Reference
### Essential Commands for New Agents
```bash
# Cluster health check
talosctl -n <IP1>,<IP2>,<IP3> health --control-plane-nodes <IP1>,<IP2>,<IP3>
# Node information
talosctl get members
talosctl -n <IP> version
# Service status
talosctl -n <IP> services
talosctl -n <IP> service kubelet
# System resources
talosctl -n <IP> memory
talosctl -n <IP> disks
# Logs and events
talosctl -n <IP> dmesg | tail -50
talosctl -n <IP> logs kubelet
talosctl -n <IP> events --since=1h
```
### Critical Procedures
- **Bootstrap**: `talosctl bootstrap --nodes <first-controlplane-ip>`
- **Backup etcd**: `talosctl -n <IP> etcd snapshot db.snapshot`
- **Upgrade OS**: `talosctl upgrade --nodes <IP> --image ghcr.io/siderolabs/installer:v1.11.x`
- **Upgrade K8s**: `talosctl --nodes <controlplane> upgrade-k8s --to v1.34.1`
### Emergency Commands
- **Node reset**: `talosctl -n <IP> reset`
- **Force reset**: `talosctl -n <IP> reset --graceful=false --reboot`
- **Disaster recovery**: `talosctl -n <IP> bootstrap --recover-from=./db.snapshot`
- **Rollback**: `talosctl rollback --nodes <IP>`
### Bare Metal Specific Commands
- **Check hardware**: `talosctl -n <IP> disks`, `talosctl -n <IP> read /proc/cpuinfo`
- **Network interfaces**: `talosctl -n <IP> get addresses`, `talosctl -n <IP> get routes`
- **Extensions**: `talosctl -n <IP> get extensions`
- **Encryption status**: `talosctl -n <IP> get encryptionconfig -o yaml`
- **Hardware monitoring**: `talosctl -n <IP> dmesg | grep -i error`
## Key Concepts for Agents
### Architecture Fundamentals
- **Immutable OS**: Single image, atomic updates, A-B rollback system
- **API-driven**: All management through gRPC API, no SSH/shell access
- **Controller pattern**: Kubernetes-style resource controllers for system management
- **Minimal attack surface**: Only services necessary for Kubernetes
### Control Plane Design
- **etcd quorum**: Requires majority for operations (3-node=2, 5-node=3)
- **Bootstrap process**: One-time initialization of etcd cluster
- **HA considerations**: Odd numbers of nodes, avoid even numbers
- **Upgrade strategy**: Rolling upgrades with automatic rollback on failure
### Network and Discovery
- **Service discovery**: Encrypted discovery service for cluster membership
- **KubeSpan**: Optional WireGuard mesh networking
- **mTLS everywhere**: All Talos API communication secured
- **Discovery registries**: Service (default) and Kubernetes (deprecated)
### Bare Metal Considerations
- **META configuration**: Network config embedded in disk images
- **Hardware compatibility**: Driver support and firmware requirements
- **Disk encryption**: LUKS2 with TPM, static keys, or node ID
- **SecureBoot**: UKI images with embedded signatures
- **System extensions**: Hardware-specific drivers and tools
- **Performance tuning**: CPU governors, IOMMU, memory management
## Common Administration Patterns
### Daily Operations
1. Check cluster health across all nodes
2. Monitor resource usage and capacity
3. Review system events and logs
4. Verify etcd health and backup status
5. Monitor discovery service connectivity
### Maintenance Windows
1. Plan upgrade sequence (workers first, then control plane)
2. Create etcd backup before major changes
3. Apply configuration changes with dry-run first
4. Monitor upgrade progress and be ready to rollback
5. Verify cluster functionality after changes
### Troubleshooting Workflow
1. **Gather information**: Health, version, resources, logs
2. **Check connectivity**: Network, discovery, API endpoints
3. **Examine services**: Status of critical services
4. **Review logs**: System events, service logs, kernel messages
5. **Apply fixes**: Configuration patches, service restarts, node resets
## Best Practices for Agents
### Configuration Management
- Use reproducible configuration workflow (secrets + patches)
- Always dry-run configuration changes first
- Store machine configurations in version control
- Test configuration changes in non-production first
### Operational Safety
- Take etcd snapshots before major changes
- Upgrade one node at a time
- Monitor upgrade progress and have rollback ready
- Test disaster recovery procedures regularly
### Performance Optimization
- Monitor etcd fragmentation and defragment when needed
- Scale vertically before horizontally for control plane
- Use appropriate hardware for etcd (fast storage, low network latency)
- Monitor resource usage trends and capacity planning
This documentation provides the essential knowledge needed to effectively administer Talos Linux clusters, organized by operational context and complexity level.

View File

@@ -0,0 +1,248 @@
# Talos Architecture and Components Guide
This guide provides deep understanding of Talos Linux architecture and system components for effective cluster administration.
## Core Architecture Principles
Talos is designed to be:
- **Atomic**: Distributed as a single, versioned, signed, immutable image
- **Modular**: Composed of separate components with defined gRPC interfaces
- **Minimal**: Focused init system that runs only services necessary for Kubernetes
## File System Architecture
### Partition Layout
- **EFI**: Stores EFI boot data
- **BIOS**: Used for GRUB's second stage boot
- **BOOT**: Contains boot loader, initramfs, and kernel data
- **META**: Stores node metadata (node IDs, etc.)
- **STATE**: Stores machine configuration, node identity, cluster discovery, KubeSpan data
- **EPHEMERAL**: Stores ephemeral state, mounted at `/var`
### Root File System Structure
Three-layer design:
1. **Base Layer**: Read-only squashfs mounted as loop device (immutable base)
2. **Runtime Layer**: tmpfs filesystems for runtime needs (`/dev`, `/proc`, `/run`, `/sys`, `/tmp`, `/system`)
3. **Overlay Layer**: overlayfs for persistent data backed by XFS at `/var`
#### Special Directories
- `/system`: Internal files that need to be writable (recreated each boot)
- Example: `/system/etc/hosts` bind-mounted over `/etc/hosts`
- `/var`: Owned by Kubernetes, contains persistent data:
- etcd data (control plane nodes)
- kubelet data
- containerd data
- Survives reboots and upgrades, wiped on reset
## Core Components
### machined (PID 1)
**Role**: Talos replacement for traditional init process
**Functions**:
- Machine configuration management
- API handling
- Resource and controller management
- Service lifecycle management
**Managed Services**:
- containerd
- etcd (control plane nodes)
- kubelet
- networkd
- trustd
- udevd
**Architecture**: Uses controller-runtime pattern similar to Kubernetes controllers
### apid (API Gateway)
**Role**: gRPC API endpoint for all Talos interactions
**Functions**:
- Routes requests to appropriate components
- Provides proxy capabilities for multi-node operations
- Handles authentication and authorization
**Usage Patterns**:
```bash
# Direct node communication
talosctl -e <node-ip> <command>
# Proxy through endpoint to specific nodes
talosctl -e <endpoint> -n <target-nodes> <command>
# Multi-node operations
talosctl -e <endpoint> -n <node1>,<node2>,<node3> <command>
```
### trustd (Trust Management)
**Role**: Establishes and maintains trust within the system
**Functions**:
- Root of Trust implementation
- PKI data distribution for control plane bootstrap
- Certificate management
- Secure file placement operations
### containerd (Container Runtime)
**Role**: Industry-standard container runtime
**Namespaces**:
- `system`: Talos services
- `k8s.io`: Kubernetes services
### udevd (Device Management)
**Role**: Device file manager (eudev implementation)
**Functions**:
- Kernel device notification handling
- Device node management in `/dev`
- Hardware discovery and setup
## Control Plane Architecture
### etcd Cluster Design
**Critical Concepts**:
- **Quorum**: Majority of members must agree on leader
- **Membership**: Formal etcd cluster membership required
- **Consensus**: Uses Raft protocol for distributed consensus
**Quorum Requirements**:
- 3 nodes: Requires 2 for quorum (tolerates 1 failure)
- 5 nodes: Requires 3 for quorum (tolerates 2 failures)
- Even numbers are worse than odd (4 nodes still only tolerates 1 failure)
### Control Plane Components
**Running as Static Pods on Control Plane Nodes**:
#### kube-apiserver
- Kubernetes API endpoint
- Connects to local etcd instance
- Handles all API operations
#### kube-controller-manager
- Runs control loops
- Manages cluster state reconciliation
- Handles node lifecycle, replication, etc.
#### kube-scheduler
- Pod placement decisions
- Resource-aware scheduling
- Constraint satisfaction
### Bootstrap Process
1. **etcd Bootstrap**: One node chosen as bootstrap node, initializes etcd cluster
2. **Static Pods**: Control plane components start as static pods via kubelet
3. **API Availability**: Control plane endpoint becomes available
4. **Manifest Injection**: Bootstrap manifests (join tokens, RBAC, etc.) injected
5. **Cluster Formation**: Other control plane nodes join etcd cluster
6. **HA Control Plane**: All control plane nodes run full component set
## Resource System Architecture
### Controller-Runtime Pattern
Talos uses Kubernetes-style controller pattern:
- **Resources**: Typed configuration and state objects
- **Controllers**: Reconcile desired vs actual state
- **Events**: Reactive architecture for state changes
### Resource Namespaces
- `config`: Machine configuration resources
- `cluster`: Cluster membership and discovery
- `controlplane`: Control plane component configurations
- `secrets`: Certificate and key management
- `network`: Network configuration and state
### Key Resources
```bash
# Machine configuration
talosctl get machineconfig
talosctl get machinetype
# Cluster membership
talosctl get members
talosctl get affiliates
talosctl get identities
# Control plane
talosctl get apiserverconfig
talosctl get controllermanagerconfig
talosctl get schedulerconfig
# Network
talosctl get addresses
talosctl get routes
talosctl get nodeaddresses
```
## Network Architecture
### Network Stack
- **CNI**: Container Network Interface for pod networking
- **Host Networking**: Node-to-node communication
- **Service Discovery**: Built-in cluster member discovery
- **KubeSpan**: Optional WireGuard mesh networking
### Discovery Service Integration
- **Service Registry**: External discovery service (default: discovery.talos.dev)
- **Kubernetes Registry**: Deprecated, uses Kubernetes Node resources
- **Encrypted Communication**: All discovery data encrypted before transmission
## Security Architecture
### Immutable Base
- Read-only root filesystem
- Signed and verified boot process
- Atomic updates with rollback capability
### Process Isolation
- Minimal attack surface
- No shell access
- No arbitrary user services
- Container-based workload isolation
### Network Security
- Mutual TLS (mTLS) for all API communication
- Certificate-based node authentication
- Optional WireGuard mesh networking (KubeSpan)
- Encrypted service discovery
### Kernel Hardening
Configured according to Kernel Self Protection Project (KSPP) recommendations:
- Stack protection
- Control flow integrity
- Memory protection features
- Attack surface reduction
## Extension Points
### Machine Configuration
- Declarative configuration management
- Patch-based configuration updates
- Runtime configuration validation
### System Extensions
- Kernel modules
- System services (limited)
- Network configuration
- Storage configuration
### Kubernetes Integration
- Automatic kubelet configuration
- Bootstrap manifest management
- Certificate lifecycle management
- Node lifecycle automation
## Performance Characteristics
### etcd Performance
- Performance decreases with cluster size
- Network latency affects consensus performance
- Storage I/O directly impacts etcd performance
### Resource Requirements
- **Control Plane Nodes**: Higher memory for etcd, CPU for control plane
- **Worker Nodes**: Resources scale with workload requirements
- **Network**: Low latency crucial for etcd performance
### Scaling Patterns
- **Horizontal Scaling**: Add worker nodes for capacity
- **Vertical Scaling**: Increase control plane node resources for performance
- **Control Plane Scaling**: Odd numbers (3, 5) for availability
This architecture enables Talos to provide a secure, minimal, and operationally simple platform for running Kubernetes clusters while maintaining the reliability and performance characteristics needed for production workloads.

View File

@@ -0,0 +1,506 @@
# Bare Metal Talos Administration Guide
This guide covers bare metal specific operations, configurations, and best practices for Talos Linux clusters.
## META-Based Network Configuration
Talos supports META-based network configuration for bare metal deployments where configuration is embedded in the disk image.
### Basic META Configuration
```yaml
# META configuration for bare metal networking
machine:
network:
interfaces:
- interface: eth0
addresses:
- 192.168.1.100/24
routes:
- network: 0.0.0.0/0
gateway: 192.168.1.1
mtu: 1500
nameservers:
- 8.8.8.8
- 1.1.1.1
```
### Advanced Network Configurations
#### VLAN Configuration
```yaml
machine:
network:
interfaces:
- interface: eth0.100 # VLAN 100
vlan:
parentDevice: eth0
vid: 100
addresses:
- 192.168.100.10/24
routes:
- network: 192.168.100.0/24
```
#### Interface Bonding
```yaml
machine:
network:
interfaces:
- interface: bond0
bond:
mode: 802.3ad
lacpRate: fast
xmitHashPolicy: layer3+4
miimon: 100
updelay: 200
downdelay: 200
interfaces:
- eth0
- eth1
addresses:
- 192.168.1.100/24
routes:
- network: 0.0.0.0/0
gateway: 192.168.1.1
```
#### Bridge Configuration
```yaml
machine:
network:
interfaces:
- interface: br0
bridge:
stp:
enabled: false
interfaces:
- eth0
- eth1
addresses:
- 192.168.1.100/24
routes:
- network: 0.0.0.0/0
gateway: 192.168.1.1
```
### Network Troubleshooting Commands
```bash
# Check interface configuration
talosctl -n <IP> get addresses
talosctl -n <IP> get routes
talosctl -n <IP> get links
# Check network configuration
talosctl -n <IP> get networkconfig -o yaml
# Test network connectivity
talosctl -n <IP> list /sys/class/net
talosctl -n <IP> read /proc/net/dev
```
## Disk Encryption for Bare Metal
### LUKS2 Encryption Configuration
```yaml
machine:
systemDiskEncryption:
state:
provider: luks2
keys:
- slot: 0
static:
passphrase: "your-secure-passphrase"
ephemeral:
provider: luks2
keys:
- slot: 0
nodeID: {}
```
### TPM-Based Encryption
```yaml
machine:
systemDiskEncryption:
state:
provider: luks2
keys:
- slot: 0
tpm: {}
ephemeral:
provider: luks2
keys:
- slot: 0
tpm: {}
```
### Key Management Operations
```bash
# Check encryption status
talosctl -n <IP> get encryptionconfig -o yaml
# Rotate encryption keys
talosctl -n <IP> apply-config --file updated-config.yaml --mode staged
```
## SecureBoot Implementation
### UKI (Unified Kernel Image) Setup
SecureBoot requires UKI format images with embedded signatures.
#### Generate SecureBoot Keys
```bash
# Generate platform key (PK)
talosctl gen secureboot uki --platform-key-path platform.key --platform-cert-path platform.crt
# Generate PCR signing key
talosctl gen secureboot pcr --pcr-key-path pcr.key --pcr-cert-path pcr.crt
# Generate database entries
talosctl gen secureboot database --enrolled-certificate platform.crt
```
#### Machine Configuration for SecureBoot
```yaml
machine:
secureboot:
enabled: true
uklPath: /boot/vmlinuz
systemDiskEncryption:
state:
provider: luks2
keys:
- slot: 0
tpm:
pcrTargets:
- 0
- 1
- 7
```
### UEFI Configuration
- Enable SecureBoot in UEFI firmware
- Enroll platform keys and certificates
- Configure TPM 2.0 for PCR measurements
- Set boot order for UKI images
## Hardware-Specific Configurations
### Performance Tuning for Bare Metal
#### CPU Governor Configuration
```yaml
machine:
sysfs:
"devices.system.cpu.cpu0.cpufreq.scaling_governor": "performance"
"devices.system.cpu.cpu1.cpufreq.scaling_governor": "performance"
```
#### Hardware Vulnerability Mitigations
```yaml
machine:
kernel:
args:
- mitigations=off # For maximum performance (less secure)
# or
- mitigations=auto # Default balanced approach
```
#### IOMMU Configuration
```yaml
machine:
kernel:
args:
- intel_iommu=on
- iommu=pt
```
### Memory Management
```yaml
machine:
kernel:
args:
- hugepages=1024 # 1GB hugepages
- transparent_hugepage=never
```
## Ingress Firewall for Bare Metal
### Basic Firewall Configuration
```yaml
machine:
network:
firewall:
defaultAction: block
rules:
- name: allow-talos-api
portSelector:
ports:
- 50000
- 50001
ingress:
- subnet: 192.168.1.0/24
- name: allow-kubernetes-api
portSelector:
ports:
- 6443
ingress:
- subnet: 0.0.0.0/0
- name: allow-etcd
portSelector:
ports:
- 2379
- 2380
ingress:
- subnet: 192.168.1.0/24
```
### Advanced Firewall Rules
```yaml
machine:
network:
firewall:
defaultAction: block
rules:
- name: allow-ssh-management
portSelector:
ports:
- 22
ingress:
- subnet: 10.0.1.0/24 # Management network only
- name: allow-monitoring
portSelector:
ports:
- 9100 # Node exporter
- 10250 # kubelet metrics
ingress:
- subnet: 192.168.1.0/24
```
## System Extensions for Bare Metal
### Common Bare Metal Extensions
```yaml
machine:
install:
extensions:
- image: ghcr.io/siderolabs/iscsi-tools:latest
- image: ghcr.io/siderolabs/util-linux-tools:latest
- image: ghcr.io/siderolabs/drbd:latest
```
### Storage Extensions
```yaml
machine:
install:
extensions:
- image: ghcr.io/siderolabs/zfs:latest
- image: ghcr.io/siderolabs/nut-client:latest
- image: ghcr.io/siderolabs/smartmontools:latest
```
### Checking Extension Status
```bash
# List installed extensions
talosctl -n <IP> get extensions
# Check extension services
talosctl -n <IP> get extensionserviceconfigs
```
## Static Pod Configuration for Bare Metal
### Local Storage Static Pods
```yaml
machine:
pods:
- name: local-storage-provisioner
namespace: kube-system
image: rancher/local-path-provisioner:v0.0.24
args:
- --config-path=/etc/config/config.json
env:
- name: POD_NAMESPACE
value: kube-system
volumeMounts:
- name: config
mountPath: /etc/config
- name: local-storage
mountPath: /opt/local-path-provisioner
volumes:
- name: config
hostPath:
path: /etc/local-storage
- name: local-storage
hostPath:
path: /var/lib/local-storage
```
### Hardware Monitoring Static Pods
```yaml
machine:
pods:
- name: node-exporter
namespace: monitoring
image: prom/node-exporter:latest
args:
- --path.rootfs=/host
- --collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)
securityContext:
runAsNonRoot: true
runAsUser: 65534
volumeMounts:
- name: proc
mountPath: /host/proc
readOnly: true
- name: sys
mountPath: /host/sys
readOnly: true
- name: rootfs
mountPath: /host
readOnly: true
volumes:
- name: proc
hostPath:
path: /proc
- name: sys
hostPath:
path: /sys
- name: rootfs
hostPath:
path: /
```
## Bare Metal Boot Asset Management
### PXE Boot Configuration
For network booting, configure DHCP/TFTP with appropriate boot assets:
```bash
# Download kernel and initramfs for PXE
curl -LO https://github.com/siderolabs/talos/releases/download/v1.11.0/vmlinuz-amd64
curl -LO https://github.com/siderolabs/talos/releases/download/v1.11.0/initramfs-amd64.xz
```
### USB Boot Asset Creation
```bash
# Write installer image to USB
sudo dd if=metal-amd64.iso of=/dev/sdX bs=4M status=progress
```
### Image Factory Integration
For custom bare metal images:
```bash
# Generate schematic for bare metal with extensions
curl -X POST --data-binary @schematic.yaml \
https://factory.talos.dev/schematics
# Download custom installer
curl -LO https://factory.talos.dev/image/<schematic-id>/v1.11.0/metal-amd64.iso
```
## Hardware Compatibility and Drivers
### Check Hardware Support
```bash
# Check PCI devices
talosctl -n <IP> read /proc/bus/pci/devices
# Check USB devices
talosctl -n <IP> read /proc/bus/usb/devices
# Check loaded kernel modules
talosctl -n <IP> read /proc/modules
# Check hardware information
talosctl -n <IP> read /proc/cpuinfo
talosctl -n <IP> read /proc/meminfo
```
### Common Hardware Issues
#### Network Interface Issues
```bash
# Check interface status
talosctl -n <IP> list /sys/class/net/
# Check driver information
talosctl -n <IP> read /sys/class/net/eth0/device/driver
# Check firmware loading
talosctl -n <IP> dmesg | grep firmware
```
#### Storage Controller Issues
```bash
# Check block devices
talosctl -n <IP> disks
# Check SMART status (if smartmontools extension installed)
talosctl -n <IP> list /dev/disk/by-id/
```
## Bare Metal Monitoring and Maintenance
### Hardware Health Monitoring
```bash
# Check system temperatures (if available)
talosctl -n <IP> read /sys/class/thermal/thermal_zone0/temp
# Check power supply status
talosctl -n <IP> read /sys/class/power_supply/*/status
# Monitor system events for hardware issues
talosctl -n <IP> dmesg | grep -i error
talosctl -n <IP> dmesg | grep -i "machine check"
```
### Performance Monitoring
```bash
# Check CPU performance
talosctl -n <IP> read /proc/cpuinfo | grep MHz
talosctl -n <IP> cgroups --preset cpu
# Check memory performance
talosctl -n <IP> memory
talosctl -n <IP> cgroups --preset memory
# Check I/O performance
talosctl -n <IP> read /proc/diskstats
```
## Security Hardening for Bare Metal
### BIOS/UEFI Security
- Enable SecureBoot
- Disable unused boot devices
- Set administrator passwords
- Enable TPM 2.0
- Disable legacy boot modes
### Physical Security
- Secure physical access to servers
- Use chassis intrusion detection
- Implement network port security
- Consider hardware-based attestation
### Network Security
```yaml
machine:
network:
firewall:
defaultAction: block
rules:
# Only allow necessary services
- name: allow-cluster-traffic
portSelector:
ports:
- 6443 # Kubernetes API
- 2379 # etcd client
- 2380 # etcd peer
- 10250 # kubelet API
- 50000 # Talos API
ingress:
- subnet: 192.168.1.0/24
```
This bare metal guide provides comprehensive coverage of hardware-specific configurations, performance optimization, security hardening, and operational practices for Talos Linux on physical servers.

View File

@@ -0,0 +1,382 @@
# Talosctl CLI Essentials
This guide covers essential talosctl commands and usage patterns for effective Talos cluster administration.
## Command Structure and Context
### Basic Command Pattern
```bash
talosctl [global-flags] <command> [command-flags] [arguments]
# Examples:
talosctl -n <IP> get members
talosctl --nodes <IP1>,<IP2> service kubelet
talosctl -e <endpoint> -n <target-nodes> upgrade --image <image>
```
### Global Flags
- `-e, --endpoints`: API endpoints to connect to
- `-n, --nodes`: Target nodes for commands (defaults to first endpoint if omitted)
- `--talosconfig`: Path to Talos configuration file
- `--context`: Configuration context to use
### Configuration Management
```bash
# Use specific config file
export TALOSCONFIG=/path/to/talosconfig
# List available contexts
talosctl config contexts
# Switch context
talosctl config context <context-name>
# View current config
talosctl config info
```
## Cluster Management Commands
### Bootstrap and Node Management
```bash
# Bootstrap etcd cluster on first control plane node
talosctl bootstrap --nodes <first-controlplane-ip>
# Apply machine configuration
talosctl apply-config --nodes <IP> --file <config.yaml>
talosctl apply-config --nodes <IP> --file <config.yaml> --mode reboot
talosctl apply-config --nodes <IP> --file <config.yaml> --dry-run
# Reset node (wipe and reboot)
talosctl reset --nodes <IP>
talosctl reset --nodes <IP> --graceful=false --reboot
# Reboot node
talosctl reboot --nodes <IP>
# Shutdown node
talosctl shutdown --nodes <IP>
```
### Configuration Patching
```bash
# Patch machine configuration
talosctl -n <IP> patch mc --mode=no-reboot -p '[{"op": "replace", "path": "/machine/logging/destinations/0/endpoint", "value": "tcp://new-endpoint:514"}]'
# Patch with file
talosctl -n <IP> patch mc --patch @patch.yaml --mode reboot
# Edit machine config interactively
talosctl -n <IP> edit mc --mode staged
```
## System Information and Monitoring
### Node Status and Health
```bash
# Cluster member information
talosctl get members
talosctl get affiliates
talosctl get identities
# Node health check
talosctl -n <IP> health
talosctl -n <IP1>,<IP2>,<IP3> health --control-plane-nodes <IP1>,<IP2>,<IP3>
# System information
talosctl -n <IP> version
talosctl -n <IP> get machineconfig
talosctl -n <IP> get machinetype
```
### Resource Monitoring
```bash
# CPU and memory usage
talosctl -n <IP> cpu
talosctl -n <IP> memory
# Disk usage and information
talosctl -n <IP> disks
talosctl -n <IP> df
# Network interfaces
talosctl -n <IP> interfaces
talosctl -n <IP> get addresses
talosctl -n <IP> get routes
# Process information
talosctl -n <IP> processes
talosctl -n <IP> cgroups --preset memory
talosctl -n <IP> cgroups --preset cpu
```
### Service Management
```bash
# List all services
talosctl -n <IP> services
# Check specific service status
talosctl -n <IP> service kubelet
talosctl -n <IP> service containerd
talosctl -n <IP> service etcd
# Restart service
talosctl -n <IP> service kubelet restart
# Start/stop service
talosctl -n <IP> service <service-name> start
talosctl -n <IP> service <service-name> stop
```
## Logging and Diagnostics
### Log Retrieval
```bash
# Kernel logs
talosctl -n <IP> dmesg
talosctl -n <IP> dmesg -f # Follow mode
talosctl -n <IP> dmesg --tail=100
# Service logs
talosctl -n <IP> logs kubelet
talosctl -n <IP> logs containerd
talosctl -n <IP> logs etcd
talosctl -n <IP> logs machined
# Follow logs
talosctl -n <IP> logs kubelet -f
```
### System Events
```bash
# Monitor system events
talosctl -n <IP> events
talosctl -n <IP> events --tail
# Filter events
talosctl -n <IP> events --since=1h
talosctl -n <IP> events --grep=error
```
## File System and Container Operations
### File Operations
```bash
# List files/directories
talosctl -n <IP> list /var/log
talosctl -n <IP> list /etc/kubernetes
# Copy files to/from node
talosctl -n <IP> copy /local/file /remote/path
talosctl -n <IP> cp /var/log/containers/app.log ./app.log
# Read file contents
talosctl -n <IP> read /etc/resolv.conf
talosctl -n <IP> cat /var/log/audit/audit.log
```
### Container Operations
```bash
# List containers
talosctl -n <IP> containers
talosctl -n <IP> containers -k # Kubernetes containers
# Container logs
talosctl -n <IP> logs --kubernetes <container-name>
# Execute in container
talosctl -n <IP> exec --kubernetes <pod-name> -- <command>
```
## Kubernetes Integration
### Kubernetes Cluster Operations
```bash
# Get kubeconfig
talosctl kubeconfig
talosctl kubeconfig --nodes <controlplane-ip>
talosctl kubeconfig --force --nodes <controlplane-ip>
# Bootstrap manifests
talosctl -n <IP> get manifests
talosctl -n <IP> get manifests -o yaml | yq eval-all '.spec | .[] | splitDoc' - > manifests.yaml
# Upgrade Kubernetes
talosctl --nodes <controlplane> upgrade-k8s --to v1.34.1
talosctl --nodes <controlplane> upgrade-k8s --to v1.34.1 --dry-run
```
### Resource Inspection
```bash
# Control plane component configs
talosctl -n <IP> get apiserverconfig -o yaml
talosctl -n <IP> get controllermanagerconfig -o yaml
talosctl -n <IP> get schedulerconfig -o yaml
# etcd configuration
talosctl -n <IP> get etcdconfig -o yaml
```
## etcd Management
### etcd Operations
```bash
# etcd cluster status
talosctl -n <IP1>,<IP2>,<IP3> etcd status
# etcd members
talosctl -n <IP> etcd members
# etcd snapshots
talosctl -n <IP> etcd snapshot db.snapshot
# etcd maintenance
talosctl -n <IP> etcd defrag
talosctl -n <IP> etcd alarm list
talosctl -n <IP> etcd alarm disarm
# Leadership management
talosctl -n <IP> etcd forfeit-leadership
```
### Disaster Recovery
```bash
# Bootstrap from snapshot
talosctl -n <IP> bootstrap --recover-from=./db.snapshot
talosctl -n <IP> bootstrap --recover-from=./db.snapshot --recover-skip-hash-check
```
## Upgrade and Maintenance
### OS Upgrades
```bash
# Upgrade Talos OS
talosctl upgrade --nodes <IP> --image ghcr.io/siderolabs/installer:v1.11.x
talosctl upgrade --nodes <IP> --image ghcr.io/siderolabs/installer:v1.11.x --stage
# Monitor upgrade progress
talosctl upgrade --nodes <IP> --image <image> --wait
talosctl upgrade --nodes <IP> --image <image> --wait --debug
# Rollback
talosctl rollback --nodes <IP>
```
## Resource System Commands
### Resource Management
```bash
# List resource types
talosctl get rd
# Get specific resources
talosctl get <resource-type>
talosctl get <resource-type> -o yaml
talosctl get <resource-type> --namespace=<namespace>
# Watch resources
talosctl get <resource-type> --watch
# Common resource types
talosctl get machineconfig
talosctl get members
talosctl get services
talosctl get networkconfig
talosctl get secrets
```
## Local Development
### Local Cluster Management
```bash
# Create local cluster
talosctl cluster create
talosctl cluster create --controlplanes 3 --workers 2
# Destroy local cluster
talosctl cluster destroy
# Show local cluster status
talosctl cluster show
```
## Advanced Usage Patterns
### Multi-Node Operations
```bash
# Run command on multiple nodes
talosctl -e <endpoint> -n <node1>,<node2>,<node3> <command>
# Different endpoint and target nodes
talosctl -e <public-endpoint> -n <internal-node1>,<internal-node2> <command>
```
### Output Formatting
```bash
# JSON output
talosctl -n <IP> get members -o json
# YAML output
talosctl -n <IP> get machineconfig -o yaml
# Table output (default)
talosctl -n <IP> get members -o table
# Custom column output
talosctl -n <IP> get members -o columns=HOSTNAME,MACHINE\ TYPE,OS
```
### Filtering and Selection
```bash
# Filter resources
talosctl get members --search <hostname>
talosctl get services --search kubelet
# Namespace filtering
talosctl get secrets --namespace=secrets
talosctl get affiliates --namespace=cluster-raw
```
## Common Command Workflows
### Initial Cluster Setup
```bash
# 1. Generate configurations
talosctl gen config cluster-name https://cluster-endpoint:6443
# 2. Apply to nodes
talosctl apply-config --nodes <controlplane-1> --file controlplane.yaml
talosctl apply-config --nodes <worker-1> --file worker.yaml
# 3. Bootstrap cluster
talosctl bootstrap --nodes <controlplane-1>
# 4. Get kubeconfig
talosctl kubeconfig --nodes <controlplane-1>
```
### Cluster Health Check
```bash
# Check all aspects of cluster health
talosctl -n <IP1>,<IP2>,<IP3> health --control-plane-nodes <IP1>,<IP2>,<IP3>
talosctl -n <IP1>,<IP2>,<IP3> etcd status
talosctl -n <IP1>,<IP2>,<IP3> service kubelet
kubectl get nodes
kubectl get pods --all-namespaces
```
### Node Troubleshooting
```bash
# System diagnostics
talosctl -n <IP> dmesg | tail -100
talosctl -n <IP> services | grep -v Running
talosctl -n <IP> logs kubelet | tail -50
talosctl -n <IP> events --since=1h
# Resource usage
talosctl -n <IP> memory
talosctl -n <IP> df
talosctl -n <IP> processes | head -20
```
This CLI reference provides the essential commands and patterns needed for day-to-day Talos cluster administration and troubleshooting.

View File

@@ -0,0 +1,239 @@
# Talos Cluster Operations Guide
This guide covers essential cluster operations for Talos Linux v1.11 administrators.
## Upgrading Operations
### Talos OS Upgrades
Talos uses an A-B image scheme for rollbacks. Each upgrade retains the previous kernel and OS image.
#### Upgrade Process
```bash
# Upgrade a single node
talosctl upgrade --nodes <IP> --image ghcr.io/siderolabs/installer:v1.11.x
# Use --stage flag if upgrade fails due to open files
talosctl upgrade --nodes <IP> --image ghcr.io/siderolabs/installer:v1.11.x --stage
# Monitor upgrade progress
talosctl dmesg -f
talosctl upgrade --wait --debug
```
#### Upgrade Sequence
1. Node cordons itself in Kubernetes
2. Node drains existing workloads
3. Internal processes shut down
4. Filesystems unmount
5. Disk verification and image upgrade
6. Bootloader set to boot once with new image
7. Node reboots
8. Node rejoins cluster and uncordons
#### Rollback
```bash
talosctl rollback --nodes <IP>
```
### Kubernetes Upgrades
Kubernetes upgrades are separate from OS upgrades and non-disruptive.
#### Automated Upgrade (Recommended)
```bash
# Check what will be upgraded
talosctl --nodes <controlplane> upgrade-k8s --to v1.34.1 --dry-run
# Perform upgrade
talosctl --nodes <controlplane> upgrade-k8s --to v1.34.1
```
#### Manual Component Upgrades
For manual control, patch each component individually:
**API Server:**
```bash
talosctl -n <IP> patch mc --mode=no-reboot -p '[{"op": "replace", "path": "/cluster/apiServer/image", "value": "registry.k8s.io/kube-apiserver:v1.34.1"}]'
```
**Controller Manager:**
```bash
talosctl -n <IP> patch mc --mode=no-reboot -p '[{"op": "replace", "path": "/cluster/controllerManager/image", "value": "registry.k8s.io/kube-controller-manager:v1.34.1"}]'
```
**Scheduler:**
```bash
talosctl -n <IP> patch mc --mode=no-reboot -p '[{"op": "replace", "path": "/cluster/scheduler/image", "value": "registry.k8s.io/kube-scheduler:v1.34.1"}]'
```
**Kubelet:**
```bash
talosctl -n <IP> patch mc --mode=no-reboot -p '[{"op": "replace", "path": "/machine/kubelet/image", "value": "ghcr.io/siderolabs/kubelet:v1.34.1"}]'
```
## Node Management
### Adding Control Plane Nodes
1. Apply machine configuration to new node
2. Node automatically joins etcd cluster via control plane endpoint
3. Control plane components start automatically
### Removing Control Plane Nodes
```bash
# Recommended approach - reset then delete
talosctl -n <IP.of.node.to.remove> reset
kubectl delete node <node-name>
```
### Adding Worker Nodes
1. Apply worker machine configuration
2. Node automatically joins via bootstrap token
### Removing Worker Nodes
```bash
kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data
kubectl delete node <node-name>
talosctl -n <IP> reset
```
## Configuration Management
### Applying Configuration Changes
```bash
# Apply config with automatic mode detection
talosctl apply-config --nodes <IP> --file <config.yaml>
# Apply with specific modes
talosctl apply-config --nodes <IP> --file <config.yaml> --mode no-reboot
talosctl apply-config --nodes <IP> --file <config.yaml> --mode reboot
talosctl apply-config --nodes <IP> --file <config.yaml> --mode staged
# Dry run to preview changes
talosctl apply-config --nodes <IP> --file <config.yaml> --dry-run
```
### Configuration Patching
```bash
# Patch machine configuration
talosctl -n <IP> patch mc --mode=no-reboot -p '[{"op": "replace", "path": "/machine/logging/destinations/0/endpoint", "value": "tcp://new-endpoint:514"}]'
# Patch with file
talosctl -n <IP> patch mc --patch @patch.yaml
```
### Retrieving Current Configuration
```bash
# Get machine configuration
talosctl -n <IP> get mc v1alpha1 -o yaml
# Get effective configuration
talosctl -n <IP> get machineconfig -o yaml
```
## Cluster Health Monitoring
### Node Status
```bash
# Check node status
talosctl -n <IP> get members
talosctl -n <IP> health
# Check system services
talosctl -n <IP> services
talosctl -n <IP> service <service-name>
```
### Resource Monitoring
```bash
# System resources
talosctl -n <IP> memory
talosctl -n <IP> cpu
talosctl -n <IP> disks
# Process information
talosctl -n <IP> processes
talosctl -n <IP> cgroups --preset memory
```
### Log Monitoring
```bash
# Kernel logs
talosctl -n <IP> dmesg
talosctl -n <IP> dmesg -f # Follow mode
# Service logs
talosctl -n <IP> logs <service-name>
talosctl -n <IP> logs kubelet
```
## Control Plane Best Practices
### Cluster Sizing Recommendations
- **3 nodes**: Sufficient for most use cases, tolerates 1 node failure
- **5 nodes**: Better availability (tolerates 2 node failures), higher resource cost
- **Avoid even numbers**: 2 or 4 nodes provide worse availability than odd numbers
### Node Replacement Strategy
- **Failed node**: Remove first, then add replacement
- **Healthy node**: Add replacement first, then remove old node
### Performance Considerations
- etcd performance decreases as cluster scales
- 5-node cluster commits ~5% fewer writes than 3-node cluster
- Vertically scale nodes for performance, don't add more nodes
## Machine Configuration Versioning
### Reproducible Configuration Workflow
Store only:
- `secrets.yaml` (generated once at cluster creation)
- Patch files (YAML/JSON patches describing differences from defaults)
Generate configs when needed:
```bash
# Generate fresh configs with existing secrets
talosctl gen config <cluster-name> <cluster-endpoint> --with-secrets secrets.yaml
# Apply patches to generated configs
talosctl gen config <cluster-name> <cluster-endpoint> --with-secrets secrets.yaml --config-patch @patch.yaml
```
This prevents configuration drift after automated upgrades.
## Troubleshooting Common Issues
### Upgrade Failures
- **Invalid installer image**: Check image reference and network connectivity
- **Filesystem unmount failure**: Use `--stage` flag
- **Boot failure**: System automatically rolls back to previous version
- **Workload issues**: Use `talosctl rollback` to revert
### Node Join Issues
- Verify network connectivity to control plane endpoint
- Check discovery service configuration
- Validate machine configuration syntax
- Ensure bootstrap process completed on initial control plane node
### Control Plane Quorum Loss
- Identify healthy nodes with `talosctl etcd status`
- Follow disaster recovery procedures if quorum cannot be restored
- Use etcd snapshots for cluster recovery
## Security Considerations
### Certificate Rotation
Talos automatically rotates certificates, but monitor expiration:
```bash
talosctl -n <IP> get secrets
```
### Pod Security
Control plane nodes are tainted by default to prevent workload scheduling. This protects:
- Control plane from resource starvation
- Credentials from workload exposure
### Network Security
- All API communication uses mutual TLS (mTLS)
- Discovery service data is encrypted before transmission
- WireGuard (KubeSpan) provides mesh networking security

View File

@@ -0,0 +1,344 @@
# Discovery and Networking Guide
This guide covers Talos cluster discovery mechanisms, network configuration, and connectivity troubleshooting.
## Cluster Discovery System
Talos includes built-in node discovery that allows cluster members to find each other and maintain membership information.
### Discovery Registries
#### Service Registry (Default)
- **External Service**: Uses public discovery service at `https://discovery.talos.dev/`
- **Encryption**: All data encrypted with AES-GCM before transmission
- **Functionality**: Works without dependency on etcd/Kubernetes
- **Advantages**: Available even when control plane is down
#### Kubernetes Registry (Deprecated)
- **Data Source**: Uses Kubernetes Node resources and annotations
- **Limitation**: Incompatible with Kubernetes 1.32+ due to AuthorizeNodeWithSelectors
- **Status**: Disabled by default, deprecated
### Discovery Configuration
```yaml
cluster:
discovery:
enabled: true
registries:
service:
disabled: false # Default
kubernetes:
disabled: true # Deprecated, disabled by default
```
**To disable service registry**:
```yaml
cluster:
discovery:
enabled: true
registries:
service:
disabled: true
```
## Discovery Data Flow
### Service Registry Process
1. **Data Encryption**: Node encrypts affiliate data with cluster key
2. **Endpoint Encryption**: Endpoints separately encrypted for deduplication
3. **Data Submission**: Node submits own data + observed peer endpoints
4. **Server Processing**: Discovery service aggregates and deduplicates data
5. **Data Distribution**: Encrypted updates sent to all cluster members
6. **Local Processing**: Nodes decrypt data for cluster discovery and KubeSpan
### Data Protection
- **Cluster Isolation**: Cluster ID used as key selector
- **End-to-End Encryption**: Discovery service cannot decrypt affiliate data
- **Memory-Only Storage**: Data stored in memory with encrypted snapshots
- **No Sensitive Exposure**: Service only sees encrypted blobs and cluster metadata
## Discovery Resources
### Node Identity
```bash
# View node's unique identity
talosctl get identities -o yaml
```
**Output**:
```yaml
spec:
nodeId: Utoh3O0ZneV0kT2IUBrh7TgdouRcUW2yzaaMl4VXnCd
```
**Identity Characteristics**:
- Base62 encoded random 32 bytes
- URL-safe encoding
- Preserved in STATE partition (`node-identity.yaml`)
- Survives reboots and upgrades
- Regenerated on reset/wipe
### Affiliates (Proposed Members)
```bash
# View discovered affiliates (proposed cluster members)
talosctl get affiliates
```
**Output**:
```
ID VERSION HOSTNAME MACHINE TYPE ADDRESSES
2VfX3nu67ZtZPl57IdJrU87BMjVWkSBJiL9ulP9TCnF 2 talos-default-controlplane-2 controlplane ["172.20.0.3","fd83:b1f7:fcb5:2802:986b:7eff:fec5:889d"]
```
### Members (Approved Members)
```bash
# View cluster members
talosctl get members
```
**Output**:
```
ID VERSION HOSTNAME MACHINE TYPE OS ADDRESSES
talos-default-controlplane-1 2 talos-default-controlplane-1 controlplane Talos (v1.11.0) ["172.20.0.2","fd83:b1f7:fcb5:2802:8c13:71ff:feaf:7c94"]
```
### Raw Registry Data
```bash
# View data from specific registries
talosctl get affiliates --namespace=cluster-raw
```
**Output shows registry sources**:
```
ID VERSION HOSTNAME
k8s/2VfX3nu67ZtZPl57IdJrU87BMjVWkSBJiL9ulP9TCnF 3 talos-default-controlplane-2
service/2VfX3nu67ZtZPl57IdJrU87BMjVWkSBJiL9ulP9TCnF 23 talos-default-controlplane-2
```
## Network Architecture
### Network Layers
#### Host Networking
- **Node-to-Node**: Direct IP connectivity between cluster nodes
- **Control Plane**: API server communication via control plane endpoint
- **Discovery**: HTTPS connection to discovery service (port 443)
#### Container Networking
- **CNI**: Container Network Interface for pod networking
- **Service Mesh**: Optional service mesh implementations
- **Network Policies**: Kubernetes network policy enforcement
#### Optional: KubeSpan (WireGuard Mesh)
- **Mesh Networking**: Full mesh WireGuard connections
- **Discovery Integration**: Uses discovery service for peer coordination
- **Encryption**: WireGuard public keys distributed via discovery
- **Use Cases**: Multi-cloud, hybrid, NAT traversal
### Network Configuration Patterns
#### Basic Network Setup
```yaml
machine:
network:
interfaces:
- interface: eth0
dhcp: true
```
#### Static IP Configuration
```yaml
machine:
network:
interfaces:
- interface: eth0
addresses:
- 192.168.1.100/24
routes:
- network: 0.0.0.0/0
gateway: 192.168.1.1
mtu: 1500
nameservers:
- 8.8.8.8
- 1.1.1.1
```
#### Multiple Interface Configuration
```yaml
machine:
network:
interfaces:
- interface: eth0 # Management interface
dhcp: true
- interface: eth1 # Kubernetes traffic
addresses:
- 10.0.1.100/24
routes:
- network: 10.0.0.0/16
gateway: 10.0.1.1
```
## KubeSpan Configuration
### Basic KubeSpan Setup
```yaml
machine:
network:
kubespan:
enabled: true
```
### Advanced KubeSpan Configuration
```yaml
machine:
network:
kubespan:
enabled: true
advertiseKubernetesNetworks: true
allowDownPeerBypass: true
mtu: 1420 # Account for WireGuard overhead
filters:
endpoints:
- 0.0.0.0/0 # Allow all endpoints
```
**KubeSpan Features**:
- Automatic peer discovery via discovery service
- NAT traversal capabilities
- Encrypted mesh networking
- Kubernetes network advertisement
- Fault tolerance with peer bypass
## Network Troubleshooting
### Discovery Issues
#### Check Discovery Service Connectivity
```bash
# Test connectivity to discovery service
talosctl get affiliates
# Check discovery configuration
talosctl get discoveryconfig -o yaml
# Monitor discovery events
talosctl events --tail
```
#### Common Discovery Problems
1. **No Affiliates Discovered**:
- Check discovery service connectivity
- Verify cluster ID matches across nodes
- Confirm discovery is enabled
2. **Partial Affiliate List**:
- Network connectivity issues between nodes
- Discovery service regional availability
- Firewall blocking discovery traffic
3. **Discovery Service Unreachable**:
- Network connectivity to discovery.talos.dev:443
- Corporate firewall/proxy configuration
- DNS resolution issues
### Network Connectivity Testing
#### Basic Network Tests
```bash
# Test network interfaces
talosctl get addresses
talosctl get routes
talosctl get nodeaddresses
# Check network configuration
talosctl get networkconfig -o yaml
# Test connectivity
talosctl -n <IP> ping <target-ip>
```
#### Inter-Node Connectivity
```bash
# Test control plane endpoint
talosctl health --control-plane-nodes <IP1>,<IP2>,<IP3>
# Check etcd connectivity
talosctl -n <IP> etcd members
# Test Kubernetes API
kubectl get nodes
```
#### KubeSpan Troubleshooting
```bash
# Check KubeSpan status
talosctl get kubespanpeerspecs
talosctl get kubespanpeerstatuses
# Monitor WireGuard connections
talosctl -n <IP> interfaces
# Check KubeSpan logs
talosctl -n <IP> logs controller-runtime | grep kubespan
```
### Network Performance Optimization
#### Network Interface Tuning
```yaml
machine:
network:
interfaces:
- interface: eth0
mtu: 9000 # Jumbo frames if supported
dhcp: true
```
#### KubeSpan Performance
- Adjust MTU for WireGuard overhead (typically -80 bytes)
- Consider endpoint filters for large clusters
- Monitor WireGuard peer connection stability
## Security Considerations
### Discovery Security
- **Encrypted Communication**: All discovery data encrypted end-to-end
- **Cluster Isolation**: Cluster ID prevents cross-cluster data access
- **No Sensitive Data**: Only encrypted metadata transmitted
- **Network Security**: HTTPS transport with certificate validation
### Network Security
- **mTLS**: All Talos API communication uses mutual TLS
- **Certificate Rotation**: Automatic certificate lifecycle management
- **Network Policies**: Implement Kubernetes network policies for workloads
- **Firewall Rules**: Restrict network access to necessary ports only
### Required Network Ports
- **6443**: Kubernetes API server
- **2379-2380**: etcd client/peer communication
- **10250**: kubelet API
- **50000**: Talos API (apid)
- **443**: Discovery service (outbound)
- **51820**: KubeSpan WireGuard (if enabled)
## Operational Best Practices
### Monitoring
- Monitor discovery service connectivity
- Track cluster member changes
- Alert on network partitions
- Monitor KubeSpan peer status
### Backup and Recovery
- Document network configuration
- Backup discovery service configuration
- Test network recovery procedures
- Plan for discovery service outages
### Scaling Considerations
- Discovery service scales to thousands of nodes
- KubeSpan mesh scales to hundreds of nodes efficiently
- Consider network segmentation for large clusters
- Plan for multi-region deployments
This networking foundation enables Talos clusters to maintain connectivity and membership across various network topologies while providing security and performance optimization options.

View File

@@ -0,0 +1,287 @@
# etcd Management and Disaster Recovery Guide
This guide covers etcd database operations, maintenance, and disaster recovery procedures for Talos Linux clusters.
## etcd Health Monitoring
### Basic Health Checks
```bash
# Check etcd status across all control plane nodes
talosctl -n <IP1>,<IP2>,<IP3> etcd status
# Check etcd alarms
talosctl -n <IP> etcd alarm list
# Check etcd members
talosctl -n <IP> etcd members
# Check service status
talosctl -n <IP> service etcd
```
### Understanding etcd Status Output
```
NODE MEMBER DB SIZE IN USE LEADER RAFT INDEX RAFT TERM RAFT APPLIED INDEX LEARNER ERRORS
172.20.0.2 a49c021e76e707db 17 MB 4.5 MB (26.10%) ecebb05b59a776f1 53391 4 53391 false
```
**Key Metrics**:
- **DB SIZE**: Total database size on disk
- **IN USE**: Actual data size (fragmentation = DB SIZE - IN USE)
- **LEADER**: Current etcd cluster leader
- **RAFT INDEX**: Consensus log position
- **LEARNER**: Whether node is still joining cluster
## Space Quota Management
### Default Configuration
- Default space quota: 2 GiB
- Recommended maximum: 8 GiB
- Database locks when quota exceeded
### Quota Exceeded Handling
**Symptoms**:
```bash
talosctl -n <IP> etcd alarm list
# Output: ALARM: NOSPACE
```
**Resolution**:
1. Increase quota in machine configuration:
```yaml
cluster:
etcd:
extraArgs:
quota-backend-bytes: 4294967296 # 4 GiB
```
2. Apply configuration and reboot:
```bash
talosctl -n <IP> apply-config --file updated-config.yaml --mode reboot
```
3. Clear the alarm:
```bash
talosctl -n <IP> etcd alarm disarm
```
## Database Defragmentation
### When to Defragment
- In use/DB size ratio < 0.5 (heavily fragmented)
- Database size exceeds quota but actual data is small
- Performance degradation due to fragmentation
### Defragmentation Process
```bash
# Check fragmentation status
talosctl -n <IP1>,<IP2>,<IP3> etcd status
# Defragment single node (resource-intensive operation)
talosctl -n <IP1> etcd defrag
# Verify defragmentation results
talosctl -n <IP1> etcd status
```
**Important Notes**:
- Defragment one node at a time
- Operation blocks reads/writes during execution
- Can significantly improve performance if heavily fragmented
### Post-Defragmentation Verification
After successful defragmentation, DB size should closely match IN USE size:
```
NODE MEMBER DB SIZE IN USE
172.20.0.2 a49c021e76e707db 4.5 MB 4.5 MB (100.00%)
```
## Backup Operations
### Regular Snapshots
```bash
# Create consistent snapshot
talosctl -n <IP> etcd snapshot db.snapshot
```
**Output Example**:
```
etcd snapshot saved to "db.snapshot" (2015264 bytes)
snapshot info: hash c25fd181, revision 4193, total keys 1287, total size 3035136
```
### Disaster Snapshots
When etcd cluster is unhealthy and normal snapshot fails:
```bash
# Copy database directly (may be inconsistent)
talosctl -n <IP> cp /var/lib/etcd/member/snap/db .
```
### Automated Backup Strategy
- Schedule regular snapshots (daily/hourly based on change frequency)
- Store snapshots in multiple locations
- Test restore procedures regularly
- Document recovery procedures
## Disaster Recovery
### Pre-Recovery Assessment
**Check if Recovery is Necessary**:
```bash
# Query etcd health on all control plane nodes
talosctl -n <IP1>,<IP2>,<IP3> service etcd
# Check member list consistency
talosctl -n <IP1> etcd members
talosctl -n <IP2> etcd members
talosctl -n <IP3> etcd members
```
**Recovery is needed when**:
- Quorum is lost (majority of nodes down)
- etcd data corruption
- Complete cluster failure
### Recovery Prerequisites
1. **Latest etcd snapshot** (preferably consistent)
2. **Machine configuration backup**:
```bash
talosctl -n <IP> get mc v1alpha1 -o yaml | yq eval '.spec' -
```
3. **No init-type nodes** (deprecated, incompatible with recovery)
### Recovery Procedure
#### Step 1: Prepare Control Plane Nodes
```bash
# If nodes have hardware issues, replace them with same configuration
# If nodes are running but etcd is corrupted, wipe EPHEMERAL partition:
talosctl -n <IP> reset --graceful=false --reboot --system-labels-to-wipe=EPHEMERAL
```
#### Step 2: Verify etcd State
All etcd services should be in "Preparing" state:
```bash
talosctl -n <IP> service etcd
# Expected: STATE: Preparing
```
#### Step 3: Bootstrap from Snapshot
```bash
# Bootstrap cluster from snapshot
talosctl -n <IP> bootstrap --recover-from=./db.snapshot
# For direct database copies, skip hash check:
talosctl -n <IP> bootstrap --recover-from=./db --recover-skip-hash-check
```
#### Step 4: Verify Recovery
**Monitor kernel logs** for recovery progress:
```bash
talosctl -n <IP> dmesg -f
```
**Expected log entries**:
```
recovering etcd from snapshot: hash c25fd181, revision 4193, total keys 1287, total size 3035136
{"level":"info","msg":"restored snapshot","path":"/var/lib/etcd.snapshot"}
```
**Verify cluster health**:
```bash
# etcd should become healthy on bootstrap node
talosctl -n <IP> service etcd
# Kubernetes control plane should start
kubectl get nodes
# Other control plane nodes should join automatically
talosctl -n <IP1>,<IP2>,<IP3> etcd status
```
## etcd Version Management
### Downgrade Process (v3.6 to v3.5)
**Prerequisites**:
- Healthy cluster running v3.6.x
- Recent backup snapshot
- Downgrade only one minor version at a time
#### Step 1: Validate Downgrade
```bash
talosctl -n <IP1> etcd downgrade validate 3.5
```
#### Step 2: Enable Downgrade
```bash
talosctl -n <IP1> etcd downgrade enable 3.5
```
#### Step 3: Verify Schema Migration
```bash
# Check storage version migrated to 3.5
talosctl -n <IP1>,<IP2>,<IP3> etcd status
# Verify STORAGE column shows 3.5.0
```
#### Step 4: Patch Machine Configuration
```bash
# Transfer leadership if node is leader
talosctl -n <IP1> etcd forfeit-leadership
# Create patch file
cat > etcd-patch.yaml <<EOF
cluster:
etcd:
image: gcr.io/etcd-development/etcd:v3.5.22
EOF
# Apply patch with reboot
talosctl -n <IP1> patch machineconfig --patch @etcd-patch.yaml --mode reboot
```
#### Step 5: Repeat for All Control Plane Nodes
Continue patching remaining control plane nodes one by one.
## Operational Best Practices
### Monitoring
- Monitor database size and fragmentation regularly
- Set up alerts for space quota approaching limits
- Track etcd performance metrics (request latency, leader changes)
- Monitor disk I/O and network latency
### Maintenance Windows
- Schedule defragmentation during low-traffic periods
- Coordinate with application teams for maintenance windows
- Test backup/restore procedures in non-production environments
### Performance Optimization
- Use fast storage (NVMe SSDs preferred)
- Minimize network latency between control plane nodes
- Monitor and tune etcd configuration based on workload
### Security
- Encrypt etcd data at rest
- Secure backup storage with appropriate access controls
- Regularly rotate certificates
- Monitor for unauthorized access attempts
## Troubleshooting Common Issues
### Split Brain Prevention
- Ensure odd number of control plane nodes
- Monitor network connectivity between nodes
- Use dedicated network for control plane communication when possible
### Performance Issues
- Check disk I/O latency
- Monitor memory usage
- Consider vertical scaling before adding nodes
- Review etcd request patterns and optimize applications
### Backup/Restore Issues
- Test restore procedures regularly
- Verify backup integrity
- Ensure consistent network and storage configuration
- Document and practice disaster recovery procedures

View File

@@ -0,0 +1,480 @@
# Talos Troubleshooting Guide
This guide provides systematic approaches to diagnosing and resolving common Talos cluster issues.
## General Troubleshooting Methodology
### 1. Gather Information
```bash
# Node status and health
talosctl -n <IP> health
talosctl -n <IP> version
talosctl -n <IP> get members
# System resources
talosctl -n <IP> memory
talosctl -n <IP> disks
talosctl -n <IP> processes | head -20
# Service status
talosctl -n <IP> services
```
### 2. Check Logs
```bash
# Kernel logs (system-level issues)
talosctl -n <IP> dmesg | tail -100
# Service logs
talosctl -n <IP> logs machined
talosctl -n <IP> logs kubelet
talosctl -n <IP> logs containerd
# System events
talosctl -n <IP> events --since=1h
```
### 3. Network Connectivity
```bash
# Discovery and membership
talosctl get affiliates
talosctl get members
# Network interfaces
talosctl -n <IP> interfaces
talosctl -n <IP> get addresses
# Control plane connectivity
kubectl get nodes
talosctl -n <IP1>,<IP2>,<IP3> etcd status
```
## Bootstrap and Initial Setup Issues
### Cluster Bootstrap Failures
**Symptoms**: Bootstrap command fails or times out
**Diagnosis**:
```bash
# Check etcd service state
talosctl -n <IP> service etcd
# Check if node is trying to join instead of bootstrap
talosctl -n <IP> logs etcd | grep -i bootstrap
# Verify machine configuration
talosctl -n <IP> get machineconfig -o yaml
```
**Common Causes & Solutions**:
1. **Wrong node type**: Ensure using `controlplane`, not deprecated `init`
2. **Network issues**: Verify control plane endpoint connectivity
3. **Configuration errors**: Check machine configuration validity
4. **Previous bootstrap**: etcd data exists from previous attempts
**Resolution**:
```bash
# Reset node if previous bootstrap data exists
talosctl -n <IP> reset --graceful=false --reboot --system-labels-to-wipe=EPHEMERAL
# Re-apply configuration and bootstrap
talosctl apply-config --nodes <IP> --file controlplane.yaml
talosctl bootstrap --nodes <IP>
```
### Node Join Issues
**Symptoms**: New nodes don't join cluster
**Diagnosis**:
```bash
# Check discovery
talosctl get affiliates
talosctl get members
# Check bootstrap token
kubectl get secrets -n kube-system | grep bootstrap-token
# Check kubelet logs
talosctl -n <IP> logs kubelet | grep -i certificate
```
**Common Solutions**:
```bash
# Regenerate bootstrap token if expired
kubeadm token create --print-join-command
# Verify discovery service connectivity
talosctl -n <IP> get affiliates --namespace=cluster-raw
# Check machine configuration matches cluster
talosctl -n <IP> get machineconfig -o yaml
```
## Control Plane Issues
### etcd Problems
**etcd Won't Start**:
```bash
# Check etcd service status and logs
talosctl -n <IP> service etcd
talosctl -n <IP> logs etcd
# Check etcd data directory
talosctl -n <IP> list /var/lib/etcd
# Check disk space and permissions
talosctl -n <IP> df
```
**etcd Quorum Loss**:
```bash
# Check member status
talosctl -n <IP1>,<IP2>,<IP3> etcd status
talosctl -n <IP> etcd members
# Identify healthy members
for ip in IP1 IP2 IP3; do
echo "=== Node $ip ==="
talosctl -n $ip service etcd
done
```
**Solution for Quorum Loss**:
1. If majority available: Remove failed members, add replacements
2. If majority lost: Follow disaster recovery procedure
### API Server Issues
**API Server Not Responding**:
```bash
# Check API server pod status
kubectl get pods -n kube-system | grep apiserver
# Check API server configuration
talosctl -n <IP> get apiserverconfig -o yaml
# Check control plane endpoint
curl -k https://<control-plane-endpoint>:6443/healthz
```
**Common Solutions**:
```bash
# Restart kubelet to reload static pods
talosctl -n <IP> service kubelet restart
# Check for configuration issues
talosctl -n <IP> logs kubelet | grep apiserver
# Verify etcd connectivity
talosctl -n <IP> etcd status
```
## Node-Level Issues
### Kubelet Problems
**Kubelet Service Issues**:
```bash
# Check kubelet status and logs
talosctl -n <IP> service kubelet
talosctl -n <IP> logs kubelet | tail -50
# Check kubelet configuration
talosctl -n <IP> get kubeletconfig -o yaml
# Check container runtime
talosctl -n <IP> service containerd
```
**Common Kubelet Issues**:
1. **Certificate problems**: Check certificate expiration and rotation
2. **Container runtime issues**: Verify containerd health
3. **Resource constraints**: Check memory and disk space
4. **Network connectivity**: Verify API server connectivity
### Container Runtime Issues
**Containerd Problems**:
```bash
# Check containerd service
talosctl -n <IP> service containerd
talosctl -n <IP> logs containerd
# List containers
talosctl -n <IP> containers
talosctl -n <IP> containers -k # Kubernetes containers
# Check containerd configuration
talosctl -n <IP> read /etc/cri/conf.d/cri.toml
```
**Common Solutions**:
```bash
# Restart containerd
talosctl -n <IP> service containerd restart
# Check disk space for container images
talosctl -n <IP> df
# Clean up unused containers/images
# (This happens automatically via kubelet GC)
```
## Network Issues
### Network Connectivity Problems
**Node-to-Node Connectivity**:
```bash
# Test basic network connectivity
talosctl -n <IP1> interfaces
talosctl -n <IP1> get routes
# Test specific connectivity
talosctl -n <IP1> read /etc/resolv.conf
# Check network configuration
talosctl -n <IP> get networkconfig -o yaml
```
**DNS Resolution Issues**:
```bash
# Check DNS configuration
talosctl -n <IP> read /etc/resolv.conf
# Test DNS resolution
talosctl -n <IP> exec --kubernetes coredns-pod -- nslookup kubernetes.default.svc.cluster.local
```
### Discovery Service Issues
**Discovery Not Working**:
```bash
# Check discovery configuration
talosctl get discoveryconfig -o yaml
# Check affiliate discovery
talosctl get affiliates
talosctl get affiliates --namespace=cluster-raw
# Test discovery service connectivity
curl -v https://discovery.talos.dev/
```
**KubeSpan Issues** (if enabled):
```bash
# Check KubeSpan configuration
talosctl get kubespanconfig -o yaml
# Check peer status
talosctl get kubespanpeerspecs
talosctl get kubespanpeerstatuses
# Check WireGuard interface
talosctl -n <IP> interfaces | grep kubespan
```
## Upgrade Issues
### OS Upgrade Problems
**Upgrade Fails or Hangs**:
```bash
# Check upgrade status
talosctl -n <IP> dmesg | grep -i upgrade
talosctl -n <IP> events | grep -i upgrade
# Use staged upgrade for filesystem lock issues
talosctl upgrade --nodes <IP> --image <image> --stage
# Monitor upgrade progress
talosctl upgrade --nodes <IP> --image <image> --wait --debug
```
**Boot Issues After Upgrade**:
```bash
# Check boot logs
talosctl -n <IP> dmesg | head -100
# System automatically rolls back on boot failure
# Check current version
talosctl -n <IP> version
# Manual rollback if needed
talosctl rollback --nodes <IP>
```
### Kubernetes Upgrade Issues
**K8s Upgrade Failures**:
```bash
# Check upgrade status
talosctl --nodes <controlplane> upgrade-k8s --to <version> --dry-run
# Check individual component status
kubectl get pods -n kube-system
talosctl -n <IP> get apiserverconfig -o yaml
```
**Version Mismatch Issues**:
```bash
# Check version consistency
kubectl get nodes -o wide
talosctl -n <IP1>,<IP2>,<IP3> version
# Check component versions
kubectl get pods -n kube-system -o wide
```
## Resource and Performance Issues
### Memory and Storage Problems
**Out of Memory**:
```bash
# Check memory usage
talosctl -n <IP> memory
talosctl -n <IP> processes --sort-by=memory | head -20
# Check for memory pressure
kubectl describe node <node-name> | grep -A 10 Conditions
# Check OOM events
talosctl -n <IP> dmesg | grep -i "out of memory"
```
**Disk Space Issues**:
```bash
# Check disk usage
talosctl -n <IP> df
talosctl -n <IP> disks
# Check specific directories
talosctl -n <IP> list /var/lib/containerd
talosctl -n <IP> list /var/lib/etcd
# Clean up if needed (automatic GC usually handles this)
kubectl describe node <node-name> | grep -A 5 "Disk Pressure"
```
### Performance Issues
**Slow Cluster Response**:
```bash
# Check API server response time
time kubectl get nodes
# Check etcd performance
talosctl -n <IP> etcd status
# Look for high DB size vs IN USE ratio (fragmentation)
# Check system load
talosctl -n <IP> cpu
talosctl -n <IP> memory
```
**High CPU/Memory Usage**:
```bash
# Identify resource-heavy processes
talosctl -n <IP> processes --sort-by=cpu | head -10
talosctl -n <IP> processes --sort-by=memory | head -10
# Check cgroup usage
talosctl -n <IP> cgroups --preset memory
talosctl -n <IP> cgroups --preset cpu
```
## Configuration Issues
### Machine Configuration Problems
**Invalid Configuration**:
```bash
# Validate configuration before applying
talosctl validate -f machineconfig.yaml
# Check current configuration
talosctl -n <IP> get machineconfig -o yaml
# Compare with expected configuration
diff <(talosctl -n <IP> get mc v1alpha1 -o yaml) expected-config.yaml
```
**Configuration Drift**:
```bash
# Check configuration version
talosctl -n <IP> get machineconfig
# Re-apply configuration if needed
talosctl apply-config --nodes <IP> --file corrected-config.yaml --dry-run
talosctl apply-config --nodes <IP> --file corrected-config.yaml
```
## Emergency Procedures
### Node Unresponsive
**Complete Node Failure**:
1. **Physical access required**: Power cycle or hardware reset
2. **Check hardware**: Memory, disk, network interface status
3. **Boot issues**: May require bootable recovery media
**Partial Connectivity**:
```bash
# Try different network interfaces if multiple available
talosctl -e <alternate-ip> -n <IP> health
# Check if specific services are running
talosctl -n <IP> service machined
talosctl -n <IP> service apid
```
### Cluster-Wide Failures
**All Control Plane Nodes Down**:
1. **Assess scope**: Determine if data corruption or hardware failure
2. **Recovery strategy**: Use etcd backup if available
3. **Rebuild process**: May require complete cluster rebuild
**Follow disaster recovery procedures** as documented in etcd-management.md.
### Emergency Reset Procedures
**Single Node Reset**:
```bash
# Graceful reset (preserves some data)
talosctl -n <IP> reset
# Force reset (wipes all data)
talosctl -n <IP> reset --graceful=false --reboot
# Selective wipe (preserve STATE partition)
talosctl -n <IP> reset --system-labels-to-wipe=EPHEMERAL
```
**Cluster Reset** (DESTRUCTIVE):
```bash
# Reset all nodes (DANGER: DATA LOSS)
for ip in IP1 IP2 IP3; do
talosctl -n $ip reset --graceful=false --reboot
done
```
## Monitoring and Alerting
### Key Metrics to Monitor
- Node resource usage (CPU, memory, disk)
- etcd health and performance
- Control plane component status
- Network connectivity
- Certificate expiration
- Discovery service connectivity
### Log Locations for External Monitoring
- Kernel logs: `talosctl dmesg`
- Service logs: `talosctl logs <service>`
- System events: `talosctl events`
- Kubernetes events: `kubectl get events`
This troubleshooting guide provides systematic approaches to identify and resolve the most common issues encountered in Talos cluster operations.

View File

@@ -0,0 +1,188 @@
# Wild Cloud Agent Context Documentation
This directory contains comprehensive documentation about the Wild Cloud project, designed to provide AI agents (like Claude Code) with the context needed to effectively help users with Wild Cloud development, deployment, and operations.
## Documentation Overview
### 📚 Core Documentation Files
1. **[overview.md](./overview.md)** - Complete project introduction and getting started guide
- What Wild Cloud is and why it exists
- Technology stack and architecture overview
- Quick start guide and common use cases
- Best practices and troubleshooting
2. **[bin-scripts.md](./bin-scripts.md)** - Complete CLI reference
- All 34+ `wild-*` commands with usage examples
- Command categories (setup, apps, config, operations)
- Script dependencies and execution order
- Common usage patterns
3. **[setup-process.md](./setup-process.md)** - Infrastructure deployment deep dive
- Complete setup phases and dependencies
- Talos Linux and Kubernetes cluster deployment
- Core services installation (MetalLB, Traefik, cert-manager, etc.)
- Network configuration and DNS management
4. **[apps-system.md](./apps-system.md)** - Application management system
- App structure and lifecycle management
- Template system and configuration
- Available applications and their features
- Creating custom applications
5. **[configuration-system.md](./configuration-system.md)** - Configuration and secrets management
- `config.yaml` and `secrets.yaml` structure
- Template processing with gomplate
- Environment setup and validation
- Security best practices
6. **[project-architecture.md](./project-architecture.md)** - Project structure and organization
- Wild Cloud repository structure
- User cloud directory layout
- File permissions and security model
- Development and deployment patterns
## Quick Reference Guide
### Essential Commands
```bash
# Setup & Initialization
wild-init # Initialize new cloud
wild-setup # Complete deployment
wild-health # System health check
# Application Management
wild-apps-list # List available apps
wild-app-add <app> # Configure app
wild-app-deploy <app> # Deploy app
# Configuration
wild-config <key> # Read config
wild-config-set <key> <val> # Set config
wild-secret <key> # Read secret
```
### Key File Locations
**Wild Cloud Repository** (`WC_ROOT`):
- `bin/` - All CLI commands
- `apps/` - Application templates
- `setup/` - Infrastructure templates
- `docs/` - Documentation
**User Cloud Directory** (`WC_HOME`):
- `config.yaml` - Main configuration
- `secrets.yaml` - Sensitive data
- `apps/` - Deployed app configs
- `.wildcloud/` - Project marker
### Application Categories
- **Content**: Ghost (blog), Discourse (forum)
- **Media**: Immich (photos)
- **Development**: Gitea (Git), Docker Registry
- **Databases**: PostgreSQL, MySQL, Redis
- **AI/ML**: vLLM (LLM inference)
## Technology Stack Summary
### Core Infrastructure
- **Talos Linux** - Immutable Kubernetes OS
- **Kubernetes** - Container orchestration
- **MetalLB** - Load balancing
- **Traefik** - Ingress/reverse proxy
- **Longhorn** - Distributed storage
- **cert-manager** - TLS certificates
### Management Tools
- **gomplate** - Template processing
- **Kustomize** - Configuration management
- **restic** - Backup system
- **kubectl/talosctl** - Cluster management
## Common Agent Tasks
### When Users Ask About...
**"How do I deploy X?"**
- Check apps-system.md for application management
- Look for X in available applications list
- Reference app deployment lifecycle
**"Setup isn't working"**
- Review setup-process.md for troubleshooting
- Check bin-scripts.md for command options
- Verify prerequisites and dependencies
**"How do I configure Y?"**
- Check configuration-system.md for config management
- Look at project-architecture.md for file locations
- Review template processing documentation
**"What does wild-X command do?"**
- Reference bin-scripts.md for complete command documentation
- Check command categories and usage patterns
- Look at dependencies between commands
### Development Tasks
**Creating New Apps**:
1. Review apps-system.md "Creating Custom Apps" section
2. Follow Wild Cloud app structure conventions
3. Use project-architecture.md for file organization
4. Test with standard app deployment workflow
**Modifying Infrastructure**:
1. Check setup-process.md for infrastructure components
2. Review configuration-system.md for template processing
3. Understand project-architecture.md file relationships
4. Test changes carefully in development environment
**Troubleshooting Issues**:
1. Use bin-scripts.md for diagnostic commands
2. Check setup-process.md for component validation
3. Review configuration-system.md for config problems
4. Reference apps-system.md for application issues
## Best Practices for Agents
### Understanding User Context
- Always check if they're in a Wild Cloud directory (look for `.wildcloud/`)
- Determine if they need setup help vs operational help
- Consider their experience level (beginner vs advanced)
- Check what applications they're trying to deploy
### Providing Help
- Reference specific documentation sections for detailed info
- Provide exact command syntax from bin-scripts.md
- Explain prerequisites and dependencies
- Offer validation steps to verify success
### Safety Considerations
- Always recommend testing in development first
- Warn about destructive operations (delete, reset)
- Emphasize backup importance before major changes
- Explain security implications of configuration changes
### Common Gotchas
- `secrets.yaml` has restricted permissions (600)
- Templates need processing before deployment
- Dependencies between applications must be satisfied
- Node hardware detection requires maintenance mode boot
## Documentation Maintenance
This documentation should be updated when:
- New commands are added to `bin/`
- New applications are added to `apps/`
- Infrastructure components change
- Configuration schema evolves
- Best practices are updated
Each documentation file includes:
- Complete coverage of its topic area
- Practical examples and use cases
- Troubleshooting guidance
- References to related documentation
This comprehensive context should enable AI agents to provide expert-level assistance with Wild Cloud projects across all aspects of the system.

View File

@@ -0,0 +1,595 @@
# Wild Cloud Apps System
The Wild Cloud apps system provides a streamlined way to deploy and manage applications on your Kubernetes cluster. It uses Kustomize for configuration management and follows a standardized structure for consistent deployment patterns.
## App Structure and Components
### Directory Structure
Each subdirectory represents a Wild Cloud app. Each app directory contains:
**Required Files:**
- `manifest.yaml` - App metadata and configuration
- `kustomization.yaml` - Kustomize configuration with Wild Cloud labels
**Standard Configuration Files (one or more YAML files containing Kubernetes resource definitions):**
```
apps/myapp/
├── manifest.yaml # Required: App metadata and configuration
├── kustomization.yaml # Required: Kustomize configuration with Wild Cloud labels
├── namespace.yaml # Kubernetes namespace definition
├── deployment.yaml # Application deployment
├── service.yaml # Kubernetes service definition
├── ingress.yaml # HTTPS ingress with external DNS
├── pvc.yaml # Persistent volume claims (if needed)
├── db-init-job.yaml # Database initialization (if needed)
└── configmap.yaml # Configuration data (if needed)
```
### App Manifest (`manifest.yaml`)
The required `manifest.yaml` file contains metadata about the app. Here's an example `manifest.yaml` file:
```yaml
name: myapp
description: A brief description of the application and its purpose.
version: 1.0.0
icon: https://example.com/icon.png
requires:
- name: postgres
defaultConfig:
image: myapp/server:1.0.0
domain: myapp.{{ .cloud.domain }}
timezone: UTC
storage: 10Gi
dbHostname: postgres.postgres.svc.cluster.local
dbUsername: myapp
requiredSecrets:
- apps.myapp.dbPassword
- apps.postgres.password
```
**Manifest Fields**:
- `name` - The name of the app, used for identification (must match directory name)
- `description` - A brief description of the app
- `version` - The version of the app (should generally follow the versioning scheme of the app itself)
- `icon` - A URL to an icon representing the app
- `requires` - A list of other apps that this app depends on (each entry should be the name of another app)
- `defaultConfig` - A set of default configuration values for the app (when an app is added using `wild-app-add`, these values will be added to the Wild Cloud `config.yaml` file)
- `requiredSecrets` - A list of secrets that must be set in the Wild Cloud `secrets.yaml` file for the app to function properly (these secrets are typically sensitive information like database passwords or API keys; keys with random values will be generated automatically when the app is added)
### Kustomization Configuration
Wild Cloud apps use standard Kustomize with required Wild Cloud labels:
```yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
namespace: myapp
labels:
- includeSelectors: true
pairs:
app: myapp
managedBy: kustomize
partOf: wild-cloud
resources:
- namespace.yaml
- deployment.yaml
- service.yaml
- ingress.yaml
- pvc.yaml
- db-init-job.yaml
```
**Kustomization Requirements**:
- Every Wild Cloud kustomization should include the Wild Cloud labels in its `kustomization.yaml` file (this allows Wild Cloud to identify and manage the app correctly)
- The `app` label and `namespace` should match the app's name/directory
- **includeSelectors: true** - Automatically applies labels to all resources AND their selectors
#### Standard Wild Cloud Labels
Wild Cloud uses a consistent labeling strategy across all apps:
```yaml
labels:
- includeSelectors: true
pairs:
app: myapp # The app name (matches directory)
managedBy: kustomize # Managed by Kustomize
partOf: wild-cloud # Part of Wild Cloud ecosystem
```
The `includeSelectors: true` setting automatically applies these labels to all resources AND their selectors, which means:
1. **Resource labels** - All resources get the standard Wild Cloud labels
2. **Selector labels** - All selectors automatically include these labels for robust selection
This allows individual resources to use simple, component-specific selectors:
```yaml
selector:
matchLabels:
component: web
```
Which Kustomize automatically expands to:
```yaml
selector:
matchLabels:
app: myapp
component: web
managedBy: kustomize
partOf: wild-cloud
```
### Template System
Wild Cloud apps are actually **templates** that get compiled with your specific configuration when you run `wild-app-add`. This allows for:
- **Dynamic Configuration** - Reference user settings via `{{ .apps.appname.key }}`
- **Gomplate Processing** - Full template capabilities including conditionals and loops
- **Secret Integration** - Automatic secret generation and referencing
- **Domain Management** - Automatic subdomain assignment based on your domain
**Template Variable Examples**:
```yaml
# Configuration references
image: "{{ .apps.myapp.image }}"
domain: "{{ .apps.myapp.domain }}"
namespace: "{{ .apps.myapp.namespace }}"
# Cloud-wide settings
timezone: "{{ .cloud.timezone }}"
domain_suffix: "{{ .cloud.domain }}"
# Conditional logic
{{- if .apps.myapp.enableSSL }}
- name: ENABLE_SSL
value: "true"
{{- end }}
```
## App Lifecycle Management
### 1. Discovery Phase
**Command**: `wild-apps-list`
Lists all available applications with metadata:
```bash
wild-apps-list --verbose # Detailed view with descriptions
wild-apps-list --json # JSON output for automation
```
Shows:
- App name and description
- Version and dependencies
- Installation status
- Required configuration
### 2. Configuration Phase
**Command**: `wild-app-add <app-name>`
Processes app templates and prepares for deployment:
**What it does**:
1. Reads app manifest directly from Wild Cloud repository
2. Merges default configuration with existing `config.yaml`
3. Generates required secrets automatically
4. Compiles templates with gomplate using your configuration
5. Creates ready-to-deploy Kustomize files in `apps/<app-name>/`
**Generated Files**:
- Compiled Kubernetes manifests (no more template variables)
- Standard Kustomize configuration
- App-specific configuration merged into your `config.yaml`
- Required secrets added to your `secrets.yaml`
### 3. Deployment Phase
**Command**: `wild-app-deploy <app-name>`
Deploys the app to your Kubernetes cluster:
**Deployment Process**:
1. Creates namespace if it doesn't exist
2. Handles app dependencies (deploys required apps first)
3. Creates secrets from your `secrets.yaml`
4. Applies Kustomize configuration to cluster
5. Copies TLS certificates to app namespace
6. Validates deployment success
**Options**:
- `--force` - Overwrite existing resources
- `--dry-run` - Preview changes without applying
### 4. Operations Phase
**Monitoring**: `wild-app-doctor <app-name>`
- Runs app-specific diagnostic tests
- Checks pod status, resource usage, connectivity
- Options: `--keep`, `--follow`, `--timeout`
**Updates**: Re-run `wild-app-add` then `wild-app-deploy`
- Use `--force` flag to overwrite existing configuration
- Updates configuration changes
- Handles image updates
- Preserves persistent data
**Removal**: `wild-app-delete <app-name>`
- Deletes namespace and all resources
- Removes local configuration files
- Options: `--force` for no confirmation
## Configuration System
### Configuration Storage
**Global Configuration** (`config.yaml`):
```yaml
cloud:
domain: example.com
timezone: America/New_York
apps:
myapp:
domain: app.example.com
image: myapp:1.0.0
storage: 20Gi
timezone: UTC
```
**Secrets Management** (`secrets.yaml`):
```yaml
apps:
myapp:
dbPassword: "randomly-generated-password"
adminPassword: "user-set-password"
postgres:
password: "randomly-generated-password"
```
### Secret Generation
When you run `wild-app-add`, required secrets are automatically generated:
- **Random Generation**: 32-character base64 strings for passwords/keys
- **User Prompts**: For secrets that need specific values
- **Preservation**: Existing secrets are never overwritten
- **Permissions**: `secrets.yaml` has 600 permissions (owner-only)
### Configuration Commands
```bash
# Read app configuration
wild-config apps.myapp.domain
# Set app configuration
wild-config-set apps.myapp.storage "50Gi"
# Read app secrets
wild-secret apps.myapp.dbPassword
# Set app secrets
wild-secret-set apps.myapp.adminPassword "my-secure-password"
```
## Networking and DNS
### External DNS Integration
Wild Cloud apps automatically manage DNS records through ingress annotations:
```yaml
metadata:
annotations:
external-dns.alpha.kubernetes.io/target: {{ .cloud.domain }}
external-dns.alpha.kubernetes.io/cloudflare-proxied: "false"
```
**How it works**:
1. App ingress created with external-dns annotations
2. ExternalDNS controller detects new ingress
3. Creates CNAME record: `app.yourdomain.com``yourdomain.com`
4. DNS resolves to MetalLB load balancer IP
5. Traefik routes traffic to appropriate service
### HTTPS Certificate Management
Automatic TLS certificates via cert-manager:
```yaml
metadata:
annotations:
traefik.ingress.kubernetes.io/router.tls: "true"
traefik.ingress.kubernetes.io/router.tls.certresolver: letsencrypt
spec:
tls:
- hosts:
- {{ .apps.myapp.domain }}
secretName: myapp-tls
```
**Certificate Lifecycle**:
1. Ingress created with TLS configuration
2. cert-manager detects certificate requirement
3. Let's Encrypt challenge initiated automatically
4. Certificate issued and stored in Kubernetes secret
5. Traefik uses certificate for TLS termination
6. Automatic renewal before expiration
## Database Integration
### Database Initialization Jobs
Apps that require databases use initialization jobs to set up the database before the main application starts:
```yaml
apiVersion: batch/v1
kind: Job
metadata:
name: myapp-db-init
spec:
template:
spec:
containers:
- name: db-init
image: postgres:15
command:
- /bin/bash
- -c
- |
PGPASSWORD=$ROOT_PASSWORD psql -h $DB_HOST -U postgres -c "
CREATE DATABASE IF NOT EXISTS $DB_NAME;
CREATE USER $DB_USER WITH PASSWORD '$DB_PASSWORD';
GRANT ALL PRIVILEGES ON DATABASE $DB_NAME TO $DB_USER;
"
env:
- name: DB_HOST
value: {{ .apps.myapp.dbHostname }}
- name: ROOT_PASSWORD
valueFrom:
secretKeyRef:
name: myapp-secrets
key: apps.postgres.password
restartPolicy: OnFailure
```
**Database URL Secrets**: For apps requiring database URLs with embedded credentials, always use dedicated secrets:
```yaml
# In manifest.yaml
requiredSecrets:
- apps.myapp.dbUrl
# Generated secret (by wild-app-add)
apps:
myapp:
dbUrl: "postgresql://myapp:password123@postgres.postgres.svc.cluster.local/myapp"
```
### Supported Databases
Wild Cloud apps commonly integrate with:
- **PostgreSQL** - Via `postgres` app dependency
- **MySQL** - Via `mysql` app dependency
- **Redis** - Via `redis` app dependency
- **SQLite** - For apps with embedded database needs
## Storage Management
### Persistent Volume Claims
Apps requiring persistent storage define PVCs:
```yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: myapp-data
spec:
accessModes:
- ReadWriteOnce
storageClassName: longhorn
resources:
requests:
storage: {{ .apps.myapp.storage }}
```
**Storage Integration**:
- **Longhorn Storage Class** - Distributed, replicated storage
- **Dynamic Provisioning** - Automatic volume creation
- **Backup Support** - Via `wild-app-backup` command
- **Expansion** - Update storage size in configuration
### Backup and Restore
**Application Backup**: `wild-app-backup <app-name>`
- Discovers databases and PVCs automatically
- Creates restic snapshots with deduplication
- Supports PostgreSQL and MySQL database backups
- Streams PVC data for efficient storage
**Application Restore**: `wild-app-restore <app-name> <snapshot-id>`
- Restores from restic snapshots
- Options: `--db-only`, `--pvc-only`, `--skip-globals`
- Creates safety snapshots before destructive operations
## Security Considerations
### Pod Security Standards
All Wild Cloud apps comply with Pod Security Standards:
```yaml
spec:
template:
spec:
securityContext:
runAsNonRoot: true
runAsUser: 999
runAsGroup: 999
seccompProfile:
type: RuntimeDefault
containers:
- name: app
securityContext:
allowPrivilegeEscalation: false
capabilities:
drop:
- ALL
readOnlyRootFilesystem: false # Set to true when possible
```
### Secret Management
- **Kubernetes Secrets** - All sensitive data stored as Kubernetes secrets
- **Secret References** - Apps reference secrets via `secretKeyRef`, never inline
- **Full Dotted Paths** - Always use complete secret paths (e.g., `apps.myapp.dbPassword`)
- **No Plaintext** - Secrets never stored in manifests or config files
### Network Policies
Apps can define network policies for traffic isolation:
```yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: myapp-network-policy
spec:
podSelector:
matchLabels:
app: myapp
ingress:
- from:
- namespaceSelector:
matchLabels:
name: traefik
```
## Available Applications
Wild Cloud includes apps for common self-hosted services:
### Content Management
- **Ghost** - Publishing platform for blogs and websites
- **Discourse** - Community discussion platform
### Development & Project Management Tools
- **Gitea** - Self-hosted Git service with web interface
- **OpenProject** - Open-source project management software
- **Docker Registry** - Private container image registry
### Media & File Management
- **Immich** - Self-hosted photo and video backup solution
### Communication
- **Keila** - Newsletter and email marketing platform
- **Listmonk** - Newsletter and mailing list manager
### Databases
- **PostgreSQL** - Relational database service
- **MySQL** - Relational database service
- **Redis** - In-memory data structure store
- **Memcached** - Distributed memory caching system
### AI/ML
- **vLLM** - Fast LLM inference server with OpenAI-compatible API
### Examples & Templates
- **example-admin** - Example admin interface application
- **example-app** - Template application for development reference
## Creating Custom Apps
### App Development Process
1. **Create Directory**: `apps/myapp/`
2. **Write Manifest**: Define metadata and configuration
3. **Create Resources**: Kubernetes manifests with templates
4. **Test Locally**: Use `wild-app-add` and `wild-app-deploy`
5. **Validate**: Ensure all resources deploy correctly
### Best Practices
**Manifest Design**:
- Include comprehensive `defaultConfig` for all configurable values
- List all `requiredSecrets` the app needs
- Specify dependencies in `requires` field
- Use semantic versioning
**Template Usage**:
- Reference configuration via `{{ .apps.myapp.key }}`
- Use conditionals for optional features
- Include proper gomplate syntax for lists and objects
- Test template compilation
**Resource Configuration**:
- Always include Wild Cloud standard labels
- Use appropriate security contexts
- Define resource requests and limits
- Include health checks and probes
**Storage and Networking**:
- Use Longhorn storage class for persistence
- Include external-dns annotations for automatic DNS
- Configure TLS certificates via cert-manager annotations
- Follow database initialization patterns for data apps
### Converting from Helm Charts
Wild Cloud provides tooling to convert Helm charts to Wild Cloud apps:
```bash
# Convert Helm chart to Kustomize base
helm fetch --untar --untardir charts stable/mysql
helm template --output-dir base --namespace mysql mysql charts/mysql
cd base/mysql
kustomize create --autodetect
# Then customize for Wild Cloud:
# 1. Add manifest.yaml
# 2. Replace hardcoded values with templates
# 3. Update labels to Wild Cloud standard
# 4. Configure secrets properly
```
## Troubleshooting Applications
### Common Issues
**App Won't Start**:
- Check pod logs: `kubectl logs -n <app-namespace> deployment/<app-name>`
- Verify secrets exist: `kubectl get secrets -n <app-namespace>`
- Check resource constraints: `kubectl describe pod -n <app-namespace>`
**Database Connection Issues**:
- Verify database is running: `kubectl get pods -n <db-namespace>`
- Check database initialization job: `kubectl logs job/<app>-db-init -n <app-namespace>`
- Validate database credentials in secrets
**DNS/Certificate Issues**:
- Check ingress status: `kubectl get ingress -n <app-namespace>`
- Verify certificate creation: `kubectl get certificates -n <app-namespace>`
- Check external-dns logs: `kubectl logs -n external-dns deployment/external-dns`
**Storage Issues**:
- Check PVC status: `kubectl get pvc -n <app-namespace>`
- Verify Longhorn cluster health: Access Longhorn UI
- Check storage class availability: `kubectl get storageclass`
### Diagnostic Tools
```bash
# App-specific diagnostics
wild-app-doctor <app-name>
# Resource inspection
kubectl get all -n <app-namespace>
kubectl describe deployment/<app-name> -n <app-namespace>
# Log analysis
kubectl logs -f deployment/<app-name> -n <app-namespace>
kubectl logs job/<app>-db-init -n <app-namespace>
# Configuration verification
wild-config apps.<app-name>
wild-secret apps.<app-name>
```
The Wild Cloud apps system provides a powerful, consistent way to deploy and manage self-hosted applications with enterprise-grade features like automatic HTTPS, DNS management, backup/restore, and integrated security.

View File

@@ -0,0 +1,262 @@
# Wild Cloud CLI Scripts Reference
Wild Cloud provides 34+ command-line tools (all prefixed with `wild-`) for managing your personal Kubernetes cloud infrastructure. These scripts handle everything from initial setup to day-to-day operations.
## Script Categories
### 🚀 Initial Setup & Scaffolding
**`wild-init`** - Initialize new Wild Cloud instance
- Creates `.wildcloud` directory structure
- Copies template files from repository
- Sets up basic configuration (email, domains, cluster name)
- **Usage**: `wild-init`
- **When to use**: First command to run in a new directory
**`wild-setup`** - Master setup orchestrator
- Runs complete Wild Cloud deployment sequence
- Options: `--skip-cluster`, `--skip-services`
- Executes: cluster setup → services setup
- **Usage**: `wild-setup [options]`
- **When to use**: After `wild-init` for complete automated setup
**`wild-update-docs`** - Copy documentation to cloud directory
- Options: `--force` to overwrite existing docs
- Copies `/docs` from repository to your cloud home
- **Usage**: `wild-update-docs [--force]`
### ⚙️ Configuration Management
**`wild-config`** - Read configuration values
- Access YAML paths from `config.yaml` (e.g., `cluster.name`, `cloud.domain`)
- Option: `--check` to test key existence
- **Usage**: `wild-config <key>` or `wild-config --check <key>`
**`wild-config-set`** - Write configuration values
- Sets values using YAML paths, creates config file if needed
- **Usage**: `wild-config-set <key> <value>`
**`wild-secret`** - Read secret values
- Similar to `wild-config` but for sensitive data in `secrets.yaml`
- File has restrictive permissions (600)
- **Usage**: `wild-secret <key>` or `wild-secret --check <key>`
**`wild-secret-set`** - Write secret values
- Generates random values if none provided (32-char base64)
- **Usage**: `wild-secret-set <key> [value]`
**`wild-compile-template`** - Process gomplate templates
- Uses `config.yaml` and `secrets.yaml` as template context
- **Usage**: `wild-compile-template < template.yaml`
**`wild-compile-template-dir`** - Process template directories
- Options: `--clean` to remove destination first
- **Usage**: `wild-compile-template-dir <source> <destination>`
### 🏗️ Cluster Infrastructure Management
**`wild-setup-cluster`** - Complete cluster setup (Phases 1-3)
- Automated control plane node setup and bootstrapping
- Configures Talos control plane nodes using wild-node-setup
- Options: `--skip-hardware`
- **Usage**: `wild-setup-cluster [options]`
- **Requires**: `wild-init` completed first
**`wild-cluster-config-generate`** - Generate Talos cluster config
- Creates base `controlplane.yaml` and `worker.yaml`
- Generates cluster secrets using `talosctl gen config`
- **Usage**: `wild-cluster-config-generate`
**`wild-node-setup`** - Complete node lifecycle management
- Handles detect → configure → patch → deploy for individual nodes
- Automatically detects maintenance mode and handles IP transitions
- Options: `--reconfigure`, `--no-deploy`
- **Usage**: `wild-node-setup <node-name> [options]`
- **Examples**:
- `wild-node-setup control-1` (complete setup)
- `wild-node-setup worker-1 --reconfigure` (force node reconfiguration)
- `wild-node-setup control-2 --no-deploy` (configuration only)
**`wild-node-detect`** - Hardware detection utility
- Discovers network interfaces and disks from maintenance mode
- Returns JSON with hardware specifications and maintenance mode status
- **Usage**: `wild-node-detect <node-ip>`
- **Note**: Primarily used internally by `wild-node-setup`
**`wild-cluster-node-ip`** - Get node IP addresses
- Sources: config.yaml, kubectl, or talosctl
- Options: `--from-config`, `--from-talosctl`
- **Usage**: `wild-cluster-node-ip <node-name> [options]`
### 🔧 Cluster Services Management
**`wild-setup-services`** - Set up all cluster services (Phase 4)
- Manages MetalLB, Traefik, cert-manager, etc. in dependency order
- Options: `--fetch` for fresh templates, `--no-deploy` for config-only
- **Usage**: `wild-setup-services [options]`
- **Requires**: Working Kubernetes cluster
**`wild-service-setup`** - Complete service lifecycle management
- Handles fetch → configure → deploy for individual services
- Options: `--fetch` for fresh templates, `--no-deploy` for config-only
- **Usage**: `wild-service-setup <service> [--fetch] [--no-deploy]`
- **Examples**:
- `wild-service-setup cert-manager` (configure + deploy)
- `wild-service-setup cert-manager --fetch` (fetch + configure + deploy)
- `wild-service-setup cert-manager --no-deploy` (configure only)
**`wild-dashboard-token`** - Get Kubernetes dashboard token
- Extracts token for dashboard authentication
- Copies to clipboard if available
- **Usage**: `wild-dashboard-token`
**`wild-cluster-secret-copy`** - Copy secrets between namespaces
- **Usage**: `wild-cluster-secret-copy <source-ns:secret> <target-ns1> [target-ns2]`
### 📱 Application Management
**`wild-apps-list`** - List available applications
- Shows metadata, installation status, dependencies
- Options: `--verbose`, `--json`, `--yaml`
- **Usage**: `wild-apps-list [options]`
**`wild-app-add`** - Configure app from repository
- Processes manifest.yaml with configuration
- Generates required secrets automatically
- Options: `--force` to overwrite existing app files
- **Usage**: `wild-app-add <app-name> [--force]`
**`wild-app-deploy`** - Deploy application to cluster
- Creates namespaces, handles dependencies
- Options: `--force`, `--dry-run`
- **Usage**: `wild-app-deploy <app-name> [options]`
**`wild-app-delete`** - Remove application
- Deletes namespace and all resources
- Options: `--force`, `--dry-run`
- **Usage**: `wild-app-delete <app-name> [options]`
**`wild-app-doctor`** - Run application diagnostics
- Executes app-specific diagnostic tests
- Options: `--keep`, `--follow`, `--timeout`
- **Usage**: `wild-app-doctor <app-name> [options]`
### 💾 Backup & Restore
**`wild-backup`** - Comprehensive backup system
- Backs up home directory, apps, and cluster resources
- Options: `--home-only`, `--apps-only`, `--cluster-only`
- Uses restic for deduplication
- **Usage**: `wild-backup [options]`
**`wild-app-backup`** - Application-specific backups
- Discovers databases and PVCs automatically
- Supports PostgreSQL and MySQL
- Options: `--all` for all applications
- **Usage**: `wild-app-backup <app-name> [--all]`
**`wild-app-restore`** - Application restore
- Restores databases and/or PVC data
- Options: `--db-only`, `--pvc-only`, `--skip-globals`
- **Usage**: `wild-app-restore <app-name> <snapshot-id> [options]`
### 🔍 Utilities & Helpers
**`wild-health`** - Comprehensive infrastructure validation
- Validates core components (MetalLB, Traefik, CoreDNS)
- Checks installed services (cert-manager, ExternalDNS, Kubernetes Dashboard)
- Tests DNS resolution, routing, certificates, and storage systems
- **Usage**: `wild-health`
**`wild-talos-schema`** - Talos schema management
- Handles configuration schema operations
- **Usage**: `wild-talos-schema [options]`
**`wild-cluster-node-boot-assets-download`** - Download Talos assets
- Downloads installation images for nodes
- **Usage**: `wild-cluster-node-boot-assets-download`
**`wild-dnsmasq-install`** - Install dnsmasq services
- Sets up DNS and DHCP for cluster networking
- **Usage**: `wild-dnsmasq-install`
## Common Usage Patterns
### Complete Setup from Scratch
```bash
wild-init # Initialize cloud directory
wild-setup # Complete automated setup
# or step by step:
wild-setup-cluster # Just cluster infrastructure
wild-setup-services # Just cluster services
```
### Individual Service Management
```bash
# Most common - reconfigure and deploy service
wild-service-setup cert-manager
# Get fresh templates and deploy (for updates)
wild-service-setup cert-manager --fetch
# Configure only, don't deploy (for planning)
wild-service-setup cert-manager --no-deploy
# Fix failed service and resume setup
wild-service-setup cert-manager --fetch
wild-setup-services # Resume full setup if needed
```
### Application Management
```bash
wild-apps-list # See available apps
wild-app-add ghost # Configure app
wild-app-deploy ghost # Deploy to cluster
wild-app-doctor ghost # Troubleshoot issues
```
### Configuration Management
```bash
wild-config cluster.name # Read values
wild-config-set apps.ghost.domain "blog.example.com" # Write values
wild-secret apps.ghost.adminPassword # Read secrets
wild-secret-set apps.ghost.apiKey # Generate random secret
```
### Cluster Operations
```bash
wild-cluster-node-ip control-1 # Get node IP
wild-dashboard-token # Get dashboard access
wild-health # Check system health
```
## Script Design Principles
1. **Consistent Interface**: All scripts use `--help` and follow common argument patterns
2. **Error Handling**: All scripts use `set -e` and `set -o pipefail` for robust error handling
3. **Idempotent**: Scripts check existing state before making changes
4. **Template-Driven**: Extensive use of gomplate for configuration flexibility
5. **Environment-Aware**: Scripts source `wild-common.sh` and initialize Wild Cloud environment
6. **Progressive Disclosure**: Complex operations broken into phases with individual controls
## Dependencies Between Scripts
### Setup Phase Dependencies
1. `wild-init` → creates basic structure
2. `wild-setup-cluster` → provisions infrastructure
3. `wild-setup-services` → installs cluster services
4. `wild-setup` → orchestrates all phases
### App Deployment Pipeline
1. `wild-apps-list` → discover applications
2. `wild-app-add` → configure and prepare application
3. `wild-app-deploy` → deploy to cluster
### Node Management Flow
1. `wild-cluster-config-generate` → base configurations
2. `wild-node-setup <node-name>` → atomic node operations (detect → patch → deploy)
- Internally uses `wild-node-detect` for hardware discovery
- Generates node-specific patches and final configurations
- Deploys configuration to target node
All scripts are designed to work together as a cohesive Infrastructure as Code system for personal Kubernetes deployments.

View File

@@ -0,0 +1,602 @@
# Wild Cloud Configuration System
Wild Cloud uses a comprehensive configuration management system that handles both non-sensitive configuration data and sensitive secrets through separate files and commands. The system supports YAML path-based access, template processing, and environment-specific customization.
## Configuration Architecture
### Core Components
1. **`config.yaml`** - Main configuration file for non-sensitive settings
2. **`secrets.yaml`** - Encrypted/protected storage for sensitive data
3. **`.wildcloud/`** - Project marker and cache directory
4. **`env.sh`** - Environment setup and path configuration
5. **Template System** - gomplate-based dynamic configuration processing
### File Structure of a Wild Cloud Project
```
your-cloud-directory/
├── .wildcloud/ # Project marker and cache
│ ├── cache/ # Downloaded templates and temporary files
│ └── logs/ # Operation logs
├── config.yaml # Main configuration (tracked in git)
├── secrets.yaml # Sensitive data (NOT tracked in git, 600 perms)
├── env.sh # Environment setup (auto-generated)
├── apps/ # Deployed application configurations
├── setup/ # Infrastructure setup files
└── docs/ # Project documentation
```
## Configuration File (`config.yaml`)
### Structure and Organization
The configuration file uses a hierarchical YAML structure for organizing settings:
```yaml
# Cloud-wide settings
cloud:
domain: "example.com"
email: "admin@example.com"
timezone: "America/New_York"
# Cluster infrastructure settings
cluster:
name: "wild-cluster"
nodeCount: 3
network:
subnet: "192.168.1.0/24"
gateway: "192.168.1.1"
dnsServer: "192.168.1.50"
metallbPool: "192.168.1.80-89"
controlPlaneVip: "192.168.1.90"
nodes:
control-1:
ip: "192.168.1.91"
mac: "00:11:22:33:44:55"
interface: "eth0"
disk: "/dev/sda"
control-2:
ip: "192.168.1.92"
mac: "00:11:22:33:44:56"
interface: "eth0"
disk: "/dev/sda"
# Application-specific settings
apps:
ghost:
domain: "blog.example.com"
image: "ghost:5.0.0"
storage: "10Gi"
timezone: "UTC"
namespace: "ghost"
immich:
domain: "photos.example.com"
serverImage: "ghcr.io/immich-app/immich-server:release"
storage: "250Gi"
namespace: "immich"
# Service configurations
services:
traefik:
replicas: 2
dashboard: true
longhorn:
defaultReplicas: 3
storageClass: "longhorn"
```
### Configuration Commands
**Reading Configuration Values**:
```bash
# Read simple values
wild-config cloud.domain # "example.com"
wild-config cluster.name # "wild-cluster"
# Read nested values
wild-config apps.ghost.domain # "blog.example.com"
wild-config cluster.nodes.control-1.ip # "192.168.1.91"
# Check if key exists
wild-config --check apps.newapp.domain # Returns exit code 0/1
```
**Writing Configuration Values**:
```bash
# Set simple values
wild-config-set cloud.domain "newdomain.com"
wild-config-set cluster.nodeCount 5
# Set nested values
wild-config-set apps.ghost.storage "20Gi"
wild-config-set cluster.nodes.worker-1.ip "192.168.1.94"
# Set complex values (JSON format)
wild-config-set apps.ghost '{"domain":"blog.com","storage":"50Gi"}'
```
### Configuration Sections
#### Cloud Settings (`cloud.*`)
Global settings that affect the entire Wild Cloud deployment:
```yaml
cloud:
domain: "example.com" # Primary domain for services
email: "admin@example.com" # Contact email for certificates
timezone: "America/New_York" # Default timezone for services
backupLocation: "s3://backup" # Backup storage location
monitoring: true # Enable monitoring services
```
#### Cluster Settings (`cluster.*`)
Infrastructure and node configuration:
```yaml
cluster:
name: "production-cluster"
version: "v1.28.0"
network:
subnet: "10.0.0.0/16" # Cluster network range
serviceCIDR: "10.96.0.0/12" # Service network range
podCIDR: "10.244.0.0/16" # Pod network range
nodes:
control-1:
ip: "10.0.0.10"
role: "controlplane"
taints: []
worker-1:
ip: "10.0.0.20"
role: "worker"
labels:
node-type: "compute"
```
#### Application Settings (`apps.*`)
Per-application configuration that overrides defaults from app manifests:
```yaml
apps:
postgresql:
storage: "100Gi"
maxConnections: 200
sharedBuffers: "256MB"
redis:
memory: "1Gi"
persistence: true
ghost:
domain: "blog.example.com"
theme: "casper"
storage: "10Gi"
replicas: 2
```
## Secrets Management (`secrets.yaml`)
### Security Model
The `secrets.yaml` file stores all sensitive data with the following security measures:
- **File Permissions**: Automatically set to 600 (owner read/write only)
- **Git Exclusion**: Included in `.gitignore` by default
- **Encryption Support**: Can be encrypted at rest using tools like `age` or `gpg`
- **Access Control**: Only Wild Cloud commands can read/write secrets
### Secret Structure
```yaml
# Generated cluster secrets
cluster:
talos:
secrets: "base64-encoded-cluster-secrets"
adminKey: "talos-admin-private-key"
kubernetes:
adminToken: "k8s-admin-service-account-token"
# Application secrets
apps:
postgresql:
rootPassword: "randomly-generated-32-char-string"
replicationPassword: "randomly-generated-32-char-string"
ghost:
dbPassword: "randomly-generated-password"
adminPassword: "user-set-password"
jwtSecret: "randomly-generated-jwt-secret"
immich:
dbPassword: "randomly-generated-password"
dbUrl: "postgresql://immich:password@postgres:5432/immich"
jwtSecret: "jwt-signing-key"
# External service credentials
external:
cloudflare:
apiToken: "cloudflare-dns-api-token"
letsencrypt:
email: "admin@example.com"
backup:
s3AccessKey: "backup-s3-access-key"
s3SecretKey: "backup-s3-secret-key"
```
### Secret Commands
**Reading Secrets**:
```bash
# Read secret values
wild-secret apps.postgresql.rootPassword
wild-secret cluster.kubernetes.adminToken
# Check if secret exists
wild-secret --check apps.newapp.apiKey
```
**Writing Secrets**:
```bash
# Set specific secret value
wild-secret-set apps.ghost.adminPassword "my-secure-password"
# Generate random secret (if no value provided)
wild-secret-set apps.newapp.apiKey # Generates 32-char base64 string
# Set complex secret (JSON format)
wild-secret-set apps.database '{"user":"admin","password":"secret"}'
```
### Automatic Secret Generation
When you run `wild-app-add`, Wild Cloud automatically generates required secrets:
1. **Reads App Manifest**: Identifies `requiredSecrets` list
2. **Checks Existing Secrets**: Never overwrites existing values
3. **Generates Missing Secrets**: Creates secure random values
4. **Updates secrets.yaml**: Adds new secrets with proper structure
**Example App Manifest**:
```yaml
name: ghost
requiredSecrets:
- apps.ghost.dbPassword # Auto-generated if missing
- apps.ghost.jwtSecret # Auto-generated if missing
- apps.postgresql.password # Auto-generated if missing (dependency)
```
**Resulting secrets.yaml**:
```yaml
apps:
ghost:
dbPassword: "aB3kL9mN2pQ7rS8tU1vW4xY5zA6bC0dE"
jwtSecret: "jF2gH5iJ8kL1mN4oP7qR0sT3uV6wX9yZ"
postgresql:
password: "eE8fF1gG4hH7iI0jJ3kK6lL9mM2nN5oO"
```
## Template System
### gomplate Integration
Wild Cloud uses [gomplate](https://gomplate.ca/) for dynamic configuration processing, allowing templates to access both configuration and secrets:
```yaml
# Template example (before processing)
apiVersion: v1
kind: ConfigMap
metadata:
name: ghost-config
namespace: {{ .apps.ghost.namespace }}
data:
url: "https://{{ .apps.ghost.domain }}"
timezone: "{{ .apps.ghost.timezone | default .cloud.timezone }}"
database_host: "{{ .apps.postgresql.hostname }}"
# Conditionals
{{- if .apps.ghost.enableSSL }}
ssl_enabled: "true"
{{- end }}
# Loops
allowed_domains: |
{{- range .apps.ghost.allowedDomains }}
- {{ . }}
{{- end }}
```
### Template Processing Commands
**Process Single Template**:
```bash
# From stdin
cat template.yaml | wild-compile-template > output.yaml
# With custom context
echo "domain: {{ .cloud.domain }}" | wild-compile-template
```
**Process Template Directory**:
```bash
# Recursively process all templates
wild-compile-template-dir source-dir output-dir
# Clean destination first
wild-compile-template-dir --clean source-dir output-dir
```
### Template Context
Templates have access to the complete configuration and secrets context:
```go
// Available template variables
.cloud.* // All cloud configuration
.cluster.* // All cluster configuration
.apps.* // All application configuration
.services.* // All service configuration
// Special functions
.cloud.domain // Primary domain
default "fallback" // Default value if key missing
env "VAR_NAME" // Environment variable
file "path/to/file" // File contents
```
**Template Examples**:
```yaml
# Basic variable substitution
domain: {{ .apps.myapp.domain }}
# Default values
timezone: {{ .apps.myapp.timezone | default .cloud.timezone }}
# Conditionals
{{- if .apps.myapp.enableFeature }}
feature_enabled: true
{{- else }}
feature_enabled: false
{{- end }}
# Lists and iteration
allowed_hosts:
{{- range .apps.myapp.allowedHosts }}
- {{ . }}
{{- end }}
# Complex expressions
replicas: {{ if eq .cluster.environment "production" }}3{{ else }}1{{ end }}
```
## Environment Setup
### Environment Detection
Wild Cloud automatically detects and configures the environment through several mechanisms:
**Project Detection**:
- Searches for `.wildcloud` directory in current or parent directories
- Sets `WC_HOME` to the directory containing `.wildcloud`
- Fails if no Wild Cloud project found
**Repository Detection**:
- Locates Wild Cloud repository (source code)
- Sets `WC_ROOT` to repository location
- Used for accessing app templates and setup scripts
### Environment Variables
**Key Environment Variables**:
```bash
WC_HOME="/path/to/your-cloud" # Your cloud directory
WC_ROOT="/path/to/wild-cloud-repo" # Wild Cloud repository
PATH="$WC_ROOT/bin:$PATH" # Wild Cloud commands available
KUBECONFIG="$WC_HOME/.kube/config" # Kubernetes configuration
TALOSCONFIG="$WC_HOME/.talos/config" # Talos configuration
```
**Environment Setup Script** (`env.sh`):
```bash
#!/bin/bash
# Auto-generated environment setup
export WC_HOME="/home/user/my-cloud"
export WC_ROOT="/opt/wild-cloud"
export PATH="$WC_ROOT/bin:$PATH"
export KUBECONFIG="$WC_HOME/.kubeconfig"
export TALOSCONFIG="$WC_HOME/setup/cluster-nodes/generated/talosconfig"
# Source this file to set up Wild Cloud environment
# source env.sh
```
### Common Script Pattern
Most Wild Cloud scripts follow this initialization pattern:
```bash
#!/bin/bash
set -e
set -o pipefail
# Initialize Wild Cloud environment
if [ -z "${WC_ROOT}" ]; then
print "WC_ROOT is not set."
exit 1
else
source "${WC_ROOT}/scripts/common.sh"
init_wild_env
fi
# Script logic here...
```
## Configuration Validation
### Schema Validation
Wild Cloud validates configuration against expected schemas:
**Cluster Configuration Validation**:
- Node IP addresses are valid and unique
- Network ranges don't overlap
- Required fields are present
- Hardware specifications meet minimums
**Application Configuration Validation**:
- Domain names are valid DNS names
- Storage sizes use valid Kubernetes formats
- Image references are valid container images
- Dependencies are satisfied
### Validation Commands
```bash
# Validate current configuration
wild-config --validate
# Check specific configuration sections
wild-config --validate --section cluster
wild-config --validate --section apps.ghost
# Test template compilation
wild-compile-template --validate < template.yaml
```
## Configuration Best Practices
### Organization
**Hierarchical Structure**:
- Group related settings under common prefixes
- Use consistent naming conventions
- Keep application configs under `apps.*`
- Separate infrastructure from application settings
**Documentation**:
```yaml
# Document complex configurations
cluster:
# Node configuration - update IPs after hardware changes
nodes:
control-1:
ip: "192.168.1.91" # Main control plane node
interface: "eth0" # Primary network interface
```
### Security
**Configuration Security**:
- Never store secrets in `config.yaml`
- Use `wild-secret-set` for all sensitive data
- Regularly rotate generated secrets
- Backup `secrets.yaml` securely
**Access Control**:
```bash
# Ensure proper permissions
chmod 600 secrets.yaml
chmod 644 config.yaml
# Restrict directory access
chmod 755 your-cloud-directory
chmod 700 .wildcloud/
```
### Version Control
**Git Integration**:
```gitignore
# .gitignore for Wild Cloud projects
secrets.yaml # Never commit secrets
.wildcloud/cache/ # Temporary files
.wildcloud/logs/ # Operation logs
setup/cluster-nodes/generated/ # Generated cluster configs
.kube/ # Kubernetes configs
.talos/ # Talos configs
```
**Configuration Changes**:
- Commit `config.yaml` changes with descriptive messages
- Tag major configuration changes
- Use branches for experimental configurations
- Document configuration changes in commit messages
### Backup and Recovery
**Configuration Backup**:
```bash
# Backup configuration and secrets
wild-backup --home-only
# Export configuration for disaster recovery
cp config.yaml config-backup-$(date +%Y%m%d).yaml
cp secrets.yaml secrets-backup-$(date +%Y%m%d).yaml.gpg # Encrypt first
```
**Recovery Process**:
1. Restore `config.yaml` from backup
2. Decrypt and restore `secrets.yaml`
3. Re-run `wild-setup` if needed
4. Validate configuration with `wild-config --validate`
## Advanced Configuration
### Multi-Environment Setup
**Development Environment**:
```yaml
cloud:
domain: "dev.example.com"
cluster:
name: "dev-cluster"
nodeCount: 1
apps:
ghost:
domain: "blog.dev.example.com"
replicas: 1
```
**Production Environment**:
```yaml
cloud:
domain: "example.com"
cluster:
name: "prod-cluster"
nodeCount: 5
apps:
ghost:
domain: "blog.example.com"
replicas: 3
```
### Configuration Inheritance
**Base Configuration**:
```yaml
# config.base.yaml
cloud:
timezone: "UTC"
email: "admin@example.com"
apps:
postgresql:
storage: "10Gi"
```
**Environment-Specific Override**:
```yaml
# config.prod.yaml (merged with base)
apps:
postgresql:
storage: "100Gi" # Override for production
replicas: 3 # Additional production setting
```
### Dynamic Configuration
**Runtime Configuration Updates**:
```bash
# Update configuration without restart
wild-config-set apps.ghost.replicas 3
wild-app-deploy ghost # Apply changes
# Rolling updates
wild-config-set apps.ghost.image "ghost:5.1.0"
wild-app-deploy ghost --rolling-update
```
The Wild Cloud configuration system provides a powerful, secure, and flexible foundation for managing complex infrastructure deployments while maintaining simplicity for common use cases.

View File

@@ -0,0 +1,443 @@
# Wild Cloud Overview
Wild Cloud is a complete, production-ready Kubernetes infrastructure designed for personal use. It combines enterprise-grade technologies to create a self-hosted cloud platform with automated deployment, HTTPS certificates, and web management interfaces.
## What is Wild Cloud?
### Vision
In a world where digital lives are increasingly controlled by large corporations, Wild Cloud puts you back in control by providing:
- **Privacy**: Your data stays on your hardware, under your control
- **Ownership**: No subscription fees or sudden price increases
- **Freedom**: Run the apps you want, the way you want them
- **Learning**: Gain valuable skills in modern cloud technologies
- **Resilience**: Reduce reliance on third-party services that can disappear
### Core Capabilities
**Complete Infrastructure Stack**:
- Kubernetes cluster with Talos Linux
- Automatic HTTPS certificates via Let's Encrypt
- Load balancing with MetalLB
- Ingress routing with Traefik
- Distributed storage with Longhorn
- DNS management with CoreDNS and ExternalDNS
**Application Platform**:
- One-command application deployment
- Pre-built apps for common self-hosted services
- Automatic database setup and configuration
- Integrated backup and restore system
- Web-based management interfaces
**Enterprise Features**:
- High availability and fault tolerance
- Automated certificate management
- Network policies and security contexts
- Monitoring and observability
- Infrastructure as code principles
## Technology Stack
### Core Infrastructure
- **Talos Linux** - Immutable OS designed for Kubernetes
- **Kubernetes** - Container orchestration platform
- **MetalLB** - Load balancer for bare metal deployments
- **Traefik** - HTTP reverse proxy and ingress controller
- **Longhorn** - Distributed block storage system
- **cert-manager** - Automatic TLS certificate management
### Supporting Services
- **CoreDNS** - DNS server for service discovery
- **ExternalDNS** - Automatic DNS record management
- **Kubernetes Dashboard** - Web UI for cluster management
- **restic** - Backup solution with deduplication
- **gomplate** - Template processor for configurations
### Development Tools
- **Kustomize** - Kubernetes configuration management
- **kubectl** - Kubernetes command line interface
- **talosctl** - Talos Linux management tool
- **Bats** - Testing framework for bash scripts
## Architecture Overview
```
┌─────────────────────────────────────────────────────────────────┐
│ Internet │
└─────────────────┬───────────────────────────────────────────────┘
┌─────────────────▼───────────────────────────────────────────────┐
│ DNS Provider │
│ (Cloudflare, Route53, etc.) │
└─────────────────┬───────────────────────────────────────────────┘
┌─────────────────▼───────────────────────────────────────────────┐
│ Your Network │
│ ┌─────────────┐ ┌─────────────────────────────────────────┐ │
│ │ dnsmasq │ │ Kubernetes Cluster │ │
│ │ Server │ │ ┌─────────────┐ ┌─────────────────┐ │ │
│ │ │ │ │ MetalLB │ │ Traefik │ │ │
│ │ DNS + DHCP │ │ │ LoadBalancer│ │ Ingress │ │ │
│ └─────────────┘ │ └─────────────┘ └─────────────────┘ │ │
│ │ ┌───────────────────────────────────┐ │ │
│ │ │ Applications │ │ │
│ │ │ Ghost, Immich, Gitea, vLLM... │ │ │
│ │ └───────────────────────────────────┘ │ │
│ └─────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
```
### Traffic Flow
1. **External Request** → DNS resolution via provider
2. **DNS Response** → Points to your cluster's external IP
3. **Network Request** → Hits MetalLB load balancer
4. **Load Balancer** → Routes to Traefik ingress controller
5. **Ingress Controller** → Terminates TLS and routes to application
6. **Application** → Serves content from Kubernetes pod
## Getting Started
### Prerequisites
**Hardware Requirements**:
- Minimum 3 nodes for high availability
- 8GB RAM per node (16GB+ recommended)
- 100GB+ storage per node
- Gigabit network connectivity
- x86_64 architecture
**Network Requirements**:
- All nodes on same network segment
- One dedicated machine for dnsmasq (can be lightweight)
- Static IP assignments or DHCP reservations
- Internet connectivity for downloads and certificates
### Quick Start Guide
#### 1. Install Dependencies
```bash
# Clone Wild Cloud repository
git clone https://github.com/your-org/wild-cloud
cd wild-cloud
# Install required tools
scripts/setup-utils.sh
```
#### 2. Initialize Your Cloud
```bash
# Create and initialize new cloud directory
mkdir my-cloud && cd my-cloud
wild-init
# Follow interactive setup prompts for:
# - Domain name configuration
# - Email for certificates
# - Network settings
```
#### 3. Deploy Infrastructure
```bash
# Complete automated setup
wild-setup
# Or step-by-step:
wild-setup-cluster # Deploy Kubernetes cluster
wild-setup-services # Install core services
```
#### 4. Deploy Your First App
```bash
# List available applications
wild-apps-list
# Deploy a blog
wild-app-add ghost
wild-app-deploy ghost
# Access at https://ghost.yourdomain.com
```
#### 5. Verify Deployment
```bash
# Check system health
wild-health
# Access Kubernetes dashboard
wild-dashboard-token
# Visit https://dashboard.internal.yourdomain.com
```
## Key Concepts
### Configuration Management
Wild Cloud uses a dual-file configuration system:
**`config.yaml`** - Non-sensitive settings:
```yaml
cloud:
domain: "example.com"
email: "admin@example.com"
apps:
ghost:
domain: "blog.example.com"
storage: "10Gi"
```
**`secrets.yaml`** - Sensitive data (auto-generated):
```yaml
apps:
ghost:
dbPassword: "secure-random-password"
postgresql:
rootPassword: "another-secure-password"
```
### Template System
All configurations are templates processed with gomplate:
**Before Processing** (in repository):
```yaml
domain: {{ .apps.ghost.domain }}
storage: {{ .apps.ghost.storage | default "5Gi" }}
```
**After Processing** (in your cloud):
```yaml
domain: blog.example.com
storage: 10Gi
```
### Application Lifecycle
1. **Discovery**: `wild-apps-list` - Browse available apps
2. **Configuration**: `wild-app-add app-name` - Configure and prepare application
3. **Deployment**: `wild-app-deploy app-name` - Deploy to cluster
4. **Operations**: `wild-app-doctor app-name` - Monitor and troubleshoot
## Available Applications
### Content Management & Publishing
- **Ghost** - Modern publishing platform for blogs and websites
- **Discourse** - Community discussion platform with modern features
### Media & File Management
- **Immich** - Self-hosted photo and video backup solution
### Development Tools
- **Gitea** - Self-hosted Git service with web interface
- **Docker Registry** - Private container image registry
### Communication & Marketing
- **Keila** - Newsletter and email marketing platform
- **Listmonk** - High-performance newsletter and mailing list manager
### Databases & Caching
- **PostgreSQL** - Advanced open-source relational database
- **MySQL** - Popular relational database management system
- **Redis** - In-memory data structure store and cache
- **Memcached** - Distributed memory caching system
### AI & Machine Learning
- **vLLM** - High-performance LLM inference server with OpenAI-compatible API
## Core Commands Reference
### Setup & Initialization
```bash
wild-init # Initialize new cloud directory
wild-setup # Complete infrastructure deployment
wild-setup-cluster # Deploy Kubernetes cluster only
wild-setup-services # Deploy cluster services only
```
### Application Management
```bash
wild-apps-list # List available applications
wild-app-add <app> # Configure application
wild-app-deploy <app> # Deploy to cluster
wild-app-delete <app> # Remove application
wild-app-doctor <app> # Run diagnostics
```
### Configuration Management
```bash
wild-config <key> # Read configuration value
wild-config-set <key> <val> # Set configuration value
wild-secret <key> # Read secret value
wild-secret-set <key> <val> # Set secret value
```
### Operations & Monitoring
```bash
wild-health # System health check
wild-dashboard-token # Get dashboard access token
wild-backup # Backup system and apps
wild-app-backup <app> # Backup specific application
```
## Best Practices
### Security
- Never commit `secrets.yaml` to version control
- Use strong, unique passwords for all services
- Regularly update system and application images
- Monitor certificate expiration and renewal
- Implement network policies for production workloads
### Configuration Management
- Store `config.yaml` in version control with proper .gitignore
- Document configuration changes in commit messages
- Use branches for experimental configurations
- Backup configuration files before major changes
- Test configuration changes in development environment
### Operations
- Monitor cluster health with `wild-health`
- Set up regular backup schedules with `wild-backup`
- Keep applications updated with latest security patches
- Monitor disk usage and expand storage as needed
- Document custom configurations and procedures
### Development
- Follow Wild Cloud app structure conventions
- Use proper Kubernetes security contexts
- Include comprehensive health checks and probes
- Test applications thoroughly before deployment
- Document application-specific configuration requirements
## Common Use Cases
### Personal Blog/Website
```bash
# Deploy Ghost blog with custom domain
wild-config-set apps.ghost.domain "blog.yourdomain.com"
wild-app-add ghost
wild-app-deploy ghost
```
### Photo Management
```bash
# Deploy Immich for photo backup and management
wild-app-add postgresql immich
wild-app-deploy postgresql immich
```
### Development Environment
```bash
# Set up Git hosting and container registry
wild-app-add gitea docker-registry
wild-app-deploy gitea docker-registry
```
### AI/ML Workloads
```bash
# Deploy vLLM for local AI inference
wild-config-set apps.vllm.model "Qwen/Qwen2.5-7B-Instruct"
wild-app-add vllm
wild-app-deploy vllm
```
## Troubleshooting
### Common Issues
**Cluster Not Responding**:
```bash
# Check node status
kubectl get nodes
talosctl health
# Verify network connectivity
ping <node-ip>
```
**Applications Not Starting**:
```bash
# Check pod status
kubectl get pods -n <app-namespace>
# View logs
kubectl logs deployment/<app-name> -n <app-namespace>
# Run diagnostics
wild-app-doctor <app-name>
```
**Certificate Issues**:
```bash
# Check certificate status
kubectl get certificates -A
# View cert-manager logs
kubectl logs -n cert-manager deployment/cert-manager
```
**DNS Problems**:
```bash
# Test DNS resolution
nslookup <app-domain>
# Check external-dns logs
kubectl logs -n external-dns deployment/external-dns
```
### Getting Help
**Documentation**:
- Check `docs/` directory for detailed guides
- Review application-specific README files
- Consult Kubernetes and Talos documentation
**Community Support**:
- Report issues on GitHub repository
- Join community forums and discussions
- Share configurations and troubleshooting tips
**Professional Support**:
- Consider professional services for production deployments
- Engage with cloud infrastructure consultants
- Participate in training and certification programs
## Advanced Topics
### Custom Applications
Create your own Wild Cloud applications:
1. **Create App Directory**: `apps/myapp/`
2. **Define Manifest**: Include metadata and configuration defaults
3. **Create Templates**: Kubernetes resources with gomplate variables
4. **Test Deployment**: Use standard Wild Cloud workflow
5. **Share**: Contribute back to the community
### Multi-Environment Deployments
Manage multiple Wild Cloud instances:
- **Development**: Single-node cluster for testing
- **Staging**: Multi-node cluster mirroring production
- **Production**: Full HA cluster with monitoring and backups
### Integration with External Services
Extend Wild Cloud capabilities:
- **External DNS Providers**: Cloudflare, Route53, Google DNS
- **Backup Storage**: S3, Google Cloud Storage, Azure Blob
- **Monitoring**: Prometheus, Grafana, AlertManager
- **CI/CD**: GitLab CI, GitHub Actions, Jenkins
### Performance Optimization
Optimize for your workloads:
- **Resource Allocation**: CPU and memory limits/requests
- **Storage Performance**: NVMe SSDs, storage classes
- **Network Optimization**: Network policies, service mesh
- **Scaling**: Horizontal pod autoscaling, cluster autoscaling
Wild Cloud provides a solid foundation for personal cloud infrastructure while maintaining the flexibility to grow and adapt to changing needs. Whether you're running a simple blog or a complex multi-service application, Wild Cloud's enterprise-grade technologies ensure your infrastructure is reliable, secure, and maintainable.

View File

@@ -0,0 +1,487 @@
# Wild Cloud Project Architecture
Wild Cloud consists of two main directory structures: the **Wild Cloud Repository** (source code and templates) and **User Cloud Directories** (individual deployments). Understanding this architecture is essential for working with Wild Cloud effectively.
## Architecture Overview
```
Wild Cloud Repository (/path/to/wild-cloud-repo) ← Source code, templates, scripts
User Cloud Directory (/path/to/my-cloud) ← Individual deployment instance
Kubernetes Cluster ← Running infrastructure
```
## Wild Cloud Repository Structure
The Wild Cloud repository (`WC_ROOT`) contains the source code, templates, and tools:
### `/bin/` - Command Line Interface
**Purpose**: All Wild Cloud CLI commands and utilities
```
bin/
├── wild-* # All user-facing commands (34+ scripts)
├── wild-common.sh # Common utilities and functions
├── README.md # CLI documentation
└── helm-chart-to-kustomize # Utility for converting Helm charts
```
**Key Commands**:
- **Setup**: `wild-init`, `wild-setup`, `wild-setup-cluster`, `wild-setup-services`
- **Apps**: `wild-app-*`, `wild-apps-list`
- **Config**: `wild-config*`, `wild-secret*`
- **Operations**: `wild-backup`, `wild-health`, `wild-dashboard-token`
### `/apps/` - Application Templates
**Purpose**: Pre-built applications ready for deployment
```
apps/
├── README.md # Apps system documentation
├── ghost/ # Blog publishing platform
│ ├── manifest.yaml # App metadata and defaults
│ ├── kustomization.yaml # Kustomize configuration
│ ├── deployment.yaml # Kubernetes deployment
│ ├── service.yaml # Service definition
│ ├── ingress.yaml # HTTPS ingress
│ └── ...
├── immich/ # Photo management
├── gitea/ # Git hosting
├── postgresql/ # Database service
├── vllm/ # AI/LLM inference
└── ...
```
**Application Categories**:
- **Content Management**: Ghost, Discourse
- **Media**: Immich
- **Development**: Gitea, Docker Registry
- **Databases**: PostgreSQL, MySQL, Redis
- **AI/ML**: vLLM
- **Infrastructure**: Memcached, NFS
### `/setup/` - Infrastructure Templates
**Purpose**: Cluster and service deployment templates
```
setup/
├── README.md
├── cluster-nodes/ # Talos node configuration
│ ├── init-cluster.sh # Cluster initialization script
│ ├── patch.templates/ # Node-specific config templates
│ │ ├── controlplane.yaml # Control plane template
│ │ └── worker.yaml # Worker node template
│ └── talos-schemas.yaml # Version mappings
├── cluster-services/ # Core Kubernetes services
│ ├── README.md
│ ├── metallb/ # Load balancer
│ ├── traefik/ # Ingress controller
│ ├── cert-manager/ # Certificate management
│ ├── longhorn/ # Distributed storage
│ ├── coredns/ # DNS resolution
│ ├── externaldns/ # DNS record management
│ ├── kubernetes-dashboard/ # Web UI
│ └── ...
├── dnsmasq/ # DNS and PXE boot server
├── home-scaffold/ # User directory templates
└── operator/ # Additional operator tools
```
### `/experimental/` - Development Projects
**Purpose**: Experimental features and development tools
```
experimental/
├── daemon/ # Go API daemon
│ ├── main.go # API server
│ ├── Makefile # Build automation
│ └── README.md
└── app/ # React dashboard
├── src/ # React source code
├── package.json # Dependencies
├── pnpm-lock.yaml # Lock file
└── README.md
```
### `/scripts/` - Utility Scripts
**Purpose**: Installation and utility scripts
```
scripts/
├── setup-utils.sh # Install dependencies
└── install-wild-cloud-dependencies.sh
```
### `/docs/` - Documentation
**Purpose**: User guides and documentation
```
docs/
├── guides/ # Setup and usage guides
├── agent-context/ # Agent documentation
│ └── wildcloud/ # Context files for AI agents
└── *.md # Various documentation files
```
### `/test/` - Test Suite
**Purpose**: Automated testing with Bats
```
test/
├── bats/ # Bats testing framework
├── fixtures/ # Test data and configurations
├── run_bats_tests.sh # Test runner
└── *.bats # Individual test files
```
### Root Files
```
/
├── README.md # Project overview
├── CLAUDE.md # AI assistant context
├── LICENSE # GNU AGPLv3
├── CONTRIBUTING.md # Contribution guidelines
├── env.sh # Environment setup
├── .gitignore # Git exclusions
└── .gitmodules # Git submodules
```
## User Cloud Directory Structure
Each user deployment (`WC_HOME`) is an independent cloud instance:
### Directory Layout
```
my-cloud/ # User's cloud directory
├── .wildcloud/ # Project marker and cache
│ ├── cache/ # Downloaded templates
│ │ ├── apps/ # Cached app templates
│ │ └── services/ # Cached service templates
│ └── logs/ # Operation logs
├── config.yaml # Main configuration
├── secrets.yaml # Sensitive data (600 permissions)
├── env.sh # Environment setup (auto-generated)
├── apps/ # Deployed application configs
│ ├── ghost/ # Compiled ghost configuration
│ ├── postgresql/ # Database configuration
│ └── ...
├── setup/ # Infrastructure configurations
│ ├── cluster-nodes/ # Node-specific configurations
│ │ └── generated/ # Generated Talos configs
│ └── cluster-services/ # Compiled service configurations
├── docs/ # Project-specific documentation
├── .kube/ # Kubernetes configuration
│ └── config # kubectl configuration
├── .talos/ # Talos configuration
│ └── config # talosctl configuration
└── backups/ # Local backup staging
```
### Configuration Files
**`config.yaml`** - Main configuration (version controlled):
```yaml
cloud:
domain: "example.com"
email: "admin@example.com"
cluster:
name: "my-cluster"
nodeCount: 3
apps:
ghost:
domain: "blog.example.com"
```
**`secrets.yaml`** - Sensitive data (not version controlled):
```yaml
apps:
ghost:
dbPassword: "generated-password"
postgresql:
rootPassword: "generated-password"
cluster:
talos:
secrets: "base64-encoded-secrets"
```
**`.wildcloud/`** - Project metadata:
- Marks directory as Wild Cloud project
- Contains cached templates and temporary files
- Used for project detection by scripts
### Generated Directories
**`apps/`** - Compiled application configurations:
- Created by `wild-app-add` command
- Contains ready-to-deploy Kubernetes manifests
- Templates processed with user configuration
- Each app in separate subdirectory
**`setup/cluster-nodes/generated/`** - Talos configurations:
- Base cluster configuration (`controlplane.yaml`, `worker.yaml`)
- Node-specific patches and final configs
- Cluster secrets and certificates
- Generated by `wild-cluster-config-generate`
**`setup/cluster-services/`** - Kubernetes services:
- Compiled service configurations
- Generated by `wild-cluster-services-configure`
- Ready for deployment to cluster
## Template Processing Flow
### From Repository to Deployment
1. **Template Storage**: Templates stored in repository with placeholder variables
2. **Configuration Merge**: `wild-app-add` reads templates directly from repository and merges app defaults with user config
3. **Template Compilation**: gomplate processes templates with user data
4. **Manifest Generation**: Final Kubernetes manifests created in user directory
5. **Deployment**: `wild-app-deploy` applies manifests to cluster
### Template Variables
**Repository Templates** (before processing):
```yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: ghost
namespace: {{ .apps.ghost.namespace }}
spec:
replicas: {{ .apps.ghost.replicas | default 1 }}
template:
spec:
containers:
- name: ghost
image: "{{ .apps.ghost.image }}"
env:
- name: url
value: "https://{{ .apps.ghost.domain }}"
```
**User Directory** (after processing):
```yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: ghost
namespace: ghost
spec:
replicas: 2
template:
spec:
containers:
- name: ghost
image: "ghost:5.0.0"
env:
- name: url
value: "https://blog.example.com"
```
## File Permissions and Security
### Security Model
**Configuration Security**:
```bash
config.yaml # 644 (readable by group)
secrets.yaml # 600 (owner only)
.wildcloud/ # 755 (standard directory)
apps/ # 755 (standard directory)
```
**Git Integration**:
```gitignore
# Automatically excluded from version control
secrets.yaml # Never commit secrets
.wildcloud/cache/ # Temporary files
.wildcloud/logs/ # Operation logs
setup/cluster-nodes/generated/ # Generated configs
.kube/ # Kubernetes configs
.talos/ # Talos configs
backups/ # Backup files
```
### Access Patterns
**Read Operations**:
- Scripts read config and secrets via `wild-config` and `wild-secret`
- Template processor accesses both files for compilation
- Kubernetes tools read generated manifests
**Write Operations**:
- Only Wild Cloud commands modify config and secrets
- Manual editing supported but not recommended
- Backup processes create read-only copies
## Development Workflow
### Repository Development
**Setup Development Environment**:
```bash
git clone https://github.com/username/wild-cloud
cd wild-cloud
source env.sh # Set up environment
scripts/setup-utils.sh # Install dependencies
```
**Testing Changes**:
```bash
# Test specific functionality
test/run_bats_tests.sh
# Test with real cloud directory
cd /path/to/test-cloud
wild-app-add myapp # Test app changes
wild-setup-cluster --dry-run # Test cluster changes
```
### User Workflow
**Initial Setup**:
```bash
mkdir my-cloud && cd my-cloud
wild-init # Initialize project
wild-setup # Deploy infrastructure
```
**Daily Operations**:
```bash
wild-apps-list # Browse available apps
wild-app-add ghost # Configure app
wild-app-deploy ghost # Deploy to cluster
```
**Configuration Management**:
```bash
wild-config apps.ghost.domain # Read configuration
wild-config-set apps.ghost.storage "20Gi" # Update configuration
wild-app-deploy ghost # Apply changes
```
## Integration Points
### External Systems
**DNS Providers**:
- Cloudflare API for DNS record management
- Route53 support for AWS domains
- Generic webhook support for other providers
**Certificate Authorities**:
- Let's Encrypt (primary)
- Custom CA support
- Manual certificate import
**Storage Backends**:
- Local storage via Longhorn
- NFS network storage
- Cloud storage integration (S3, etc.)
**Backup Systems**:
- Restic for deduplication and encryption
- S3-compatible storage backends
- Local and remote backup targets
### Kubernetes Integration
**Custom Resources**:
- Traefik IngressRoute and Middleware
- cert-manager Certificate and Issuer
- Longhorn Volume and Engine
- ExternalDNS DNSEndpoint
**Standard Resources**:
- Deployments, Services, ConfigMaps
- Ingress, PersistentVolumes, Secrets
- NetworkPolicies, ServiceAccounts
- Jobs, CronJobs, DaemonSets
## Extensibility Points
### Custom Applications
**Create New Apps**:
1. Create directory in `apps/`
2. Define `manifest.yaml` with metadata
3. Create Kubernetes resource templates
4. Test with `wild-app-add` and `wild-app-deploy`
**App Requirements**:
- Follow Wild Cloud labeling standards
- Use gomplate template syntax
- Include external-dns annotations
- Implement proper security contexts
### Custom Services
**Add Infrastructure Services**:
1. Create directory in `setup/cluster-services/`
2. Define installation and configuration scripts
3. Create Kubernetes manifests with templates
4. Integrate with service deployment pipeline
### Script Extensions
**Extend CLI**:
- Add scripts to `bin/` directory with `wild-` prefix
- Follow common script patterns (error handling, help text)
- Source `wild-common.sh` for utilities
- Use configuration system for customization
## Deployment Patterns
### Single-Node Development
**Configuration**:
```yaml
cluster:
nodeCount: 1
nodes:
all-in-one:
roles: ["controlplane", "worker"]
```
**Suitable For**:
- Development and testing
- Learning Kubernetes concepts
- Small personal deployments
- Resource-constrained environments
### Multi-Node Production
**Configuration**:
```yaml
cluster:
nodeCount: 5
nodes:
control-1: { role: "controlplane" }
control-2: { role: "controlplane" }
control-3: { role: "controlplane" }
worker-1: { role: "worker" }
worker-2: { role: "worker" }
```
**Suitable For**:
- Production workloads
- High availability requirements
- Scalable application hosting
- Enterprise-grade deployments
### Hybrid Deployments
**Configuration**:
```yaml
cluster:
nodes:
control-1:
role: "controlplane"
taints: [] # Allow workloads on control plane
worker-gpu:
role: "worker"
labels:
nvidia.com/gpu: "true" # GPU-enabled node
```
**Use Cases**:
- Mixed workload requirements
- Specialized hardware (GPU, storage)
- Cost optimization
- Gradual scaling
The Wild Cloud architecture provides a solid foundation for personal cloud infrastructure while maintaining flexibility for customization and extension.

View File

@@ -0,0 +1,390 @@
# Wild Cloud Setup Process & Infrastructure
Wild Cloud provides a complete, production-ready Kubernetes infrastructure designed for personal use. It combines enterprise-grade technologies to create a self-hosted cloud platform with automated deployment, HTTPS certificates, and web management interfaces.
## Setup Phases Overview
The Wild Cloud setup follows a sequential, dependency-aware process:
1. **Environment Setup** - Install required tools and dependencies
2. **DNS/Network Foundation** - Set up dnsmasq for DNS and PXE booting
3. **Cluster Infrastructure** - Deploy Talos Linux nodes and Kubernetes cluster
4. **Cluster Services** - Install core services (ingress, storage, certificates, etc.)
## Phase 1: Environment Setup
### Dependencies Installation
**Script**: `scripts/setup-utils.sh`
**Required Tools**:
- `kubectl` - Kubernetes CLI
- `gomplate` - Template processor for configuration
- `kustomize` - Kubernetes configuration management
- `yq` - YAML processor
- `restic` - Backup tool
- `talosctl` - Talos Linux cluster management
### Project Initialization
**Command**: `wild-init`
Creates the basic Wild Cloud directory structure:
- `.wildcloud/` - Project marker and cache
- `config.yaml` - Main configuration file
- `secrets.yaml` - Sensitive data storage
- Basic project scaffolding
## Phase 2: DNS/Network Foundation
### dnsmasq Infrastructure
**Location**: `setup/dnsmasq/`
**Requirements**: Dedicated Linux machine with static IP
**Services Provided**:
1. **LAN DNS Server**
- Forwards internal domains (`*.internal.domain.com`) to cluster
- Forwards external domains (`*.domain.com`) to cluster
- Provides DNS resolution for entire network
2. **PXE Boot Server**
- Enables network booting for cluster node installation
- DHCP/TFTP services for Talos Linux deployment
- Automated node provisioning
**Network Configuration Example**:
```yaml
network:
subnet: 192.168.1.0/24
gateway: 192.168.1.1
dnsmasq_ip: 192.168.1.50
dhcp_range: 192.168.1.100-200
metallb_pool: 192.168.1.80-89
control_plane_vip: 192.168.1.90
node_ips: 192.168.1.91-93
```
## Phase 3: Cluster Infrastructure Setup
### Talos Linux Foundation
**Command**: `wild-setup-cluster`
**Talos Configuration**:
- **Version**: v1.11.0 (configurable)
- **Immutable OS**: Designed specifically for Kubernetes
- **System Extensions**:
- Intel microcode updates
- iSCSI tools for storage
- gVisor container runtime
- NVIDIA GPU support (optional)
- Additional system utilities
### Cluster Setup Process
#### 1. Configuration Generation
**Script**: `wild-cluster-config-generate`
- Generates base Talos configurations (`controlplane.yaml`, `worker.yaml`)
- Creates cluster secrets using `talosctl gen config`
- Establishes foundation for all node configurations
#### 2. Node Setup (Atomic Operations)
**Script**: `wild-node-setup <node-name> [options]`
**Complete Node Lifecycle Management**:
- **Hardware Detection**: Discovers network interfaces and storage devices
- **Configuration Generation**: Creates node-specific patches and final configs
- **Deployment**: Applies Talos configuration to the node
**Options**:
- `--detect`: Force hardware re-detection
- `--no-deploy`: Generate configuration only, skip deployment
**Integration with Cluster Setup**:
- `wild-setup-cluster` automatically calls `wild-node-setup` for each node
- Individual node failures don't break cluster setup
- Clear retry instructions for failed nodes
### Cluster Architecture
**Control Plane**:
- 3 nodes for high availability
- Virtual IP (VIP) for load balancing
- etcd distributed across all control plane nodes
**Worker Nodes**:
- Variable count (configured during setup)
- Dedicated workload execution
- Storage participation via Longhorn
**Networking**:
- All nodes on same LAN segment
- Sequential IP assignment
- MetalLB integration for load balancing
## Phase 4: Cluster Services Installation
### Services Deployment Process
**Command**: `wild-setup-services [options]`
- **`--fetch`**: Fetch fresh templates before setup
- **`--no-deploy`**: Configure only, skip deployment
**New Architecture**: Per-service atomic operations
- Uses `wild-service-setup <service>` for each service in dependency order
- Each service handles complete lifecycle: fetch → configure → deploy
- Dependency validation before each service deployment
- Fail-fast with clear recovery instructions
**Individual Service Management**: `wild-service-setup <service> [options]`
- **Default**: Configure and deploy using existing templates
- **`--fetch`**: Fetch fresh templates before setup
- **`--no-deploy`**: Configure only, skip deployment
### Core Services (Installed in Order)
#### 1. MetalLB Load Balancer
**Location**: `setup/cluster-services/metallb/`
- **Purpose**: Provides load balancing for bare metal clusters
- **Functionality**: Assigns external IPs to Kubernetes services
- **Configuration**: IP address pool from local network range
- **Integration**: Foundation for ingress traffic routing
#### 2. Longhorn Distributed Storage
**Location**: `setup/cluster-services/longhorn/`
- **Purpose**: Distributed block storage for persistent volumes
- **Features**:
- Cross-node data replication
- Snapshot and backup capabilities
- Volume expansion and management
- Web-based management interface
- **Storage**: Uses local disks from all cluster nodes
#### 3. Traefik Ingress Controller
**Location**: `setup/cluster-services/traefik/`
- **Purpose**: HTTP/HTTPS reverse proxy and ingress controller
- **Features**:
- Automatic service discovery
- TLS termination
- Load balancing and routing
- Gateway API support
- **Integration**: Works with MetalLB for external traffic
#### 4. CoreDNS
**Location**: `setup/cluster-services/coredns/`
- **Purpose**: DNS resolution for cluster services
- **Integration**: Connects with external DNS providers
- **Functionality**: Service discovery and DNS forwarding
#### 5. cert-manager
**Location**: `setup/cluster-services/cert-manager/`
- **Purpose**: Automatic TLS certificate management
- **Features**:
- Let's Encrypt integration
- Automatic certificate issuance and renewal
- Multiple certificate authorities support
- Certificate lifecycle management
#### 6. ExternalDNS
**Location**: `setup/cluster-services/externaldns/`
- **Purpose**: Automatic DNS record management
- **Functionality**:
- Syncs Kubernetes services with DNS providers
- Automatic A/CNAME record creation
- Supports multiple DNS providers (Cloudflare, Route53, etc.)
#### 7. Kubernetes Dashboard
**Location**: `setup/cluster-services/kubernetes-dashboard/`
- **Purpose**: Web UI for cluster management
- **Access**: `https://dashboard.internal.domain.com`
- **Authentication**: Token-based access via `wild-dashboard-token`
- **Features**: Resource management, monitoring, troubleshooting
#### 8. NFS Storage (Optional)
**Location**: `setup/cluster-services/nfs/`
- **Purpose**: Network file system for shared storage
- **Use Cases**: Media storage, backups, shared data
- **Integration**: Mounted as persistent volumes in applications
#### 9. Docker Registry
**Location**: `setup/cluster-services/docker-registry/`
- **Purpose**: Private container registry
- **Features**: Store custom images locally
- **Integration**: Used by applications and CI/CD pipelines
## Infrastructure Components Deep Dive
### DNS and Domain Architecture
```
Internet → External DNS → MetalLB LoadBalancer → Traefik → Kubernetes Services
Internal DNS (dnsmasq)
Internal Network
```
**Domain Types**:
- **External**: `app.domain.com` - Public-facing services
- **Internal**: `app.internal.domain.com` - Admin interfaces only
- **Resolution**: dnsmasq forwards all domain traffic to cluster
### Certificate and TLS Management
**Automatic Certificate Flow**:
1. Service deployed with ingress annotation
2. cert-manager detects certificate requirement
3. Let's Encrypt challenge initiated
4. Certificate issued and stored in Kubernetes secret
5. Traefik uses certificate for TLS termination
6. Automatic renewal before expiration
### Storage Architecture
**Longhorn Distributed Storage**:
- Block-level replication across nodes
- Default 3-replica policy for data durability
- Automatic failover and recovery
- Snapshot and backup capabilities
- Web UI for management and monitoring
**Storage Classes**:
- `longhorn` - Default replicated storage
- `longhorn-single` - Single replica for non-critical data
- `nfs` - Shared network storage (if configured)
### Network Traffic Flow
**External Request Flow**:
1. DNS resolution via dnsmasq → cluster IP
2. Traffic hits MetalLB load balancer
3. MetalLB forwards to Traefik ingress
4. Traefik terminates TLS and routes to service
5. Service forwards to appropriate pod
6. Response follows reverse path
### High Availability Features
**Control Plane HA**:
- 3 control plane nodes with leader election
- Virtual IP for API server access
- etcd cluster with automatic failover
- Distributed workload scheduling
**Storage HA**:
- Longhorn 3-way replication
- Automatic replica placement across nodes
- Node failure recovery
- Data integrity verification
**Networking HA**:
- MetalLB speaker pods on all nodes
- Automatic load balancer failover
- Multiple ingress controller replicas
## Hardware Requirements
### Minimum Specifications
- **Nodes**: 3 control plane + optional workers
- **RAM**: 8GB minimum per node (16GB+ recommended)
- **CPU**: 4 cores minimum per node
- **Storage**: 100GB+ local storage per node
- **Network**: Gigabit ethernet connectivity
### Network Requirements
- All nodes on same LAN segment
- Static IP assignments or DHCP reservations
- dnsmasq server accessible by all nodes
- Internet connectivity for image pulls and Let's Encrypt
### Recommended Hardware
- **Control Plane**: 16GB RAM, 8 cores, 200GB NVMe SSD
- **Workers**: 32GB RAM, 16 cores, 500GB NVMe SSD
- **Network**: Dedicated VLAN or network segment
- **Redundancy**: UPS protection, dual network interfaces
## Configuration Management
### Configuration Files
- `config.yaml` - Main configuration (domains, network, apps)
- `secrets.yaml` - Sensitive data (passwords, API keys, certificates)
- `.wildcloud/` - Cache and temporary files
### Template System
**gomplate Integration**:
- All configurations processed as templates
- Access to config and secrets via template variables
- Dynamic configuration generation
- Environment-specific customization
### Configuration Commands
```bash
# Read configuration values
wild-config cluster.name
wild-config apps.ghost.domain
# Set configuration values
wild-config-set cloud.domain "example.com"
wild-config-set cluster.nodeCount 5
# Secret management
wild-secret apps.database.password
wild-secret-set apps.api.key "secret-value"
```
## Setup Commands Reference
### Complete Setup
```bash
wild-init # Initialize project
wild-setup # Complete automated setup
```
### Phase-by-Phase Setup
```bash
wild-setup-cluster # Cluster infrastructure only
wild-setup-services # Cluster services only
```
### Individual Operations
```bash
wild-cluster-config-generate # Generate base configs
wild-node-setup <node-name> # Complete node setup (detect → configure → deploy)
wild-node-setup <node-name> --detect # Force hardware re-detection
wild-node-setup <node-name> --no-deploy # Configuration only
wild-dashboard-token # Get dashboard access
wild-health # System health check
```
## Troubleshooting and Validation
### Health Checks
```bash
wild-health # Overall system status
kubectl get nodes # Node status
kubectl get pods -A # All pod status
talosctl health # Talos cluster health
```
### Service Validation
```bash
kubectl get svc -n metallb-system # MetalLB status
kubectl get pods -n longhorn-system # Storage status
kubectl get pods -n traefik # Ingress status
kubectl get certificates -A # Certificate status
```
### Log Analysis
```bash
talosctl logs -f machined # Talos system logs
kubectl logs -n traefik deployment/traefik # Ingress logs
kubectl logs -n cert-manager deployment/cert-manager # Certificate logs
```
This comprehensive setup process creates a production-ready personal cloud infrastructure with enterprise-grade reliability, security, and management capabilities.

23
docs/MAINTENANCE.md Normal file
View File

@@ -0,0 +1,23 @@
# Maintenance Guide
Keep your wild cloud running smoothly.
- [Security Best Practices](./guides/security.md)
- [Monitoring](./guides/monitoring.md)
- [Making backups](./guides/making-backups.md)
- [Restoring backups](./guides/restoring-backups.md)
## Upgrade
- [Upgrade applications](./guides/upgrade-applications.md)
- [Upgrade kubernetes](./guides/upgrade-kubernetes.md)
- [Upgrade Talos](./guides/upgrade-talos.md)
- [Upgrade Wild Cloud](./guides/upgrade-wild-cloud.md)
## Troubleshooting
- [Cluster issues](./guides/troubleshoot-cluster.md)
- [DNS issues](./guides/troubleshoot-dns.md)
- [Service connectivity issues](./guides/troubleshoot-service-connectivity.md)
- [TLS certificate issues](./guides/troubleshoot-tls-certificates.md)
- [Visibility issues](./guides/troubleshoot-visibility.md)

3
docs/SETUP.md Normal file
View File

@@ -0,0 +1,3 @@
# Setting Up Your Wild Cloud
Visit https://mywildcloud.org/get-started for full wild cloud setup instructions.

View File

@@ -0,0 +1,265 @@
# Making Backups
This guide covers how to create backups of your wild-cloud infrastructure using the integrated backup system.
## Overview
The wild-cloud backup system creates encrypted, deduplicated snapshots using restic. It backs up three main components:
- **Applications**: Database dumps and persistent volume data
- **Cluster**: Kubernetes resources and etcd state
- **Configuration**: Wild-cloud repository and settings
## Prerequisites
Before making backups, ensure you have:
1. **Environment configured**: Run `source env.sh` to load backup configuration
2. **Restic repository**: Backup repository configured in `config.yaml`
3. **Backup password**: Set in wild-cloud secrets
4. **Staging directory**: Configured path for temporary backup files
## Backup Components
### Applications (`wild-app-backup`)
Backs up individual applications including:
- **Database dumps**: PostgreSQL/MySQL databases in compressed custom format
- **PVC data**: Application files streamed directly for restic deduplication
- **Auto-discovery**: Finds databases and PVCs based on app manifest.yaml
### Cluster Resources (`wild-backup --cluster-only`)
Backs up cluster-wide resources:
- **Kubernetes resources**: All pods, services, deployments, secrets, configmaps
- **Storage definitions**: PersistentVolumes, PVCs, StorageClasses
- **etcd snapshot**: Complete cluster state for disaster recovery
### Configuration (`wild-backup --home-only`)
Backs up wild-cloud configuration:
- **Repository contents**: All app definitions, manifests, configurations
- **Settings**: Wild-cloud configuration files and customizations
## Making Backups
### Full System Backup (Recommended)
Create a complete backup of everything:
```bash
# Backup all components (apps + cluster + config)
wild-backup
```
This is equivalent to:
```bash
wild-backup --home --apps --cluster
```
### Selective Backups
#### Applications Only
```bash
# All applications
wild-backup --apps-only
# Single application
wild-app-backup discourse
# Multiple applications
wild-app-backup discourse gitea immich
```
#### Cluster Only
```bash
# Kubernetes resources + etcd
wild-backup --cluster-only
```
#### Configuration Only
```bash
# Wild-cloud repository
wild-backup --home-only
```
### Excluding Components
Skip specific components:
```bash
# Skip config, backup apps + cluster
wild-backup --no-home
# Skip applications, backup config + cluster
wild-backup --no-apps
# Skip cluster resources, backup config + apps
wild-backup --no-cluster
```
## Backup Process Details
### Application Backup Process
1. **Discovery**: Parses `manifest.yaml` to find database and PVC dependencies
2. **Database backup**: Creates compressed custom-format dumps
3. **PVC backup**: Streams files directly to staging for restic deduplication
4. **Staging**: Organizes files in clean directory structure
5. **Upload**: Creates individual restic snapshots per application
### Cluster Backup Process
1. **Resource export**: Exports all Kubernetes resources to YAML
2. **etcd snapshot**: Creates point-in-time etcd backup via talosctl
3. **Upload**: Creates single restic snapshot for cluster state
### Restic Snapshots
Each backup creates tagged restic snapshots:
```bash
# View all snapshots
restic snapshots
# Filter by component
restic snapshots --tag discourse # Specific app
restic snapshots --tag cluster # Cluster resources
restic snapshots --tag wc-home # Wild-cloud config
```
## Where Backup Files Are Staged
Before uploading to your restic repository, backup files are organized in a staging directory. This temporary area lets you see exactly what's being backed up and helps with deduplication.
Here's what the staging area looks like:
```
backup-staging/
├── apps/
│ ├── discourse/
│ │ ├── database_20250816T120000Z.dump
│ │ ├── globals_20250816T120000Z.sql
│ │ └── discourse/
│ │ └── data/ # All the actual files
│ ├── gitea/
│ │ ├── database_20250816T120000Z.dump
│ │ └── gitea-data/
│ │ └── data/ # Git repositories, etc.
│ └── immich/
│ ├── database_20250816T120000Z.dump
│ └── immich-data/
│ └── upload/ # Photos and videos
└── cluster/
├── all-resources.yaml # All running services
├── secrets.yaml # Passwords and certificates
├── configmaps.yaml # Configuration data
└── etcd-snapshot.db # Complete cluster state
```
This staging approach means you can examine backup contents before they're uploaded, and restic can efficiently deduplicate files that haven't changed.
## Advanced Usage
### Custom Backup Scripts
Applications can provide custom backup logic:
```bash
# Create apps/myapp/backup.sh for custom behavior
chmod +x apps/myapp/backup.sh
# wild-app-backup will use custom script if present
wild-app-backup myapp
```
### Monitoring Backup Status
```bash
# Check recent snapshots
restic snapshots | head -20
# Check specific app backups
restic snapshots --tag discourse
# Verify backup integrity
restic check
```
### Backup Automation
Set up automated backups with cron:
```bash
# Daily full backup at 2 AM
0 2 * * * cd /data/repos/payne-cloud && source env.sh && wild-backup
# Hourly app backups during business hours
0 9-17 * * * cd /data/repos/payne-cloud && source env.sh && wild-backup --apps-only
```
## Performance Considerations
### Large PVCs (like Immich photos)
The streaming backup approach provides:
- **First backup**: Full transfer time (all files processed)
- **Subsequent backups**: Only changed files processed (dramatically faster)
- **Storage efficiency**: Restic deduplication reduces storage usage
### Network Usage
- **Database dumps**: Compressed at source, efficient transfer
- **PVC data**: Uncompressed transfer, but restic handles deduplication
- **etcd snapshots**: Small files, minimal impact
## Troubleshooting
### Common Issues
**"No databases or PVCs found"**
- App has no `manifest.yaml` with database dependencies
- No PVCs with matching labels in app namespace
- Create custom `backup.sh` script for special cases
**"kubectl not found"**
- Ensure kubectl is installed and configured
- Check cluster connectivity with `kubectl get nodes`
**"Staging directory not set"**
- Configure `cloud.backup.staging` in `config.yaml`
- Ensure directory exists and is writable
**"Could not create etcd backup"**
- Ensure `talosctl` is installed for Talos clusters
- Check control plane node connectivity
- Verify etcd pods are accessible in kube-system namespace
### Backup Verification
Always verify backups periodically:
```bash
# Check restic repository integrity
restic check
# List recent snapshots
restic snapshots --compact
# Test restore to different directory
restic restore latest --target /tmp/restore-test
```
## Security Notes
- **Encryption**: All backups are encrypted with your backup password
- **Secrets**: Kubernetes secrets are included in cluster backups
- **Access control**: Secure your backup repository and passwords
- **Network**: Consider bandwidth usage for large initial backups
## Next Steps
- [Restoring Backups](restoring-backups.md) - Learn how to restore from backups
- Configure automated backup schedules
- Set up backup monitoring and alerting
- Test disaster recovery procedures

50
docs/guides/monitoring.md Normal file
View File

@@ -0,0 +1,50 @@
# System Health Monitoring
## Basic Monitoring
Check system health with:
```bash
# Node resource usage
kubectl top nodes
# Pod resource usage
kubectl top pods -A
# Persistent volume claims
kubectl get pvc -A
```
## Advanced Monitoring (Future Implementation)
Consider implementing:
1. **Prometheus + Grafana** for comprehensive monitoring:
```bash
# Placeholder for future implementation
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm install prometheus prometheus-community/kube-prometheus-stack --namespace monitoring --create-namespace
```
2. **Loki** for log aggregation:
```bash
# Placeholder for future implementation
helm repo add grafana https://grafana.github.io/helm-charts
helm install loki grafana/loki-stack --namespace logging --create-namespace
```
## Additional Resources
This document will be expanded in the future with:
- Detailed backup and restore procedures
- Monitoring setup instructions
- Comprehensive security hardening guide
- Automated maintenance scripts
For now, refer to the following external resources:
- [K3s Documentation](https://docs.k3s.io/)
- [Kubernetes Troubleshooting Guide](https://kubernetes.io/docs/tasks/debug/)
- [Velero Backup Documentation](https://velero.io/docs/latest/)
- [Kubernetes Security Best Practices](https://kubernetes.io/docs/concepts/security/)

View File

@@ -0,0 +1,294 @@
# Restoring Backups
This guide will walk you through restoring your applications and cluster from wild-cloud backups. Hopefully you'll never need this, but when you do, it's critical that the process works smoothly.
## Understanding Restore Types
Your wild-cloud backup system can restore different types of data depending on what you need to recover:
**Application restores** bring back individual applications by restoring their database contents and file storage. This is what you'll use most often - maybe you accidentally deleted something in Discourse, or Gitea got corrupted, or you want to roll back Immich to before a bad update.
**Cluster restores** are for disaster recovery scenarios where you need to rebuild your entire Kubernetes cluster from scratch. This includes restoring all the cluster's configuration and even its internal state.
**Configuration restores** bring back your wild-cloud repository and settings, which contain all the "recipes" for how your infrastructure should be set up.
## Before You Start Restoring
Make sure you have everything needed to perform restores. You need to be in your wild-cloud directory with the environment loaded (`source env.sh`). Your backup repository and password should be configured and working - you can test this by running `restic snapshots` to see your available backups.
Most importantly, make sure you have kubectl access to your cluster, since restores involve creating temporary pods and manipulating storage.
## Restoring Applications
### Basic Application Restore
The most common restore scenario is bringing back a single application. To restore the latest backup of an app:
```bash
wild-app-restore discourse
```
This restores both the database and all file storage for the discourse app. The restore system automatically figures out what the app needs based on its manifest file and what was backed up.
If you want to restore from a specific backup instead of the latest:
```bash
wild-app-restore discourse abc123
```
Where `abc123` is the snapshot ID from `restic snapshots --tag discourse`.
### Partial Restores
Sometimes you only need to restore part of an application. Maybe the database is fine but the files got corrupted, or vice versa.
To restore only the database:
```bash
wild-app-restore discourse --db-only
```
To restore only the file storage:
```bash
wild-app-restore discourse --pvc-only
```
To restore without database roles and permissions (if they're causing conflicts):
```bash
wild-app-restore discourse --skip-globals
```
### Finding Available Backups
To see what backups are available for an app:
```bash
wild-app-restore discourse --list
```
This shows recent snapshots with their IDs, timestamps, and what was included.
## How Application Restores Work
Understanding what happens during a restore can help when things don't go as expected.
### Database Restoration
When restoring a database, the system first downloads the backup files from your restic repository. It then prepares the database by creating any needed roles, disconnecting existing users, and dropping/recreating the database to ensure a clean restore.
For PostgreSQL databases, it uses `pg_restore` with parallel processing to speed up large database imports. For MySQL, it uses standard mysql import commands. The system also handles database ownership and permissions automatically.
### File Storage Restoration
File storage (PVC) restoration is more complex because it involves safely replacing files that might be actively used by running applications.
First, the system creates a safety snapshot using Longhorn. This means if something goes wrong during the restore, you can get back to where you started. Then it scales your application down to zero replicas so no pods are using the storage.
Next, it creates a temporary utility pod with the PVC mounted and copies all the backup files into place, preserving file permissions and structure. Once the data is restored and verified, it removes the utility pod and scales your application back up.
If everything worked correctly, the safety snapshot is automatically deleted. If something went wrong, the safety snapshot is preserved so you can recover manually.
## Cluster Disaster Recovery
Cluster restoration is much less common but critical when you need to rebuild your entire infrastructure.
### Restoring Kubernetes Resources
To restore all cluster resources from a backup:
```bash
# Download cluster backup
restic restore --tag cluster latest --target ./restore/
# Apply all resources
kubectl apply -f restore/cluster/all-resources.yaml
```
You can also restore specific types of resources:
```bash
kubectl apply -f restore/cluster/secrets.yaml
kubectl apply -f restore/cluster/configmaps.yaml
```
### Restoring etcd State
**Warning: This is extremely dangerous and will affect your entire cluster.**
etcd restoration should only be done when rebuilding a cluster from scratch. For Talos clusters:
```bash
talosctl --nodes <control-plane-ip> etcd restore --from ./restore/cluster/etcd-snapshot.db
```
This command stops etcd, replaces its data with the backup, and restarts the cluster. Expect significant downtime while the cluster rebuilds itself.
## Common Disaster Recovery Scenarios
### Complete Application Loss
When an entire application is gone (namespace deleted, pods corrupted, etc.):
```bash
# Make sure the namespace exists
kubectl create namespace discourse --dry-run=client -o yaml | kubectl apply -f -
# Apply the application manifests if needed
kubectl apply -f apps/discourse/
# Restore the application data
wild-app-restore discourse
```
### Complete Cluster Rebuild
When rebuilding a cluster from scratch:
First, build your new cluster infrastructure and install wild-cloud components. Then configure backup access so you can reach your backup repository.
Restore cluster state:
```bash
restic restore --tag cluster latest --target ./restore/
# Apply etcd snapshot using appropriate method for your cluster type
```
Finally, restore all applications:
```bash
# See what applications are backed up
wild-app-restore --list
# Restore each application individually
wild-app-restore discourse
wild-app-restore gitea
wild-app-restore immich
```
### Rolling Back After Bad Changes
Sometimes you need to undo recent changes to an application:
```bash
# See available snapshots
wild-app-restore discourse --list
# Restore from before the problematic changes
wild-app-restore discourse abc123
```
## Cross-Cluster Migration
You can use backups to move applications between clusters:
On the source cluster, create a fresh backup:
```bash
wild-app-backup discourse
```
On the target cluster, deploy the application manifests:
```bash
kubectl apply -f apps/discourse/
```
Then restore the data:
```bash
wild-app-restore discourse
```
## Verifying Successful Restores
After any restore, verify that everything is working correctly.
For databases, check that you can connect and see expected data:
```bash
kubectl exec -n postgres deploy/postgres-deployment -- \
psql -U postgres -d discourse -c "SELECT count(*) FROM posts;"
```
For file storage, check that files exist and applications can start:
```bash
kubectl get pods -n discourse
kubectl logs -n discourse deployment/discourse
```
For web applications, test that you can access them:
```bash
curl -f https://discourse.example.com/latest.json
```
## When Things Go Wrong
### No Snapshots Found
If the restore system can't find backups for an application, check that snapshots exist:
```bash
restic snapshots --tag discourse
```
Make sure you're using the correct app name and that backups were actually created successfully.
### Database Restore Failures
Database restores can fail if the target database isn't accessible or if there are permission issues. Check that your postgres or mysql pods are running and that you can connect to them manually.
Review the restore error messages carefully - they usually indicate whether the problem is with the backup file, database connectivity, or permissions.
### PVC Restore Failures
If PVC restoration fails, check that you have sufficient disk space and that the PVC isn't being used by other pods. The error messages will usually indicate what went wrong.
Most importantly, remember that safety snapshots are preserved when PVC restores fail. You can see them with:
```bash
kubectl get snapshot.longhorn.io -n longhorn-system -l app=wild-app-restore
```
These snapshots let you recover to the pre-restore state if needed.
### Application Won't Start After Restore
If pods fail to start after restoration, check file permissions and ownership. Sometimes the restoration process doesn't perfectly preserve the exact permissions that the application expects.
You can also try scaling the application to zero and back to one, which sometimes resolves transient issues:
```bash
kubectl scale deployment/discourse -n discourse --replicas=0
kubectl scale deployment/discourse -n discourse --replicas=1
```
## Manual Recovery
When automated restore fails, you can always fall back to manual extraction and restoration:
```bash
# Extract backup files to local directory
restic restore --tag discourse latest --target ./manual-restore/
# Manually copy database dump to postgres pod
kubectl cp ./manual-restore/discourse/database_*.dump \
postgres/postgres-deployment-xxx:/tmp/
# Manually restore database
kubectl exec -n postgres deploy/postgres-deployment -- \
pg_restore -U postgres -d discourse /tmp/database_*.dump
```
For file restoration, you'd need to create a utility pod and manually copy files into the PVC.
## Best Practices
Test your restore procedures regularly in a non-production environment. It's much better to discover issues with your backup system during a planned test than during an actual emergency.
Always communicate with users before performing restores, especially if they involve downtime. Document any manual steps you had to take so you can improve the automated process.
After any significant restore, monitor your applications more closely than usual for a few days. Sometimes problems don't surface immediately.
## Security and Access Control
Restore operations are powerful and can be destructive. Make sure only trusted administrators can perform restores, and consider requiring approval or coordination before major restoration operations.
Be aware that cluster restores include all secrets, so they potentially expose passwords, API keys, and certificates. Ensure your backup repository is properly secured.
Remember that Longhorn safety snapshots are preserved when things go wrong. These snapshots may contain sensitive data, so clean them up appropriately once you've resolved any issues.
## What's Next
The best way to get comfortable with restore operations is to practice them in a safe environment. Set up a test cluster and practice restoring applications and data.
Consider creating runbooks for your most likely disaster scenarios, including the specific commands and verification steps for your infrastructure.
Read the [Making Backups](making-backups.md) guide to ensure you're creating the backups you'll need for successful recovery.

View File

@@ -0,0 +1,19 @@
# Troubleshoot Wild Cloud Cluster issues
## General Troubleshooting Steps
1. **Check Node Status**:
```bash
kubectl get nodes
kubectl describe node <node-name>
```
1. **Check Component Status**:
```bash
# Check all pods across all namespaces
kubectl get pods -A
# Look for pods that aren't Running or Ready
kubectl get pods -A | grep -v "Running\|Completed"
```

View File

@@ -0,0 +1,20 @@
# Troubleshoot DNS
If DNS resolution isn't working properly:
1. Check CoreDNS status:
```bash
kubectl get pods -n kube-system -l k8s-app=kube-dns
kubectl logs -l k8s-app=kube-dns -n kube-system
```
2. Verify CoreDNS configuration:
```bash
kubectl get configmap -n kube-system coredns -o yaml
```
3. Test DNS resolution from inside the cluster:
```bash
kubectl run -i --tty --rm debug --image=busybox --restart=Never -- nslookup kubernetes.default
```

View File

@@ -0,0 +1,18 @@
# Troubleshoot Service Connectivity
If services can't communicate:
1. Check network policies:
```bash
kubectl get networkpolicies -A
```
2. Verify service endpoints:
```bash
kubectl get endpoints -n <namespace>
```
3. Test connectivity from within the cluster:
```bash
kubectl run -i --tty --rm debug --image=busybox --restart=Never -- wget -O- <service-name>.<namespace>
```

View File

@@ -0,0 +1,24 @@
# Troubleshoot TLS Certificates
If services show invalid certificates:
1. Check certificate status:
```bash
kubectl get certificates -A
```
2. Examine certificate details:
```bash
kubectl describe certificate <cert-name> -n <namespace>
```
3. Check for cert-manager issues:
```bash
kubectl get pods -n cert-manager
kubectl logs -l app=cert-manager -n cert-manager
```
4. Verify the Cloudflare API token is correctly set up:
```bash
kubectl get secret cloudflare-api-token -n internal
```

View File

@@ -0,0 +1,246 @@
# Troubleshoot Service Visibility
This guide covers common issues with accessing services from outside the cluster and how to diagnose and fix them.
## Common Issues
External access to your services might fail for several reasons:
1. **DNS Resolution Issues** - Domain names not resolving to the correct IP address
2. **Network Connectivity Issues** - Traffic can't reach the cluster's external IP
3. **TLS Certificate Issues** - Invalid or missing certificates
4. **Ingress/Service Configuration Issues** - Incorrectly configured routing
## Diagnostic Steps
### 1. Check DNS Resolution
**Symptoms:**
- Browser shows "site cannot be reached" or "server IP address could not be found"
- `ping` or `nslookup` commands fail for your domain
- Your service DNS records don't appear in CloudFlare or your DNS provider
**Checks:**
```bash
# Check if your domain resolves (from outside the cluster)
nslookup yourservice.yourdomain.com
# Check if ExternalDNS is running
kubectl get pods -n externaldns
# Check ExternalDNS logs for errors
kubectl logs -n externaldns -l app=external-dns < /dev/null | grep -i error
kubectl logs -n externaldns -l app=external-dns | grep -i "your-service-name"
# Check if CloudFlare API token is configured correctly
kubectl get secret cloudflare-api-token -n externaldns
```
**Common Issues:**
a) **ExternalDNS Not Running**: The ExternalDNS pod is not running or has errors.
b) **Cloudflare API Token Issues**: The API token is invalid, expired, or doesn't have the right permissions.
c) **Domain Filter Mismatch**: ExternalDNS is configured with a `--domain-filter` that doesn't match your domain.
d) **Annotations Missing**: Service or Ingress is missing the required ExternalDNS annotations.
**Solutions:**
```bash
# 1. Recreate CloudFlare API token secret
kubectl create secret generic cloudflare-api-token \
--namespace externaldns \
--from-literal=api-token="your-api-token" \
--dry-run=client -o yaml | kubectl apply -f -
# 2. Check and set proper annotations on your Ingress:
kubectl annotate ingress your-ingress -n your-namespace \
external-dns.alpha.kubernetes.io/hostname=your-service.your-domain.com
# 3. Restart ExternalDNS
kubectl rollout restart deployment -n externaldns external-dns
```
### 2. Check Network Connectivity
**Symptoms:**
- DNS resolves to the correct IP but the service is still unreachable
- Only some services are unreachable while others work
- Network timeout errors
**Checks:**
```bash
# Check if MetalLB is running
kubectl get pods -n metallb-system
# Check MetalLB IP address pool
kubectl get ipaddresspools.metallb.io -n metallb-system
# Verify the service has an external IP
kubectl get svc -n your-namespace your-service
```
**Common Issues:**
a) **MetalLB Configuration**: The IP pool doesn't match your network or is exhausted.
b) **Firewall Issues**: Firewall is blocking traffic to your cluster's external IP.
c) **Router Configuration**: NAT or port forwarding issues if using a router.
**Solutions:**
```bash
# 1. Check and update MetalLB configuration
kubectl apply -f infrastructure_setup/metallb/metallb-pool.yaml
# 2. Check service external IP assignment
kubectl describe svc -n your-namespace your-service
```
### 3. Check TLS Certificates
**Symptoms:**
- Browser shows certificate errors
- "Your connection is not private" warnings
- Cert-manager logs show errors
**Checks:**
```bash
# Check certificate status
kubectl get certificates -A
# Check cert-manager logs
kubectl logs -n cert-manager -l app=cert-manager
# Check if your ingress is using the correct certificate
kubectl get ingress -n your-namespace your-ingress -o yaml
```
**Common Issues:**
a) **Certificate Issuance Failures**: DNS validation or HTTP validation failing.
b) **Wrong Secret Referenced**: Ingress is referencing a non-existent certificate secret.
c) **Expired Certificate**: Certificate has expired and wasn't renewed.
**Solutions:**
```bash
# 1. Check and recreate certificates
kubectl apply -f infrastructure_setup/cert-manager/wildcard-certificate.yaml
# 2. Update ingress to use correct secret
kubectl patch ingress your-ingress -n your-namespace --type=json \
-p='[{"op": "replace", "path": "/spec/tls/0/secretName", "value": "correct-secret-name"}]'
```
### 4. Check Ingress Configuration
**Symptoms:**
- HTTP 404, 503, or other error codes
- Service accessible from inside cluster but not outside
- Traffic routed to wrong service
**Checks:**
```bash
# Check ingress status
kubectl get ingress -n your-namespace
# Check Traefik logs
kubectl logs -n kube-system -l app.kubernetes.io/name=traefik
# Check ingress configuration
kubectl describe ingress -n your-namespace your-ingress
```
**Common Issues:**
a) **Incorrect Service Targeting**: Ingress is pointing to wrong service or port.
b) **Traefik Configuration**: IngressClass or middleware issues.
c) **Path Configuration**: Incorrect path prefixes or regex.
**Solutions:**
```bash
# 1. Verify ingress configuration
kubectl edit ingress -n your-namespace your-ingress
# 2. Check that the referenced service exists
kubectl get svc -n your-namespace
# 3. Restart Traefik if needed
kubectl rollout restart deployment -n kube-system traefik
```
## Advanced Diagnostics
For more complex issues, you can use port-forwarding to test services directly:
```bash
# Port-forward the service directly
kubectl port-forward -n your-namespace svc/your-service 8080:80
# Then test locally
curl http://localhost:8080
```
You can also deploy a debug pod to test connectivity from inside the cluster:
```bash
# Start a debug pod
kubectl run -i --tty --rm debug --image=busybox --restart=Never -- sh
# Inside the pod, test DNS and connectivity
nslookup your-service.your-namespace.svc.cluster.local
wget -O- http://your-service.your-namespace.svc.cluster.local
```
## ExternalDNS Specifics
ExternalDNS can be particularly troublesome. Here are specific debugging steps:
1. **Check Log Level**: Set `--log-level=debug` for more detailed logs
2. **Check Domain Filter**: Ensure `--domain-filter` includes your domain
3. **Check Provider**: Ensure `--provider=cloudflare` (or your DNS provider)
4. **Verify API Permissions**: CloudFlare token needs Zone.Zone and Zone.DNS permissions
5. **Check TXT Records**: ExternalDNS uses TXT records for ownership tracking
```bash
# Restart with verbose logging
kubectl set env deployment/external-dns -n externaldns -- --log-level=debug
# Check for specific domain errors
kubectl logs -n externaldns -l app=external-dns | grep -i yourservice.yourdomain.com
```
## CloudFlare Specific Issues
When using CloudFlare, additional issues may arise:
1. **API Rate Limiting**: CloudFlare may rate limit frequent API calls
2. **DNS Propagation**: Changes may take time to propagate through CloudFlare's CDN
3. **Proxied Records**: The `external-dns.alpha.kubernetes.io/cloudflare-proxied` annotation controls whether CloudFlare proxies traffic
4. **Access Restrictions**: CloudFlare Access or Page Rules may restrict access
5. **API Token Permissions**: The token must have Zone:Zone:Read and Zone:DNS:Edit permissions
6. **Zone Detection**: If using subdomains, ensure the parent domain is included in the domain filter
Check CloudFlare dashboard for:
- DNS record existence
- API access logs
- DNS settings including proxy status
- Any error messages or rate limit warnings

View File

@@ -0,0 +1,3 @@
# Upgrade Applications
TBD

View File

@@ -0,0 +1,3 @@
# Upgrade Kubernetes
TBD

View File

@@ -0,0 +1,3 @@
# Upgrade Talos
TBD

View File

@@ -0,0 +1,3 @@
# Upgrade Wild Cloud
TBD

46
docs/security.md Normal file
View File

@@ -0,0 +1,46 @@
# Security
## Best Practices
1. **Keep Everything Updated**:
- Regularly update K3s
- Update all infrastructure components
- Keep application images up to date
2. **Network Security**:
- Use internal services whenever possible
- Limit exposed services to only what's necessary
- Configure your home router's firewall properly
3. **Access Control**:
- Use strong passwords for all services
- Implement a secrets management strategy
- Rotate API tokens and keys regularly
4. **Regular Audits**:
- Review running services periodically
- Check for unused or outdated deployments
- Monitor resource usage for anomalies
## Security Scanning (Future Implementation)
Tools to consider implementing:
1. **Trivy** for image scanning:
```bash
# Example Trivy usage (placeholder)
trivy image <your-image>
```
2. **kube-bench** for Kubernetes security checks:
```bash
# Example kube-bench usage (placeholder)
kubectl apply -f https://raw.githubusercontent.com/aquasecurity/kube-bench/main/job.yaml
```
3. **Falco** for runtime security monitoring:
```bash
# Example Falco installation (placeholder)
helm repo add falcosecurity https://falcosecurity.github.io/charts
helm install falco falcosecurity/falco --namespace falco --create-namespace
```

View File

@@ -0,0 +1,79 @@
#!/bin/bash
# Detect arm or amd
ARCH=$(uname -m)
if [ "$ARCH" != "aarch64" ] && [ "$ARCH" != "x86_64" ]; then
echo "Error: Unsupported architecture $ARCH. Only arm64 and amd64 are supported."
exit 1
fi
ARCH_ABBR="amd64"
if [ "$ARCH" == "aarch64" ]; then
ARCH_ABBR="arm64"
fi
# Install kubectl
if ! command -v kubectl &> /dev/null; then
echo "Error: kubectl is not installed. Installing."
curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/$ARCH_ABBR/kubectl"
curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/$ARCH_ABBR/kubectl.sha256"
echo "$(cat kubectl.sha256) kubectl" | sha256sum --check
sudo install -o root -g root -m 0755 kubectl /usr/local/bin/kubectl
fi
# Install talosctl
if command -v talosctl &> /dev/null; then
echo "talosctl is already installed."
else
curl -sL https://talos.dev/install | sh
echo "talosctl installed successfully."
fi
# Install gomplate
if command -v gomplate &> /dev/null; then
echo "gomplate is already installed."
else
curl -sSL https://github.com/hairyhenderson/gomplate/releases/latest/download/gomplate_linux-$ARCH_ABBR -o $HOME/.local/bin/gomplate
chmod +x $HOME/.local/bin/gomplate
echo "gomplate installed successfully."
fi
# Install kustomize
if command -v kustomize &> /dev/null; then
echo "kustomize is already installed."
else
curl -s "https://raw.githubusercontent.com/kubernetes-sigs/kustomize/master/hack/install_kustomize.sh" | bash
mv kustomize $HOME/.local/bin/
echo "kustomize installed successfully."
fi
## Install yq
if command -v yq &> /dev/null; then
echo "yq is already installed."
else
VERSION=v4.45.4
BINARY=yq_linux_$ARCH_ABBR
wget https://github.com/mikefarah/yq/releases/download/${VERSION}/${BINARY}.tar.gz -O - | tar xz
mv ${BINARY} $HOME/.local/bin/yq
chmod +x $HOME/.local/bin/yq
rm yq.1
echo "yq installed successfully."
fi
## Install restic
if command -v restic &> /dev/null; then
echo "restic is already installed."
else
sudo apt-get update
sudo apt-get install -y restic
echo "restic installed successfully."
fi
## Install direnv
if command -v direnv &> /dev/null; then
echo "direnv is already installed."
else
sudo apt-get update
sudo apt-get install -y direnv
echo "direnv installed successfully. Add `eval \"\$(direnv hook bash)\"` to your shell configuration file if not already present."
fi

1
wild-directory Submodule

Submodule wild-directory added at db621755b3